Text preparation strips unicode characters

Created on 25 May 2023, almost 2 years ago

Problem/Motivation

When text is prepared for prompt input it is filtered through a number of steps. The final step tries to remove unwanted characters but also removes characters that are both valid in other languages and understood by OpenAI. This can lead to less usable results being returned.

Steps to reproduce

1. Install and configure Drupal and the OpenAPI Content module
2. Create an article with the following content (in Danish):

Title: "Hvad er en ø"
Body: "En ø er et landområde helt omgivet af vand ved normalvandstand, der er mindre end et kontinent og større end en sten eller et skær. En lille ø kaldes for en holm eller småø.

Øer kan ses i havene, i oceaner, i søer eller i floder. Størrelsen varierer kraftigt fra små slam- og sandøer på få kvadratmeter til Grønland, som med sit areal på på 2.166.086 km², hvoraf 410.449 km² er isfrit, er Jordens største ø. Australien er større, men regnes for at være et kontinent."
Source: https://da.wikipedia.org/wiki/%C3%98

3. Edit the article and hit "Suggest taxonomy".
4. I get the following suggestions: Holm, Skr, Slam, Sand, Kontinent. Notice that what I would expect to be the primary topic "Ø" would be present. It is not. Also "Skr" is not a word or misspelled.

Proposed resolution

Do not strip unicode characters.

Remaining tasks

None.

User interface changes

None.

API changes

None.

Data model changes

None.

🐛 Bug report
Status

Fixed

Version

1.0

Component

Code

Created by

🇩🇰Denmark kasperg

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Production build 0.71.5 2024