Text preparation strips unicode characters

Open on Drupal.org →

Created on 25 May 2023, almost 2 years ago

Problem/Motivation

When text is prepared for prompt input it is filtered through a number of steps. The final step tries to remove unwanted characters but also removes characters that are both valid in other languages and understood by OpenAI. This can lead to less usable results being returned.

Steps to reproduce

1. Install and configure Drupal and the OpenAPI Content module
2. Create an article with the following content (in Danish):

Title: "Hvad er en ø"
Body: "En ø er et landområde helt omgivet af vand ved normalvandstand, der er mindre end et kontinent og større end en sten eller et skær. En lille ø kaldes for en holm eller småø.

Øer kan ses i havene, i oceaner, i søer eller i floder. Størrelsen varierer kraftigt fra små slam- og sandøer på få kvadratmeter til Grønland, som med sit areal på på 2.166.086 km², hvoraf 410.449 km² er isfrit, er Jordens største ø. Australien er større, men regnes for at være et kontinent."
Source: https://da.wikipedia.org/wiki/%C3%98

3. Edit the article and hit "Suggest taxonomy".
4. I get the following suggestions: Holm, Skr, Slam, Sand, Kontinent. Notice that what I would expect to be the primary topic "Ø" would be present. It is not. Also "Skr" is not a word or misspelled.

Proposed resolution

Do not strip unicode characters.

Remaining tasks

None.

User interface changes

None.

API changes

None.

Data model changes

None.

🐛 Bug report

Status

Fixed

Version

1.0

Component

Code

Created by

🇩🇰Denmark kasperg

Live updates comments and jobs are added and updated live.

Sign in to follow issues

Comments & Activities

Issue created by @kasperg
@kasperg opened merge request.
Status changed to Needs review almost 2 years ago6:46pm 25 May 2023
Comment almost 2 years ago →
🇩🇰Denmark kasperg
I have updated the regular expression to support unicode characters.

With this change the following taxonomy terms are suggested in the above example: ø, vand, landområde, kontinent, holm. Here "Ø" is present and none of the words are misspelled.
First commit to issue fork.
Comment almost 2 years ago →
🇺🇸United States kevinquillen
Thanks for that. I made a couple of additions plus a test case. Are there other text scenarios we could assert?

https://git.drupalcode.org/project/openai/-/merge_requests/40/diffs?comm...
Comment almost 2 years ago →
🇺🇸United States kevinquillen
It looks a little arcane but it was the consensus among several threads about Unicode + PHP DomDocument.
Comment almost 2 years ago →
System Message

kevinquillen → committed 55cf3006 on 1.0.x authored by kasperg →
Issue #3362773 by kevinquillen: Text preparation strips unicode...
Status changed to Fixed almost 2 years ago8:10pm 25 May 2023
Comment almost 2 years ago →
🇺🇸United States kevinquillen
Comment almost 2 years ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

contrib.social Blog FAQ Discussions

Production build 0.71.5 2024