Tweak the summarize/taxonomy suggester input to get better output

Created on 23 February 2023, over 1 year ago
Updated 2 March 2023, over 1 year ago

Problem/Motivation

For better results, the text that is sent to OpenAI needs to be stripped of HTML and puncutation. New lines, punctuation, HTML can reduce the accuracy of results.

Proposed resolution

Prepare the text better before sending it to OpenAI.

Remaining tasks

Create a Utility class for cleaning text, with tests. Example code:

    $body = Unicode::truncate(strip_tags(trim($body)), 3900, TRUE);
    $body = str_replace(array("\r\n","\r","\n","\\r","\\n","\\r\\n"),"", $body);
    $body = preg_replace("/  +/", ' ', $body);
    $body = preg_replace("/[^a-z0-9 ]/i", '', $body);

The length of the body is trimmed to 3900 so we have room for our other text in the prompt (the question). This can break certain words that need punctuation, like software versions, but maybe that is not all that important.

πŸ“Œ Task
Status

Fixed

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States kevinquillen

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @kevinquillen
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Committed some changes to dev. The results coming back are already 10x better than they were previously. Here is an example from my own content:

    It has been far more accurate with every article I have tried.

    It would be good to expand on this a bit here later and allow the user to select which longtext field to summarize, but for now this is working. I may be able to get to that part (selecting which field) next week.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    It might be a good idea to implement a pass through DOMDocument too and just delete nodes that are pre or code formatted. I have noticed that code samples even though passed through strip_tags isn't cleanly removed and can interfere with summaries.

  • Status changed to Needs work over 1 year ago
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    So far so good. Still getting good results.

    I did notice that we may have an issue trying to remove special tokens like or ... may have to figure out how to handle that too.

  • @kevinquillen opened merge request.
  • πŸ‡ΊπŸ‡ΈUnited States d0t101101

    @kevinquillen - I can also confirm that these tweaks to the OpenAI queries made a dramatic improvement for summarization and taxonomy generation, which is part of the openai_content sub module (which now appears on the node edit pages). Tested this across 10 different nodes with varying subjects and lengths; working great.

    Well done, sir!

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Ok, this is probably in a good enough position at the moment. I can go back and help other areas with the StringHelper (name subject to change) utility class, and implement the stopwords method that is currently in the queue worker and bring it all into one helper class.

  • Status changed to Fixed over 1 year ago
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen
  • πŸ‡ΊπŸ‡ΈUnited States d0t101101

    The latest 'Suggest Taxonomy' feature is great and quite powerful! Excellent English grammar skills over there with now requesting 'nouns and adjectives' only too :)

    I've noticed that if you repeatedly 'Suggest Taxonomy' again and again, sometimes its in a numbered list, and other times its a comma separated list. Ideally this should be a comma separated list only so that it can be quickly copied and pasted into a Drupal Autocomplete Tags type of input. This seems to do the trick!

    'Suggest five words to classify the following text. The words must be nouns or adjectives, comma separated:'

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.69.0 2024