Broken Byte-Pair Encoding (BPE)

Created on 21 August 2024, 3 months ago

Problem/Motivation

Text chunker, src/Utility/TextChunker.php, breaks Byte-Pair Encoding (BPE) by chunking tokenized text. Chunking must follow the tokenization rules.

Steps to reproduce

TextChunker returns malformed strings on many occasions. The easiest way to test is iterative chunk complex text like math or Chinese.

Proposed resolution

We need to extend Tiktoken Encoder and add a method which will extend functionality of its encodeInChunks().

🐛 Bug report
Status

Active

Version

1.0

Component

AI Search

Created by

🇬🇧United Kingdom seogow

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @seogow
  • 🇬🇧United Kingdom scott_euser

    I guess your proposed solution is a temporary one until it's fixed in https://github.com/yethee/tiktoken-php? We can add test cases to the unit tests to cover this I think + should raise as an issue there. Do you agree?

  • 🇬🇧United Kingdom seogow

    TikToken doesn't fail, our implementation of it does. The easiest temporary fix is to disable overlap (put it to 0) and use averaging strategy. We need to better extend TikToken where this is accounted for.

  • 🇨🇳China fishfree

    @seogow I followed your tip for temporary fixing, changed the ai_search milvus server config, then cleard all index and re-index, it occurred error:

    Error: Typed property Drupal\ai_search\Base\EmbeddingStrategyPluginBase::$chunkMinOverlap must not be accessed before initialization in Drupal\ai_search\Plugin\EmbeddingStrategy\EmbeddingBase->getChunks() (line 254 of /var/www/html/drupal/web/modules/contrib/ai/modules/ai_search/src/Plugin/EmbeddingStrategy/EmbeddingBase.php).
    
  • 🇨🇳China fishfree

    Any update here? :-)

  • Status changed to Postponed: needs info 19 days ago
  • 🇬🇧United Kingdom scott_euser

    Can someone provide clear steps and example content that is failing in order to be able to reproduce this. When providing example content, please:

    1. Provide just enough e.g. to demonstrate what should be e.g. 2 chunks but isn't get broken into chunks (if I understand the issue right)
    2. The chunk size to test with
    3. The current chunked results (if any)
    4. The expected chunks

    From @seogow's description, it sounds like the problem is at the chunker level, which means here is a quick sample code you could start from to demonstrate the problem:

    $text = 'Your text here that is failing';
    /** @var \Drupal\ai\Utility\TextChunker $chunker */
    $chunker = \Drupal::service('ai.text_chunker');
    $chunks = $chunker->chunkText($text, 100, 0);
    
  • 🇨🇳China fishfree

    @scott I'm willing to help debug and test. On the page /devel/php in my site, I ran your code snippet, it showed errors: Typed property Drupal\ai\Utility\Tokenizer::$encoder must not be accessed before initialization

    Will you pls provide a whole working code snippet for me to run with devel_php module in my site? Many thanks!

  • 🇬🇧United Kingdom scott_euser

    You'll need to run it within something (e.g. a existing function or method) and then eg dvm() the result. Thanks!

  • 🇬🇧United Kingdom scott_euser

    Hmmm actually let me look further, could be that I need to update the code snippet

  • 🇬🇧United Kingdom scott_euser

    Okay here is a working sample code (replace MYMODULE with your module name that you drop this in):

    function MYMODULE_page_attachments_alter() {
      $text = 'Your text here that is failing';
      /** @var \Drupal\ai\Utility\TextChunker $chunker */
      $chunker = \Drupal::service('ai.text_chunker');
      $chunker->setModel('openai__gpt-3.5-turbo');
      $chunks = $chunker->chunkText($text, 100, 0);
      dvm($chunks);
    }
    

    Really you can use any hook or drop it in an existing hook. I just randomly chose this one because it is triggered once on every page load.

  • 🇨🇳China fishfree

    @scott Thank you! I run your code snippet as below:

    function token_page_attachments_alter() {
      $text = '我是一名退休教师,退休前我很难抽出时间去运动和锻炼,那时感觉自己处在一种亚健康的状态。退休后,我的时间充裕了,经过几年的坚持锻炼,我现在的身体素质比退休前还要好。单位每年组织体检,我的各项指标基本可以达标。在这里,我呼吁老年朋友多运动,多锻炼。立冬时气温降低,老年人外出运动时一定要穿着保暖透气的衣服,运动前做好准备活动,先热身、后运动。选择运动方式时,一定要适合自己。在运动中释放自己,保持好心情。';
      /** @var \Drupal\ai\Utility\TextChunker $chunker */
      $chunker = \Drupal::service('ai.text_chunker');
      $chunker->setModel('openai__gpt-3.5-turbo');
      $chunks = $chunker->chunkText($text, 100, 0);
      dpm($chunks);
    }
    

    It printed out as below:

    array:3 [
      0 => "我是一名退休教师,退休前我很难抽出时间去运动和锻炼,那时感觉自己处在一种亚健康的状态。退休后,我的时间充裕了,经过几年的坚持锻炼"
      1 => ",我现在的身体素质比退休前还要好。单位每年组织体检,我的各项指标基本可以达标。在这里,我呼吁老年朋友多运动,多锻炼。立冬时气温降低"
      2 => ",老年人外出运动时一定要穿着保暖透气的衣服,运动前做好准备活动,先热身、后运动。选择运动方式时,一定要适合自己。在运动中释放自己,保持好心情。"
    ]
    

    The output 3 chunks have no overlappings.

  • 🇬🇧United Kingdom scott_euser

    Thanks! And then I just need 4. The expected chunks from comment #6 because I need to understand what's wrong with the chunking (ie, to me what you show looks fine, you asked for chunks of 100 tokens with no overlap and it gave you three chunks).

  • 🇬🇧United Kingdom scott_euser

    Actually I see how to reproduce this, if I take your same example and add any overlap I get gibberish in later chunks rather than proper chunks reflecting the original text.

  • Merge request !258Resolve #3469392 "Broken byte pair encoding" → (Merged) created by scott_euser
  • 🇬🇧United Kingdom scott_euser

    Okay got it, here is how you can see the results:

      $text = '我是一名退休教师,退休前我很难抽出时间去运动和锻炼,那时感觉自己处在一种亚健康的状态。退休后,我的时间充裕了,经过几年的坚持锻炼,我现在的身体素质比退休前还要好。单位每年组织体检,我的各项指标基本可以达标。在这里,我呼吁老年朋友多运动,多锻炼。立冬时气温降低,老年人外出运动时一定要穿着保暖透气的衣服,运动前做好准备活动,先热身、后运动。选择运动方式时,一定要适合自己。在运动中释放自己,保持好心情。';
      /** @var \Drupal\ai\Utility\TextChunker $chunker */
      $chunker = \Drupal::service('ai.text_chunker');
      $chunker->setModel('openai__gpt-3.5-turbo');
      $chunks = $chunker->chunkText($text, 100, 10);
      dvm('overlap 10:');
      dvm($chunks);
    
      $chunks = $chunker->chunkText($text, 100, 0);
      dvm('overlap 0:');
      dvm($chunks);
    

    to compare with overlap to without overlap.

    Side note to anyone following along: I think this necessarily has a minor shift in the overlap but reviewing the changes to the test coverage its still sensible overlap and I don't think it needs any triggering of reindexing as its minor and vectors will be roughly the same (and since of course dimensions, etc don't change a query against it will continue to work fine).

  • 🇨🇳China fishfree

    @scott, After pathing your commit, it seems working now. Many thanks!

  • 🇬🇧United Kingdom scott_euser

    Great thanks for confirming, will see if any other maintainers have comments before merging.

  • Pipeline finished with Skipped
    13 days ago
    #336165
    • scott_euser committed 7e4d775b on 1.0.x
      Issue #3469392 by scott_euser, fishfree, seogow, marcus_johansson:...
    • scott_euser committed 7e4d775b on 3486953-add-structuredoutput-to
      Issue #3469392 by scott_euser, fishfree, seogow, marcus_johansson:...
Production build 0.71.5 2024