- Issue created by @seogow
- 🇬🇧United Kingdom scott_euser
I guess your proposed solution is a temporary one until it's fixed in https://github.com/yethee/tiktoken-php? We can add test cases to the unit tests to cover this I think + should raise as an issue there. Do you agree?
- 🇬🇧United Kingdom seogow
TikToken doesn't fail, our implementation of it does. The easiest temporary fix is to disable overlap (put it to 0) and use averaging strategy. We need to better extend TikToken where this is accounted for.
- 🇨🇳China fishfree
@seogow I followed your tip for temporary fixing, changed the ai_search milvus server config, then cleard all index and re-index, it occurred error:
Error: Typed property Drupal\ai_search\Base\EmbeddingStrategyPluginBase::$chunkMinOverlap must not be accessed before initialization in Drupal\ai_search\Plugin\EmbeddingStrategy\EmbeddingBase->getChunks() (line 254 of /var/www/html/drupal/web/modules/contrib/ai/modules/ai_search/src/Plugin/EmbeddingStrategy/EmbeddingBase.php).
- Status changed to Postponed: needs info
19 days ago 8:03pm 6 November 2024 - 🇬🇧United Kingdom scott_euser
Can someone provide clear steps and example content that is failing in order to be able to reproduce this. When providing example content, please:
- Provide just enough e.g. to demonstrate what should be e.g. 2 chunks but isn't get broken into chunks (if I understand the issue right)
- The chunk size to test with
- The current chunked results (if any)
- The expected chunks
From @seogow's description, it sounds like the problem is at the chunker level, which means here is a quick sample code you could start from to demonstrate the problem:
$text = 'Your text here that is failing'; /** @var \Drupal\ai\Utility\TextChunker $chunker */ $chunker = \Drupal::service('ai.text_chunker'); $chunks = $chunker->chunkText($text, 100, 0);
- 🇨🇳China fishfree
@scott I'm willing to help debug and test. On the page /devel/php in my site, I ran your code snippet, it showed errors:
Typed property Drupal\ai\Utility\Tokenizer::$encoder must not be accessed before initialization
Will you pls provide a whole working code snippet for me to run with devel_php module in my site? Many thanks!
- 🇬🇧United Kingdom scott_euser
You'll need to run it within something (e.g. a existing function or method) and then eg dvm() the result. Thanks!
- 🇬🇧United Kingdom scott_euser
Hmmm actually let me look further, could be that I need to update the code snippet
- 🇬🇧United Kingdom scott_euser
Okay here is a working sample code (replace MYMODULE with your module name that you drop this in):
function MYMODULE_page_attachments_alter() { $text = 'Your text here that is failing'; /** @var \Drupal\ai\Utility\TextChunker $chunker */ $chunker = \Drupal::service('ai.text_chunker'); $chunker->setModel('openai__gpt-3.5-turbo'); $chunks = $chunker->chunkText($text, 100, 0); dvm($chunks); }
Really you can use any hook or drop it in an existing hook. I just randomly chose this one because it is triggered once on every page load.
- 🇨🇳China fishfree
@scott Thank you! I run your code snippet as below:
function token_page_attachments_alter() { $text = '我是一名退休教师,退休前我很难抽出时间去运动和锻炼,那时感觉自己处在一种亚健康的状态。退休后,我的时间充裕了,经过几年的坚持锻炼,我现在的身体素质比退休前还要好。单位每年组织体检,我的各项指标基本可以达标。在这里,我呼吁老年朋友多运动,多锻炼。立冬时气温降低,老年人外出运动时一定要穿着保暖透气的衣服,运动前做好准备活动,先热身、后运动。选择运动方式时,一定要适合自己。在运动中释放自己,保持好心情。'; /** @var \Drupal\ai\Utility\TextChunker $chunker */ $chunker = \Drupal::service('ai.text_chunker'); $chunker->setModel('openai__gpt-3.5-turbo'); $chunks = $chunker->chunkText($text, 100, 0); dpm($chunks); }
It printed out as below:
array:3 [ 0 => "我是一名退休教师,退休前我很难抽出时间去运动和锻炼,那时感觉自己处在一种亚健康的状态。退休后,我的时间充裕了,经过几年的坚持锻炼" 1 => ",我现在的身体素质比退休前还要好。单位每年组织体检,我的各项指标基本可以达标。在这里,我呼吁老年朋友多运动,多锻炼。立冬时气温降低" 2 => ",老年人外出运动时一定要穿着保暖透气的衣服,运动前做好准备活动,先热身、后运动。选择运动方式时,一定要适合自己。在运动中释放自己,保持好心情。" ]
The output 3 chunks have no overlappings.
- 🇬🇧United Kingdom scott_euser
Thanks! And then I just need 4. The expected chunks from comment #6 because I need to understand what's wrong with the chunking (ie, to me what you show looks fine, you asked for chunks of 100 tokens with no overlap and it gave you three chunks).
- 🇬🇧United Kingdom scott_euser
Actually I see how to reproduce this, if I take your same example and add any overlap I get gibberish in later chunks rather than proper chunks reflecting the original text.
- 🇬🇧United Kingdom scott_euser
Okay got it, here is how you can see the results:
$text = '我是一名退休教师,退休前我很难抽出时间去运动和锻炼,那时感觉自己处在一种亚健康的状态。退休后,我的时间充裕了,经过几年的坚持锻炼,我现在的身体素质比退休前还要好。单位每年组织体检,我的各项指标基本可以达标。在这里,我呼吁老年朋友多运动,多锻炼。立冬时气温降低,老年人外出运动时一定要穿着保暖透气的衣服,运动前做好准备活动,先热身、后运动。选择运动方式时,一定要适合自己。在运动中释放自己,保持好心情。'; /** @var \Drupal\ai\Utility\TextChunker $chunker */ $chunker = \Drupal::service('ai.text_chunker'); $chunker->setModel('openai__gpt-3.5-turbo'); $chunks = $chunker->chunkText($text, 100, 10); dvm('overlap 10:'); dvm($chunks); $chunks = $chunker->chunkText($text, 100, 0); dvm('overlap 0:'); dvm($chunks);
to compare with overlap to without overlap.
Side note to anyone following along: I think this necessarily has a minor shift in the overlap but reviewing the changes to the test coverage its still sensible overlap and I don't think it needs any triggering of reindexing as its minor and vectors will be roughly the same (and since of course dimensions, etc don't change a query against it will continue to work fine).
- 🇨🇳China fishfree
@scott, After pathing your commit, it seems working now. Many thanks!
- 🇬🇧United Kingdom scott_euser
Great thanks for confirming, will see if any other maintainers have comments before merging.
-
scott_euser →
committed 7e4d775b on 1.0.x
Issue #3469392 by scott_euser, fishfree, seogow, marcus_johansson:...
-
scott_euser →
committed 7e4d775b on 1.0.x
-
scott_euser →
committed 7e4d775b on 3486953-add-structuredoutput-to
Issue #3469392 by scott_euser, fishfree, seogow, marcus_johansson:...
-
scott_euser →
committed 7e4d775b on 3486953-add-structuredoutput-to