Which VDB Provider are to using? It's more the support of the filterable attributes to VDB provider filtering (which varies wildly) that I'd suspect limitations get hit for...
Will take care of this in ✨ Allow to do vector search by user supplied query Active first to make it easier to handle + satisfy other historical requests from other devs
Hmmmm actually I can see the use, though slightly tweaked (updated issue summary and title). It would be much easier to do a similarity search and avoid an extra LLM call to generate embedding if we could look-up by ID, e.g. for here ✨ Make an option to get embedding from the already indexed content to avoid going to LLM Active . In that case in my module there I would retrieve the embedding via the vdb provider and then use that to query the vector database.
- For your front-end error, I'm guessing its this 🐛 Fatal error on render on front-end resulting in no related content Active
- Adding in a no LLM call option here ✨ Make an option to get embedding from the already indexed content to avoid going to LLM Active
If there are other things missing do let me know, otherwise will wait and see what you push, or wait for you to reach out in Drupal Slack to chat.
Thanks!
Of course I'm only seeing half the puzzle at the moment but maybe before you take it out of dev to a full release I'll be able to see and we can discuss better....
Hmmm trying to think back; I can imagine this came in to not stop the status quo of chunking adding a tiny bit of context (originally there was no control of contextual content + no way to index filterable attributes. Hard to follow git history as so much reworking happened to get it there. I expect the decision at that time was that title should be the minimum contextual content in a chunk (but its too long ago to remember/be sure)
Moving forward I don't think we should move to a place where title is not there as contextual content by default as we know however clear we make instructions, people will inevitably skip reading them. So whatever approach we take here should be an 'opt out' for those who specifically want to not have the title in contextual content.
And yes we should fix the duplicate title so it appears only once in what actually gets indexed. What actually gets indexed hopefully accurately matches what the Chunker Preview outputs but I cannot check rate now - ie, AiSearchIndexFieldsForm::buildCheckerChunkTable()
From our past discussions and searching, we had come to the conclusion that LLMs do correctly understand markdown, hence the '# $title' (conversion to h1 is only for the sake of the preview and only if you have the optional php league package installed, LLM is sent the markdown - again unless I remember wrong).
In terms of the test, I believe its testing against the chunker preview which again need to follow the code a bit more, but I don't think actually uses the indexed content, just checks if the item is in the index (could be wrong). Looking at AiSearchIndexFieldsForm::getCheckerEmbeddings() I believe it is generating fresh embedding based on current field indexing settings.
Thanks! Apologies that was in there in the first place, the module doesn't provide a drush command so removed it altogether
scott_euser → made their first commit to this issue’s fork.
Personally would much prefer to keep to the issue summary here as we have spent a lot of time having a back and forth to agree approach and get test coverage in. The goal here is to break up the options so it's not an all or nothing but otherwise leave the status quo to keep the scope limited and in a place that is realistically mergeable.
Changing the UX and what the options are do sound like you're raising seperate issues that are related to the user controls in general. I'm not a maintainer here of course so would suggest waiting until they chime in :)
Do you want to consider making not using embeddings an option in there? Any chance you're on drupal slack to chat? Just feels a shame to not try to team up and make something more powerful overall
Thanks for the work on this! Does look like a related test failure, worth checking what the browser output is there before and after
For adding filterable attributes however assuming you are using AI Search, the Search API Indexing Fields configuration page is where to go to add new attributes to you meta data (and limited support via Views to filter via those)
We haven't attempted to normalise filters in VBD providers (their support is too inconsistent and not close to what views filters offers) so this is more likely a question for the milvus/zilliz module (although may also need a change here to allow milvus/zilliz to use it via whatever the chatbot code is). So probably this issue needs to be postponed on milvus/zilliz issue first.
Note that methods exist in the vdb providers to accept queries but you'd need to follow milvus/zilliz documentation rather than AI module documentation (unless things have changed there that I've missed)
Oh yeah that's not what I mean - definitely the functionality here is important and provides quite a bit more value; more thinking of the UX changes only.
Had to dig a bit but this was an earlier release (not in stable version yet of AI module): https://drupal.slack.com/archives/C0803LX4536/p1751900254090099?thread_t... which is this mp4 video of using a button below an entity reference field →
- Updated issue support to reflect the making generic is addressed in #86.
- Merged 11.x into the branch locally to resolve conflicts, think I got it right, but let's see test coverage. Test warnings for php8.5 are from 11.x as they also occur on the daily pipeline
There are a few more issues raised by @kim.pepper around some refactoring that are breaking changes which require coordinated releases with the VDB provider modules (most of which we maintain) - those have suggested we target AI 2.0.x - so perhaps that is the point we aim for stable. So those I think should go into the 'must' area to deal with them at once before we tie ourselves to stable and then struggle to get them in later due to BC
I think we are probably pretty close; there are some minor issues in the issue queue but all bugfixes; in terms of features we are probably rather complete as this is of course essentially a bridge between TMGMT and AI module. The patch to TMGMT core listed on the module homepage shouldn't be a blocker as its an upstream problem and the consequence without it is higher token usage but at least it continues to work.
Thank you! Makes sense to make it more protective like that given the unpredictable nature of the AI suggestions. Thanks, getting merged.
For reference for someone coming across this, this is the status quo UX of this module:
I think before spending too much time on a direction would like to check in on the latest for Field Widget Actions in the AI module. Hoping we can leverage that somehow - its merge into the 1.2.x branch there.
I also wonder if while leveraging that, we'll see any design patterns there e.g. modal window or something with suggestions which could avoid a full form submission. With a lot of AI Initiative efforts happening there, aligning the UX as much as possible while keeping the suggestion levels and searching existing content from this module would great.
It may be hard however to avoid a page refresh unless we e.g. require the field to be Tagify like Drupal CMS uses (or at least remove support for autocomplete) otherwise e.g. an entity reference field that results in 3 references added yet only 1 field initially available doesn't work of course.
FYI there are other test failures now from https://www.drupal.org/project/ai/issues/3525303#comment-16266934 ✨ Create Plugin Action for Recipes to check for installed default provider Active it seems, so worth noting when working on this to expect that if fork gets updated with latest 1.2.x
I think this is adding some new test failures on phpunit next minor, e.g. https://git.drupalcode.org/project/ai/-/jobs/6550213
Error 0.723s testVdbServerSetup
Pass 0.728s testVdbServerNotSetup
Failure 10.775s *** Process execution output ***
PHPUnit 11.5.39 by Sebastian Bergmann and contributors.Runtime: PHP 8.3.25
Configuration: /builds/project/ai/web/core/phpunit.xml.distDDDDDDDDDED 11 / 11 (100%)
Time: 00:10.553, Memory: 6.00 MB
Verify Setup Ai (Drupal\Tests\ai\Kernel\Plugin\ConfigAction\VerifySetupAi)
⚠ No provider setup
⚠ Provider setup
⚠ Provider does not exist
⚠ Provider exists
✔ Operation type has default model
✔ Operation type does not have default model
✔ Invalid value type
✔ Invalid array type
✔ Empty settings
✘ Vdb server setup
┐
├ InvalidArgumentException: The VDB provider 'echo_db' does not exist, so this recipe will not work.
│
│ /builds/project/ai/src/Plugin/ConfigAction/VerifySetupAi.php:113
│ /builds/project/ai/tests/src/Kernel/Plugin/ConfigAction/VerifySetupAiTest.php:178
┴
✔ Vdb server not setup
Could re-open this or someone could make it as a separate issue.
Thanks! I gave this a whirl:
- Checked UI and details shown
- Compared with VDB providers like Pinecone which have additional info added; Ping gets duplicated, but I can fix that in Pinecone as it makes sense in base here
- Removed a collection in Milvus and the auto-recreate works well
scott_euser → made their first commit to this issue’s fork.
Thanks for flagging! Can probably start with (1) given its a nice-to-have type feature provided by this module to keep it simple. Thanks!
That sounds like a good plan to have a setup method trigger the cache dir creation. I think probably can use
protected function setUp(): void {
parent::setUp();
...
and in KernelTestBase that this class extends, Core already has an initFileCache() where they do a similar thing it seems, so we can have ours call initTiktokenCache to match
Its likely once we have something we will need to merge and monitor if the issue recurs as it'll be hard to know for sure if it solves, but can always revert if not.
Thanks for your persistence with it! Will close this one for now and continue in that thread as that one is the issue I saw failures for e.g. after committing the above into 1.2.x
Its definitely opinionated unfortunately, so it needs to be configurable.
scott_euser → changed the visibility of the branch 3519927-citations-are-not to hidden.
Okay got some more automated test coverage in as well to ensure with & without grouping works okay. Will leave this here for a little bit as needs review in case anyone has any concerns
Thanks @dieterholvoet for the start on this!
We have been working on a giant publication with hundreds of footnotes and we have also realised that when across multiple fields, not only is the grouping not shared for identical numbers, the matching of same text appearing in different fields does not group those still.
Working on some refactoring to share some of the FootnotesFilter code with FootnotesGroup - moved to a separate branch for now.
Seems to be working for now, but will be adding test coverage to this as its a problem that has come back too often.
scott_euser → made their first commit to this issue’s fork.
Refactored it, this I think is the most easy to read, but has the downside of repeating the title & description. Could move those into variables I suppose, but maybe this is keeping it simpler?
Even after merge, 1.2.x pipeline failed, then on retry passed. Wonder if its something to do with the order pipeline jobs get picked up or parallel test running. Open to help/suggestions!
Aha so its taking the Drupal Core version here, yet if the module is updated independently of core, its potentially an issue. Okay will merge, thanks!
Is this still an issue actually?
I tried to reproduce on 2.0.x:
- Added `$config['user.role.anonymous']['permissions'][9999] = 'access devel information';` to settings.local.php (confirmed with a die that it kicks in)
- Checked in UI to see if devel is checked and its not (ie, override is not affecting the form)
- Checked in original code here to add a dvm and can see that `access devel information` is not added to the array
foreach ($roles as $role_name => $role) { if (in_array(self::ALL_OPTIONS, $filter['roles']) || in_array($role_name, $filter['roles'])) { // Retrieve role names for columns. $role_names[$role_name] = $role->label(); // Fetch permissions for the roles. $role_permissions[$role_name] = $role->getPermissions(); $admin_roles[$role_name] = $role->isAdmin(); } } dvm($role_permissions);
- I also tried saving after anyways in case its an issue in submit handler only.
- After save, config export does not result in any changes, ie, devel not added to anonymous
I think I need a little bit more help understanding how to reproduce or perhaps this is just outdated.
Seems like it could be unrelated, as there is a random test failure from TextChunkerText, which on retry passes:
---- Drupal\Tests\ai\Kernel\Utility\TextChunkerTest ----
Status Duration Info
--------------------------------------------------------------------------------------------------------
Error 1.883s testTokenCount with data set #0
Pass 1.477s testTokenCount with data set #1
Pass 0.574s testTokenCount with data set #2
Pass 0.528s testTokenCount with data set #3
Pass 0.569s testTokenCount with data set #4
Pass 0.495s testTokenCount with data set #5
Pass 0.572s testTokenCount with data set #6
Pass 0.490s testTokenCount with data set #7
Pass 0.547s testTokenCount with data set #8
Pass 0.481s testTokenCount with data set #9
Pass 0.556s testTokenCount with data set #10
Pass 0.554s testTokenCount with data set #11
Log 11.252s *** Process execution output ***
PHPUnit 11.5.39 by Sebastian Bergmann and contributors.Runtime: PHP 8.3.25
Configuration: /builds/project/ai/web/core/phpunit.xml.distE........... 12 / 12 (100%)
Time: 00:11.024, Memory: 8.00 MB
Text Chunker (Drupal\Tests\ai\Kernel\Utility\TextChunker)
✘ Token count with 0
┐
├ Yethee\Tiktoken\Exception\ParseError: Could not decode token "IHdpbm5p=" at line 11231
│
│ /builds/project/ai/vendor/yethee/tiktoken/src/Vocab/Vocab.php:93
│ /builds/project/ai/vendor/yethee/tiktoken/src/Vocab/Loader/DefaultVocabLoader.php:89
│ /builds/project/ai/vendor/yethee/tiktoken/src/EncoderProvider.php:177
│ /builds/project/ai/vendor/yethee/tiktoken/src/EncoderProvider.php:138
│ /builds/project/ai/src/Utility/Tokenizer.php:51
│ /builds/project/ai/src/Utility/TextChunker.php:25
│ /builds/project/ai/tests/src/Kernel/Utility/TextChunkerTest.php:52
┴
✔ Token count with 1
✔ Token count with 2
✔ Token count with 3
✔ Token count with 4
✔ Token count with 5
✔ Token count with 6
✔ Token count with 7
✔ Token count with 8
✔ Token count with 9
✔ Token count with 10
✔ Token count with 11
For now this separates out the indexing so failure to reach 100% indexing in setup becomes visible, as otherwise the pipeline does not show what the failure is. Will merge it since it passes, but I suspect its not the end of it if we are still getting random test failures from tiktoken like the above.
scott_euser → changed the visibility of the branch 3544206-ai-search-next-minor-attempt2 to hidden.
scott_euser → changed the visibility of the branch 3544206-ai-search-tests to hidden.
Hmmm strange, tests sometimes pass sometimes fail as can be seen here https://git.drupalcode.org/project/ai/-/merge_requests/867/pipelines
Won't revert as this commit at least fixes part of the issue.
Needs further digging unfortunately
Thanks for looking into this and reviewing!
So what would be the proposal with getSetupData? That Gemini and other providers should suggest a default Tokenizer from Tiktoken PHP's list? https://github.com/yethee/tiktoken-php/blob/master/src/EncoderProvider.php
Personally I don't know enough about the other providers to know if there are noticeable differences in the counting from gpt3.5 token counting.
scott_euser → changed the visibility of the branch 3491446-solr-boost-of to hidden.
Probably needs update hook as well to set default empty value, but will wait for feedback as to whether this direction is acceptable first or not
Couldn't see a good option in Core for word splitting, opted to borrow the one used in the Search API Database backend and therefore suggesting to move it into Utility class as a result, but feel free to disagree of course
That sounds very sensible yeah - and at least to me makes sense to have a very minimal page, e.g. like cloudflare managed challenge is.
Would you imagine a plugin system where we contribute plugins to the various captcha modules?
(And of course also contribute to them sending signals as you mentioned)
No reply for 5 months with request for more info, so closing. Feel free to reopen, thanks!
No reply for a year with request for more info, so closing. Feel free to reopen, thanks!
Hi! If you could please make as a merge request so test coverage runs, thanks!
Hi! If you could please make as a merge request so test coverage runs, thanks!
Hi! Thanks for this module, read about it in this blog https://www.drupal.org/blog/beyond-patching-drupal-association-and-crowd... →
Would the captcha instead of block be something for the road map here as well? It is mentioned on the module homepage as a potential future feature.
Thanks!
Thanks for flagging! Should work now on 10x as well (have only been using it on 11x myself sorry)
scott_euser → made their first commit to this issue’s fork.
Hi! As an FYI, the 11.3.x issue for AI Search is being addressed here 🐛 AI Search tests is failing on next minor Active
Thanks very much for the continued persistence with this. Its a really complicated chunk of work, hence the challenge as well to properly find big chunks of time to review in my free time, but at least for now holiday season over for me, back into school term time!
Indexing progress bar
Okay I had a bit of a dig into this to see what it would take to get the progress bar to show overall progress as I still find it so confusing as a user when I run indexing like this. I wanted to understand if its something we can easily do later in a follow-up, but I think its not easy to do later unfortunately. I think this is what would need to happen:
- The AI Search Tracker should do the chunking rather than chunking happening on demand. I think this would be fine as the chunking itself is not a heavy load as its no API calls.
- The tracker can then select the chunks that are not yet processed (we already have ai_search_chunks.status in the database table), and use it for all tracker methods which includes counts of number of items to be processed
A single batch rather than sub-batches can be used + the indexing status overview (index_progress variable as added in template_preprocess_search_api_index) simply then needs to clarify what is the count. E.g. something like:
Before:
65 of 980 indexed
After:
65 of 980 chunks indexed (This is 12 of 210 items which have been broken into 980 chunks in the vector database)
Other notes
- We have more outstanding unresolved comments still some indicating breaking changes we need to work out how to deal with
- The complexity of things here, particularly the indexing/chunking would be hard to maintain without test coverage I think, but open to opinions
Thanks, that debugging was very helpful!
Not solved yet, but I can see as a result of `git bisect` that 📌 Deprecate $variables['page'] for node.html.twig Active is the issue. Namely the change there to the node.html.twig now checks if view mode is full, and if so, does not output the URL.
So we are expecting to render the 'full' view mode and therefore the URL should not be expected after that commit.
We can update the tests to check that the title of the found results are as expected rather than title and link. E.g. either [Chocolate Cake](
(to keep it working in 11.2) or # Chocolate Cake
(i.e., without the URL for 11.3)
Sounds like a good idea!
Re #8 whoops sorry retargeted branch to 11x
Re #9 do you mean what to do if your database doesn't support json? You need to make sure your database is one the supported ones at https://www.drupal.org/docs/getting-started/system-requirements/database... →
Leaving as needs work per refactor note in #8
Hi! Can you try debugging in the Hooks folder at the ::handleFloodProtection() method? That's all the logic there that affects the View filters.
Thanks! I couldn't test of course at the moment, so added a couple comments that might just be questions for you to confirm behaviour will be as expected. Thanks!
Sorry not very around at the moment but maybe first step is enabled ai api Explorer to make sure you're getting results in the vector database test section there.
Not necessary to have the indexed fields match at all, essentially the boost:
1) programmatically runs the free text search against the vector database to fetch entity IDs only
2) the does something like this pseudo query into database:
"WHERE
(keywords LIKE" %normal database search%" OR item id IN (IDs retrieved from vector database)
AND (other filters & exposed filters
ORDER BY
Vector database retrieved item IDs DESC,
Relevance from database search DESC,
Any other orders you have applied DESC
"
You can e.g. in the boost plugin code add a dvm to check item IDs are found in (1) and that should be exactly what the AI API Explorer submodule shows to make sure that part is working okay
Hope that helps keep you going on debugging
Hi thanks for the contribution. I believe @seanb was keen to avoid adding many more plugins here to the core functionality and possible use viewsreference_extras instead, but will check since we have 'Hide header', this would match and maybe sensible.
However per 🐛 EntityUntranslatableFieldsConstraint can cause translation issues when content is created with an earlier version of viewsreference Needs work adding plugins may need update hook as well as its caused a comparison regression in multilingual it seems (though I have not looked very deeply into it yet).
scott_euser → made their first commit to this issue’s fork.
Will set to Needs Work per your comment in #3 in any case, but my guess is update hook is probably the way to go instead
Hi! Thanks for the contribution. Looks like it causes test failures for when JS is disabled. Test coverage should also be extended to avoid a future regression on this. Thanks!
scott_euser → made their first commit to this issue’s fork.
Thanks for the contribution! This sounds like it might be better as a update hook instead? Though I have not looked deeply at the cause.
For now I rebased latest into your MR since tests past on latest at the moment
scott_euser → made their first commit to this issue’s fork.
scott_euser → created an issue.
Thanks! In the end I modified what you wrote to fit in the alternative/complementary approaches section of the module homepage: https://www.drupal.org/project/protect_views_flood_control#module-project--alternatives →
Thank you
I raised 💬 Provide more help understanding what triggers "Non-translatable fields can only be changed when updating the current revision" Active in core (needs consideration on best approach) to try to find a way to provide more information for developers to be able to understand the cause of the problem in their specific case, please do consider providing suggestions there on how to make this particular symptom more friendly to debug. Thanks!
scott_euser → created an issue.
I think the problem here is there are many ways to get to the same symptom so many people get to this issue via Google or other means.
Layer builder asymmetric translations for example doesn't have this module as a dependency as far as I can see.
Ultimately we need to try to solve the problem described in the issue summary and if there are other problems ideally those get raised as seperate issues (potentially in other modules or core if unrelated to this) or it will be extremely difficult for this to ever get merged (and even better, ideally having failing tests to reproduce the issue).
For honeypot I wonder if it's accessibility issue having extra hidden form field if its not then hidden from screen readers (in order to not be hidden from AI bots)
That's an interesting suggestion with the honeypot!
In https://capellic.com/blog/ai-bot-abuse-our-evolving-strategy-taming-perf... @smustgrave just posted he goes into details on what worked for them.
Despite ignoring me plea in #55 the comments seem to indicate this is working well and given we have test coverage for it as well, hoping its okay to move to RTBC even though I did some of the work here
In any case it's a sort of fallback to things that get past cloudflare