Account created on 22 July 2015, over 10 years ago
#

Merge Requests

More

Recent comments

🇬🇧United Kingdom scott_euser

Haven't checked (will have a better look later) but if you have the search api index, I think in 2.0.x you can get the search api backend which then gives the vdb provider maybe?

🇬🇧United Kingdom scott_euser

Yeah those features for grouping don't exist in the experimental 1.*.x branches, only 2.0.x but I'm back to my laptop next week so maybe just wait - but I definitely get unique results to the quantity requested in our usage of it which also definitely contains multiple chunks that will be highly relevant from the same entity.

To your other comment in general I don't see why Search API can't run a single query. But again maybe wait until I'm back and can provide more details.

🇬🇧United Kingdom scott_euser

Hi! I could be wrong but with Search API I believe in general (beyond AI Search) you would add language to the Search API Fields page. In AI Search you'd select if as 'Filterable Metadata'

Let me know if that helps

🇬🇧United Kingdom scott_euser

scott_euser changed the visibility of the branch 3554899-language-metadata-is to hidden.

🇬🇧United Kingdom scott_euser

Before I recheck, are you using 2.0.x-dev for both?

But overall should that not just all be a part of the VDB Provider to have its query optimised for for its particular environment rather than bespoke querying? So that when you'd use Postgres VDB Provider, you'd get your Postgres query, when someone else uses Milvus or Solr or Mariadb or whatever, they'd get their best scenario. To me it still doesn't justify not working together to make the overall ecosystem better. Especially right now while it's all marked experimental, we can change things in 2.0.x for example and roll out those changes to the VDB Providers.

As to VDB Provider chosen that should be up to site builder of course. Having a wide range of managed hosts - some very restrictive with what can be installed - helps make sure we can support the range of people from pantheon, acquia, platform.sh, cloudways, etc

🇬🇧United Kingdom scott_euser

scott_euser created an issue.

🇬🇧United Kingdom scott_euser

Thanks for the feedback here (even if its via your module homepage)! Personally still would prefer to work together... looking at your code you're tying it directly to Postgres? Or is that just short term?

It is leveraging LLMs to do a vector search and that may incur significant costs if a site has a lot of content and plenty of users viewing the content. AI More Like This module uses only Vector DB to get a list of similar nodes.

This is a bit misleading as it does not support getting the original vectors by default but its a good point, generating new embeddings as a fallback should be opt-in not required. Will sort making that opt-in in Make on-demand generating of embedding an opt-in option Active

With no RAG distance threshold setting, for a genuinely unique content AI Related Content may produce a list of completely unrelated nodes.

This is not true, you can set the threshold on the index via processor (at least that's how I've done it), but yes its a good idea to let people configure it. Will create an issue + provide clearer instructions for the status quo on the module homepage.

For an Index with Enriched Embedding Strategy (default) AI Related Content looks for the N chunks closest to the current node vector, and if all N chunks belong to the same node the other nodes get ignored, and after grouping you may end up with only one related node. Using appropriate SQL query the AI More Like This module finds N distinct nodes that have chunks closest to the current node.

This is not true: A) the current node is excluded (if that was part of what this implies) and B) SearchApiAiSearchBackend::doSearch by default groups by entity when the provider supports grouping (e.g. like Milvus does) otherwise does a level of recursion to get the specified number of results for providers that don't support grouping. But there are also instructions for the latter case that the site builder may want to instead use an index that e.g. uses a single embedding per entity like the average pooling embedding strategy 'seogow' wrote.

🇬🇧United Kingdom scott_euser

Oh that's great!

🇬🇧United Kingdom scott_euser

Thanks, not sure what the normal process is : needs work again vs seperate issue

🇬🇧United Kingdom scott_euser

Does that rely on the name changing though? E.g. I believe Gin is chaning name to get into core... Otherwise 2 modules of same name can't be installed at the same time right?

https://www.drupal.org/project/gin/issues/3552829 📌 Rename Gin Active - opposite direction here but I guess it still applies?

🇬🇧United Kingdom scott_euser

scott_euser created an issue.

🇬🇧United Kingdom scott_euser

Will retest myself probably next week

🇬🇧United Kingdom scott_euser

scott_euser created an issue.

🇬🇧United Kingdom scott_euser

As an FYI for anyone stumbling across this:

  • The footnotes module is very different than described here. It is now working across multiple fields fine (like Paragraphs), has its own rich text so can support links, etc via modal window
  • The bibcite module is an option to consider for more complex management of references. I believe there is an an option to integrate the two 
🇬🇧United Kingdom scott_euser

Hoping someone following this issue or stumbling across it can indicate what's the preferred direction for #159

🇬🇧United Kingdom scott_euser

Ah nice, thanks for the info!

🇬🇧United Kingdom scott_euser

scott_euser created an issue.

🇬🇧United Kingdom scott_euser

I'm pretty sure most of us who do it more seriously customise the chunking strategy to the site's needs (it's a Plugin system) and optimise what gets indexed, but it's hard to make a one size fits all, and using LLM to chunk would of course would incur a far higher cost to be a default setup, hence things like overlap and context as default blunt instruments. Happy to take contributions of course if you e.g. want to contribute more embedding strategy plugins that you think are not too project specific that you think would be sensible for wider range of sites.

In terms of implement event + update hook to maintain status quo for existing sites, are you able to contribute that? We are moving AI Search into a sub-module so we'll have a bit more control of timeline, happy to delay 2.0.x release for it if you think it's doable in short to medium term.

🇬🇧United Kingdom scott_euser

Thanks for the work on this!

Sounds like Views Ajax History module probably needs a follow-up issue to mark itself incompatible with next release of core or perhaps check core versions otherwise sites might get 2 changes to history for each views ajax. Have I understood this right? Happy to work on contributing such a fix to there if so.

🇬🇧United Kingdom scott_euser

BTW do you want that to be a separate issue or replace this one?

🇬🇧United Kingdom scott_euser

Thanks for the work on this! Added some feedback, in general would be good to align with how core does it a bit better for UX consistency + a couple minor bug/edge cases

🇬🇧United Kingdom scott_euser

scott_euser changed the visibility of the branch 2.0.x to hidden.

🇬🇧United Kingdom scott_euser

@wouters_f suggested this could also be a useful resource for future travellers who are willing to help on this issue https://github.com/woutersf/drupal-ai-workshop

🇬🇧United Kingdom scott_euser

I see what you mean - moved it back sorry!

So essentially the base class, if not fully overridden, is (or was) making assumptions about methods existing that are not part of interfaces and therefore shouldn't be assumed to exist.

Will check if this problem still exists in 2x. Should be an easy one to port over there too if so

🇬🇧United Kingdom scott_euser

Yeah that was essentially the quicker way for people to not miss it in context. Chunking just doesn't work without context, bodies of text do not get vectorise properly if you have no context.

The original code before I contributed to Search API to redo the fields system UI and then allow us to extend it with context and metadata control had it hardcoded.

The checkbox to opt out was to avoid having to try to do the update hook complexity yet considered it too high risk people wouldn't read and manually added title (at the very least) as context.

If you or someone else is willing to contribute a fix that does an upgrade path so that what used to be hardcoded is maintained happy to receive that but I already spend a ton of volunteer time in this (and other projects) so sometimes compromises need to be made if there is insufficient support to help out. So was a let's not let perfect be the enemy of good situation - very willing to review an MR if you feel strongly about fixing the workaround.

But we cannot just turn off titles in context for all existing sites that used AI Search since it's earlier says.

And new Indexes must have it turned on by default - there are already too many ways for people to set things up poorly and get poor results, maybe some sort of hook on index create that auto-configures it...

🇬🇧United Kingdom scott_euser

Added to merge train

🇬🇧United Kingdom scott_euser

scott_euser created an issue.

🇬🇧United Kingdom scott_euser

scott_euser created an issue.

🇬🇧United Kingdom scott_euser

scott_euser created an issue.

🇬🇧United Kingdom scott_euser

Thank you! Tested this out with a VDB Provider and after changing base class to the new search api one all worked fine, merging

🇬🇧United Kingdom scott_euser

This greatly depends on the VDB Provider as they don't follow the same pattern as each other, I don't think we can sort it in the base class other than potentially decide to remove defaults from the base class to force VDB Providers to decide. The base class was originally written for Milvus which is why it leans in naming that lines up with Milvus

But the VDB Providers at a glance are filtering by their equivalents of index at a glance:

🇬🇧United Kingdom scott_euser

Thanks for raising! but I couldn't reproduce it (or at least not on 2.0.x), I tried to set to entity_title and added as contextual content:

  1. With getFieldIdentifier: added as contextual content when I check the "Preview content to be vectorized" on the add index fields screen
  2. With your change: no contextual content for title appears

Thinking it was probably sorted as part of 🐛 Title is duplicated in chunk if present in the index as contextual content Needs review given that is in 2.0.x and you've raised for 1.2.1

🇬🇧United Kingdom scott_euser

Looking at where Symfony is really we could go straight to using Symfony AI packages perhaps...

  1. Overview: https://github.com/symfony/ai
  2. Vector Store bridges: https://github.com/symfony/ai/blob/main/src/store/src/Bridge/Milvus/Stor...
  3. AI Provider operations like embeddings: https://github.com/symfony/ai/blob/main/src/platform/src/Bridge/OpenAi/E...

Really they got so far in the last few months that if we are going to go this direction, we should have a bigger discussion whether to require symfony/ai in the AI module in general, or in AI Search at least for VectorDocument and VectorStore and Vector.

🇬🇧United Kingdom scott_euser

Here as well, I know its not solely AI Search but with too many modules in AI I'm not seeing it unless its in AI Search Component issue queue, so hope its okay I move this there

In terms of implementation, if you have already considered how/where, happy to work on it as well

🇬🇧United Kingdom scott_euser

I know its not solely AI Search but with too many modules in AI I'm not seeing it unless its in AI Search Component issue queue, so hope its okay I move this there

🇬🇧United Kingdom scott_euser

I suppose this also needs implementation within the AI Search back-end and/or VDB providers (haven't look closely so apologies if I misunderstood.

🇬🇧United Kingdom scott_euser

With AI Search moving out, I guess this is still appropriate as the docs will still be a unified area largely?

🇬🇧United Kingdom scott_euser

Hmmm in 📌 Deprecate AiVdbProviderInterface::getVdbIds() Active we moved this Milvus specific code out to Milvus module. But I don't quite understand the bug, deleteFromCollection just calls getClient()->deleteFromCollection() anyways right?

  /**
   * {@inheritdoc}
   */
  public function deleteFromCollection(
    string $collection_name,
    array $ids,
    string $database = 'default',
  ): void {
    $this->getClient()->deleteFromCollection($collection_name, $ids, $database);
  }
🇬🇧United Kingdom scott_euser

scott_euser created an issue.

🇬🇧United Kingdom scott_euser

scott_euser created an issue.

🇬🇧United Kingdom scott_euser

With many issues merged ( 🌱 [Meta] Bring AI Search, Assistants and Chatbot out of Experimental Active ) on its own this doesn't work any more. Suggesting to combine this more broadly so we just hit all changes/deprecations in one go.

🇬🇧United Kingdom scott_euser

Moving to AI Search module given there is no progress on this yet

🇬🇧United Kingdom scott_euser

scott_euser changed the visibility of the branch 2.0.x to hidden.

🇬🇧United Kingdom scott_euser

scott_euser changed the visibility of the branch 3487487-improve-ai-search-table to hidden.

🇬🇧United Kingdom scott_euser

Thank you!

🇬🇧United Kingdom scott_euser

FWIW I don't think anything stops this from happening at the moment, its just that when an item is returned from an external system, the developer needs to decide how to render that if its not a drupal entity.

🇬🇧United Kingdom scott_euser

Changing from deprecate to remove given we are experimental here only. Will add to the change log + update the VDB Providers I manage

🇬🇧United Kingdom scott_euser

Thank you!

🇬🇧United Kingdom scott_euser

scott_euser changed the visibility of the branch 3550910-highlight-search-api to hidden.

🇬🇧United Kingdom scott_euser

scott_euser changed the visibility of the branch 3550910- to active.

🇬🇧United Kingdom scott_euser

scott_euser changed the visibility of the branch 3550910- to hidden.

🇬🇧United Kingdom scott_euser

scott_euser made their first commit to this issue’s fork.

🇬🇧United Kingdom scott_euser

Retested this, and did the updates within Pinecone and Milvus to go through it. Sorted merge conflicts particularly with 📌 Remove KeyRepositoryInterface from \Drupal\ai\Base\AiVdbProviderClientBase Active

Merge train is running

🇬🇧United Kingdom scott_euser

Hi @askibinski,

What's the state of this now, are you still after getting it in? If so, can you provide steps to test please?

We'll be moving AI Search out of the AI module into its own namespace so we are keen to sort out the issue queue where possible to avoid extra work moving merge requests over (but could be we need to do that for this one)

🇬🇧United Kingdom scott_euser

Okay no longer postponed since 📌 Improve AI Search Module Indexing to Handle Long-Running Chunk Embedding Processes Needs work has been refactored with a different approach.

🇬🇧United Kingdom scott_euser

Thanks for confirming. Given there is no work on this yet, moving to AI Search module so any work is eventually done into separate module rather than sub-module.

Personally given competing priorities and the amount of work this would take to actually change everywhere with coordinated releases, its not something I am likely to spend my personal time on unfortunately, but if someone feels strongly about pushing this forward I won't stand in the way of course.

Also marking as minor per docs "Minor priority is most often used for cosmetic issues that do not inhibit the functionality or main purpose of the project." which I think reflects this.

🇬🇧United Kingdom scott_euser

Given this is fairly small change at the moment and really the bigger decision here is the path to take (because I don't think the current path of forcing it for all existing sites is correct) moving this over to AI Search module so any work is eventually done into separate module rather than sub-module

Changing to feature request because its not really a bug

🇬🇧United Kingdom scott_euser

Marking as postponed to get more info to be able to decide solution

Given there is no merge request for this yet, moving to AI Search module so any work is eventually done into separate module rather than sub-module

🇬🇧United Kingdom scott_euser

Just assigning to myself to deal with at first

🇬🇧United Kingdom scott_euser

It would be good if there is a more complete stack trace so we can see in what case AI Search or another AI sub-module is calling the getOriginalObject method. Going to preemptively move this to maintainer needs more info since its quite old

🇬🇧United Kingdom scott_euser

Yep agreed, given there is no work on this yet, moving to AI Search module so any work is eventually done into separate module rather than sub-module

Additionally setting to low priority since its just helping developers not set it up wrong, but nothing is really broken

🇬🇧United Kingdom scott_euser

Given there is no work on this yet, moving to AI Search module so any work is eventually done into separate module rather than sub-module

🇬🇧United Kingdom scott_euser

Resolved conflicts and updated EchoDb VDB Provider to match

🇬🇧United Kingdom scott_euser

I think tests will fail at least until Search API Make IndexBatchHelper a service so it can be overridden Needs review is merged (+ probably some minor phpstan/phpcs things after that) but otherwise this is working.

Progress for example looks like this at 2 items per batch match when entities 1 and 2 are exceeding the maximum, the progress bar slows down a bit for them as it hits multiple batch runs before they are completed:

> [notice] Processed 12.5% of items on Test milvus index (1 / 8 items).
> [notice] Processed 19.4% of items on Test milvus index (1 / 8 items).
> [notice] Processed 26.6% of items on Test milvus index (2 / 8 items).
> [notice] Processed 34.4% of items on Test milvus index (2 / 8 items).
> [notice] Processed 50.0% of items on Test milvus index (4 / 8 items).
> [notice] Processed 75.0% of items on Test milvus index (6 / 8 items).
> [notice] Processed 100.0% of items on Test milvus index (8 / 8 items).
> [notice] Message: Successfully indexed 8 items.

🇬🇧United Kingdom scott_euser

Thank you! This is perfect, I am able to then override it fine and AI Search to handle the chunking now.

I did the rename to IndexingBatchHelper - makes sense to me.

🇬🇧United Kingdom scott_euser

Thanks! I added option 2 from comment #6, still look okay for you?

🇬🇧United Kingdom scott_euser

Thanks! Will take a look asap, much appreciated!

Ultimately the reasoning for the change is because e.g. a single entity can take multiple passes from the batch to complete when generating embeddings as it can be very long running for huge processes.

So indexing 5 items might take 8 passes in the batch if 1 item needs 3 extra passes before LLM has generated all embeddings.

First step which we do programmatically is Chunk the content using an embedding strategy (usually a combination of breaking into chunks + re-adding things like title, short description as contextual content so the chunk doesn't loose its overall semantic meaning once converted to an embedding (set of vectors). This is quick so we quickly know that e.g. a giant article had 45 chunks, and we can estimate that roughly 10 chunks can be processed per batch.

So it means we need to revisit the progress and slow it down for that long running item, potentially something like this:

  • item 1 complete, less than 10 chunks: 20%
  • item 2 start, 45 chunks, first 10 done: 22.5%
  • item 2 continued, 45 chunks, 20 done: 25%
  • item 2 continued, 45 chunks, 30 done: 27.5%
  • etc...

Whereas currently the index helper progress start/finish is determined and fixed from the start of the batch setup. So its that method that I'd want to extend to manipulate the progress specifically for AI Search.

We had originally tried sub-batch but the UI is very confusing as progress bar then keeps jumping from 0 to 100% and its really hard as a user to understand how close to the overall total we actually are.

🇬🇧United Kingdom scott_euser

Work in progress: screencast gif of work in progress opening modal and suggesting tags:

🇬🇧United Kingdom scott_euser

Thanks, I retested with this manually as well and works well thank you!

🇬🇧United Kingdom scott_euser

scott_euser created an issue.

🇬🇧United Kingdom scott_euser

scott_euser made their first commit to this issue’s fork.

🇬🇧United Kingdom scott_euser

scott_euser created an issue.

🇬🇧United Kingdom scott_euser

scott_euser created an issue.

🇬🇧United Kingdom scott_euser

Phpstan will of course fail since method doesn't exist yet, relies on it being merged + target 2.0.x

🇬🇧United Kingdom scott_euser

Hmmm I'm less convinced by it, while I think its definitely useful for people its a very specific/opinionated instruction without explanation why to help user understand their choices (and the why is actually explained in quite some detail in the configuration UI itself). So I'm suggesting we shift the majority of this text to a separate md file and make it more clear its an example setup, and instead have the main AI Search page give the high level only like it does for other sections. Its also very specific about chatbot/assistant whereas AI Search is not that, its more that it can be used by that (and that's just one way to use it)

Sorry I know this is annoying after all the effort going into it but hopefully not too painful to shift it over.

Thanks for all the work on it!

🇬🇧United Kingdom scott_euser

Just setting back to NW because tests are failing

🇬🇧United Kingdom scott_euser

Downside with doing it at this level is that indexing content also gets hash cached... initial thoughts:

  1. Simplest maybe index rebuild should have a message to recommend running the cache clear call so cache is smaller
  2. Or EmbeddingInput class can get a method shouldCache() and new default arg of true in constructor to set it, then AI Search Back-end can set it to false on indexing
🇬🇧United Kingdom scott_euser

Possibly, we could add dimensions into the hash, but I believe dimensions are 1:1 relationship with model right? Anyone know if there are any models that give you the option of multiple dimensions?

Production build 0.71.5 2024