Improve the AI Search recursive retrieval of a specific quantity of results

Created on 24 May 2025, 2 months ago

Problem/Motivation

At the moment if you follow the code in Drupal\ai_search\Plugin\search_api\backend\SearchApiAiSearchBackend::$maxAccessRetries, we re-attempt up to 10 times to get a specific count (limit) of results.

For some scenarios like AI Related Content β†’ in a context where large Nodes have been broken into many smaller chunks, even this iteration may not be sufficient, especially if no filter on access is made and subsequent access checks also exclude many nodes (e.g. a more member content driven site).

Steps to reproduce

  1. Have a site with many access controlled nodes
  2. Have large content lengths with small chunk size
  3. Attempt to search and retrieve a specific number of results

Proposed resolution

Improve the iteration by allowing Vector Databases to say they are either:

  1. A vector database that supports Grouping or Aggregation of some form like https://milvus.io/docs/grouping-search.md. We can group by drupal_long_id. This seems to just be Milvus (big win for Milvus!)
  2. A vector database that supports filtering by NOT IN array of already found drupal_long_id. Most VDB Providers (if not all) should be able to support this).

So I think some more changes to VDB Provider interfaces probably. For (1) its pre-query change, for (2) its post query condition setting by VDB Provider on recursive ::doSearch() call

Remaining tasks

  1. Merge request to build in this functionality
  2. Decide when to implement, as it will probably be a breaking change and require coordinated release of VDB providers. I suggest 2.0.x

User interface changes

N/A

API changes

TBD

Data model changes

N/A

✨ Feature request
Status

Active

Version

2.0

Component

AI Search

Created by

πŸ‡¬πŸ‡§United Kingdom scott_euser

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @scott_euser
  • πŸ‡¬πŸ‡§United Kingdom scott_euser
  • πŸ‡¬πŸ‡§United Kingdom scott_euser
  • πŸ‡ΊπŸ‡ΈUnited States keiserjb

    I altered the doSearch method to track distinct results. It seems to work for me, but I don't understand the implications of what I've done.

  • πŸ‡¬πŸ‡§United Kingdom scott_euser

    Can you put it in a merge request so it can be tested/reviewed?

  • πŸ‡ͺπŸ‡ΈSpain gxleano CΓ‘ceres
  • πŸ‡ͺπŸ‡ΈSpain gxleano CΓ‘ceres

    Changes include the logic to handle the limitations of recursive vector search in scenarios involving:

    • Large content split into many small chunks
    • Numerous access-controlled nodes
    • Insufficient retrieval due to 10-iteration (maxAccessRetries) cap

    At the meantime, I've also added related changes in https://www.drupal.org/project/ai_vdb_provider_milvus/issues/3526393 ✨ Make use of Milvus' Grouping functionality Active .

  • πŸ‡ͺπŸ‡ΈSpain gxleano CΓ‘ceres
  • πŸ‡¬πŸ‡§United Kingdom scott_euser

    Thanks for all the work on this and apologies for the delay. Keen to hear other opinions as I have been struggling to focus on this lately, but my general feeling is that we are repeating a lot both in AiVdbProviderClientBase/Interface + in the SearchApiAiSearchBackend.php

    Looking here and at ✨ Make use of Milvus' Grouping functionality Active it seems both exclusion and grouping are both small modifications but we are making separate methods and repeating I guess to avoid breaking change, but maybe we are adding a lot of complexity to avoid BC while we are still able to, and maybe coordinated release lesser of two evils.

    Then knowing that a VDB supports Grouping only means that the SearchApiAiSearchBackend needs to skip the iteration attempts when chunks are not wanted && grouping supported. And the exclude_entity_ids just added as a filter param if supported. Not actually sure if we need the supportsNotInFiltering() because there are plenty of things not supported by filtering in VDBs and ultimately if it is supported the VDB will get through the interation quicker to get the desired number of results.

    Then we should be able to avoid 3 separate doSearch methods which have a lot of repetition and instead the supportsGrouping can just be what stops iteration, and we can always attempt to apply exclusions regardless of whether iteration happens or not. So hopefully results in a lot less code change in SearchApiAiSearchBackend.

    I say this all without actually properly trying it, as perhaps it has been tried and its not doable. But in any case, first step I think is agreement together how much breaking change we are okay wtih

  • πŸ‡©πŸ‡ͺGermany marcus_johansson

    @scott_euser - regarding BC, AI Search is the one that is ok to do it to since its still Experimental. The other option is to do it in 2.0.0.

  • Status changed to Needs review 3 days ago
  • πŸ‡ΊπŸ‡ΈUnited States keiserjb

    Here is my altered doSearch that does the trick for me.

    protected function doSearch(QueryInterface $query, $params, $bypass_access, &$results, $start_limit, $start_offset, $iteration = 0, &$unique_entity_ids = []) {
      $params['database'] = $this->configuration['database_settings']['database_name'];
      $params['collection_name'] = $this->configuration['database_settings']['collection'];
    
      // Conduct the search.
      if (!$bypass_access) {
        // Fetch more to account for deduplication.
        $params['limit'] = $start_limit * 5;
        $params['offset'] = $start_offset + ($iteration * $start_limit * 5);
      }
    
      $search_words = $query->getKeys();
      if (!empty($search_words)) {
        [$provider_id, $model_id] = explode('__', $this->configuration['embeddings_engine']);
        $embedding_llm = $this->aiProviderManager->createInstance($provider_id);
    
        if (!isset($params['vector_input'])) {
          if (is_array($search_words)) {
            unset($search_words['#conjunction']);
            $search_words = implode(' ', $search_words);
          }
          $input = new EmbeddingsInput($search_words);
          $params['vector_input'] = $embedding_llm->embeddings($input, $model_id)->getNormalized();
        }
    
        $params['query'] = $query;
        $response = $this->getClient()->vectorSearch(...$params);
      }
      else {
        $response = $this->getClient()->querySearch(...$params);
      }
    
      $i = 0;
      foreach ($response as $match) {
        if (is_object($match)) {
          $match = (array) $match;
        }
        $i++;
    
        $entity_id = $match['drupal_entity_id'];
    
        // Access check.
        if (!$bypass_access && !$this->checkEntityAccess($entity_id)) {
          continue;
        }
    
        // Only count distinct entities.
        if (!isset($unique_entity_ids[$entity_id])) {
          $unique_entity_ids[$entity_id] = TRUE;
        }
    
        $results[] = $match;
    
        if (count($unique_entity_ids) >= $start_limit) {
          return [
            'real_offset' => $start_offset + ($iteration * $start_limit * 5) + $i,
            'reason' => 'distinct_entity_limit',
            'vector_score' => $match['distance'] ?? 0,
          ];
        }
      }
    
      if ($iteration == $this->maxAccessRetries) {
        return [
          'real_offset' => $iteration * $start_limit * 5 + $i,
          'reason' => 'max_retries',
          'vector_score' => $match['distance'] ?? 0,
        ];
      }
    
      if (count($response) < $params['limit']) {
        return [
          'real_offset' => $iteration * $start_limit * 5 + $i,
          'reason' => 'reached_end',
          'vector_score' => $match['distance'] ?? 0,
        ];
      }
    
      // Recurse to next batch.
      return $this->doSearch($query, $params, $bypass_access, $results, $start_limit, $start_offset, $iteration + 1, $unique_entity_ids);
    }
Production build 0.71.5 2024