- Issue created by @scott_euser
- πΊπΈUnited States keiserjb
I altered the doSearch method to track distinct results. It seems to work for me, but I don't understand the implications of what I've done.
- π¬π§United Kingdom scott_euser
Can you put it in a merge request so it can be tested/reviewed?
- Merge request !717Issue #3526390: Add advanced search capabilities with VDB grouping and exclusion filtering β (Open) created by gxleano
- πͺπΈSpain gxleano CΓ‘ceres
Changes include the logic to handle the limitations of recursive vector search in scenarios involving:
- Large content split into many small chunks
- Numerous access-controlled nodes
- Insufficient retrieval due to 10-iteration (
maxAccessRetries
) cap
At the meantime, I've also added related changes in https://www.drupal.org/project/ai_vdb_provider_milvus/issues/3526393 β¨ Make use of Milvus' Grouping functionality Active .
- π¬π§United Kingdom scott_euser
Thanks for all the work on this and apologies for the delay. Keen to hear other opinions as I have been struggling to focus on this lately, but my general feeling is that we are repeating a lot both in AiVdbProviderClientBase/Interface + in the SearchApiAiSearchBackend.php
Looking here and at β¨ Make use of Milvus' Grouping functionality Active it seems both exclusion and grouping are both small modifications but we are making separate methods and repeating I guess to avoid breaking change, but maybe we are adding a lot of complexity to avoid BC while we are still able to, and maybe coordinated release lesser of two evils.
Then knowing that a VDB supports Grouping only means that the SearchApiAiSearchBackend needs to skip the iteration attempts when chunks are not wanted && grouping supported. And the exclude_entity_ids just added as a filter param if supported. Not actually sure if we need the supportsNotInFiltering() because there are plenty of things not supported by filtering in VDBs and ultimately if it is supported the VDB will get through the interation quicker to get the desired number of results.
Then we should be able to avoid 3 separate doSearch methods which have a lot of repetition and instead the supportsGrouping can just be what stops iteration, and we can always attempt to apply exclusions regardless of whether iteration happens or not. So hopefully results in a lot less code change in SearchApiAiSearchBackend.
I say this all without actually properly trying it, as perhaps it has been tried and its not doable. But in any case, first step I think is agreement together how much breaking change we are okay wtih
- π©πͺGermany marcus_johansson
@scott_euser - regarding BC, AI Search is the one that is ok to do it to since its still Experimental. The other option is to do it in 2.0.0.
- Status changed to Needs review
3 days ago 7:47pm 25 July 2025 - πΊπΈUnited States keiserjb
Here is my altered doSearch that does the trick for me.
protected function doSearch(QueryInterface $query, $params, $bypass_access, &$results, $start_limit, $start_offset, $iteration = 0, &$unique_entity_ids = []) { $params['database'] = $this->configuration['database_settings']['database_name']; $params['collection_name'] = $this->configuration['database_settings']['collection']; // Conduct the search. if (!$bypass_access) { // Fetch more to account for deduplication. $params['limit'] = $start_limit * 5; $params['offset'] = $start_offset + ($iteration * $start_limit * 5); } $search_words = $query->getKeys(); if (!empty($search_words)) { [$provider_id, $model_id] = explode('__', $this->configuration['embeddings_engine']); $embedding_llm = $this->aiProviderManager->createInstance($provider_id); if (!isset($params['vector_input'])) { if (is_array($search_words)) { unset($search_words['#conjunction']); $search_words = implode(' ', $search_words); } $input = new EmbeddingsInput($search_words); $params['vector_input'] = $embedding_llm->embeddings($input, $model_id)->getNormalized(); } $params['query'] = $query; $response = $this->getClient()->vectorSearch(...$params); } else { $response = $this->getClient()->querySearch(...$params); } $i = 0; foreach ($response as $match) { if (is_object($match)) { $match = (array) $match; } $i++; $entity_id = $match['drupal_entity_id']; // Access check. if (!$bypass_access && !$this->checkEntityAccess($entity_id)) { continue; } // Only count distinct entities. if (!isset($unique_entity_ids[$entity_id])) { $unique_entity_ids[$entity_id] = TRUE; } $results[] = $match; if (count($unique_entity_ids) >= $start_limit) { return [ 'real_offset' => $start_offset + ($iteration * $start_limit * 5) + $i, 'reason' => 'distinct_entity_limit', 'vector_score' => $match['distance'] ?? 0, ]; } } if ($iteration == $this->maxAccessRetries) { return [ 'real_offset' => $iteration * $start_limit * 5 + $i, 'reason' => 'max_retries', 'vector_score' => $match['distance'] ?? 0, ]; } if (count($response) < $params['limit']) { return [ 'real_offset' => $iteration * $start_limit * 5 + $i, 'reason' => 'reached_end', 'vector_score' => $match['distance'] ?? 0, ]; } // Recurse to next batch. return $this->doSearch($query, $params, $bypass_access, $results, $start_limit, $start_offset, $iteration + 1, $unique_entity_ids); }