Why is this closed with no update? It would make this 1000 times more useful than something you have to do manually in the UI.
Yes this is working pretty well for us. Along with that, you need ways of telling Search API whether to generate a new vector or not. I was working with something like this:
declare(strict_types=1);
namespace Drupal\search_api_solr_dense_vector\Plugin\search_api\processor;
use Drupal\Core\Logger\LoggerChannelInterface;
use Drupal\search_api_solr_dense_vector\Cache\KeyValue;
use Drupal\search_api_solr_dense_vector\VectorFieldTracker;
use Drupal\search_api\Plugin\search_api\data_type\value\TextValue;
use Drupal\search_api\Processor\ProcessorPluginBase;
use Symfony\Component\DependencyInjection\ContainerInterface;
/**
* Determines if a vector should be generated for an entity.
*
* @SearchApiProcessor(
* id = "determine_vector_change",
* label = @Translation("Determine Vector Change"),
* description = @Translation("Determines if the item being indexed needs the vector to be re-generated."),
* stages = {
* "preprocess_index" = 5,
* }
* )
*/
class DetermineVectorChange extends ProcessorPluginBase {
/**
* The key value cache service.
*/
protected KeyValue $keyValue;
/**
* The logger interface.
*/
protected LoggerChannelInterface $logger;
/**
* The vector field tracker service.
*/
protected VectorFieldTracker $vectorFieldTracker;
/**
* {@inheritdoc}
*/
public static function create(ContainerInterface $container, array $configuration, $plugin_id, $plugin_definition) {
$instance = new static($configuration, $plugin_id, $plugin_definition);
$instance->keyValue = $container->get('dp_search.key_value');
$instance->logger = $container->get('logger.factory')->get('search_api_solr_dense_vector');
$instance->vectorFieldTracker = $container->get('search_api_solr_dense_vector.vector_field_tracker');
return $instance;
}
/**
* {@inheritdoc}
*/
public function preprocessIndexItems(array $items) {
$field_values = [];
// Dynamically extract fields from solr_densevector field configuration.
$tracked_fields = $this->vectorFieldTracker->getTrackedFields('solr');
/** @var \Drupal\search_api\Item\Item $item */
foreach ($items as $item) {
$entity = $item->getOriginalObject()->getEntity();
// Get the stored content in the index if it exists for comparison.
$query = $this->getIndex()->query(['limit' => 1])
->addCondition('task_id', $entity->id())
->execute();
if ($query->getResultCount() > 0) {
$results = $query->getResultItems();
/** @var \Drupal\search_api\Item\Item $result */
$result = reset($results);
$fields = array_intersect_key($result->getFields(), array_flip($tracked_fields));
foreach ($fields as $field) {
$values = $field->getValues();
foreach ($values as $value) {
$field_values[] = ($value instanceof TextValue) ? $value->getText() : $value;
}
}
}
if (!empty($field_values)) {
$original_hash = $this->keyValue->getHash($entity);
$compared_hash = hash('sha256', implode(" ", array_filter($field_values)));
$vectors = $this->keyValue->getVector($entity);
$item->setExtraData('skip_vector_generation', ($original_hash === $compared_hash && !empty($vectors)));
}
}
}
}
declare(strict_types=1);
namespace Drupal\search_api_solr_dense_vector;
use Drupal\Core\Entity\EntityTypeManagerInterface;
/**
* Obtains tracked fields for vector indexing.
*/
class VectorFieldTracker {
public function __construct(
protected EntityTypeManagerInterface $entityTypeManager,
) {}
/**
* Gets the tracked fields on a dense vector field type.
*
* @param string $id
* The index id.
*
* @return array
* The fields stored in the dense vector field type.
*/
public function getTrackedFields(string $id): array {
$tracked_fields = [];
$index = $this->entityTypeManager->getStorage('search_api_index')->load($id);
$fields = $index->getFields();
foreach ($fields as $field) {
if ($field->getType() === 'solr_densevector') {
$field_config = $field->getConfiguration();
$configured_fields = $field_config['fields'] ?? [];
foreach ($configured_fields as $field_path) {
$parts = explode('/', $field_path);
$field_name = end($parts);
if (!in_array($field_name, $tracked_fields)) {
$tracked_fields[] = $field_name;
}
}
}
}
return $tracked_fields;
}
}
Although in my case I am using an aggregated field (several concatenated String fields) to create a vector out of, so VectorFieldTracker likely only works under that scenario. But it otherwise works to signal to the main processor whether the vector needs updating or not. Then in an entity update Hook:
/**
* Act on entities after they are updated.
*
* @param \Drupal\Core\Entity\EntityInterface $entity
* The entity being updated.
*/
public function __invoke(EntityInterface $entity): void {
if ($entity->getEntityTypeId() === 'node' && $entity->bundle() === 'performance_task') {
try {
$field_values = [];
$tracked_fields = $this->vectorFieldTracker->getTrackedFields('solr');
foreach ($tracked_fields as $field_name) {
if ($entity->hasField($field_name) && !$entity->get($field_name)->isEmpty()) {
$field_values[] = $entity->get($field_name)->value;
}
}
$hash = hash('sha256', implode(" ", array_filter($field_values)));
$this->keyValue->setHash($entity, $hash);
}
catch (\Exception $e) {
$this->loggerFactory->get('dp_search')->error('Error generating hash for node @nid: @message', [
'@nid' => $entity->id(),
'@message' => $e->getMessage(),
]);
}
}
}
I'm seeing this too, with system.site for example, drush cex still exports system.site values and writes to the file in Drupal 11.2.x.
That should be all of it in the branch there, try that out. Note that I am removing all punctuation and case from the input, figured that would be more efficient.
I have some code for this, been using it for a few weeks. Will drop in a branch.
You beat me to it, only I was trying this:
$vector_field = reset($vector_fields);
$solr_fields = $index->getServerInstance()->getBackend()->getSolrFieldNames($index);
$solr_field = $solr_fields[$vector_field->getFieldIdentifier()];
I'm sure there is a programmatic way to get the full id or prefix without doing that, I will have to check.
I am using an aggregated field with 5 text fields concatenated for dense vector.
I think this has to do with the way the form is rebuilt and a subform is injected along with the plugin form.
I have only seen knn and not the other - what kind of field were you making a dense vector type?
The meat of this PR is workable. You can develop numerous plugins with their own configuration and forms and it will swap back and forth and export on the index level.
Example plugin:
Updating status... I am not sure if there is anything to do here or if the warning is simply a warning.
I've upgraded my local to 9.8 and 9.9 and see this warning in the logs, but I think its just a warning.
According to the official Solr docs, any integer value is fine:
https://solr.apache.org/guide/solr/9_8/query-guide/dense-vector-search.html
although it isn't entirely clear from the discussions since 2022:
https://github.com/apache/lucene/pull/12436
https://github.com/apache/lucene/issues/11507#issuecomment-1733223795
This will be resolved by 🌱 Should plugins be configurable? Active .
One caveat with this setting is that 'infinity' or 'no option' is specified as either being left out of the query or '-Infinity'. In Drupal, we should interpret 0 as 'do not include' or add a checkbox that toggles this and minReturn on to include or not include.
This should have made it into the last release, but the plugin now receives an array of the settings from the Dense Vector section of the index. After this the plugins should likely be configurable, but these are base settings all plugins should have.
This merge introduces a new interface and plugin manager as well as moves some of the settings from the processor to the index. A starter plugin is included (PureVector) - from here others are able to develop their own manipulations to fit the use cases they need.
This did not have a lot of effect as I thought - there are still hundreds of calls to node_access and taxonomy_index per page, even as just myself browsing the site on a lower environment.
Landed on this issue like others diagnosing what could be wrong. In my case, I noticed a few things.
permissions_by_term_node_access
is called several times per page, as designed. However, the service it calls does not account for a few things:
- Whether the target page contains a field that is used by PBT (it only checks if it has a taxonomy term field)
- Whether the target page(s) in canUserAccessByNode have values for those field(s).
In our case, we only use one vocab for PBT and only one a few content types. A majority of pages that have the field have no value.
I am going to try the following:
- Limit isAnyTaxonomyTermFieldDefinedInNodeType to respecting the configuration of PBT:
/**
* Checks whether there are taxonomy fields defined in a given node type.
*/
public function isAnyTaxonomyTermFieldDefinedInNodeType(string $nodeType): bool {
$fieldDefinitons = $this->entityFieldManager->getFieldDefinitions('node', $nodeType);
$config = \Drupal::config('permissions_by_term.settings');
$pbt_target_bundles = $config->get('target_bundles') ?? [];
foreach ($fieldDefinitons as $fieldDefiniton) {
if ($fieldDefiniton->getType() === 'entity_reference' && is_numeric(strpos($fieldDefiniton->getSetting('handler'), 'taxonomy_term'))) {
$field_target_bundles = $fieldDefiniton->getSetting('handler_settings')['target_bundles'];
if (array_intersect_key(array_flip($field_target_bundles), array_flip($pbt_target_bundles))) {
return TRUE;
}
}
}
return FALSE;
}
- Check if the $node has values for the taxonomy field in canUserAccessByNode - if it doesn't, then allow the user access:
$configPermissionMode = $this->configFactory->get('permissions_by_term.settings')->get('permission_mode');
$requireAllTermsGranted = $this->configFactory->get('permissions_by_term.settings')->get('require_all_terms_granted');
if (!$configPermissionMode && !$requireAllTermsGranted) {
$access_allowed = TRUE;
}
else {
$access_allowed = FALSE;
}
if ($node->hasField('field_access_restrictions') && $node->get('field_access_restrictions')->isEmpty()) {
return TRUE;
}
// ... rest of method
I am hoping to see some improved NewRelic reports from this - we are seeing a ton of activity on MySQL and taxonomy_index degrading performance and I believe 90% of it does not need to occur.
Circling back to this, I added in the #access FALSE which should prevent it (in normal rendering?) to not be printed. I updated the tests to ensure trying to render produces empty output. This may also be a case where there is markup outside of the twig tag that checks `{% if content %}` or similar evaluation to ensure it is not empty.
A more complete example may look something like this:
/**
* {@inheritdoc}
*/
public function postExtractResults(PostExtractResultsEvent $event): void {
$query = $event->getSearchApiQuery();
if ($query->getIndex()->isValidProcessor('solr_densevector')) {
try {
$processor = $query->getIndex()->getProcessor('solr_densevector');
$settings = $processor->getConfiguration();
if (!empty($settings['content_field'])) {
$results = $event->getSearchApiQuery()->getResults();
foreach ($results as $result) {
$field = $result->getField($settings['content_field']);
$field_values = [];
$values = $field->getValues();
foreach ($values as $value) {
$text = ($value instanceof TextValue) ? $value->getText() : $value;
$text = strip_tags($text);
$field_values[] = $text;
}
$result->setExtraData('content', implode(' ', $field_values));
}
}
}
catch (\Exception $e) {
// log here
}
}
}
kevinquillen → created an issue. See original summary → .
I see what you mean, but where would that be set? In a ProcessingResultsEvent event subscriber?
Also, I spent yesterday looking into this for the first time (RAG tool with AI Chatbot) and in the cases where you want to use a View or different Solr options (parse mode, dismax/edismax, etc) its lost in the tool because its called and executed directly. This was causing me to never get any results.
That caused me to write my own plugin like:
/**
* {@inheritdoc}
*/
public function execute() {
// Collect the context values.
$this->searchString = $this->getContextValue('search_string');
$end_results = [];
try {
$view = Views::getView('search');
$view->setDisplay('sitesearch');
$view->setExposedInput(['keyword' => $this->searchString]);
$view->execute();
$i = 1;
/** @var \Drupal\views\ResultRow $result */
foreach ($view->result as $result) {
/** @var \Drupal\node\NodeInterface $node */
$node = $result->_entity;
$url = $node->toUrl()->toString();
$end_results[] = "#$i: " . $node->getTitle() . " (<a href=\"$url\">visit page</a>)<br>";
$i++;
if ($i >= 6) {
break;
}
}
}
catch (\Exception $e) {
$this->setOutput("We're sorry, but I could not search the site. Please give me a few moments and try again.");
return;
}
if (count($end_results)) {
$output = "I was able to find some relevant content for you! Here are the top results based on what you asked:<br><br>";
$output .= implode("<br/><br/>", $end_results);
$output .= "<br/><br/>-----<br/>Didn't find what you were looking for? Try using our more robust <a href='/search'>site search</a>!";
$this->setOutput($output);
}
else {
$this->setOutput("No results were found when searching in the rag index " . $this->index . " for the following prompt: " . $this->searchString . ".\n");
}
}
Which produced results. I could also confirm that the event subscriber in this module was also fired, so it performed a RAG search against Solr. These are probably items to report back to the main ai_search module because it can be a little confusing when the RAG tool returns nothing. I see that ai_search module provides its own search backend that creates the content property.
If I add that event to the current query event subscriber:
public function postExtractResults(PostExtractResultsEvent $event): void {
$results = $event->getSearchApiQuery()->getResults();
foreach ($results as $result) {
$result->setExtraData('content', 'Content string here');
}
}
I can see the result on the FunctionTool plugin I made. Perhaps the shortest solution here is adding on the processor setting to ask the user "which" field should be returned as 'content' in this context.
Does this include where CKEditor can stream the response?
if ($response instanceof StreamedChatMessageIteratorInterface) {
return new StreamedResponse(function () use ($response) {
foreach ($response as $message) {
echo $message->getText();
ob_flush();
flush();
}
}, 200, [
'Cache-Control' => 'no-cache, must-revalidate',
'Content-Type' => 'text/event-stream',
'X-Accel-Buffering' => 'no',
]);
}
This patch completely breaks all functionality.
Looks like this was somewhat fixed in alpha2. I have updated the behavior name to match naming conventions, removed superflous comments and removed the CSS animations. The animations may not look good in different admin themes and can present some accessibility challenges.
Are you able to see the size property in the configuration files on Solr? It should reflect the size of the model (1536) on the processor settings. If not, you may need to upload that configset and reload the core. Otherwise, it may simply just be a warning from Lucene/Solr.
What version of Solr are you running? IIRC this dimension size was capped at 1024 until around Solr 9.3.
Sure, I'm interested in seeing a conceptual solution of old way (no HTMX) / new way with HTMX. Does it also work for Form AJAX?
I read the CRs, but could the examples be expanded on a bit?
A quick naive change I made was to replace any '/' in a model name with '+' so the URL didn't break, then rewriting the '+' back to '/' for the form and title method(s);
This works, even though Ollama models cannot be (at least this one) edited:
Either way, I think such a change is necessary to make these screens function or alternatively handling model_id in the route differently so it works for all cases.
kevinquillen → created an issue. See original summary → .
This is because the event subscriber fires when the module is enabled and the event subscriber loads the processor from the index and runs operations. Now there is a check first to see that solr_densevector is a valid processor before continuing. However, if you remove this processor, you should also remove any 'Dense Vector' field configured on the index because it will fail at index time trying to store non vectorized value into the vector field in Solr.
Not sure if this is related to this issue, but I wound up in the same area of code that the patch is addressing but for another reason.
My case: I have a Views REST display with a path of api/v1/foo/bar/%node. It has two contextual filters. The first one uses the URL value to load a node for the contextual filter. The second one uses the currently logged in user.
The admin View UI preview works fine by just passing a node id in "Preview with contextual filters". However, if I am trying to assemble the URL to pass along for a decoupled React app, Views requires the two arguments:
$view_url = $this->request->getSchemeAndHttpHost() .
$view->getUrl(
[
$node->id(),
\Drupal::currentUser()->id(),
]
)->toString();
If I visit api/v1/foo/bar/(node id) directly, I get results without needing the additional argument in the URL. If I curl the URL (with an authenticated cookie value from my session) I get a value.
I can alternatively do this:
if ($display_id == 'my_view_id' && !empty($node->id())) {
$view_url = $this->request->getSchemeAndHttpHost() . Url::fromUserInput('/' . str_replace('%node', $node->id(), $view->getPath()))->toString();
}
but that is not so great to read or maintain. Should the URL be the equivalent of the path? How is that interpreted?
kevinquillen → created an issue.
Placeholder isn't clear. The field defaults to 11434 and has a note in the field description. Setting to NR.
For DDEV and typical Docker based services, the service name is enough (Docker will resolve this behind the scenes). http://ollama
will work - similar to running Solr in DDEV.
kevinquillen → made their first commit to this issue’s fork.
kevinquillen → created an issue.
kevinquillen → created an issue.
kevinquillen → created an issue.
That is correct, we were told because of the use of $_SESSION to pass state and values around. It does not persist or not guaranteed to be valid when read in all cases (it always worked locally in DDEV).
mably → credited kevinquillen → .
Unavoidable at the time - open a new issue and should be an easy fix. In the long run we should probably figure out a better way to handle this, even though new numeric series models for OpenAI are not too frequent.
Just as an update here:
1. On first index creation, go to the Processors tab.
2. Enable the DenseVector processor.
3. Configure the processor - first select the provider and save, then reload and pick the AI model (this will be improved later).
4. Save.
This should fix it. Why it errors on first creation of an index I am not sure of yet.
kevinquillen → created an issue. See original summary → .
kevinquillen → created an issue.
kevinquillen → created an issue.
In that case this change should likely apply to both the variables and token section, not just the token section.
I think this is because of submit buttons not having a unique #name value. If I give Remove a unique value like 'remove_token_$i' then extract that element in the ajax callback, the issue goes away.
kevinquillen → created an issue.
I also cannot see the Close button in Gin, but I can see it in the source of the modal. I had to write a lot of CSS overrides to get this to work for me in Gin:
html .ui-dialog .ui-dialog-titlebar .ui-dialog-titlebar-close {
margin: 12px 5px 0 0 !important;
padding: 0 !important;
opacity: 1 !important;
inline-size: 25px;
background: none;
}
html .ui-dialog .ui-dialog-titlebar .ui-dialog-titlebar-close .ui-icon.ui-icon-closethick {
background: #fff !important;
}
Now I can see the close icon.
This seems like it needs a reroll for 11.2.
Quick question, how were you able to get multiple search terms passed?
kevinquillen → made their first commit to this issue’s fork.
Please read the README for Search API Solr on how to generate and update XML configuration sets for Solr. It will do the necessary work and not require manual XML editing.
It is expected you follow Search API Solr 4.3+ setup instructions, yes, on configuring Drupal for Solr and generating the necessary configuration for it.
As for Solr itself, I can't speak to that (how anyone installs it). I used DDEVs Solr instance and it worked perfectly fine.
kristen pol → credited kevinquillen → .
kevinquillen → made their first commit to this issue’s fork.