Recent comments

πŸ‡ΊπŸ‡ΈUnited States DiegoPino

This bug is still around (6 years!) in 11.x. but both patches are failing bc the tests are not passing once the actual "excluded" elements are no longer part of the raw data and thus removed from the URLs

e.g here
https://git.drupalcode.org/project/drupal/-/blob/11.x/core/modules/views...

and here at
ExposedFormRenderTest::testExposedFormRawInput

The Layout builder one might be just layout builder failing in 10.1 so all needs to be rebased to 11.x-dev
@quietone since you are already in this, do you want to tackle that or you ok with me giving it a shot?

πŸ‡ΊπŸ‡ΈUnited States DiegoPino

@mkalkbrenner thanks for your quick reply.

Our way of producing vectors (embedding extraction) is for sure not the standard way. We have have a chain-able and configurable post processor plugin system for our custom type of fields/data that runs as a set of "extractors", from OCR, to file transforms, to vectors, in this case that are pushed into a background processing queue, then injected into custom datasources. The number of moving parts is kinda huge and does not feel the type of project you would like to mimic for this.

But, going back to the idea of plugins. I believe, that people (users and devs) using your module would be more comfortable using the existing search api processor idea. Since indexing already happens (most of the time at least) via cron or via drush, the overhead of calling an external service (well in our case it is external to Drupal not but no external in the sense of a commercial API) would be not huge. I mean we enqueue and have workers for everything but that is a choice. Why an extra plugin additionally/on top of to just a new processor?

Because you want to reuse the "processing/remote API call -> return as vector" logic also on query time. So a Views filter would need to be able to call the same logic used to index a certain vector using the same API. Vectors are opinionated, a one vector generated by X won't make any sense in relation to one generated by Y. Also here vector dimension is key. Fixed, never variable and lastly, depending on the comparison algorithm you might want to provide a normalized Unit Vector so you can use the faster dot_product instead of cosine (which again is a fixed setting for the Field type)

So, resuming (my 25cents). A post processor (e.g like the aggregated field one, or the Entity renderer one) that takes as argument/another type of very opinionated Plugin as config. These plugins would have standard methods (but opinionated internal logic) to call APIs using an input (in this case the same as a normal processor would have) and return a vector (array) and fixed annotations with vector size, etc. That way devs can write their own plugins that talk/understand/provide the needed logic that will vary a LOT for each remote service and also plug the same logic (which needs to be available outside of the processor itself) when querying to transform the input into a vector.

I see what you are doing on search_api_clir and it is very interesting.

πŸ‡ΊπŸ‡ΈUnited States DiegoPino

hi @mkalkbrenner, our project has a need for this and I'm willing to give this a try but, to align with your roadmap I need some pointers.
First a bit of background. We already have tons of external (Drupal and non Drupal) supporting code and some good experience altering/acting on events on this wonderful module to use custom Solr types, custom data sources, JOINS, etc. e.g The way we alter highlighting allowing use to use Fields that are driven by external Solr plugins that require different different query arguments, etc.

1.- So, from the perspective of actual implementation, First we need to put the data in :)
Because the Dense Vector Types are pre-set with a fixed comparison algorithm and a fixed vector Size per type we are right now defining 4 types with vector sizes of 384(Bert/Text embeddings), 512 (Apple Vision Image Fingerprint), 576 (Yolo Embeddings) and 1024 (mobileNet Embeddings). I believe as part of a release a 384 one should be sufficient and anyone else could then extend providing their own.

The first issue is the mismatch of cardinality and the field generation. A Vector, when passed from PHP to Solr is an array (so multivalued, fixed size based on the Field Type config), but goes always into a single value Field into Solr (multivalued=FALSE), the dynamic field generation \Drupal\search_api_solr\Entity\SolrFieldType::getDynamicFields is blind to this need.
Question is (or what would you suggest)
- Add a new Field type Config setting e.g like $this->custom_code $this->cardinality, allowing a Field Type to "ask" for no Dynamic Fields outside of what its type allows (in the case of a Vector of course Single Valued only). This could be useful for future types/other fields driven by custom solr plugins that have that need. Could be also directly a full Solr field settings override. Where a Field Type could "ask" for handling how the field is generated completely via a config
- OR a fixed method like getSpellcheckField() (e.g getDenseVectorField() that targets specifically Dense Vectors)
- OR an event that allows any external module to alter the dynamics fields (delegating the actual support and extra configs to anyone willing to write an event subscriber)

Second issue: Let's say we have now a dynamic, single valued field for one of these custom field type. And I want to setValue for the field.
The datatype at the PHP level will be an array (multivalued), mismatching the data type at the backend. So question is
- Do we need a new @SearchApiDataType ? that allows a Vector. Any other work arounds?

I think the how/one/generates/populates the Vectors both on index time/query time are beyond a first implementation in this module. We, for example, have a Docker container that processes Images and generates a custom datasource populated with this data (and NLP, HOCR). But that will vary a lot between users. Some might want to add this type of fields as a Processor.

At query time:
Our hack for custom queries has been to "set EDISMAX" dynamically via a custom Views Filter and add a custom option to the query. EDISMAX because it is the current Parser that alters less/is less opinionated of all of them. Then we intercept all at PostConvertedQueryEvent subscriber, check if a given option was passed, if so we remove the edismax component from the Solarium query and add all our custom logic. This allowed us in the past to do subqueries, JOINS, etc. But for an official implementation, I wonder if having a custom Parse Plugin would be ideal. The only issue I see with that (And Views integration) is that it will have to interact with a Normal Filter/Facets but use them as Pre Filter in a !knn query. And Solr also recommends 3 different options, pre filter, re-ranking too, and a "must" compound query. And this custom parser makes no sense used in an exposed Filter in a Views. Ideas?

That is what I have so far. I think the issue is not really coding this (testing might be a challenge but then your current tests are excellent, most of what I have learned from this module is reading your tests) but knowing what is worth tapping in, to what degree this module needs to cover all, or just allow the flexibility to override some things and provide the basics.

Thanks

πŸ‡ΊπŸ‡ΈUnited States DiegoPino

Hi @Chi, you are totally right. Please feel free to close this issue. This was my mistake, I had another root composer dependency blocking the upgrade and the composer messages confused me. ^10.0 is correct in the sense of 10.1 and 10.2 being allowed. But really, semantically speaking 10.1 has breaking changes compared to 10.0 so in that sense (unrelated to this module) it is not. Again, sorry and thanks

Production build 0.69.0 2024