- Issue created by @larowlan
- 🇦🇹Austria drunken monkey Vienna, Austria
Thanks for creating this issue!
Your reasoning makes sense, having the same word lots of times in a single document doesn’t just linearly make it more “relevant” to a search for that word. That’s one of the many shortcomings of the current DB backend implementation, which of course often has to lag behind the functionality of a dedicated search engine like Solr. Solr, for instance, uses something called the “inverted document frequency” (idf) to penalize search terms that just appear in lots of documents when calculating their scores.
In principle, I think something like this, or even in general mimicking Solr’s scoring more closely, wouldn’t even be that hard to implement using the database, if we’re prepared to accept the “cost” of added complexity for already pretty complex code. (Really, the only thing that’s been keeping this afloat is the excellent test coverage.) In general, though, I tend to rather suggest to people to just switch to Solr, or a similar dedicated search backend, if they want to improve their search experience. I don’t like to let this “simple” backend get too ambitious for the main reason that I’m the one who has to maintain it in the end, and I can’t do that properly when I don’t really know what it does anymore.
Anyways, I realize that this isn’t very much on topic for this issue, which is a far more limited proposal. However, my point is that I do think such a small change, of having a term that appears lots of times in an item not just contribute linearly to that item’s score, would be something we could just implement in the database backend itself, without the need for another processor. It would also simplify the implemenation a lot – see the attached patch as an example. (Not yet a finished patch, of course, but something like this would be the idea. The exact formula for how the influence of the term count on the term score should degrade would still need to be determined, probably through literature research or comparison to other OSS search engines.)
Irrespective of where this is implemented, I think my approach would have two advantages over your proposed processor:
- I don’t think making this configuration dependent on the specific words makes much sense. After all, it seems the same problem would occur for almost any word that appears very often in some body text. You cannot effectively counteract such a problem with a taxative list of words for which it should be fixed.
- On the other hand, I don’t think just capping the token count at, for instance, ten occurrences is the best solution. After all, an item containing the word 20 times is probably more relevant then one only containing it ten times (all else being equal) – just not twice as relevant. I think, degrading the contribution of each new occurrence, while still keeping them positive, would be the best approach. (On reflection, I don’t think my example patch actually ticks that “keeping them positive” box, as soon as you have tokens with different scores. Something to consider.)
Of course, making such a change to the basic score calculation would probably need a bit more input and feedback from others, to make sure this is an overall positive change. (We might also want to make it optional, at least in some way, so we don’t break anyone’s site without warning or easy workaround.) Also, as said, some work coming up with a good formula.
If that sounds like too much work, or you’re convinced of your approach, I can totally understand that. Then I suggest you just implement this in a custom or separate contributed module for now. If enough people declare interest in it here (or are using your module), we might still add it to the Search API module later, but right now it appears a bit too specific to be integrated. I don’t like newcomers to be overwhelmed when going to the Processors tab (if it’s not far too late for that), and this does sound like a processor that would be difficult to explain to beginners.
If you are interested in working on this, though, that would be great, as we could potentially improve the search experience on thousands of sites, at least by a bit. In that case, please tell me what you think of my suggestions and what your thoughts are regarding the specific formula to use, whether to make this optional, etc.
- 🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10
That looks much simpler.
I'll give it a go and try to add some test coverage for it.
- 🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10
In testing, this didn't help our scenario, as the word is not common across many documents, just across some documents.
I'll have a go at my proposal and see if it works for our scenario, but from the sound of it, our scenario sounds uncommon.
So feel free to repurpose this issue for a IDF implementation per your patch.
- last update
over 1 year ago 527 pass, 1 fail - last update
over 1 year ago 527 pass, 1 fail - 🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10
FWIW I have a working version of my approach. Happy to share here or create a new issue if you want to use this for IDF.
In the extreme case there was one document with 164 instances of one of the search terms so was always showing first for any search that included document.