Term Frequency limit processor feature

Created on 12 May 2023, over 1 year ago
Updated 31 May 2023, over 1 year ago

Problem/Motivation

Working on a project that has certain words that can be very prevalent in some documents and hence skew search results for anything containing that term.

For example, the word 'document' can appear in some content 50+ times.

But when searching for a specific title phrase e.g. 'Clinical service document' (which appears in a title field), even with boosting of 21 for title, the content with the really dense instances of 'document' appear first.

At present, the score calculation keeps adding additional score for each instance.

We've looked at making 'document' a stop word, but it isn't really the right use for that feature.

We also looked into the logic of the module and found that it does do 'decay' for results that appear later in the document (this code)

// Taken from core search to reflect less importance of words later
// in the text.
// Focus is a decaying value in terms of the amount of unique words
// up to this point. From 100 words and more, it decays, to (for
// example) 0.5 at 500 words and 0.3 at 1000 words.
$score *= min(1, .01 + 3.5 / (2 + count($unique_tokens) * .015));

We're wondering if you'd support a feature along the lines of 'Term frequency limit' (name to be confirmed) which would consist of a processor that would take configuration as a set of word and count key pairs.

It would remove tokens for the given word when the count had been passed.

For example in this instance we would say `document:10` and it would only consider the first 10 instances of document when calculating the score.

Proposed resolution

Add a processor plugin per the above
The name 'term frequency' made sense because that's what it's named in SOLR parlance when calculating scores.

Remaining tasks

Decide if this feature would be acceptable for the module
If so, we're willing to work on development
If not, we'll likely build it as a custom module or separate module

Thanks for taking the time to consider this

Feature request
Status

Active

Version

1.0

Component

Database backend

Created by

🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @larowlan
  • 🇦🇹Austria drunken monkey Vienna, Austria

    Thanks for creating this issue!

    Your reasoning makes sense, having the same word lots of times in a single document doesn’t just linearly make it more “relevant” to a search for that word. That’s one of the many shortcomings of the current DB backend implementation, which of course often has to lag behind the functionality of a dedicated search engine like Solr. Solr, for instance, uses something called the “inverted document frequency” (idf) to penalize search terms that just appear in lots of documents when calculating their scores.

    In principle, I think something like this, or even in general mimicking Solr’s scoring more closely, wouldn’t even be that hard to implement using the database, if we’re prepared to accept the “cost” of added complexity for already pretty complex code. (Really, the only thing that’s been keeping this afloat is the excellent test coverage.) In general, though, I tend to rather suggest to people to just switch to Solr, or a similar dedicated search backend, if they want to improve their search experience. I don’t like to let this “simple” backend get too ambitious for the main reason that I’m the one who has to maintain it in the end, and I can’t do that properly when I don’t really know what it does anymore.

    Anyways, I realize that this isn’t very much on topic for this issue, which is a far more limited proposal. However, my point is that I do think such a small change, of having a term that appears lots of times in an item not just contribute linearly to that item’s score, would be something we could just implement in the database backend itself, without the need for another processor. It would also simplify the implemenation a lot – see the attached patch as an example. (Not yet a finished patch, of course, but something like this would be the idea. The exact formula for how the influence of the term count on the term score should degrade would still need to be determined, probably through literature research or comparison to other OSS search engines.)

    Irrespective of where this is implemented, I think my approach would have two advantages over your proposed processor:

    1. I don’t think making this configuration dependent on the specific words makes much sense. After all, it seems the same problem would occur for almost any word that appears very often in some body text. You cannot effectively counteract such a problem with a taxative list of words for which it should be fixed.
    2. On the other hand, I don’t think just capping the token count at, for instance, ten occurrences is the best solution. After all, an item containing the word 20 times is probably more relevant then one only containing it ten times (all else being equal) – just not twice as relevant. I think, degrading the contribution of each new occurrence, while still keeping them positive, would be the best approach. (On reflection, I don’t think my example patch actually ticks that “keeping them positive” box, as soon as you have tokens with different scores. Something to consider.)

    Of course, making such a change to the basic score calculation would probably need a bit more input and feedback from others, to make sure this is an overall positive change. (We might also want to make it optional, at least in some way, so we don’t break anyone’s site without warning or easy workaround.) Also, as said, some work coming up with a good formula.

    If that sounds like too much work, or you’re convinced of your approach, I can totally understand that. Then I suggest you just implement this in a custom or separate contributed module for now. If enough people declare interest in it here (or are using your module), we might still add it to the Search API module later, but right now it appears a bit too specific to be integrated. I don’t like newcomers to be overwhelmed when going to the Processors tab (if it’s not far too late for that), and this does sound like a processor that would be difficult to explain to beginners.

    If you are interested in working on this, though, that would be great, as we could potentially improve the search experience on thousands of sites, at least by a bit. In that case, please tell me what you think of my suggestions and what your thoughts are regarding the specific formula to use, whether to make this optional, etc.

  • 🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10

    That looks much simpler.

    I'll give it a go and try to add some test coverage for it.

  • 🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10

    In testing, this didn't help our scenario, as the word is not common across many documents, just across some documents.

    I'll have a go at my proposal and see if it works for our scenario, but from the sound of it, our scenario sounds uncommon.

    So feel free to repurpose this issue for a IDF implementation per your patch.

  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.x + Environment: PHP 8.2 & MySQL 8
    last update over 1 year ago
    527 pass, 1 fail
  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update over 1 year ago
    527 pass, 1 fail
  • 🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10

    FWIW I have a working version of my approach. Happy to share here or create a new issue if you want to use this for IDF.

    In the extreme case there was one document with 164 instances of one of the search terms so was always showing first for any search that included document.

Production build 0.71.5 2024