Implement auto-tagging of glossary entries, without DFN

Created on 13 June 2023, over 1 year ago
Updated 4 July 2023, over 1 year ago

Problem/Motivation

A regularly occurring request of webmasters about this module is the ability to automatically tag content matching glossary entries,
without having to resort to manual tagging using <dfn> elements.

This has a number of issues:

  • a naive multimatch, as performed by str_replace in array format, has O(m^2) complexity on the number of entries, assuming a constant length source message.
  • a replacement based on preg_replace introduces the issues related with generation of regexps from plain text, as well as possible unidentified issues with regex performance and compiled size limit on large glossaries
  • auto-tagging conflicts may happen when multiples matches can be found at the same position. Consider for example a text containing the string "ASCII", and a glossary referencing both the ASCI supercomputers and the ASCII standard : what should be tagged ? The shortest or longest match (greedy/ungreedy) ?
  • as glossaries increase their number of entries, the likelihood of false positives increases. Consider for example a glossary containing an entry for the legacy HP Machine experimental architecture: it would irrelevantly tag that entry on any definition containing "machine", which is fairly common without being a typical stop word.

Steps to reproduce

n.a.

Proposed resolution

  • implements auto-tagging, using a Aho-Corasick matcher for linear complexity
  • support a list of stop words: define whether this should be
    • a specific field on G2 entries tagging them as not eligible for automatic tagging (needs a report page listing those entries)
    • a configuration list (needs a form reminder on the node edit forms and/or admin view of these entries)
  • maintain the availability of manual tagging even with auto-tagging enabled, to support manual tagging of stop words
  • on multiple matches starting at the same position, auto-tag the longest match, which mechanically has the lower likelihood of being a false positive in most cases. Authors can still manually tag the short match to override this.
  • add the extra reporting needed for this
  • indirectly related, consider provide a CKEditor plugin for such manual tagging

Remaining tasks

All of it.

User interface changes

  • new reporting and/or configuration page or details in an existing page
  • possibly a checkbox field for the stop list on G2 entry edit forms
  • <dfn> elements no longer being the only proof of an existing match

API changes

n.a.

Data model changes

New configuration keys or content around stop words and auto-tagging activation.

Feature request
Status

Fixed

Version

1.0

Component

Code

Created by

🇫🇷France fgm Paris, France

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Production build 0.71.5 2024