Implement auto-tagging of glossary entries, without DFN

Created on 13 June 2023, about 2 years ago

Updated 4 July 2023, about 2 years ago

Problem/Motivation

A regularly occurring request of webmasters about this module is the ability to automatically tag content matching glossary entries,
without having to resort to manual tagging using <dfn> elements.

This has a number of issues:

a naive multimatch, as performed by str_replace in array format, has O(m^2) complexity on the number of entries, assuming a constant length source message.
a replacement based on preg_replace introduces the issues related with generation of regexps from plain text, as well as possible unidentified issues with regex performance and compiled size limit on large glossaries
auto-tagging conflicts may happen when multiples matches can be found at the same position. Consider for example a text containing the string "ASCII", and a glossary referencing both the ASCI supercomputers and the ASCII standard : what should be tagged ? The shortest or longest match (greedy/ungreedy) ?
as glossaries increase their number of entries, the likelihood of false positives increases. Consider for example a glossary containing an entry for the legacy HP Machine experimental architecture: it would irrelevantly tag that entry on any definition containing "machine", which is fairly common without being a typical stop word.

Steps to reproduce

n.a.

Proposed resolution

implements auto-tagging, using a Aho-Corasick matcher for linear complexity
support a list of stop words: define whether this should be
- a specific field on G2 entries tagging them as not eligible for automatic tagging (needs a report page listing those entries)
- a configuration list (needs a form reminder on the node edit forms and/or admin view of these entries)
maintain the availability of manual tagging even with auto-tagging enabled, to support manual tagging of stop words
on multiple matches starting at the same position, auto-tag the longest match, which mechanically has the lower likelihood of being a false positive in most cases. Authors can still manually tag the short match to override this.
add the extra reporting needed for this
indirectly related, consider provide a CKEditor plugin for such manual tagging

Remaining tasks

All of it.

User interface changes

new reporting and/or configuration page or details in an existing page
possibly a checkbox field for the stop list on G2 entry edit forms
<dfn> elements no longer being the only proof of an existing match

API changes

n.a.

Data model changes

New configuration keys or content around stop words and auto-tagging activation.

✨ Feature request

Status

Fixed

Version

1.0

Component

Code

Created by

🇫🇷France fgm Paris, France

Live updates comments and jobs are added and updated live.

Comments & Activities

Issue created by @fgm
Comment about 2 years ago →
🇫🇷France fgm Paris, France
Status changed to Fixed about 2 years ago8:43pm 4 July 2023
Comment about 2 years ago →
🇫🇷France fgm Paris, France
This is actually just more details on ancient issue fixed in alpha2 ✨ automatically detect dictionary terms in nodes Fixed .
Comment about 2 years ago →
🇫🇷France fgm Paris, France
Comment about 2 years ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

contrib.social Blog FAQ Discussions

Production build 0.71.5 2024