OMDocumentFragment::appendXML(): Entity: line 1: parser error : EntityRef: expecting ';' in Drupal\glossify\GlossifyBase->parseTooltipMatch()

Created on 30 March 2023, almost 2 years ago

Hi,
I found an issue related with special characters.

When the user have a taxonomy like "I&O" and have the option:
" First match only Match and link only the first occurance per field." selected configurated at Text formats and editors.

Error Message:
Warning: DOMDocumentFragment::appendXML(): Entity: line 1: parser error : EntityRef: expecting ';' in Drupal\glossify\GlossifyBase->parseTooltipMatch() (line 147 of /modules/contrib/glossify/src/GlossifyBase.php)

πŸ› Bug report
Status

Active

Version

2.0

Component

Code

Created by

πŸ‡΅πŸ‡ΉPortugal Emanuel Moreira

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @Emanuel Moreira
  • πŸ‡΅πŸ‡ΉPortugal Emanuel Moreira

    I created a patch to convert special characters to HTML entities.

  • Status changed to Needs review over 1 year ago
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MariaDB 10.3.22
    last update over 1 year ago
    Patch Failed to Apply
  • πŸ‡©πŸ‡ͺGermany Anybody Porta Westfalica

    Patch #2 has security implications, so this should be checked with care.
    Thank you for the patch so far! :)

  • πŸ‡©πŸ‡ͺGermany Anybody Porta Westfalica

    It seems like this can be reproduced by creating a Glossify term containing an ampersand ("&"). And then running the replacement.

    Could someone please try to reproduce this and then write a test?
    Maybe also for other special characters?

  • Status changed to Needs work 5 months ago
  • πŸ‡©πŸ‡ͺGermany Anybody Porta Westfalica

    Okay I at least found a way to fix the errors, but I think the replacement may not work propely in all cases now, but that's less important.

    Hotfix attached in MR. For further implementation, this should first have tests.

  • Merge request !26Escape term titles β†’ (Open) created by Anybody
  • πŸ‡©πŸ‡ͺGermany Anybody Porta Westfalica

    Okay I at least found a way to fix the errors, but I think the replacement may not work propely in all cases now, but that's less important.

    Hotfix attached in MR. For further implementation, this should first have tests.

  • Merge request !27Only escape array keys and keep the value β†’ (Merged) created by Anybody
  • Status changed to Needs review 5 months ago
  • πŸ‡©πŸ‡ͺGermany Anybody Porta Westfalica

    Green tests are not enough here due to missing test coverage, but I think if the hotfix from MR!27 works, we should merge it already to improve the risk for people running into this.

    We should keep this major and add tests next to reproduce it.

    In my case it happened on taxonomy terms with the title "Lorem & ipsum". It resulted in user session being discarded due to the errors.

  • Pipeline finished with Skipped
    5 months ago
    #273742
    • anybody β†’ committed 459f3f19 on 3.x
      Issue #3351345 by anybody, Emanuel Moreira: DOMDocumentFragment::...
  • πŸ‡©πŸ‡ͺGermany Anybody Porta Westfalica

    Merged the hotfix! See #9 for further steps.

  • Assigned to Grevil
  • Status changed to Needs work 5 months ago
  • πŸ‡©πŸ‡ͺGermany Anybody Porta Westfalica
  • πŸ‡©πŸ‡ͺGermany Anybody Porta Westfalica

    I started with https://git.drupalcode.org/issue/glossify-3351345/-/tree/3351345-test-only

    that needs to be completed, but even then it's not yet sure it will fail (but it's definitely a good test to add).
    But for the test-only we'll need to revert the fix there!

  • Has this been resolved? Is there a patch one can apply?

  • πŸ‡ΊπŸ‡ΈUnited States bkosborne New Jersey, USA

    I just experienced this too. It can cause text to be completely left out in a scenario where you have two glossify terms in one text segment, and the second term has an ampersand in it. All the text after the first term is not output at all :(

    I'll look at the code, though it does seem pretty complicated.

  • πŸ‡ΊπŸ‡ΈUnited States bkosborne New Jersey, USA

    So, on 3.0.3, with the hotfix from this issue, I have problems with parsing this scenario:

    • Glossify term "testing"
    • Glossify term "PB&J"
    • Text that reads "I am testing a sentence with PB&J in it"

    I get output "I am testing". The rest is cut off entirely.

    When I downgrade to 3.0.2, everything works as expected, and I don't get the original PHP warning that this issue posted about. I'm on PHP 8.1.x if that matters at all?

    Can we have clearer steps to reproduce the problem? As it stands now, the hotfix in 3.0.3 didn't fix anything for my situation, but broke it. Hmm.

  • πŸ‡ΊπŸ‡ΈUnited States bkosborne New Jersey, USA

    bkosborne β†’ changed the visibility of the branch 3351345-omdocumentfragmentappendxml-entity-line to hidden.

  • πŸ‡ΊπŸ‡ΈUnited States bkosborne New Jersey, USA

    I created a new MR that reverts the previous hotfix and implements what I think is simpler solution. The problem is that we're using ->appendXML to attach HTML to the DOM Document. This isn't really correct, because what's valid in HTML may not be valid in XML. That's why we're running into issues with &, which is perfectly valid in HTML but it must be escaped to & in XML.

    I think the real solution here is to find a way to properly attach HTML to the DOM Document instead of using ->appendXML. Short term, we can cover most use cases by simply escaping the invalid XML characters. I say this is short term because we're using ->appendXML to append the rendered tooltip, and I don't think we can really mess with the rendered output by escaping it.

    Note that for the test, I fixed the mock render code to actually perform a simple version of twig autoescaping.

  • πŸ‡ΊπŸ‡ΈUnited States bkosborne New Jersey, USA
  • πŸ‡ΊπŸ‡ΈUnited States bkosborne New Jersey, USA
  • Merge request !31Fixes for & character in terms β†’ (Open) created by bkosborne
  • Pipeline finished with Success
    12 days ago
    Total: 141s
    #399053
  • Hi, I tested with a special character like "O&O," and noticed that the output is incorrect. Some text is entirely missing. The issue occurs because the & in "O&O" is treated as an unescaped entity in XML. Merge Request (MR) #31 resolves this issue for me.

  • πŸ‡©πŸ‡ͺGermany Anybody Porta Westfalica

    @kushagra.goyal could you please add another test for that in MR!31? Thanks!

    @bkosborne which other tests might be helpful to ensure we don't break something in this complexity?

  • πŸ‡ΊπŸ‡ΈUnited States bkosborne New Jersey, USA

    The test I added in MR !31 already covers what kushagra.goyal ran into.

    I'm not really sure what other tests could be added. I think that if people currently have an & in their glossary term, things are already broken for them, and this should fix it. They may not even know it's broken if they're not closely paying attention to their output.

  • πŸ‡©πŸ‡ͺGermany Anybody Porta Westfalica

    @bkosborne thanks for your feedback! So what would you say, how far are we away from RTBC?
    The tests are green, so if the tests are fine, it should be fine too, but yeah this so complex, so I'm still afraid to merge it, even if I tested it manually.

  • Issue was unassigned.
  • Status changed to Needs review 1 day ago
  • πŸ‡©πŸ‡ͺGermany Anybody Porta Westfalica
  • πŸ‡©πŸ‡ͺGermany Anybody Porta Westfalica
  • πŸ‡©πŸ‡ͺGermany Grevil

    One remaining question, otherwise this LGTM!

Production build 0.71.5 2024