Improve linebreak handling for imported MD files

Created on 1 July 2024, 5 months ago
Updated 3 July 2024, 5 months ago

Problem/Motivation

When importing MD files, the contained line breaks would normally be OK in HTML. However, the text filter on project pages converts line breaks into <br> elements which results in bad formatting of project pages on d.o while the output looks as expected on GitLab.

Let's try to remove the unnecessary line breaks right after converting MD to HTML.

πŸ“Œ Task
Status

RTBC

Version

3.0

Component

Code

Created by

πŸ‡©πŸ‡ͺGermany jurgenhaas Gottmadingen

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @jurgenhaas
  • Merge request !265Update drupalorg.module β†’ (Merged) created by jurgenhaas
  • Status changed to Needs review 5 months ago
  • πŸ‡©πŸ‡ͺGermany jurgenhaas Gottmadingen

    MR!265 seems to be doing the job, please have a look.

  • Status changed to Needs work 5 months ago
  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Thanks for giving it a go at this. I left a review in the MR.

  • Status changed to Needs review 5 months ago
  • πŸ‡©πŸ‡ͺGermany jurgenhaas Gottmadingen

    Oops, worked on that but not sure I caught it yet. Left a comment in the MR as well.

  • Status changed to Needs work 5 months ago
  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Back to you with a suggestion.

  • Status changed to Needs review 5 months ago
  • πŸ‡©πŸ‡ͺGermany jurgenhaas Gottmadingen

    Good old D7 times ;-)

  • Status changed to Needs work 5 months ago
  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Thanks for the changes.

    I just tested and it seems to be adding the wrapping htm/body tags.

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><body>
    <p><strong>ECA is the n
    ...
    nauer</a></p>
    </body></html>
    
  • Status changed to Needs review 5 months ago
  • πŸ‡©πŸ‡ͺGermany jurgenhaas Gottmadingen

    Ah, those are a problem? Better to be on the safe side, agreed. And thanks for the link. I've tested the usage of those 2 options and they're doing what it's saying.

  • Status changed to Needs work 5 months ago
  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Tested the code on:
    - https://fjgarlin-drupal.dev.devdrupal.org/project/eca (reading from https://git.drupalcode.org/project/eca/-/blob/2.0.x/README.md)
    - https://fjgarlin-drupal.dev.devdrupal.org/project/config_notify (reading from https://git.drupalcode.org/project/config_notify)

    See the second one:

    <h2>CONTENTS OF THIS FILE<ul>
    <li>Introduction</li>
    ...
    

    and

    If I try on the current www.drupal.org for that second module, I get valid markup:

    <h2>CONTENTS OF THIS FILE</h2>
    <ul>
    <li>Introduction</li>
    

    So, whatever it is, it is introduced by this MR. Maybe the regex or maybe something else, not sure.
    If you want to test, this is the raw content of the file: https://git.drupalcode.org/project/config_notify/-/raw/8.x-1.x/README.md...

  • πŸ‡©πŸ‡ͺGermany jurgenhaas Gottmadingen

    That's strange, I ran this locally and received the expected output:

    <h2><a href="#contents-of-this-file" aria-hidden="true" class="anchor" id="user-content-contents-of-this-file"></a>CONTENTS OF THIS FILE</h2><ul><li>Introduction</li>.....
    

    Not sure where this went wrong.

    The local script looks like this (3rd line starts the extra code):

    // Html entities appearing in the markup are encoded.
    $html = html_entity_decode($html);
    // Remove data-sourcepos attributes.
    $html = preg_replace('/(<[^>]+) data-sourcepos=".*?"/si', '$1', $html);
    
    // Remove whitespace between tags.
    $html = preg_replace('~>\s+<~', '><', $html);
    
    // This code block replaces remaining linebreaks with spaces, except within
    // code or pre elements, where we need to keep the linebreaks.
    $html = str_replace("\n", '###TEMP###', $html);
    $document = new \DomDocument('1.0', 'UTF-8');
    $document->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    foreach (['code', 'pre'] as $tagName) {
      /** @var \DOMElement $element */
      foreach ($document->getElementsByTagName($tagName) as $element) {
        $element->textContent = str_replace('###TEMP###', "\n", $element->textContent);
      }
    }
    $html = $document->saveHTML($document);
    $html = str_replace('###TEMP###', ' ', $html);
    

    Don't know how to reproduce this. Have I missed something og the code in the MR? I don't see any differences, do you?

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    My code with debugging:

      ...
      $html = str_replace("\n", '###TEMP###', $html);
      var_dump($html);
      $document = new \DomDocument('1.0', 'UTF-8');
      $document->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
      foreach (['code', 'pre'] as $tagName) {
        /** @var \DOMElement $element */
        foreach ($document->getElementsByTagName($tagName) as $element) {
          $element->textContent = str_replace('###TEMP###', "\n", $element->textContent);
        }
      }
      $html = $document->saveHTML($document);
      var_dump($html);
    

    Results:

    /var/www/dev/fjgarlin-drupal.dev.devdrupal.org/htdocs/sites/all/modules/drupalorg/drupalorg/drupalorg.module:3849:string '<h2>CONTENTS OF THIS FILE</h2><ul><li>Introduction</li><li>Requirements</li><li>Installation</li><li>Configuration</li><li>Maintainers</li></ul><h2>INTRODUCTION</h2><p>The module lets you trigger notifications depending on the current status of###TEMP###your configuration in production.</p><p>Notifications can be triggered via cron or instantly. Slack and email are###TEMP###currently supported.</p><ul><li><p>For a full description of the module visit:###TEMP###<a href="https://www.drupal.org/project/config_'... (length=1789)
    
    /var/www/dev/fjgarlin-drupal.dev.devdrupal.org/htdocs/sites/all/modules/drupalorg/drupalorg/drupalorg.module:3859:string '<h2>CONTENTS OF THIS FILE<ul>
    <li>Introduction</li>
    <li>Requirements</li>
    

    As we can see, the closing h2 disappears somehow in between loading and saving the HTML.

    I even went ahead and tested the following:

    var_dump($html);
      $document = new \DomDocument('1.0', 'UTF-8');
      $document->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
      $html = $document->saveHTML($document);
      var_dump($html);
    

    And the output is identical to the above one, without the closing h2.
    I suspect that it might be due to the PHP version we're running on πŸ™ˆ (better don't ask which).

    I'm not sure about what to try now...

  • πŸ‡©πŸ‡ͺGermany jurgenhaas Gottmadingen

    Looks like your test code is missing the first instruction:

    // Remove whitespace between tags.
    $html = preg_replace('~>\s+<~', '><', $html);
    

    Maybe that's the reason?

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    No, it's there. I just removed the part that wasn't relevant with the ...

    I just added the relevant part where we still have the closing "h2" and then we don't anymore.

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Oh wait, I just saw where the closing "h2" is...

    ...
    <h2>MAINTAINERS</h2>
    <ul>
    <li>Fran Garcia-Linares (fjgarlin) - <a href="https://www.drupal.org/u/fjgarlin">https://www.drupal.org/u/fjgarlin</a>
    </li>
    </ul>
    </h2>
    

    I think I know why it's happening. Let me do a quick test.

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    I made a suggestion in the MR that fixes the issue and explains why it was happening.

    Code tested here: https://fjgarlin-drupal.dev.devdrupal.org/project/config_notify

    I think once that change is in place it might be good for RTBC.

  • Status changed to Needs review 5 months ago
  • πŸ‡©πŸ‡ͺGermany jurgenhaas Gottmadingen

    I've applied your suggestion, this is genius.

  • Status changed to RTBC 5 months ago
  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Reviewed the MR and tested in "eca", "api" and "config_notify" modules and it all looks good.
    Also, this feels pretty safe as it's just working on the suggested markup, but not saving anything.

    I'm marking it RTBC. Thanks so much for the work here!

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    So great how fast you got this fixed, thanks! I was first thinking that this was for Drupal core, but then saw it's for he drupalorg module ...

    Do you know if a similar issue exists for Drupal core? I ask because I am importing Markdown content with Migrate module, and the line breaks are stripped during import, so lists break down ... My work-around is to insert an extra line break, for each line, which is not ideal ... I am using the Markdown Easy module for Markdown support in CKEditor.

    Here's the migration, if I don't have the extra line breaks, all the text is on one line, and the Markdown breaks down.

    id: sub_nodes
    label: Subject migration, based on github.com/dinarcon/ud_migrations
    source:
      plugin: embedded_data
      data_rows:
        -
          nid: 7021
          title: 'How to use Drupal'
          body: 'you can use Drupal for web site building.
    
          Here are some benefits:
    
          * open source
    
          * Big community
    
          * A third reason
    
          So as you can see, there are many reasons to use Drupal'
      ids:
        nid:
          type: integer
    process:
      nid: nid
      title: title
      body/value: body
      body/format:
        plugin: default_value
        default_value: markdown
    
    destination:
      plugin: entity:node
      default_bundle: article
    migration_dependencies:
      optional: []
    

    I would hope that this could work, but it doesn't:

          body: 'you can use Drupal for web site building.
          Here are some benefits:
    
          * open source
          * Big community
          * A third reason
    
          So as you can see, there are many reasons to use Drupal'

    I'll create a Drupal core issue, but thought I'd ask you guys first, in case you perhaps have some ideas, or maybe seen it before ...

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    What does the format "markdown" look for you? Do you perhaps have the Convert line breaks into HTML (i.e. <br> and <p>) filter enabled in your text format? Any other filters or processors linked to the "markdown" format?

  • Pipeline finished with Skipped
    5 months ago
    #213297
  • First commit to issue fork.
  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    Thanks for the suggestion @jgarlin, it could have been the solution ... sadly, it made no difference, after I removed both filters Limit allowed HTML tags and correct faulty HTML and Convert line breaks into HTML (i.e. <br> and <p>). It may be Migrate or Migrate Plus then? I'll create an issue in one those projects, and hope for the best :-)

  • πŸ‡ΊπŸ‡ΈUnited States drumm NY, US

    Looks good, deployed. Thanks!

  • πŸ‡©πŸ‡ͺGermany jurgenhaas Gottmadingen

    Thank you so much @fjgarlin for working together on this and thanks @drumm for merging and deploying it. I've tested it on the ECA module page and it makes a huge difference. The description looks so much better now.

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    Great!

    I created πŸ› Preserve line breaks, to support importing Markdown format Active for my Migrate / Markdown challenge.

Production build 0.71.5 2024