Pasting from Google Docs doesn't preserve some formatting

Created on 17 November 2023, about 1 year ago
Updated 1 July 2024, 6 months ago

Problem/Motivation

Pasting from a Google Doc, tags that are nested more than two deep are being removed. This is not an issue with content copied from Word.

Steps to reproduce

<!-- The steps below are a starting point, customize or add to them as needed. -->

1. Create a new Google Doc with a paragraph that has bold text, such as:
This is a new paragraph that is bold.
2. Copy the paragraph
3. Paste the rich content into CKEditor 5 that has the Paste filter enabled with only the following filter/replacement pattern:
Search expression
(<[^>]*) (style="[^"]*")
Replacement
$1
4. The expected outcome would be bold and italic text.

The filter used above is a default, and all other filters have been disabled.

Markup samples

This is a new paragraph that is bold.

Markup result (pasting without filtering)

<p style="line-height:1.38;margin-bottom:0pt;margin-top:0pt;" dir="ltr">
    <span style="background-color:transparent;color:#000000;font-family:Arial,sans-serif;font-size:11pt;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;"><strong>This is a new&nbsp;</strong></span><em><span style="background-color:transparent;color:#000000;font-family:Arial,sans-serif;font-size:11pt;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;"><strong>paragraph</strong></span></em><span style="background-color:transparent;color:#000000;font-family:Arial,sans-serif;font-size:11pt;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap;"><strong> that is bold.</strong></span>
</p>

Markup result (pasting with filtering)

<!-- Test with "Filter pasted content" in the text format configuration checked -->
<p dir="ltr">
    <span>This is a new&nbsp;paragraph that is bold.</span>
</p>

Expected markup result

<p dir="ltr">
    <span><strong>This is a new&nbsp;</strong></span><em><span><strong>paragraph</strong></span></em><span><strong> that is bold.</strong></span>
</p>

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

πŸ› Bug report
Status

Closed: won't fix

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States GBlicharz

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @GBlicharz
  • Status changed to Closed: won't fix about 1 year ago
  • πŸ‡¨πŸ‡¦Canada star-szr

    Thank you for the detailed bug report and for making use of the issue summary template!

    I was able to reproduce this behaviour and the root cause is how Google Docs sends its content to the pasteboard and I don't think much can be done about it within the scope of this module. You may be able to get it to work as you expect with some creative custom filters to preserve more of the formatting based on the original markup coming out of Google Docs, or by creating a custom CKEditor 5 plugin.

    In short, our module and plugin is working as expected, but CKEditor 5 has its own magic that makes it seem like it should work differently. I'll break down the details below.

    Google docs uses <span> tags to convey all its formatting rather than semantic tags (<strong>/<b> or <em> <i>) as we might expect.

    If we leave the span tags and attributes as-is by not filtering them out (your without filtering example), CKEditor 5 sees the span tags with specific style attributes from Google docs and turns them into the corresponding semantic tags. For example, font-style: italic should be converted to <em>. Since our default style removal filter removes all the style attributes before this transformation process, CKEditor 5 no longer has the data it needs to create the semantic tags and therefore does not preserve the formatting.

    The example I'm testing with uses the following text in a Google doc:

    This is a new paragraph that has bold and italic text.

    Below is the markup that can be seen at the ClipboardPipeline#inputTransformation event. This is the same event that our module acts on to do its filtering of the pasted content. The markup at this stage can be seen below and has been run through prettier to make it a bit easier to read.

    <p
      style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt"
      dir="ltr"
      id="docs-internal-guid-21dbe10f-7fff-36cd-034d-5b586bd69b9f"
    >
      <span
        style="
          background-color: transparent;
          color: #000000;
          font-family: Arial, sans-serif;
          font-size: 11pt;
          font-style: normal;
          font-variant: normal;
          font-weight: 400;
          text-decoration: none;
          vertical-align: baseline;
          white-space: pre-wrap;
        "
        >This is a new paragraph that has&nbsp;</span
      ><span
        style="
          background-color: transparent;
          color: #000000;
          font-family: Arial, sans-serif;
          font-size: 11pt;
          font-style: normal;
          font-variant: normal;
          font-weight: 700;
          text-decoration: none;
          vertical-align: baseline;
          white-space: pre-wrap;
        "
        >bold</span
      ><span
        style="
          background-color: transparent;
          color: #000000;
          font-family: Arial, sans-serif;
          font-size: 11pt;
          font-style: normal;
          font-variant: normal;
          font-weight: 400;
          text-decoration: none;
          vertical-align: baseline;
          white-space: pre-wrap;
        "
      >
        and&nbsp;</span
      ><span
        style="
          background-color: transparent;
          color: #000000;
          font-family: Arial, sans-serif;
          font-size: 11pt;
          font-style: italic;
          font-variant: normal;
          font-weight: 400;
          text-decoration: none;
          vertical-align: baseline;
          white-space: pre-wrap;
        "
        >italic</span
      ><span
        style="
          background-color: transparent;
          color: #000000;
          font-family: Arial, sans-serif;
          font-size: 11pt;
          font-style: normal;
          font-variant: normal;
          font-weight: 400;
          text-decoration: none;
          vertical-align: baseline;
          white-space: pre-wrap;
        "
      >
        text.</span
      >
    </p>
    
  • πŸ‡¨πŸ‡΄Colombia jedihe

    I got GPT4 to generate a simple POC that uses proper HTML parsing in order to preserve bold/italics as strong, em tags:

    function processHtml(html) {
        // Create a new DOMParser
        var parser = new DOMParser();
        
        // Use the DOMParser to create a new document from the HTML string
        var doc = parser.parseFromString(html, "text/html");
        
        // Select all span elements in the document
        var spans = doc.querySelectorAll("span");
        
        // Loop through each span
        spans.forEach(function(span) {
            // Check if the span has a 'style' attribute
            if (span.hasAttribute("style")) {
                var style = span.getAttribute("style");
                
                // Create a new element based on the style of the span
                var newElement = null;
                if (style.includes("font-weight:700") || style.includes("font-weight:bold")) {
                    newElement = doc.createElement("strong");
                } else if (style.includes("font-style:italic")) {
                    newElement = doc.createElement("em");
                }
                
                // If a new element was created, replace the span with the new element
                if (newElement !== null) {
                    // Copy all child nodes of the span to the new element
                    while (span.firstChild) {
                        newElement.appendChild(span.firstChild);
                    }
                    
                    // Replace the span with the new element
                    span.parentNode.replaceChild(newElement, span);
                } else {
                    // If no new element was created, remove the span but keep its children
                    while (span.firstChild) {
                        span.parentNode.insertBefore(span.firstChild, span);
                    }
                    
                    // Now remove the empty span
                    span.parentNode.removeChild(span);
                }
            } else {
                // If the span doesn't have a 'style' attribute, just remove it but keep its children
                while (span.firstChild) {
                    span.parentNode.insertBefore(span.firstChild, span);
                }
                
                // Now remove the empty span
                span.parentNode.removeChild(span);
            }
        });
        
        // Return the processed HTML string
        return doc.querySelector('body').innerHTML;
    }
    
    

    Thinking about security implications:
    - GPT4 said DOMParser doesn't execute script tags (verification needed).
    - I assume CKE5 will perform some final cleanup before inserting the HTML.

    Prompt:

    Starting from a string containing html like:

    (paste HTML that CKE5 gets during the paste action)

    How can I process using native browser APIs so that:
    - span tags with font-weight: 700/bold get converted to (with proper closing)
    - span tags with font-style: italic get converted to (with proper closing)
    - Any other span tags get fully removed from the markup

  • πŸ‡¨πŸ‡¦Canada star-szr

    Thanks for sharing, using a custom CKEditor 5 plugin that executes before our paste filter plugin is a viable approach (or just your custom plugin if you don't need any other paste filtering, no judgment!).

    In a perfect world I would love to use the DOM instead of regular expressions for all of our paste filtering, but maintaining the same level of customization we currently have with the UI seems nearly impossible, and it would also make this module significantly more complex.

    For specific use cases like this, at this time I am recommending that folks create a custom CKEditor 5 plugin.

Production build 0.71.5 2024