[meta] Input filters and text formats

Created on 24 May 2010, over 15 years ago

Updated 21 July 2025, 3 months ago

There's been some exciting brainstorming about the possible future of input filters and formats in 📌 Specify what the line break converter should do and rewrite it in DOM Postponed: needs info and #653988: Line break filter corrupts existing XHTML → . Here's the thinking so far.

Just to clarify my point of view on this, Pathologic, an input filter, was my first contrib module, back in the D5 days, and still going strong.

The D7 HTML Corrector filter basically loads the HTML as a PHP DOMDocument object, which appears to be pretty flexible about parsing tag soup, then serializes it back out. The basic idea in reforming filters for D8 is that we actually keep that object around for a bit and pass it around for other filters to work on before we serialize it. This will be of great benefit for filters which could benefit from being able to navigate and modify a DOM instead of parsing with regular expressions.

The biggest sticking point that we need to consider with this sort of approach is that it will no longer be possible to not "correct" HTML while also running other filters on it, as doing this is inherent in creating and serializing the DOMDocument. We could possibly offer some sort of "passthrough" approach which just doesn't run any filters on it at all, but as soon as it hits the filter system, it's gonna be "corrected." Is this a good idea? Let the debate commence! I personally am of the mind that the net benefit of allowing filters like Pathologic to be able to fiddle with things using the DOM far outweighs other concerns.

Let's back up a bit. The full filtering process will look like this: The text will pass through preprocess filters. This is where filters which convert Markdown, Textile, BBcode, etc to HTML will go. The text (hopefully all HTML at this point) then gets loaded into a PHP DOMDocument, which is then passed around to "mid-process" filters to work on. Once those are done, the DOMDocument is serialized back to HTML, and then can be passed through post-process filters, for filters which need to work on HTML instead of a DOMDocument for whatever reason.

This three-stage approach will mean that the filter rearranging page can be (debatably) done away with. We can make sure that filters like BBcode will run before filters like Pathologic by virtue of the fact that the former will be a preprocess filter and the latter will be a DOM (or possibly postprocess) filter. I think this will be a wonderful usability boon for novice users. Possibly, filters can carry their own weights if they need to run before or after other filters in a particular stage, but the user never needs to see that, just as they never need to see module weights in {system}.

Coders of many filters in contrib will be able to easily roll a D8 version without having to rewrite their filter to use the DOM simply by making their filter a postprocess filter - so it still has standard HTML as an input. Eventually, if it makes sense to do so, they can create a new major release which uses the DOM instead.

And while we're reinventing wheels, #226963: Context-aware text filters (provide more meta information to the filter system) → needs to happen too.

I've never been a major kitten killer as of yet, but I'm maybe possibly volunteering myself to take a major role in this, pending community feedback.

📌 Task

Status

Active

Version

11.0 🔥

Component

filter.module

Created by

🇺🇸United States Garrett Albright

Live updates comments and jobs are added and updated live.