Html to Markdown abstraction

Created on 4 July 2025, 29 days ago

Problem/Motivation

At the moment HTML to markdown conversion is implemented with league/html-to-markdown package, that of course does its job really good, but it has a lot more to offer. There are settings on how the conversion is done, it is possible to add more tag converter, etc. At the moment the package is used "as is" with default (reasonable) settings.

Even though league package is good, someone might want to use another tool or configure league package differently. Or in case of usage of web components, not all HTML tags can be converted properly to markdown, just because the package is not aware of them and their purpose.

Proposed resolution

Create abstraction layer that will allow to use any HTML to markdown conversion tool to the liking of a user. For example, there are already modules like https://www.drupal.org/project/markdownify that expose HTML to markdown feature as a service with pluggable structure, so that any tool can be used to convert markup to markdown with common interface. The league package is in the module out of the box. I assume there are also other modules.

Remaining tasks

Discuss how the abstraction layer should be done:

  • maybe possible usage of markdownify or some other module
  • or plugin manager, so that other modules can implement plugins with ai module and overtake the conversion process
  • ....
Feature request
Status

Active

Version

2.0

Component

AI Core module

Created by

🇩🇪Germany a.dmitriiev

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @a.dmitriiev
  • 🇩🇪Germany marcus_johansson

    It would be great to have in the AI Core and I think the idea of abstracting the process away, so that anyone can plugin to override the default league package would be great - I think we should offer a easily pluggable form that you can use it third party settings or added configuration forms, so that you can attach a specific configuration it to a Search API index for instance.

  • 🇩🇪Germany marcus_johansson

    Actually thinking about it, it should probably be done as part of this: Create Document Loader Normalization Layer Active

    Even if its less likely that a html to markdown will be an external service, having a built in service and the option to swith over to other services, sounds like this fits right into document loaders.

Production build 0.71.5 2024