Move TextExtractor service to AI core

Created on 18 July 2025, 3 months ago

Problem/Motivation

Follow-up to Attempt to find nested/referenced text field values for content suggestions Active : AI content suggestion and AI translate (probably automators as well) have very similar requirement to extract and then insert textual metadata from/into content entities.

Such a service exists in ai_translate namespace (TextTranslator service + FieldTextExtractorInterface plugins), but it should be moved into core to be available to other submodules and contrib

📌 Task
Status

Active

Version

1.2

Component

AI Core module

Created by

🇧🇬Bulgaria valthebald Sofia

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @valthebald
  • First commit to issue fork.
  • 🇧🇪Belgium svendecabooter Gent

    Added a MR to refactor this.
    Marked the existing ai_translate classes / services deprecated, so contrib modules extending them, have time to update their dependencies.

  • Pipeline finished with Failed
    about 1 month ago
    Total: 236s
    #588740
  • Pipeline finished with Failed
    about 1 month ago
    #588748
  • 🇧🇪Belgium svendecabooter Gent

    TODO:

    - \Drupal\ai\TextExtractor::shouldExtract() still contains translation-specific logic. This should be moved to the module making use of the Text Extractor service, in this case ai_translate. Not sure how this should be set up.

  • Pipeline finished with Success
    about 1 month ago
    #588766
  • 🇧🇪Belgium svendecabooter Gent

    I probably was a bit too enthusiastic with my MR.

    I identified multiple problems with just moving the FieldTextExtractor plugins from ai_translate to ai core:

    1. The logic in shouldExtract() for each plugin is tightly coupled to translation logic. In it's easiest form, it checks if a field is translatable. In more complex cases, e.g. for Entity References, it checks configuration settings to determine whether to extract a given entity reference.
    2. The logic in setValue() is also tightly coupled to translation logic. It assumes the plugin that extracted the field value, will store it again on the exact field same location (but on the translated version of an entity)

    The scenario's where text extraction could be useful outside of translation logic, seem to be:

    • For ai_content_suggestion: this just needs the extraction part, not the logic for setting the value again on the same field!
    • For ai_automators: might be useful to provide the extracted values as tokens to automators? There is already something in place like that, but using the extractor plugins might extend that logic? Haven't looked to deeply into this. But this would also only need the extraction logic I assume?
    • ... other places?

    Refactoring the FieldTextExtractor plugins to be more generic, while keeping backwards compatibility, seems rather challenging.
    I was thinking we could set it up as follows:

    New plugin type: AiFieldTextExtractor

    • Resides in the Drupal\ai namespace
    • Provides implementations for various field types, same as current ai_translate FieldTextExtractor plugins
    • Only implements the extract() method + shouldExtract() method / no value setting logic
    • Does it need to be limited to extracting text? If not, name it AiFieldExtractor? Haven't thought this through yet :)

    Existing plugin type: FieldTextExtractor

    • Resides in the Drupal\ai_translate namespace
    • Already has implementations for various field types: they will be refactored to extend their AiFieldTextExtractor equivalent
    • Only provides the setValue() method.
    • extract() and shouldExtract() could be overridden to provide translation-specific logic that is not in the parent AiFieldTextExtractor plugin counterpart
    • Could potentially be deprecated in 2.0.0 in favor of a better named AiTranslateFieldTextExtractor plugin / namespace, but that might be nitpicking on my part...

    ai_translate and contrib modules extending this module would keep on using the FieldTextExtractor plugins.
    Other AI modules could use just the extract logic from the AiFieldTextExtractor plugins, and run with it.

    Does that make sense, or is it overly complex and unnecessary? Thanks for giving it some thought.

  • 🇧🇪Belgium svendecabooter Gent

    So for example:

    \Drupal\ai\AiFieldTextExtractor\TextFieldExtractor:

    #[AiFieldTextExtractor(
      id: "text",
      label: new TranslatableMarkup('Text'),
      field_types: [
        'title',
        'text',
        'text_long',
        'string',
        'string_long',
      ],
    )]
    class TextFieldExtractor {
    
      /**
       * {@inheritdoc}
       */
      public function getColumns(): array {
        return ['value'];
      }
    
      /**
       * {@inheritdoc}
       */
      public function extract(ContentEntityInterface $entity, string $fieldName): array {
        // IMPLEMENTATION.
      }
    
      /**
       * {@inheritdoc}
       */
      public function shouldExtract(ContentEntityInterface $entity, FieldConfigInterface $fieldDefinition): bool {
        return TRUE;
      }
    
    }
    

    (some of these methods would be in a base class obviously, but for illustration purposes I added them here)

    and

    \Drupal\ai_translate\FieldTextExtractor\TextFieldExtractor:

    #[FieldTextExtractor(
      id: "text",
      label: new TranslatableMarkup('Text'),
      field_types: [
        'title',
        'text',
        'text_long',
        'string',
        'string_long',
      ],
    )]
    class TextFieldExtractor extends \Drupal\ai\AiFieldTextExtractor\TextFieldExtractor: {
    
      /**
       * {@inheritdoc}
       */
      public function setValue(ContentEntityInterface $entity, string $fieldName, array $textMeta) : void {
        // IMPLEMENTATION.
      }
    
      /**
       * {@inheritdoc}
       */
      public function shouldExtract(ContentEntityInterface $entity, FieldConfigInterface $fieldDefinition): bool {
        return $fieldDefinition->isTranslatable();
      }
    
    }
    
Production build 0.71.5 2024