Add a reliable entity-usage system to core

Created on 4 June 2023, about 1 year ago
Updated 15 March 2024, 3 months ago

Problem/Motivation

At present Drupal has a file usage API. This is critical to things like private file access and garbage collection of files.
Unfortunately, it relies on modules recording and removing usage entries and cannot be relied upon to provide a source of truth.

There are a wide number of bugs in core that pertain to invalid file usage data and we've had security issues and critical data loss issues as a result of this.

Some examples include:

Automatic file deletion in core had to be disabled due to persistent and impossible to resolve data loss issues:

🌱 Dealing with unexpected file deletion due to incorrect file usage Active (see also all the linked and related issues of that issue, many of which are unresolved).

In addition, this API is limited to file usage, but Drupal's data model allows much richer entity relationships.

For example, the file usage API may record that a media entity makes use of a file. But content editors need to know if any other content entities make use of that media entity before they can decide if the file is in fact no longer in use.

Proposed resolution

Adapt the entity usage module β†’ for core as a low level API.
It provides the following features.

  • A configurable usage API allowing an entity type to be flagged as a source or target of a usage record
  • Support for revisions
  • Support for translations
  • Calculated at save time so cheap to query at run time
  • A plugin based API allowing usage to be determined in a myriad of ways - e.g. via an entity reference, via a link in HTML, via an image tag, via an inline block in layout builder and many more
  • The ability to reload the entire usage dataset via batch (ala node access rebuild) to ensure the data is accurate

With Entity Usage, a content editor can traverse the entity relationship to ascertain that e.g. image file A is attached to media entity B which is referenced from block content C which is used inline in the layout of node D. This allows the content editor to get meaningful usage data.

Remaining tasks

Agree this would be a useful feature to add to core, move to the core issue queue

User interface changes

API changes

Data model changes

🌱 Plan
Status

Active

Component

Idea

Created by

πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @larowlan
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10
  • πŸ‡ΊπŸ‡ΈUnited States phenaproxima Massachusetts

    +1 for this. It would help solve some tricky, long-standing problems with media.

  • πŸ‡ͺπŸ‡ΈSpain marcoscano Barcelona, Spain

    Adding πŸ“Œ Track media usage and present it to the site builder (in the media library, media view, on media deletion confirmation, etc.) Active since it has some background discussion that might be helpful here.

  • πŸ‡ͺπŸ‡ΈSpain marcoscano Barcelona, Spain

    After re-reading the related issue I see that back in the day I expressed that this was a "hard problem" to solve in core but didn't expand too much on that.

    I still think it's non-trivial, especially if we want a solution that would both 1) register the data, and 2) present the data in a meaningful manner to end users. To me the trickiest aspects are:
    - Depending on the content model, there may be "intermediate entities" (eg paragraphs, inline blocks, etc) that don't have a standalone representation (in other words, they "don't mean anything on their own to end users"), but on a technical level they are just as important as any other entity. So we need to track them the same way because they are part of the chain, but possibly display them differently (for example omitting them) on the UI. This causes complexity and overhead in the code that display usages to users.
    - In content models that have a very large relationship tree (nested entities), calculating usage on entity save is very expensive. In entity usage we introduced a mechanism to allow updating the usage table as a background process (using a @destructable service) but that introduces its own complexities and inaccuracies to handle in certain scenarios.

    Having said that, I too face this need in almost every project I work on, so despite the challenges, I am +1 for trying to find the most reasonable way to have this type of functionality in core.

  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    At the risk of going into implementation details, I think we can resolve some of those issues with entity handlers.
    I had a similar issue with filter format audit and have a default handler but special cases for other entity types like paragraphs and inline blocks

  • πŸ‡¬πŸ‡§United Kingdom catch

    I'm not familiar with entity_usage module at all, but a general +1 to ripping the current file_usage system out and replacing it with something completely different - IMO the file usage API as it currently stands is unfixable.

    Added 🌱 Dealing with unexpected file deletion due to incorrect file usage Active to the issue summary which is the current meta for how broken it currently is.

    - Depending on the content model, there may be "intermediate entities" (eg paragraphs, inline blocks, etc) that don't have a standalone representation (in other words, they "don't mean anything on their own to end users"), but on a technical level they are just as important as any other entity.

    This seems a similar issue to usages for entities that you don't have view access to. Doesn't seem insurmountable.

    In content models that have a very large relationship tree (nested entities), calculating usage on entity save is very expensive. In entity usage we introduced a mechanism to allow updating the usage table as a background process

    This is definitely worth an implementation issue if we decide to add this to core, I'm sure we could figure something out with an overall 'system is catching up to itself' flag that could indicate data is being rebuilt - whether a full rebuild or in queue/destruct etc. We'd need some ability to disable it happening directly on save for migrations (i.e. you'd often want the initial migration to run as fast as possible, then rebuild the usage tables when all the entities and their revisions are in).

  • πŸ‡¦πŸ‡ΊAustralia kim.pepper πŸ„β€β™‚οΈπŸ‡¦πŸ‡ΊSydney, Australia

    Sounds like a great idea.

  • πŸ‡¨πŸ‡­Switzerland Berdir Switzerland

    Generally +1, entity usage tracking does seem to fit into core, also as a replacement for file_usage. being able to track revisions would vastly simplify issues current limitations there.

    And yes, composite entities like paragraphs and inline blocks are a challenge. No, it's not the same as inaccessible entities. You still want and need to see them, but in a way that's actually useful as an editor, which is how you view and edit them. So you want to see the node that contains the paragraph/inline block, with maybe some information on which element it is. This is especially an issue with revisions, if you stop using an inline block or paragraph in a new revision, the composite entity will remain the default revision and look relevant but it's not. There is an issue/plan for a 3.x version of entity_usage that would explicitly track usage on its host entity.

    Another challenging issue is that entity_usage needs to deal with are string ids both on the source and target, which result in annoyingly large indexes and complicated queries (entity_usage tends to be one of the biggest tables in our projects).

  • πŸ‡¦πŸ‡ΊAustralia kim.pepper πŸ„β€β™‚οΈπŸ‡¦πŸ‡ΊSydney, Australia
  • πŸ‡³πŸ‡±Netherlands Lendude Amsterdam

    Just want to give a general +1 on this, it's something that is needed or being requested on most projects we do these days.

  • πŸ‡¨πŸ‡­Switzerland Berdir Switzerland

    Replying partially to the comment from @catch in issues like πŸ› Deleting an entity with revisions and file/image field does not release file usage of non-default revisions, causing files to linger Postponed :
    > This should really be postponed on (this issue), the file_usage system is broken beyond repair, there is no way to rebuild file usage, so as soon as it's off by one or more, it's like that forever.

    I'm not entirely convinced that we should postpone those issues. True, most of them haven't moved in years. But having them fixed in some capacity would also result in test coverage that we can build on.

    Also, there are a few things to consider in regards to the rebuild usage scenario and also using it for replacing file_usage:

    * entity_usage is limited to entity to entity usages, file_usage is not. There are valid cases where files are used in non-entities, for example in core that's the theme logo. Rebuliding with entities is one thing, they are a known thing and we can loop over them all and eventually our data is rebuilt. But rebuilding those other things is going to be more complex if we want to keep support for that. This would need to be built into the plugin system and data model in way.
    * On sites that are large enough, with millions of entities, the rebuild feature becomes so slow that it becomes almost theoretical and unusuable. At least the current implementation in the entity_usage module, better options may be possible. Since it tracks both revisions targets *and* sources, it means looping over every revision of every entity type you have it enabled on, load them and run every plugin through each revision. The module can do it either on-the-fly on batch (but you have to restart if it fails for any reason) or through the queue.
    * The current functionality is opt-in for both source and target entity type, which makes a lot of sense as there are plenty of references you don't want to track, for example orders/items/profiles on ecommerce sites are most likely not useful/needed. However, file_usage decides on whether or not files get deleted, so tracking for it has to be mandatory for all source entity types that might use it as well.

    Again, I think entity_usage is awesome, we use it on all projects, but it it's a complex problem space and entity_usage in it's current form isn't really designed to replace file_usage.

  • πŸ‡¦πŸ‡ΊAustralia acbramley

    Big +1 to get something like this into core.

    Wrt. the entity_usage β†’ module it seems there are currently 2 different rewrites in progress in the 8.x-3.x and 8.x-4.x branches so it'd be good to figure out which of those is likely to be the solution going forward.

  • I think this is pretty important if it's required to fix file usage and files not getting deleted when a node is deleted if the files are used by a non-current node revision.

  • πŸ‡¬πŸ‡§United Kingdom catch

    I'm not entirely convinced that we should postpone those issues. True, most of them haven't moved in years. But having them fixed in some capacity would also result in test coverage that we can build on.

    Yeah the test coverage and knowing that we need to cover the case is I think fair enough, however we're never, ever going to be able to turn automatic file deletion based on file_usage on again, which makes the data in there essentially worthless, and it's going to be like that regardless of how many issues we fix due to the impossibility of an upgrade path. So as an exercise yes, but only at best indirectly leading to fixing the actual issue we want to fix.

  • however we're never, ever going to be able to turn automatic file deletion based on file_usage on again, which makes the data in there essentially worthless

    It would be possible to have automated deletion for files created after the fixes. For other files, though, you would need to manually review the files and determine if they can be deleted or not. It would be really awesome if we could have an update hook "rebuild" usage data.

    In any case, that's all the more reason to fix it and have data correct going forward, even if all legacy files have incorrect data. That way, new websites will at least be creating correct data.

  • The problem is that the file_usage table only tracks the entity ID, not the revision ID for that entity. So if you remove the file for the current version of the entity, it just decrements the count column value, but there is no way to know from the file_usage table what revision is referencing it.

    Ideally, we would remove the count column and use a separate row for each reference and revision. However, this could cost a lot of space if an entity has a lot of files F and a lot of revisions R. There would be F times R rows for that entity. Maybe there's a better solution?

  • Honestly, this seems like a "revisioning" problem. Currently, nodes save the entire node for each revision. Ideally, only the "diff" would be saved. The same could be done for file usage, noting which revision(s) added/removed a reference to the file. It's sort of a Git-like problem.

  • πŸ‡¬πŸ‡§United Kingdom catch

    It would be really awesome if we could have an update hook "rebuild" usage data.

    That's what this issue is about - replacing the entire system with entity_usage that tracks each individual usage in a rebuildable way instead of a count.

  • I took a look at it, however it doesn't fix the issue of files not being marked temporary because it doesn't integrate at the level yet. It also doesn't have a nice report or anything to view all entity usage for each entity type, which I would want. I think that for now I might just have to use a workaround πŸ› File not marked temporary and usage not updated if only used in past revisions when node is deleted Active to delete entity usage for revisions.

Production build 0.69.0 2024