Unpublish already imported entities that are no more in the server channel

Created on 22 November 2021, over 2 years ago
Updated 14 November 2023, 8 months ago

Problem/Motivation

When previously imported entity is no more in the server channel (deleted entity or entity that no more satisfies the channel filters), we cant detect this in the client

Proposed resolution

For all entries in the entity_import_status table which have the last_import lower than entity_share_cron.cron_last_run, do :

  1. Unpublish the corresponding entities
  2. Delete these import statuses (or keep them ? In this case for each cron run the same entities will be unpublished.)
✨ Feature request
Status

Needs work

Version

3.0

Component

Code

Created by

πŸ‡«πŸ‡·France alexandre.todorov

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡¨πŸ‡­Switzerland Berdir Switzerland

    I'm reopening this for now, adding a setting to make it optional and chose between deletion and unpublishing on the client.

    I'm restructured things a bit, the previous implementation assumed that the queue item it creates would not be called before this one finishes, which could actually happen if you use an alterantive queue system with multiple runners, this also simplifies it a bit as we don't need another fake-pass through into the same plugin to then create another plugin. More cleanup is necessary, this really should be an entity query for example.

    But I've hit a bit of a snag now, it is very much tied to the entity_share_cron flow and the import timestamps. So it would be tricky to implement it in a different module as those modules have no way of knowing that entity_share_cron is done.

    That said, even within this module, there is a conflict between this implementation and the SkipImported plugin, which seems crucial to efficiently handle even a medium amount of entities. That is, the last_import flag of an import status entity isn't updated if it is skipped, and then this patch would delete/unpublish it. For this to work, we need something else, we need to know if/when an entity was last *attempted* to be synced.

    I have roughly two approaches in mind:

    a) We introduce a new last_import_attempt or something field for EntityImportStatus, that is always updated, before SkipImported as a chance to skip it (or we need a way for plugins to react to an entity being skipped), preferably a bulk update to avoid hundreds of import status entity updates. Would require changes in entity_share as well and seems nicer but also a lot more work.

    b) We put something together just for this module, what I have in mind is some storage per remote/channel, at the start we add ids for of all import statuses for the given remote/channel, and then as we work through it delete those entries(), the leftovers are then what we have to delete. I was hoping we could just put it in a key value collection, but while the API has a setMultiple() method, the database storage doesn't implement it, so it would be single INSERT queries.

    The current patch is half-way done and not property tested at all. No interdiff, I moved and renamed most things in it anyway.

    I'll also post an update to ✨ Support a entity delete policy Active for the bigger picture.

  • πŸ‡§πŸ‡ͺBelgium kriboogh

    Been reading up again on this as we also probably still need to be able to detect server side deleted entities and delete them on the clients.

    @Berdir, your approach looks like it might also solve our situation (we don't use SkipImported). If you do, as a workaround, you might create a custom client side process plugin that runs before the SkipImported plugin, which sets the last_import date of an entity that is about to get skipped. That being said, maybe SkipImported should set the last_import date always (even if it skips an entity).

  • πŸ‡¨πŸ‡­Switzerland Berdir Switzerland

    Yes, I had similar thoughts on custom process plugins. and always updating last_import instead of a new field is an option, either would need to be done in the entity_share main project as a prerequisite for this and I'd imagine this is something that would only be accepted for a new major version.

    And if we do that, I'd really prefer to have a bulk update through a custom storage handler for example, so it's a single update where id IN() condition, that would be a lot faster if you have hundreds/thousands of entities to check compared to a full entity save with hooks and everything.

    That said, I'm currently strongly considering to follow the approach with DeletedEntity entity type (that I mentioned in the referenced issue) that can be fetched from the client with an efficient condition on the date. I'll likely to a proof of concept in our project with that and then share that some, a bit unsure how that fits into the architecture of entity_share (what of it is in the main project, what's in the cron module ...)

  • πŸ‡¨πŸ‡­Switzerland Berdir Switzerland

    I implemented that entity type now custom in my project, it's pretty straight forward, just not sure how exactly it fits into the architecture of this module. If we get some input on that, then maybe I or someone else can convert that into a patch/modules:

    What you need for that (FOO = module name on sites providing content, BAR is module on sites pulling content, can be the same if you you do bidirectional):
    1. A basic entity type with id, uuid, created (for efficient queries/sync) and entity_type_id and entity_uuid:

    /**
     * Defines the deleted entity class.
     *
     * @ContentEntityType(
     *   id = "FOO_deleted",
     *   label = @Translation("Deleted entity"),
     *   label_collection = @Translation("Deleted entities"),
     *   handlers = {
     *     "views_data" = "Drupal\views\EntityViewsData",
     *     "access" = "\Drupal\FOO\DeletedAccessControlHandler",
     *   },
     *   base_table = "FOO_deleted",
     *   entity_keys = {
     *     "id" = "id",
     *     "label" = "id",
     *     "uuid" = "uuid"
     *   }
     * )
     */
    class FOODeletedEntity extends ContentEntityBase {
    
      /**
       * {@inheritdoc}
       */
      public function getCreatedTime() {
        return $this->get('created')->value;
      }
    
      /**
       * {@inheritdoc}
       */
      public function setCreatedTime($timestamp) {
        $this->set('created', $timestamp);
        return $this;
      }
    
      /**
       * {@inheritdoc}
       */
      public static function baseFieldDefinitions(EntityTypeInterface $entity_type) {
    
        $fields = parent::baseFieldDefinitions($entity_type);
    
        $fields['created'] = BaseFieldDefinition::create('created')
          ->setLabel(t('Created'));
    
        $fields['entity_uuid'] = BaseFieldDefinition::create('string')
          ->setLabel(t('Entity UUID'))
          ->setDescription(t('The UUID of the deleted entity.'));
    
        $fields['entity_type_id'] = BaseFieldDefinition::create('string')
          ->setLabel(t('Entity type'))
          ->setDescription(t('The  entity type of the deleted entity.'));
    
        return $fields;
      }
    
    }
    

    2. Minimal access control for just view access with either a custom permission or something you already have. You could also use an admin permission, then step 4 won't be necessary, but I prefer to lock access down, even if there is no UI.

    class DeletedAccessControlHandler extends EntityAccessControlHandler {
    
      /**
       * {@inheritdoc}
       */
      protected function checkAccess(EntityInterface $entity, $operation, AccountInterface $account) {
        if ($operation == 'view') {
          return AccessResult::allowedIfHasPermission($account, 'access FOO deleted entity information');
        }
        return parent::checkAccess($entity, $operation, $account);
      }
    
    }
    

    3. Implement hook_entity_delete() for the entity types that you care about, in a reusable solution this would be config.

    function FOO_entity_delete(EntityInterface $entity) {
      if (in_array($entity->getEntityTypeId(), ['node', 'media', 'taxonomy_term', '...'])) {
        $deleted = FOODeletedEntity::create([
          'entity_type_id' => $entity->getEntityTypeId(),
          'entity_uuid' => $entity->uuid(),
        ]);
        $deleted->save();
      }
    }
    

    4. Without an admin permission, jsonapi will silently break query filters (no idea why it doesn't just give an error, it's super verbose on any other access issue), so you need to implement its filter access hook.

    function FOO_jsonapi_FOO_deleted_filter_access(EntityTypeInterface $entity_type, AccountInterface $account) {
      return ([
        JSONAPI_FILTER_AMONG_ALL => AccessResult::allowedIfHasPermission($account, 'access FOO deleted entity information'),
      ]);
    }
    

    Deploy this to all sites that provide channels.

    5. Final step is a hook_cron() implemention on any site that pulls content.

    function BAR_cron() {
      $last_delete_check = \Drupal::state()->get('BAR_last_delete', 0);
    
      $entity_type_manager = \Drupal::entityTypeManager();
      /** @var \Drupal\Core\Entity\EntityRepositoryInterface $entity_repository */
      $entity_repository = \Drupal::service('entity.repository');
    
      // Build the initial filter query, only fetch deletions that happened since
      // the last cron run, 100 results per page.
      $query = [
        'filter[created][condition][path]' => 'created',
        'filter[created][condition][operator]' => '>',
        'filter[created][condition][value]' => $last_delete_check,
        'sort' => 'created',
        'page[limit]' => 100,
      ];
      foreach (Remote::loadMultiple([]) as $remote) {
        try {
    
          // Build the initial URL, this is repeated for each page we find.
          $url = 'jsonapi/FOO_deleted/FOO_deleted?' . http_build_query($query);
          $client = $remote->getHttpClient(TRUE);
          do {
            $response = $client->get($url);
    
            $data = Json::decode($response->getBody());
    
            foreach ($data['data'] as $deleted) {
              $entity_type_id = $deleted['attributes']['entity_type_id'];
              $entity_uuid = $deleted['attributes']['entity_uuid'];
              if ($entity_type_manager->hasDefinition($entity_type_id)) {
                $entity = $entity_repository->loadEntityByUuid($entity_type_id, $entity_uuid);
                if ($entity) {
                  \Drupal::logger('FOO')->notice('Deleting removed @type @label (#@id) from remote @remote_label', [
                    '@type' => $entity->getEntityType()->getSingularLabel(),
                    '@label' => $entity->label(),
                    '@id' => $entity->id(),
                    '@remote_label' => $remote->label(),
                  ]);
                  $entity->delete();
                }
              }
            }
    
            $url = $data['links']['next']['href'] ?? NULL;
          } while ($url);
        }
        catch (\Exception $e) {
          watchdog_exception('FOO', $e);
        }
      }
    
      // Update the last delete flag, so we only check new deletions. Subtract
      // one second to make sure we're not missing anything happening at exactly
      // the same time.
      \Drupal::state()->set('BAR_last_delete', \Drupal::time()->getRequestTime() - 1);
    }
    
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Thanks for sharing @Berdir - I'm working for a client with a similar requirement. What are your thoughts on collaborating on making #8 a contrib project?

  • πŸ‡¨πŸ‡­Switzerland Berdir Switzerland

    If there's no interest to have it in the main project then I'm open to that.

  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    I think in ✨ Support a entity delete policy Active @Grimreaper indicated he didn't want it in the main module. But I'll ping him on slack to confirm

  • πŸ‡«πŸ‡·France Grimreaper France πŸ‡«πŸ‡·

    Hi,

    Entity deleted does not keep track of translation deletion. But maybe it will be too much ide effects prone too, because Entity Share does not care about maintaining translation source, so I guess not possible to be able to delete a specific translation on client website. If this translation had been imported first on the client website, it will be the translation source so not possible to delete it with deleting the whole entity.

    Thanks to have introduced configuration.

    About the unpublish feature, if the client website is using Content Moderation, I suppose that it should be a state that should be configured?

    So from what I understand:
    - changes are needed into entity_share directly
    - having the work done here into a separated module from entity_share_cron will even more complicated the amount of work required

    If this will be a configurable behavior (in entity_share_cron and/or as import policy) I am more opened to have changed introduced in entity_share and entity_share_cron. I don't want a behavior (especially data deletion) forced like patch in comment 2.

    Of course this would need to have tests.

    Also regarding ✨ Evaluate if possible to share users Active , if this lands in 4.0.x, I think adaptation on the deletion behavior will be needed.

  • πŸ‡¨πŸ‡­Switzerland Berdir Switzerland

    Translation is a good point, it might make sense to support that, the deleted entity type would need a langcode column then so it can keep track which translations have been deleted. translations can repear, so we'll have to make sure to also remove that information again to avoid race conditions with translations that quickly get deleted and recreated.

    > About the unpublish feature, if the client website is using Content Moderation, I suppose that it should be a state that should be configured?

    Do you mean to unpublish on the client if the server deletes that, per current title. Yes, I suppose the client could support a setting that is either delete or unpublish or if content moderation is active on that entity type/bundle, use a given moderation state.

    > So from what I understand:
    > - changes are needed into entity_share directly
    > - having the work done here into a separated module from entity_share_cron will even more complicated the amount of work required

    not sure I understand this. nothing *has* to go into the main entity share client/server modules. But, having them elsewhere requires two new modules, one to be installed everywhere where you also have server and one on all clients. So having this feature directly in the main modules, optional and configurable of course, has some benefits. Possibly just as an API/manual sync operation or something like that. And then cron/pubsub/... modules that automate the sync could build on that.

    I guess it could be beneficial to start with a patch against entity_share directly, see how that would work and look and then make a decision to keep it or move it to a separate project.

  • πŸ‡«πŸ‡·France Grimreaper France πŸ‡«πŸ‡·

    > Do you mean to unpublish on the client if the server deletes that, per current title.

    Per current issue title? Yes

    Sorry I thought that in the patch there was already an unpublish setting, but it is only a delete one. I confused with something else.

    > not sure I understand this. nothing *has* to go into the main entity share client/server modules.

    From comment 5:
    - > So it would be tricky to implement it in a different module as those modules have no way of knowing that entity_share_cron is done.
    - option a): > ... Would require changes in entity_share as well and seems nicer but also a lot more work.

    So more difficult to implement if new projects are introduced?

    > I guess it could be beneficial to start with a patch against entity_share directly, see how that would work and look and then make a decision to keep it or move it to a separate project.

    Yes please. With MRs.

  • πŸ‡¨πŸ‡­Switzerland Berdir Switzerland

    > So more difficult to implement if new projects are introduced?

    The approach we used that I outlined in #8 is completely standalone and does not require any integration with entity_share_cron. Mostly because the previous approach is not compatible with SkipImported and it would be complex to handle that efficiently with a large amount of entities.

    It's a separate API that runs parallel to the entity_share channels (which sync new and updated content), it just queries new FOO_deleted entities through the standard jsonapi collection.

    In theory this entity type too could be synced as a channel, I decided against that, because there will never be any updates, there will never be any referenced entities, the FOO_deleted entities do not need to be persisted on the client. That allows for a more efficient sync API, it includes a condition to only get new entities since the last sync, deletes (or unpublishes when that option is added) them locally if they exist, done.

    It might struggle in some edge case scenarios, like a massive bulk deletion of content on the server, a queue could be used to handle that.

    The main reason to include that in the main modules would be to have it available as an official feature that is more tightly integrated with the channels (channel configuration could be used to track which entity types/bundles/languages need to be handled, I didn't need that complexity in my custom solution). And then the only reason to involve entity_share_cron() would be to just call that API, since entity_share currently does not do any cron operation, so it would make sense to have that there as well.

  • πŸ‡«πŸ‡·France Grimreaper France πŸ‡«πŸ‡·

    Ok. So the solution to workaround SkipImported is put aside.

    Then I think the approach from comment 15 is the best start.

Production build 0.69.0 2024