Create optimised username and email sanitisers based on user ID

Created on 23 May 2022, almost 3 years ago
Updated 9 May 2025, 8 days ago

Problem/Motivation

When trying to run GDPR dump on a database with a lot of users, this will take a lot of time. This is because the relevant anonymisers use the Faker library with the option to keep usernames and email adresses unique. This, of course, is a requirement for Drupal. However, because Faker has no context, the only option it has to guarantee uniqueness, is to keep a record of values that have already been issued. This causes the process to become slower and slower. For a database with a few dozen users, this is not an issue. For a database with thousands of users, it quickly becomes one (note that the latter is more likely to have GDRP issues when an unsanitized database were to leak).

Steps to reproduce

Given a development site with a handful of users:

  • Configure GDPR dump to sanitize usernames and email adresses
  • Create a GDRP dump
  • Notice that the speed is OK
  • Generate 80 0000 users (e.g. devel_generate), or import a database that has a similar amount of users (the database I encountered this issue with has that number of users; a few less will probably also serve to illustrate the point)
  • Repeat the creatin of the GDPR dump
  • Notice how this takes a lot of time.

Proposed resolution

Pass row data (or allow an anonymizer plugin to opt to receive it) and create a plugins for username and password sanitization using the new mechanisme, derived from the user ID.

Remaining tasks

Agree on solution.
Create merge request
Review
Merge

User interface changes

New anonymiser plugins will be available.

API changes

We will need a way to pass row context into anonymiser plugins.

Data model changes

None.

Feature request
Status

Active

Version

3.0

Component

Code

Created by

🇳🇱Netherlands eelkeblok Netherlands 🇳🇱

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

Production build 0.71.5 2024