Batchify and optimize field scan (dangerous tags in content)

Created on 21 February 2024, 10 months ago
Updated 12 June 2024, 6 months ago

Problem/Motivation

The Field scan goes over all text fields, and it will request every database column in turn to inspect the content (including columns that are not displayed, such as the format column; not sure if this is easy to prevent without running the risk of missing things). This all happens in the same run. Although I haven't seen this causing a broken scan yet, it can take a lot of time if there is a lot of content, so the progress reporting added in πŸ“Œ Move batch functionality into check plugin Fixed could be used to turn this into a batch.

Proposed resolution

Re-implement the scan using the batch mechanism by overriding the run() method of the base plugin. Maybe opportunities present themselves to optimize a code a bit in the process.

Remaining tasks

  • Create merge request
  • Review
  • Merge

User interface changes

The user will now receive some rudimentary progress reporting during the execution of the field scan.

API changes

None.

Data model changes

None.

πŸ“Œ Task
Status

Fixed

Version

3.0

Component

Code

Created by

πŸ‡³πŸ‡±Netherlands eelkeblok Netherlands πŸ‡³πŸ‡±

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @eelkeblok
  • πŸ‡³πŸ‡±Netherlands eelkeblok Netherlands πŸ‡³πŸ‡±
  • πŸ‡³πŸ‡±Netherlands eelkeblok Netherlands πŸ‡³πŸ‡±

    Pushed some work in progress, not functional ATM.

  • Merge request !62Implement field scan in a batch β†’ (Closed) created by eelkeblok
  • Pipeline finished with Success
    10 months ago
    Total: 238s
    #116357
  • Status changed to Needs review 10 months ago
  • πŸ‡³πŸ‡±Netherlands eelkeblok Netherlands πŸ‡³πŸ‡±

    This refactors the field scan into a batch process, doing the fields we want to scan 1000 rows at a time.

    I've combined the querying of the database to do all columns at once (it does now ask the field processing method whether it would like to scan the ID column as well, but that seems to be a small price to pay for efficiency, as it does return quickly because the ID is not a text column).

    The progress reporting is a bit wonky, as it counts every entity type equally, as well as every field within each entity; the progress is calculated as a simple fraction of the total numbers. This means that an entity without any scannable fields counts as heavy as an entity with many scannable fields. In practice, this means it is quite choppy; it can make huge jumps when it gets a bunch of entities that have noting of interest, and then seem to get stuck for a while, when scanning a text field that has a lot of data (the percentage with a decimal position we added for the individual scan progress does help there). More accurate would be to find out which fields are scannable up front and see how many rows there are to scan, and then keep a grand total of scanned rows. Still, this is a huge improvement with my "site of interest", which has a lot of user generated content.

  • πŸ‡³πŸ‡±Netherlands eelkeblok Netherlands πŸ‡³πŸ‡±

    BTW, I don't think this is a must-have for 3.0, could easily wait for a 3.1.

  • πŸ‡ΊπŸ‡ΈUnited States smustgrave

    Tested locally and still appears to be functional. Thanks!

  • Status changed to Fixed 7 months ago
  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024