[META] Improve validation and indexing process speed for large data

Created on 26 January 2022, about 3 years ago
Updated 1 June 2023, over 1 year ago

Problem/Motivation

Right now, it takes ages to validate and index large data items (35K records). UI doesn't even support validation and indexing for the data this large because during the batch process both the process are attempted in one go hence I created #3259685: Allow skipping indexing on Dataset save β†’ .

Given that indexing take 15 min then reindex might take that long as well which would translate into 15 min of downtime for a particular feature relying on that index.

Proposed resolution

  • To fix validation move the data validation out of the Dataset entity as the data validation only changes the status of its entity.
  • Add the new data validation subsystem which will run in batch and can communicate the change of status to its entity.
  • Queuing Dataset entity for indexing is a moot point as we only index one entity per queue according to current functionality. Queue DatasetData for indexing instead of Dataset entity.
  • Add indexing status to DatasetData just like migration row status. If Dataset entity is set to re-index mark all the corresponding DatasetData to re-index and index them in the queue.
  • Change the indexing logic so that we don't delete the whole index to re-index the data. Store the data hash(just like migration row) in the DatasetData so that we can remove outdated data from the index and add index update data.

Remaining tasks

Discuss, finalize the apporoch and open the followups.

User interface changes

None.

API changes

The changes is validation and indexing process.

Data model changes

None.

🌱 Plan
Status

Fixed

Version

1.0

Component

Code

Created by

πŸ‡¨πŸ‡¦Canada jibran Toronto, Canada

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

Production build 0.71.5 2024