Don't store full extracted file content data in the database

Created on 1 October 2019, over 5 years ago
Updated 26 September 2023, over 1 year ago

After running a full migration on a project, I noticed my database went from 200MB to just about 3 GB.

I ran a query to find the largest tables, and this was entirely the key_value table at 2.6 GB. I noticed that every content item that Solr is indexing the PDF attachment has the entire text dump in this record, which leads to the ever increasing size.

This will not scale very well, as just 25,000 items with 1 PDF attachment created such a large increase in overall size.

I am using the built-in Solr Extractor with this module.

πŸ› Bug report
Status

Needs review

Version

9.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States kevinquillen

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • #41 - has this been tested on D10? Any plan of releasing this patch to 9.0? I am currently using the module in D10 and the key_value table is 3GB+ which is affecting the overall performance of the site.

  • I've added a new feature which is the ability of changing the output file location

  • πŸ‡«πŸ‡·France mably

    Could be interesting to make a merge-request now.

  • πŸ‡«πŸ‡·France izus

    We just need to let user choose to keep storing in database or in files (depending on their needs and possibilities), so an option on that would be great

  • πŸ‡¨πŸ‡­Switzerland berdir Switzerland

    This is only about where to store the extracted text from files for indexing. This is a system/admin level, global decision, it can not be per user.

  • πŸ‡«πŸ‡·France izus

    yes of course, this is what i meant, have an option to let the admin decide if they want to store on database (as currently) or in the file system (as dne by the patch)

  • πŸ‡―πŸ‡΄Jordan rahaf albawab Amman

    Reroll #57

  • πŸ‡ΊπŸ‡ΈUnited States lhridley

    @rahaf-albawab Please provide an interdiff of the patch on #57.

  • πŸ‡―πŸ‡΄Jordan oways23

    Re-roll patch #57

  • Pipeline finished with Success
    8 days ago
    Total: 158s
    #402783
  • πŸ‡¨πŸ‡­Switzerland berdir Switzerland

    This was a rough one to reroll, I'm not sure against which branches the rerolls exactly where, but even the last had several complicated conflicts due to DI changes for me on 9.0.x. Updated the issue fork and created a MR for it, didn't test this yet at all.

    Re #64:
    > yes of course, this is what i meant, have an option to let the admin decide if they want to store on database (as currently) or in the file system (as done by the patch)

    I don't get this. The patch introduces configuration and a UI to configure the desired cache implementation.

    I noticed that there are now 10.0.x tags, but the branch is still 9.0.x. @ixus, note that you do _not_ need to create new major versions just to update the Drupal core requirement. That only needs a minor update.

  • Pipeline finished with Success
    8 days ago
    Total: 156s
    #402801
  • Pipeline finished with Success
    8 days ago
    Total: 150s
    #402866
  • Pipeline finished with Success
    8 days ago
    Total: 159s
    #403108
  • Pipeline finished with Success
    8 days ago
    Total: 151s
    #403183
  • πŸ‡«πŸ‡·France mably

    Tested successfully on Drupal 11.0.10 / PHP 8.3 / Solr 8.11.3.

    Nice work everybody!

Production build 0.71.5 2024