Download and parse tgz files as a data source

Created on 31 May 2012, about 13 years ago

Updated 27 May 2024, about 1 year ago

Problem/Motivation

If we are going to maintain many more (or all) contrib modules and themes on api.drupal.org, we need a simpler way to manage their files. Currently, we have scripts that maintain our projects/branches as Git clones, but this is time consuming to set up. Instead, it would be easier if the API module (or a sub-module) could get the files from TGZ downloads. This would work in conjunction with
✨ Grab project list and packages from Drupal database or XML Closed: outdated
to make a system that could automatically maintain the project/branch lists from meta-data, and download the necessary files for the API module to parse.

Proposed resolution

A few things would need to change:

a) File update times are not reliable from TGZ files. So, instead of deciding a given file needs to be reparsed based on the file update time, we would need to switch to using a hash or checksum of the file instead, at least for projects/branches being managed via TGZ files.

b) Ideally, if the TGZ had not been updated at all, we could skip checking the hash/checksum of individual files in the branch when doing a branch update, because we would know that nothing needed to be updated.

c) Probably the API module could have a couple of new hooks, that would ask "Does this project/branch need to be fully checked" and "Get me the files for this project/branch". We can have a new submodule that will manage these hooks by using TGZ files, for certain branches, or maybe it would just be some code in the main module that would notice it's a TGZ managed branch and use this method, vs. a regular files branch, and do things the old way.

d) Unzipped files would have a time-to-live and could be cleaned up once the API module is done looking at them.

Remaining tasks

TBD

User interface changes

TBD

API changes

TBD

Data model changes

TBD

Original issue report....

Directly using tgz files downloaded from Drupal.org, or anywhere else, will greatly reduce the setup for each project. This is needed for ✨ Grab project list and packages from Drupal database or XML Closed: outdated .

This could either be a new branch type, next to files, or changes to the files branch type. Some code will likely be shared.

Two strategies I can think of are:

Extract the tgz to the files directory and treat it the same as a files branch.
Extract the tgz in memory with http://pear.php.net/package/Archive_Tar/docs/latest/Archive_Tar/Archive_... or equivalent. Store in api_documentation's code column for queued full parsing.

I like in memory because it avoids the filesystem, and permissions problems that come with it, entirely. In memory might take a lot or memory, but we already use a lot of memory on parsing.

Localize.drupal.org uses Archive_Tar is used to extract to the filesystem: http://drupalcode.org/project/l10n_server.git/blob/refs/heads/7.x-1.x:/c....

✨ Feature request

Status

Closed: outdated

Version

1.0

Component

Parser

Created by

🇺🇸United States drumm NY, US

Live updates comments and jobs are added and updated live.

api.drupal.org contrib

Incomplete comments

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

Comment about 1 year ago →
🇪🇸Spain fjgarlin
The new D10 version of the module only needs one input: the git https URL. Optionally you can say which branches you want.

Then it will clone the repo, create a folder for each branch, and then do a git pull automatically on cron to get new files and then reparse them.

I think this issue is now outdated.

contrib.social Blog FAQ Discussions

Production build 0.71.5 2024