Add csv generation and filtering

Created on 19 March 2022, over 2 years ago
Updated 22 April 2023, about 1 year ago

Problem/Motivation

We have some large datasets being returned on https://www.va.gov/ for various tags, and we need to export them to different analysis tools. We also need to filter these large data sets for analysis. This pr provides a csv generator that can be accessed via a button added to the top of dataset tables, as well as 4 filters that allow for winnowing results by field name, entity type, bundle and text string (latter two available only for nodes and paragraphs).

Testing

  1. Go to admin/reports/html_tag_usage and select one of the count items (e.g.: Rich text -> a -> href: count)
  2. On subsequent page, select node entity and visually verify bundle and search string filters appear
  3. Visit an entity listed in the table, and copy a snippet of text from the field that is highlighted as a match in the table
  4. Return to the filter form, enter the name of the field in the Name of the field to filter filter, choose node for entity type, enter the node bundle in the bundle filter, and paste the snippet in the Search by string filter
  5. Click Apply filters and verify your node is returned (may be more than one result if item appears on other pages)
  6. Choose paragraph for the entity type filter
  7. Visually verify 4 filters appear
  8. Choose an entity type that is not node or paragraph and visually verify only two filters appear
✨ Feature request
Status

Needs work

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States EthanT Sarasota, Florida

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡©πŸ‡ͺGermany FeyP

    Thank you very much for working on this and sorry for the delayed review.

    Adding some kind of filter and export functionality was actually on my roadmap for the module. I think this would be a great new feature.

    I've looked at the patch and I think there is still some work to do until we could commit this. Let me know, if you're still interested to continue working on this or if you'd rather want me to take it from here.

    html_tag_usage.install
    
    +      'bundle' => [
    +        'type' => 'varchar',
    +        'description' => 'Node bundle',
    +        'length' => 50,
    +        'not null' => FALSE,
    +      ],
    

    Testing the patch the bundle filter seems to work only for nodes and the column is NULL for other entity types with bundles, e.g. taxonomy terms. I think we should support entity types without bundles, but also provide this feature for all entity types with bundles, so let's rename the description to "The entity bundle." (could also be "Entity bundle" or "Bundle", but this is for consistency with the other descriptions.

    For entity types without bundles, we could either use the entity type id as the value of the bundle column (which might be favorable, see further below) or use NULL. If we do the former, not null could be TRUE.

    Also, the maximum length of a bundle is only 32 chars (see \Drupal\Core\Entity\EntityTypeInterface::BUNDLE_MAX_LENGTH where this is defined), so let's update the length accordingly.

    Let's move the new bundle field further up in the definition to follow the entity_type field.

    I also wonder, if we should add the bundle field to the primary key. For that, it would have to be not null, though.

    +      'field_text' => [
    +        'type' => 'text',
    +        'description' => 'Field text',
    +        'size' => 'big',
    +        'not null' => FALSE,
    +      ],
    

    For reasons stated above, let's rename this to "The field text."

    I'm not really sure if I like that the field text is duplicated for each entry. This will produce a lot of duplicated data, especially for very large installations. Given that the report should reflect the state at the time of generation, I guess we can't join the field text from the field tables and storing it for the report makes some sense, but maybe we should store the field text once per analyzed delta of a field of an entity in a language and not once per tag/attribute combination. We could add a separate table to do that.

    +function html_tag_usage_update_9001() {
    

    We need to update the schema for the new fields here as well.

    +  $schema = \Drupal::service('database')->schema();
    +  $schema->addField('html_tag_usage', 'bundle', $bundle);
    +  $schema->addField('html_tag_usage', 'field_text', $field_text);
    

    We shouldn't make any assumptions in update hooks, so we should check, if the fields already exist and only add them, if they don't. This can be done using Schema::fieldExists().

    We also need to think about what to do with existing reports. I tried the patch on an existing report and the filters obviously didn't work as expected, because some of the required data was missing.

    We could either:

    1. discard the current report and force users to regenerate it.
    2. egenerate the report.

    The first option would probably be easier to do in an update hook. For the second option I think we would need to add a post update hook.

    Then again, users might not expect their report to be regenerated. We could add a flag for this into state and warn users that they must regenerate the report if they want to use the new feature, but that's maybe a bit overkill.

    Currently I lean towards option 1) and just warn users in the release notes that any existing reports will be removed during the update. Since we're still in beta, that should be acceptable.

    html_tag_usage.routing.yml
    
    +  requirements:
    +    _access: 'TRUE'
    

    It would be great, if we could introduce a separate permission that would allow CSV export and add this as a requirement. If users of the module really didn't want any access checks, they could still give this permission to anonymous users. I don't think we could leave it like that since it would probably count as an information disclosure security vulnerability. Also, users might want more fine-grained control over who would be allowed to export data.

    Another thing that would be great would be, if we could add some basic validation for the route parameters (see Validate route parameters β†’ ).

    +++ b/html_tag_usage.services.yml
    
           - '@datetime.time'
    +      - '@form_builder'
    +      - '@renderer'
    +      - '@request_stack'
    

    I don't think we need to inject the renderer into the analyzer. See further below.

    +  html_tag_usage.filters_form:
    +    class: Drupal\html_tag_usage\Form\FiltersForm
    +    arguments:
    +      - '@entity_type.bundle.info'
    +      - '@entity_type.manager'
    +      - '@request_stack'
    

    I'm not sure why you added the form as a service. As far as I can see we don't need it. If it's just for the injection of the dependencies, the container should be able to handle it just fine without this entry as long as we implement the right interface for the form (\Drupal\Core\DependencyInjection\ContainerInjectionInterface). Your form extends FormBase, which already implements this interface and looking at the code of the form, you're already doing everything that's required. So unless I'm missing something I think this service definition can be removed.

    +      - '@request_stack'
    +
    

    I think we need to remove that last empty line from the services.yml, it violates Drupal Coding Standards.

    src/Analyzer.php
    
    +    $build['#prefix'] = $this->render->render($csv_link);
    

    Instead of using '#prefix' here, couldn't we just add a separate section to the render array? I think something like $build['csv_link'] = $csv_link; should work just fine. With the new permission, we might also want to check for access somewhere around here, so something like $build['csv_link'] = $csv_url->access() ? $csv_link : [];?

    That way we wouldn't need to render the renderable array in the analyzer. For correct bubbling up of cache metadata and using Drupal's caching functionality to its fullest potential, it's always a good idea, if we refrain from rendering something early and let the template layer do the actual rendering.

    +    $build['filters']['#markup'] = $this->render->render($filters_form);
    

    Likewise $filters_form is a render array, so why not just use $build['filters'] = $filters_form;?

    Then we wouldn't need to inject the renderer into the analyzer at all.

    +    if ($entity->getEntityTypeId() === 'node' || $entity->getEntityTypeId() === 'paragraph') {
    +      $bundle = $entity->bundle();
    +    }
    

    I don't see the need to restrict this nice feature only to nodes or paragraphs. Why not check, if the the entity uses bundles? We can use EntityTypeInterface::getBundleEntityType() for that. So something like $bundle = $entity->getEntityType()->getBundleEntityType() ? $entity->bundle() : NULL; should do the trick.

    Maybe just using the entity type id as the bundle id could also be an option? Then we could just use the return value of $entity->bundle() as is and we could also set the field to be not null in the field definition in the schema, which would be required when we want to add it as the primary key.

    src/Controller/CsvController.php
    

    First some general remarks:

    • It looks like some of the query building logic and parts of generating the data array are very similar or even the same as other logic already existing in the service. I wonder, if we should let the service handle the bulk of the programming logic so that we can leverage protected methods for less code duplication?
    • There is no sanitizing of the data to prevent CSV injection. Right now, we're not exporting the field data, which would be the most likely way to exploit this. With the nature of the current data it is somewhat unlikely that this would be exploited, but it is not impossible. So I guess we need to properly sanitize the data or it might be considered a security vulnerability. Especially an exploit via the entity label would be in-scope for a security advisory, when I create a stable release at some point.
    • CSV generation: I wonder if it would be a better idea to use the Serialization API provided by Symfony/Drupal Core to generate the CSV file. To do this, the drupal:serialization module would be an optional dependency that would be required to use the CSV feature. Unfortunately, even though Symfony comes with a CSV encoder/decoder by default, Drupal Core doesn't support CSV out of the box. So we have two options:
      1. We could add the contributed csv_serialization module as another optional dependency that would be required for CSV export. This module provides a CSV encoder that we could use. For the data generation phase, we would only generate an array of the $data arrays. Then we would call $csv_data = $serializer->encode($csv, 'csv'); to generate the CSV file.

      2. If we don't want to require modules outside of Drupal Core, we could just add the Symfony CSV encoder to the container in services.yml. Only then, I would use another name for the format so that we don't conflict with the implementation in contrib:

          html_tag_usage.encoder.csv:
            class: Symfony\Component\Serializer\Encoder\CsvEncoder
            tags:
              - { name: encoder, format: html_tag_usage_csv }
              

        To build the CSV, we would need to use the CsvEncoder, so that we can enable sanitization against CSV injection. Then, it would look something like this:

               $csv_data = $serializer->encode($csv, 'html_tag_usage_csv', [
                 CsvEncoder::ESCAPE_FORMULAS_KEY => TRUE,
               ]);
              

        If you have requirements that would exclude sanitization against CSV injection, I think we could also add a setting for that that is on by
        default, that you could then switch off.

    So much for the general stuff, now on to more concrete remarks:

    'entity_type' => $entity->getEntityType()->getSingularLabel()->__toString(),
    

    getSingularLabel() can return either a TranslatableMarkupInterface or a plain string. For the latter, I think the magic __toString() value wouldn't be available, because there is no object. To make it work for both cases, just remove
    the magic method and cast the return value to string instead.

    'entity' => is_array($entity_label) ? $this->render->render($entity_label) : '',
    

    You might end up with an array here, because you call ->toRenderable() on the link object above. You could just call ->toString() instead, so that you always have a string. That would also remove the need to use/inject the renderer. Also, I don't think we want an empty string, if the label is not an array, because then we're missing the label, if the entity type has no edit form template.

    Another thing I wondered was, if we should really export a link here. If we care for the Url, maybe a separate column for the edit url would be better?

    $response = new Response();
    

    I wonder, if we should use a CacheableResponse instead. This could be tied to the generation timestamp of the report in state. That way, we wouldn't regenerate CSV data, if the data didn't change.

    src/Form/FiltersForm.php
    

    Some general observations:

    I think we could remove ReportController::getDialogContent() and have the route just point to the FilterForm instead. We could then call Analyzer::buildInspectionTable() in FilterForm::buildForm(). Since it already returns a renderable array it should be pretty straight forward. That way, we don't need all the request logic in submitForm() and in the Analyzer and could just pass on the filter values to the Analyzer directly from form state.

    It would be nice, if we used human readable labels in the select field for field name, entity type and bundle. If you think that might be useful, e.g. for fields, we could add the machine name in brackets as a suffix.

    Ideally, we would only display options for selection, that are actually present in the data for a tag/bundle combination.

    Currently, the bundle filter is only available for nodes and paragraphs. If we used Ajax for the bundle filter we could get the bundles for the selected entity type, which would eliminate the need to have different fields for different entity types and we wouldn't have to prepare all possible options in advance.

    In my test, I wasn't offered the option to filter by description field of a Taxonomy term, even though there was data for this field in the table. The reason might be that it is a base field? We should look into making this available.

    I'm not sure, why the "Search by string" filter is only available for nodes and paragraphs. I don't see a reason to restrict this feature to just those two entity types, since I think we collect the data for all entity types anyway.

    If we implement some of these changes, the code in buildForm() will probably look very different, so just a few remarks for now, which I think will also apply then:

        $field_options['all'] = '-- Select --';
    

    We would have to translate this string using $this->t(), but I think we don't even need to add this to the options, if you set '#empty_value' => 'all', for the select field. If you don't like - None - as the string for the option, you could set #empty_option. I would then suggest to use - Select - though, which is sufficiently similar, because that's the default for required fields and will already be translated for most languages.

        $form['#token'] = FALSE;
    

    What does this do? Why is it needed?

    +  public function submitForm(array &$form, FormStateInterface $form_state) {
    +    $path = $form_state->getUserInput()['path'];
    +    if (!empty($form_state->getUserInput()['fields']) && $form_state->getUserInput()['fields'] !== 'all') {
    +      $path .= '&field_name=' . $form_state->getUserInput()['fields'];
    +    }
    +    if (!empty($form_state->getUserInput()['types']) && $form_state->getUserInput()['types'] !== 'all') {
    +      $path .= '&entity_name=' . $form_state->getUserInput()['types'];
    +    }
    +    if (!empty($form_state->getUserInput()['bundle']) && $form_state->getUserInput()['bundle'] !== 'all') {
    +      $path .= '&bundle_name=' . $form_state->getUserInput()['bundle'];
    +    }
    +    if (!empty($form_state->getUserInput()['paragraph_bundle']) && $form_state->getUserInput()['paragraph_bundle'] !== 'all') {
    +      $path .= '&paragraph_bundle_name=' . $form_state->getUserInput()['paragraph_bundle'];
    +    }
    +    if (!empty($form_state->getUserInput()['string_name']) && $form_state->getUserInput()['string_name'] !== '') {
    +      $path .= '&string_name=' . $form_state->getUserInput()['string_name'];
    +    }
    +    $response = new RedirectResponse($path);
    +    $response->send();
    +  }
    

    If we point the route directly at the form we probably don't need most of this logic anymore, but still...

    I wonder why you used FormStateInterface->getUserInput() and chose to build the URL and query string yourself? This might have security implications. It would be better to use FormStateInterface->getValues() or FormStateInterface->getValue().

    Then you could use UrlHelper::buildQuery() to generate a query string from an array of the parameter values keyed by the parameter names and pass that to Url::fromUserInput(). You can use a ? as the first character to indicate that the Url is intended to point at the current page. No need to get the path.

    This should make sure that Url we redirect to is sufficiently "secure".

    For the response object I would use \Drupal\Core\Routing\LocalRedirectResponse since we don't ever want to redirect to an external Url.

    Finally, use FormStateInterface::setResponse() to "execute" the response.

Production build 0.69.0 2024