Adding GIN and GIST indexes to PostgreSQL databases

Created on 30 October 2023, about 1 year ago
Updated 23 May 2024, 6 months ago

Problem/Motivation

The PostgreSQL database has indexes types that are only on PostgreSQL. Those indexes are called GIN and GIST. They add query improvements that are needed by Drupal in Add "json" as core data type to Schema and Database API Needs work and 🐛 [PP-1] Performance issues with path alias generated queries on PostgreSQL Postponed .

Proposed resolution

The comment at #2988018-97: [PP-1] Performance issues with path alias generated queries on PostgreSQL proposes an interesting solution:

Yet another way would be to use an object with array access to denote it is a gist index - but this is a way bigger API change. That said at some point table specifications should move away from ArrayPI and this at least gives us a requirement.

This is what we ended up doing.

This issue is split out to address these competing options on how to allow indexes to still be specified in a driver-agnostic manner, with as little BC-break and as much forward-compatibility as possible.

There has been some bikeshedding in how those PostgreSQL specific indexes should be created, dropped and check if they exist.

Specifically, the core schema API currently provides no way for ancillary data to be communicated about an index, e.g. what Postgres index type should be used.

An alternative proposal is to provide for an additional DB schema spec key, e.g. gists. This would maybe be workable if we were only talking about GIST types, however if you also need GIN then you'd need to support a second key, and it gets rather messy. It also injects Postgres-specific new keys into the spec array shape.

Remaining tasks

Most (all?) of the decision-making centers around the above questions on the method of specifying the type of the index to be created.

Done

User interface changes

None

API changes

New value object to wrap existing index definitions and also convey additional information for db drivers (e.g., Postgres index type.)

Data model changes

This doesn't change the data model but the new config object is backwards-compatible with ArrayPI.

Release notes snippet

Draft release note done.

📌 Task
Status

Needs work

Version

11.0 🔥

Component
PostgreSQL driver 

Last updated about 2 months ago

No maintainer
Created by

🇳🇱Netherlands daffie

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @daffie
  • 🇺🇸United States bradjones1 Digital Nomad Life

    Minor typo fix in IS and title.

  • 🇺🇸United States bradjones1 Digital Nomad Life
  • 🇺🇸United States bradjones1 Digital Nomad Life

    A bit of rubber-ducking on the possibility of using an object implementing ArrayAccess for a value object that would allow for additional data on the index (beyond just an array of keys.) As proposed by @alexpott.

    If I understand this properly, the idea is to be able to specify indexes something like (pseudocode):

    $schema[$table]['indexes']['indexName'] = new DbIndexSpec($keys, $postgresIndexType);
    

    If DbIndexSpec implements both ArrayAccess and Traversable, then it could be crafted in a BC-compatible way for any instances where it is foreach()'ed over currently. There is an instance of the Postgres driver doing a typecheck with is_array(), which would not be compatible here, but that code would be changed anyway because this is the whole reason for this conversation.

    Setting the is_array() issue aside, the value object would need to be BC-compatible for nested foreach loops such as found in the mysql driver:

      protected function getNormalizedIndexes(array $spec) {
        $indexes = $spec['indexes'] ?? [];
        foreach ($indexes as $index_name => $index_fields) {
          foreach ($index_fields as $index_key => $index_field) {
            // Get the name of the field from the index specification.
            $field_name = is_array($index_field) ? $index_field[0] : $index_field;
            // Check whether the field is defined in the table specification.
            if (isset($spec['fields'][$field_name])) {
              // Get the MySQL type from the processed field.
              $mysql_field = $this->processField($spec['fields'][$field_name]);
              if (in_array($mysql_field['mysql_type'], $this->mysqlStringTypes)) {
                // Check whether we need to shorten the index.
                if ((!isset($mysql_field['type']) || $mysql_field['type'] != 'varchar_ascii') && (!isset($mysql_field['length']) || $mysql_field['length'] > 191)) {
                  // Limit the index length to 191 characters.
                  $this->shortenIndex($indexes[$index_name][$index_key]);
                }
              }
            }
            else {
              throw new SchemaException("MySQL needs the '$field_name' field specification in order to normalize the '$index_name' index");
            }
          }
        }
        return $indexes;
      }
    

    That said, since there are contrib DB drivers out there, is this just "too much" to try and attempt?

  • 🇺🇸United States bradjones1 Digital Nomad Life

    Got some feedback and background in Slack on "ArrayPI" - the name appears to be a Drupalism (and a fun one at that). There are similar discussions around trying to chip away at particularly complex or DX-unfriendly structures, e.g. #3380145-19: ViewsData should not cache by language . Seems @catch is optimistic about being able to refactor some of these with limited BC impact.

    While poking around the related code I also discovered that an index's "key column specifier" can either be an array of column names OR arrays which contain a column name and prefix length. Database drivers can elect to respect the prefix or just use the full column value. (IOW, they must support the syntax but not necessarily create a functional index as a result.) I _think_ this potentially opens us up to using this array syntax to also express driver-specific options, e.g. index type. I don't love it, but it is potentially better than the BC concerns around a value object.

    So we could rephrase that docblock to say something like:

     * A key column specifier is either a string naming a column or an array of two
     * elements, column name and length, specifying a prefix of the named column.
     * Note that some DBMS drivers may opt to ignore the prefix length configuration
     * and still use the whole field value for the key. Code should therefore not
     * rely on this functionality.
     * A key column specifier may also contain a key, 'driver_options', which is an array
     * keyed by database module name (e.g., 'pgsql') which is an associative array of
     * keys and values to set driver-specific index options. Drivers supporting additional
     * options should respect a value of NULL for the prefix length, if it is unnecessary or
     * incompatible given the driver-specific options.
    

    This shouldn't BC-break any code that accesses the numeric-indexed members.

  • 🇺🇸United States bradjones1 Digital Nomad Life
  • Merge request !5839Postgres - Support GIN and GIST → (Open) created by bradjones1
  • Pipeline finished with Failed
    11 months ago
    Total: 201s
    #64634
  • Pipeline finished with Success
    11 months ago
    #64638
  • Status changed to Needs review 11 months ago
  • 🇺🇸United States bradjones1 Digital Nomad Life

    So I _think_ I might have found a way to implement a value object for the index specification that will allow for this kind of configuration to be conveyed. The only issue would be if a non-core database driver does an explicit is_array() check on the index definitions. This would only pass that kind of very strict (and not very useful) type checking if it changed to is_iterable(), which the value object would pass.

    Given the talk in Slack and on related issues on this subject, I think that should be OK? And we really, really don't seem to have any other way to include this outside of sticking it in a very awkward new schema definition key like indexes_config or something, which is not very ergonomic and shards this information from the index definitions.

  • Pipeline finished with Success
    11 months ago
    #64818
  • 🇺🇸United States bradjones1 Digital Nomad Life
  • 🇺🇸United States bradjones1 Digital Nomad Life

    I've added a draft CR to help explain how this works and start the conversation on how to message this, since it's the first time we're doing a value object inside ArrayPI.

    I think it could be interesting to encapsulate this into the coding standards re: is_array() is is_iterable() on type checks for ArrayPI members.

  • Pipeline finished with Success
    11 months ago
    Total: 690s
    #64822
  • 🇮🇹Italy mondrake 🇮🇹

    A comment on the general approach here if I may.

    I think the idea of having value objects to represent DB objects is a great step forward. Doctrine DBAL has done something similar and dropped usage of arrays to describe DB structures.

    That said, and given this will set a precedent in Drupal DB API, I would suggest:

    1. To have a Schema subfolder where to store both the value object and the enum.
    2. With that, the value object name could easily be Index instead of IndexSpecification.
    3. Why final? The value object could be extended by drivers exactly to cover DB specific features like here.
    4. Testing of DB specifics should be done in the driver extensions of DriverSpecificSchemaTestBase, not in the base class.

  • 🇺🇸United States bradjones1 Digital Nomad Life

    Thanks for the vote of confidence and feedback. On points 1-3 that's the kind of feedback I needed because there is no existing pattern to follow so I was just winging it.

    As for #4, I do think that we need to have testing of this in the base class, to detect compatibility breaks. Basically, we need to ensure that all drivers work when they don't necessarily support the extra data and successfully just treat it as an array. Perhaps annotating the test to state as such could make it clearer why it's there. The alternative would be having a base test that has just the object and the pgsql version has the pgsql-specific semantics.

  • Pipeline finished with Success
    11 months ago
    #65073
  • 🇺🇸United States bradjones1 Digital Nomad Life

    I realize I missed addressing the question about making the value object final. My justification is that the index definition object itself is driver-agnostic; it's the configuration that is specific to each flavor. So there is no legitimate reason to fork this off as PostgresIndex or whatever because it's not a postgres index - it's an index definition with some postgres metadata attached.

  • Status changed to Needs work 11 months ago
  • 🇳🇱Netherlands daffie

    @bradjones1: The solution looks good to me.

    We are missing some CRUD operations with the GIN/GIST indexes. We should also add functionality and testing to:
    - to check that the GIN/GIST index exists;
    - to add GIN/GIST index is created on an existing field;
    - to remove a GIN/GIST index;
    - when we change an existing field we can also add/remove a GIN/GIST index.

    Add some testing that when we create a new table with a GIN/GIST index that the index exists and is a GIN/GIST index. Please create a query with the explain to check that we have created a GIN/GIST index and that the query is using it.

  • Pipeline finished with Failed
    7 months ago
    #157889
  • Pipeline finished with Success
    7 months ago
    Total: 560s
    #157892
  • Status changed to Needs review 7 months ago
  • 🇺🇸United States bradjones1 Digital Nomad Life

    Rebased.

    Re: The requests in #16 -

    to check that the GIN/GIST index exists

    GIN/GIST indexes are cataloged by Postgres similar to other indexes, so checking they exist is already possible. To determine their type, you must dig a little deeper, so there is now an additional column of information about indexes (using pg_get_indexdef()) which we're also using in the new tests.

    to add GIN/GIST index is created on an existing field
    to remove a GIN/GIST index

    Test coverage added.

    when we change an existing field we can also add/remove a GIN/GIST index

    I'm not sure we need to do this, and if we do, this amounts to a new feature for the driver/DBAL overall and nothing specific to GIN/GIST. There is a lot of logic already in ::changeField() but none of it currently touches indexes. In that respect GIN/GiST aren't special, and I think it's incumbent upon the caller to determine if the existing indexes will be compatible. I suggest addressing this in a separate issue. (GIN/GIST do introduce some interesting new considerations around index type support for changing field types, but I still don't think that means we must build a new feature into ::changeField() to get support in.)

    Add some testing that when we create a new table with a GIN/GIST index that the index exists and is a GIN/GIST index.

    This is in the test coverage. Note that we are in a cart-and-horse situation with respect to testing GIN - its test is almost identical to GIST, we just need a supported field type to test with. So I put in a skipped test as a placeholder and once we commit this, we can enable that test coverage with/after JSON data types go in.

  • Pipeline finished with Failed
    7 months ago
    Total: 166s
    #157937
  • Pipeline finished with Canceled
    7 months ago
    Total: 54s
    #157942
  • Pipeline finished with Failed
    7 months ago
    Total: 199s
    #157944
  • Pipeline finished with Success
    7 months ago
    Total: 475s
    #157946
  • Pipeline finished with Success
    7 months ago
    Total: 736s
    #164372
  • Status changed to Needs work 6 months ago
  • 🇳🇱Netherlands daffie

    The testbot is failing on PostgreSQL on your added code.

    The CR needs examples on how the use the added GIN and GIST indexes.

  • Status changed to Needs review 6 months ago
  • 🇺🇸United States bradjones1 Digital Nomad Life

    Ah, how embarassing, I forget you need to manually trigger the PgSQL tests. Fixed with some updated coverage of the introspection method.

    Updating the CR now.

  • Pipeline finished with Canceled
    6 months ago
    Total: 521s
    #167594
  • Status changed to Needs work 6 months ago
  • 🇺🇸United States bradjones1 Digital Nomad Life

    Missed that there are also new MR comments.

  • Pipeline finished with Success
    6 months ago
    Total: 708s
    #167604
  • 🇸🇰Slovakia poker10

    Thanks for great work @bradjones1! I have added some other comments to the MR.

  • 🇮🇹Italy mondrake 🇮🇹

    Some comments inline.

  • Pipeline finished with Failed
    6 months ago
    Total: 175s
    #167781
  • Pipeline finished with Failed
    6 months ago
    Total: 176s
    #167795
  • Pipeline finished with Failed
    6 months ago
    Total: 181s
    #167806
  • Pipeline finished with Success
    6 months ago
    Total: 725s
    #167810
  • Status changed to Needs review 6 months ago
  • 🇺🇸United States bradjones1 Digital Nomad Life

    Replied to MR comments and made a few changes as a result. Back to NR.

  • Pipeline finished with Success
    6 months ago
    Total: 3103s
    #168371
  • Pipeline finished with Success
    6 months ago
    Total: 568s
    #168396
  • Pipeline finished with Success
    6 months ago
    Total: 708s
    #169135
  • 🇺🇸United States mradcliffe USA

    I think this pretty good to commit. Everything makes sense based on the changes from daffie's review.

    I updated the change record.

    I think we can update the issue summary with the proposed resolution now. Maybe moving some of the initial resolution into the problem/motivation?

  • 🇺🇸United States bradjones1 Digital Nomad Life
  • 🇺🇸United States bradjones1 Digital Nomad Life
  • 🇺🇸United States bradjones1 Digital Nomad Life

    @mradcliffe would you mind RTBC'ing per your review?

  • Status changed to RTBC 6 months ago
  • 🇺🇸United States mradcliffe USA

    Agreed. We'll see if we need to do anything else :-)

  • Status changed to Needs work 6 months ago
  • 🇳🇿New Zealand quietone

    I read the Issue Summary, the CR and then the MR. I reviewed the comments and the MR. I left comments in the MR and suggestions. At least one of the suggestions is incomplete, that is it still needs work.

    The change record reads well but I was surprised to find that new enum is only mentioned at the end. I think that is important information and should be at the start. There is a code example for the 'after' situation. Can one be added for the 'before' case? And I don't follow this sentence, "The only backwards-compatibility break would be if a driver is explicitly checking the index definition with is_array(), which is currently unnecessary.' It is not clear to me when this is unnecessary, is it before this change or after. Maybe it is better to change this to something like this, "Sites that are using a driver that is explicitly checking the index definition with is_array() will need to ....'..

Production build 0.71.5 2024