Optimize joins and table selection in SQL entity query implementation

Created on 2 May 2017, over 8 years ago
Updated 2 March 2023, over 2 years ago

Problem/Motivation

Quite a long time ago, before Drupal 8.0, we made the decision to make Views more intelligent about which entity table to select from by default. Before, we always started of the base table and then almost always joined the data table if it exists, because that's where all the data is (except the UUID).

Unfortunately, we never updated entity query accordingly and it is actually far worse there:

* It always picks the base table first, that was the same as views
* It always adds a join as soon as you start to add conditions. *even* if that happens to be the UUID, then it just joins itself or an entity type without a data table.
* It always joins other entity types through the base table and then adds yet another join even if they have no data table. Because reasons.

Proposed resolution

This is an attempt at making things more sane and possibly considerably more performant.

Just like views, it starts of at the data table if possible, I'm also avoiding the duplicated join by initializing the entity tables (really, joins) mapping table. And I'm trying to improve actual joining as well by skipping the initial join as well.

I've only tested it with a simple example queries, didn't even run the tests yet, I just wanted to start this before I forget about this again (Had the idea of doing this probably a few times already).

My test script:


echo "\nQuery 1:\n";
$ids = \Drupal::entityQuery('node')
  ->execute();
print_r($ids);

echo "\nQuery 2:\n";
$ids = \Drupal::entityQuery('node')
  ->condition('title', 'First node')
  ->execute();
print_r($ids);

echo "\nQuery 3:\n";
$ids = \Drupal::entityQuery('node')
  ->condition('uuid', '20478baa-64e4-4b01-bf68-5ea34e3db78b')
  ->execute();
print_r($ids);

echo "\nQuery 4:\n";
$ids = \Drupal::entityQuery('node')
  ->condition('uid.entity.uid', 1)
  ->execute();
print_r($ids);

(conditions only work on my specific nodes of course). I also added a 'echo $this->sqlQuery . "\n";' so I can see the generated SQL query.

HEAD:

Query 1:
SELECT base_table.vid AS vid, base_table.nid AS nid
FROM 
{node} base_table
Array
(
    [1] => 1
    [2] => 2
    [3] => 3
    [4] => 4
)

Query 2:
SELECT base_table.vid AS vid, base_table.nid AS nid
FROM 
{node} base_table
INNER JOIN {node_field_data} node_field_data ON node_field_data.nid = base_table.nid
WHERE node_field_data.title LIKE :db_condition_placeholder_0 ESCAPE '\\'
Array
(
    [3] => 3
)

Query 3:
SELECT base_table.vid AS vid, base_table.nid AS nid
FROM 
{node} base_table
INNER JOIN {node} node ON node.nid = base_table.nid
WHERE node.uuid LIKE :db_condition_placeholder_0 ESCAPE '\\'
Array
(
    [1] => 1
)

Query 4:
SELECT base_table.vid AS vid, base_table.nid AS nid
FROM 
{node} base_table
INNER JOIN {node_field_data} node_field_data ON node_field_data.nid = base_table.nid
LEFT OUTER JOIN {users} users ON users.uid = node_field_data.uid
INNER JOIN {users_field_data} users_field_data ON users_field_data.uid = users.uid
WHERE users_field_data.uid = :db_condition_placeholder_0
Array
(
    [4] => 4
)

With my patch.

Query 1:
SELECT base_table.vid AS vid, base_table.nid AS nid
FROM 
{node_field_data} base_table
Array
(
    [1] => 1
    [2] => 2
    [3] => 3
    [4] => 4
)

Query 2:
SELECT base_table.vid AS vid, base_table.nid AS nid
FROM 
{node_field_data} base_table
WHERE base_table.title LIKE :db_condition_placeholder_0 ESCAPE '\\'
Array
(
    [3] => 3
)

Query 3:
SELECT base_table.vid AS vid, base_table.nid AS nid
FROM 
{node_field_data} base_table
INNER JOIN {node} node ON node.nid = base_table.nid
WHERE node.uuid LIKE :db_condition_placeholder_0 ESCAPE '\\'
Array
(
    [1] => 1
)

Query 4:
SELECT base_table.vid AS vid, base_table.nid AS nid
FROM 
{node_field_data} base_table
INNER JOIN {users_field_data} users_field_data ON users_field_data.uid = base_table.uid
WHERE users_field_data.uid = :db_condition_placeholder_0
Array
(
    [4] => 4
)

Comparison:
1. Very similar except that my select is on the node_field_data table.
2. No join necessary with my version.
3. This actually requires a join in the new implementation, and this query is currently unfortunately very common (loadByUuid()). But the current implementation joins itself which is likely not much better?
4. A single join instead of *3*

Remaining tasks

* Figure out things that dont' work yet, pretty sure that translations will be one such case as I'm possibly skipping the language condition now in at least some cases on the base table.
* Is this an API change or not?
* Answer #39
* Answer #41

User interface changes

API changes

Generated queries change, which might break query alters?

Data model changes

๐Ÿ“Œ Task
Status

Needs work

Version

10.1 โœจ

Component
Entityย  โ†’

Last updated about 4 hours ago

  • Maintained by
  • ๐Ÿ‡ฌ๐Ÿ‡งUnited Kingdom @catch
  • ๐Ÿ‡จ๐Ÿ‡ญSwitzerland @berdir
  • ๐Ÿ‡ฉ๐Ÿ‡ชGermany @hchonov
Created by

๐Ÿ‡จ๐Ÿ‡ญSwitzerland berdir Switzerland

Live updates comments and jobs are added and updated live.
  • Performance

    It affects performance. It is often combined with the Needs profiling tag.

Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States smustgrave

    Seems there are still some open questions to answer before review.

    #39 and #41 should be answered (added to remaining tasks)

  • ๐Ÿ‡ท๐Ÿ‡บRussia Chi

    Faced this issue with "single-table" entity type. Entity query joined base table to itself which caused bad performance. Patch #16 works well on Drupal 10.0.

  • ๐Ÿ‡ท๐Ÿ‡บRussia Chi

    Patch #16 works well on Drupal 10.0.

    Actually it does not. EFQ with entity references produces wrong SQL join. See comment #30.

  • ๐Ÿ‡ณ๐Ÿ‡ฑNetherlands spadxiii

    We have been using the mr in #37 for a while, but there are some issues with it: when using multiple conditions on the same column in an entity-query, the same table is joined several times.

    I've fixed this with another if-statement in the patch that checks if the table is already joined (with the same type).

  • @spadxiii Please make the change to the merge request, rather than submitting a patch.

  • ๐Ÿ‡ณ๐Ÿ‡ฑNetherlands spadxiii

    I seem to have attached the wrong patch. Here's the correct one, that works.

  • ๐Ÿ‡ณ๐Ÿ‡ฑNetherlands spadxiii

    @solideogloria the mr is quite old and needs to be rebased :\

    and when I push, I get an error that: remote: You are not allowed to push code to this project.
    So I cannot update the mr.

  • You have to click the "Get Push Access" button at the top of this page.

  • First commit to issue fork.
  • Pipeline finished with Failed
    11 months ago
    Total: 698s
    #293174
  • ๐Ÿ‡ฎ๐Ÿ‡ณIndia arunkumark Coimbatore

    arunkumark โ†’ changed the visibility of the branch 2875033-optimize-joins-and to hidden.

  • ๐Ÿ‡ฎ๐Ÿ‡ณIndia arunkumark Coimbatore

    arunkumark โ†’ changed the visibility of the branch 2875033-optimize-joins-and to hidden.

  • ๐Ÿ‡ฎ๐Ÿ‡ณIndia arunkumark Coimbatore
  • ๐Ÿ‡ฎ๐Ÿ‡ณIndia arunkumark Coimbatore
  • ๐Ÿ‡ฎ๐Ÿ‡ณIndia mrinalini9 New Delhi

    Hi,

    I have tried to create MR for the changes mentioned in patch #48 but was unable to do so because the MR points to branch 9.5.x instead of 11.x. Also, I have tried to create a new branch from 11.x but getting the below error:

    Thanks & Regards,
    Mrinalini

  • First commit to issue fork.
  • Merge request !10783Optimize joins โ†’ (Open) created by ptmkenny
  • ๐Ÿ‡ฏ๐Ÿ‡ตJapan ptmkenny

    To run the tests, I created an MR of patch #48.

  • Pipeline finished with Failed
    8 months ago
    Total: 481s
    #384838
  • ๐Ÿ‡ซ๐Ÿ‡ทFrance Nixou Toulon

    Thanks for this !

    Attach is the patch from #48 (2875033-46.patch) rerolled for Drupal 10.3.x and 10.4.x

  • The changes need to be applied to the merge request.

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States pwolanin

    patch #61 is failing for my colleague when filtering with jsonapi on the value of a referenced entity referenced by the main entity.

    It's writing the WHERE clause such that it's filtering the main entity to the node ID of the referenced entity.

    example:

    • main entity house.
    • house references a city node
    • city references a state node

    if I filter houses in jsonapi by state, the SQL where clause is filtering the house node ID by the state node ID.

  • ๐Ÿ‡บ๐Ÿ‡ฆUkraine HitchShock Ukraine

    Hi all.
    First of all, I want to thank everyone who is working on this task. It solves the performance problem for big data entities in certain cases.

    But I also found a possible way to make it better for some scenarios.
    We can remove `$type === 'INNER'` from the condition.
    If the table is the same, then it doesn't matter which type of join is used. Anyway, the same table will be used.
    Removing this condition can be useful for big data queries with base fields of the entity, which are stored in the same table if the data table is the same as the base table for an entity.

    For example,
    - we have a `custom_entity`
    - we send a simple entity query to get IDs sorted by uuid
    - the default query will be

    SELECT base_table.id AS id, base_table.id AS base_table_id, custom_entity.uuid AS uuid
    FROM
    custom_entity base_table
    LEFT JOIN custom_entity custom_entity ON custom_entity.id = base_table.id
    ORDER BY custom_entity.uuid ASC
    LIMIT 20 OFFSET 0
    

    What is the problem? If custom_entity is a big data entity, then we are trying to join big data to big data, which will take much more time than without 'join'. This impacts performance a lot

    If we remove `$type === 'INNER'` part of the condition, the issue will be solved, because the query will be generated like

    SELECT base_table.id AS id, base_table.id AS base_table_id, base_table.uuid AS uuid
    FROM
    custom_entity base_table
    ORDER BY base_table.uuid ASC
    LIMIT 20 OFFSET 0
    

    I added a hidden patch with this fix.

    P.S. Please let me know if my opinion is correct or if it has obvious flaws in the context of Drupal core

  • First commit to issue fork.
  • ๐Ÿ‡ฌ๐Ÿ‡งUnited Kingdom Driskell

    I've attempted to work on this a little and produced the following based on 11.x:
    https://git.drupalcode.org/issue/drupal-2875033/-/commit/41ac87795f50465...

    (Up front disclosure - I tested this as a patch on 10.4.x and this is just a cherry pick for analysis)

    Here is an outline:
    * For testing, a clone of a Query will now properly clone the conditions and update the attached query on the Conditions. Happy to take this out of this issue into another if it's desired. Essentially at the moment __toString on a Query is going to work with Conditions that still refer to the original Query. It causes problems in debugging mostly.
    * getTables on Query was creating a new tables instance, not returning the existing property value. This meant tables added via separate Condition instances (such as nested in an AND or an OR) were not deduplicated - they were effectively creating new joins for every branch. This change keeps getTables for BC but it should be now unused, and it introduces getSqlTables that returns the same object after creating it so all conditions share the same instance, preventing attaching the same table twice
    * Specifying different langcode in conditions for a shared table field previously could cause issues as it would reuse the same table join even though it would refer to a different langcode - now it joins separately for each langcode
    * When following a reference field, the tables were never shared at all by addNextBaseTable, and this introduces a nextBaseTables property to track and reuse these
    * When there is a data table, we don't just forcefully set simple_query to FALSE. Warrants some eyes and testing but it feels this is unnecessary as the table joining code that attaches the data table already sets it to FALSE at the point the table might get attached. In this change it actually only sets simple_query to FALSE if the join of a data table is not going to use a langcode. If a langcode is used, then the data table will provide a single row, simplifying the query.
    * Added ability to specify langcode as "__default__" to make it join the data table on default_langcode=1 with simple_query remaining. For me this looks to massively improve query performance and allow us to add indexes in several places on the data table for massive gain. The API here might warrant discussion.

    A note worth mentioning is that if a table joins as INNER JOIN and the next request is for a LEFT JOIN it will not change it. It will remain as INNER. This is how it was previously when deduplicating of table join was working. This seems fine to me as if it's INNER there's a restriction so loosening it serves no purpose. For the inverse, where it is initially added as LEFT or LEFT OUTER, it also won't change it to INNER if a request comes in. It seems fine to me this as I think from what I can see every join that uses INNER will have a WHERE addition anyway, so the fact it's LEFT doesn't make much difference. It could be "nice" to clean it up to INNER but then you're modifying already added tables in the query and it feels the API change stretches too far.

    Generally it looks like the improvements from this experiment are:
    * Tables are no longer duplicated in all cases (from what I can see)
    * Where multiple shared data table fields are used in conditions, indexes on the data table can now get properly used as they use the same table join, and the optimiser can kick in
    * Grouping is now completely avoided for basic queries that don't involve the shared data table
    * Sorting by a shared data field that is untranslated can also be massively sped up with a quick index addition and specifying the langcode "__default__" as this ensures it keeps simple query and works on the default langcode, which inherently will have the untranslated value for that field. For me personally this is what I was trying to get to to resolve some slow queries.

    Would be great to get some feedback while I do some further testing and look at adding some tests etc.

Production build 0.71.5 2024