Missing or cross-page duplicate content via Entity Query

Created on 5 September 2017, over 7 years ago
Updated 18 December 2023, about 1 year ago

Problem/Motivation

It is possible to build a seemingly straightforward Entity Query which, if used to "page" through all items, will randomly exclude some nodes, and randomly duplicate others.

Pulling together the elements of this query from the different layers of a class hierarchy, it looks a bit like this:

$query = $this->entityTypeManager->getStorage('node')->getQuery()
     ->addTag('modulename_pluginname')
    ->condition('type', 'article')
    ->condition('status', 1)
    ->sort('changed', 'DESC')
    ->range($offset, $limit);

if (!empty($title)) {
  $query->condition('title', $this->request->get('titleContains'), 'CONTAINS');
}    

This generates queries such as:

SELECT base_table.vid AS vid, base_table.nid AS nid, max(node_field_data_2.changed) AS expression
    FROM @node base_table
    INNER JOIN @node_field_data node_field_data ON node_field_data.nid = base_table.nid
    LEFT JOIN @node_field_data node_field_data_2 ON node_field_data_2.nid = base_table.nid
    WHERE
        (node_field_data.type = :db_condition_placeholder_0)
        AND (node_field_data.status = :db_condition_placeholder_1)
    GROUP BY base_table.vid, base_table.nid
    ORDER BY expression DESC
    LIMIT 100
    OFFSET 0

Note that for purposes of my query, revisions are irrelevant.

This query powers a customized REST plugin to generate lists of nodes. A script crawling through the entire "collection" by incrementing the offset and checking the data found this output based on a title string search to focus on a more easily measured subset of the data:

INFO :: Searching for 'X1234'
Node 21403 "X1234: Title A": 1 instances.
    ----===> Position 64 on Page 1
Node 20677 "X1234: Title B": 2 instances.
    ----===> Position 24 on Page 9
    ----===> Position 10 on Page 10
Node 20490 "X1234: Title C": 2 instances.
    ----===> Position 33 on Page 11
    ----===> Position 8 on Page 12
Node 19680 "X1234: Title D": 1 instances.
    ----===> Position 33 on Page 20
Node 19554 "X1234: Title E": 2 instances.
    ----===> Position 90 on Page 20
    ----===> Position 53 on Page 21
Node 19283 "X1234: Title F": 1 instances.
    ----===> Position 86 on Page 23
Node 18919 "X1234: Title G": 2 instances.
    ----===> Position 88 on Page 27
    ----===> Position 31 on Page 28
Node 18397 "X1234: Title H": 1 instances.
    ----===> Position 5 on Page 33
Node 17989 "X1234: Title I": 1 instances.
    ----===> Position 54 on Page 37

--------------------
MISSING ITEMS
=> Node 18483 "X1234: Title J": NOT FOUND.
=> Node 19191 "X1234: Title K": NOT FOUND.

Note that missing items were identified by separately querying the database for all nodes with the matching "X1234" string, which would be expected in this query as published items.

When I remove the ordering condition by simply deleting the sort, it changes the query to:

SELECT base_table.vid AS vid, base_table.nid AS ned
    FROM @node base_table
    INNER JOIN @node_field_data node_field_data ON node_field_data.nid = base_table.nid
    WHERE
        (node_field_data.type = 'cruise_offer')
        AND (node_field_data.status = '1')
    GROUP BY base_table.vid, base_table.nid
    LIMIT 100
    OFFSET 0

After making that change, I tested my crawling script against the resulting resource 3 times, and had another developer cross-check me 3 more times on their machine. (In fact, the developer that suggested removing ->sort() might help). While this problem is erratic, I had not seen it vanish completely before, let alone 6 times in a row on two environments.

As a data integrity problem driven by straightforward developer usage if the API, this seems like it should be a major. Leaving that for the experts to decide. While this is filed in 8.2, since I could not find anything meaningful in the issue queue about this it is reasonable to assume it remains a problem and should be tested against later minor releases.

Proposed resolution

Correct how the sort is applied in Entity Queries such that data consistency is restored.

Remaining tasks

Create a test to demonstrate this bug.
Identify a path forward.

User interface changes

N/A

API changes

The query result of adding a sort might change.

Data model changes

N/A

πŸ› Bug report
Status

Active

Version

11.0 πŸ”₯

Component
EntityΒ  β†’

Last updated about 6 hours ago

Created by

πŸ‡ΊπŸ‡ΈUnited States Grayside

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡ΊπŸ‡¦Ukraine vlad.dancer Kyiv

    I have the same problem (Drupal 9.5.11). Thank you @Grayside for identifying it!
    For the moment I removed sorting on paged query to avoid duplicates.

Production build 0.71.5 2024