Incremental builds and fastbuilds include data for unnecessary nodes

Created on 3 February 2021, almost 4 years ago
Updated 4 May 2023, over 1 year ago

Problem/Motivation

When using gatsby_instantpreview to do builds, or when using gatsby_fastbuilds with log_published enabled, the buildRelationshipJson function does a recursive crawl through all related entities that the module is configured to send to Preview/Build. This can become problematic in sites which make heavy use of entity references. In the worst case, where content is highly connected, it's possible for a single-node update to include hundreds or even thousands of entities alongside the entity which is actually being updated. This is harmful to both Drupal performance and build performance.

This may sound like a very theoretical issue, but just recently we saw an incremental build take 5-10 times longer than normal, and upon looking into it we noticed that over 100 entities were included when in practice, just the node would've been sufficient. But the node has a taxonomy reference, and the taxonomy term has other references, and before long we're sending a large amount of data which hasn't changed since Gatsby last heard from Drupal.

From what I can tell, the inclusion of related entities is just a way to get certain entities to Gatsby without triggering a build. If you create 5 taxonomy terms and upload 3 images in preparation for creating a node, there's probably no benefit in sending those to Gatsby yet. If you can avoid sending them until they're needed, it can substantially cut down on the number of superfluous builds, so we include them when sending over a node that uses them.

Steps to reproduce

  1. Enable gatsby_instantpreview and gatsby_fastbuilds
  2. Create a content type with an entity reference field to other nodes
  3. Create a node A which references another node B
  4. Check /gatsby-fastbuilds/sync/[recent timestamp] and confirm that the insert log for node A also includes the data for node B

Proposed resolution

Avoid sending entities which Gatsby is already aware of unless they've actually changed. We don't want to actually track the entities which have been sent (several Gatsby builds could be pulling data from the same Drupal instance.) but if it's possible to know which entities Gatsby should already be aware of, we can avoid sending those.

As it's currently set up, in order to reach the recursive buildRelationshipJson function, the original entity being created/updated/deleted must be a node. While we can't easily be certain that an arbitrary entity has been sent to Gatsby, it is safe to assume that all published nodes have been sent. So when traversing entity relationships to include things Gatsby isn't yet aware of, if the referenced entity is a node, just skip it.

To be clear, this is not a comprehensive fix. If other entity types are extensively interconnected, it's possible to bring in a lot of unnecessary data. But this feels like a straightforward improvement which can substantially cut down on sending unnecessary entities.

Remaining tasks

Create patch and/or discuss.

πŸ› Bug report
Status

Closed: outdated

Version

2.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States iansholtys

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡ΊπŸ‡ΈUnited States apmsooner

    Version 2 requires fastBuilds set in gatsby-source-drupal and thus drupal no longer sends the whole entity json objects to front-end but rather pings it to make a request for recent logs via the api provided. Changes went from push to pull essentially.

Production build 0.71.5 2024