When we have a huge number of nodes the cron job does not work (504 gateway timeout)

Created on 14 July 2020, over 4 years ago
Updated 20 June 2023, over 1 year ago

Repeatable: Always

Steps to repeat:

  • We have more than 1m nodes
  • /admin/config/system/google-analytics-counter
    • Minimum time to wait before fetching Google Analytics data (in minutes) = 20
    • Number of items to fetch from Google Analytics in one request = 1000
    • Maximum GA API requests per day = 50000
    • Google Analytics query cache (in hours) = 24
    • Queue Time (in seconds) = 600
  • after I run the cron I'm getting the 504 gateway timeout error

Expected Results:
Fetching page views and update the counter field

Actual Results:
Not running successfuly and not pulling the data
504 gateway timeout error

πŸ› Bug report
Status

Fixed

Version

4.0

Component

Code

Created by

πŸ‡―πŸ‡΄Jordan ammar qala

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • Status changed to Needs work over 1 year ago
  • πŸ‡ΈπŸ‡°Slovakia kaszarobert

    The problem for these timeouts are that the module refreshes pageviews for every node. The module does the processing as the following:

    1. Cron starts
    2. First, if there are no more queue items left, queries the amount of URLs from GA4
    3. Based on how many chunks you have set (default is 1000), it creates that many to process every result in separate queue item. For example if there're 2500 URLs total, then it creates 3 queue items for 3 chunks: 1-1000, 1001-2000, 2001-2500
    4. Then the currently non-scalable code: create every queue item for every published node for the site. That could absolutely lead to timeout if this huge queue could not be created as this one process times out.
    5. Queue processing starts
    6. There were 3 queue items created to update the URL-pageview counts in the database. We save that information to the google_analytics_counter table.
    7. Following are the queue items that collect all the content URL & URL alias for each and every node. These should only run if all the 2500 URLs were collected from GA, otherwise wrong number will be calculated with stale data.
    8. Queue processing ends

    I pushed one possible solution for timeouts in branch 4.0.x with 2 settings where you can limit the size of the queue:
    - Update pageviews for content created in the last X days
    - Update pageviews for the last X content

    This will go fine for a hotfix but the proper solution would be a major rewrite for cases I can think of:
    - Don't build the whole queue with millions of items, but give a setting for the user to limit this. Also this way, cron process needs to reliably track, how many nodes were put to the queue and how many should be added.
    - Don't use queues at all. For each cron run, save the last processed nid, and just process the following X amount at each cron run.

    I don't know if we should backport this solution to 8.x-3.x as July 1st, 2023 is so close and version 8.x-3.x becomes useless as Universal Analytics will stop working and fall 2023 Universal Analytics data will no longer be available.

  • Status changed to Fixed over 1 year ago
  • πŸ‡ΈπŸ‡°Slovakia kaszarobert

    I decided to go with a similar solution I advised back then in ✨ Make google_analytics_counter_cron() faster Fixed . So from now on, during cron run:
    - if queue is not empty, skip and let the queue processors finish processing every queue items
    - if queue is empty, then get the amount of URLs from GA4
    - save (amount of URLs / chunk setting) "fetch" queue items, for example for 4500 URLs with 1000 chunk setting means creating 4 "fetch" queue items: 0-1000, 1001-2000, 2001-3000, 4001-4500.
    - Now is the change I made: instead of collecting all the published Node IDs into separate queue items, we just create exactly 1 "count" queue item for the first node.
    - Then, the queue processing starts and when the "count" queue item processing finishes, instead of exiting, it creates the next 1 "count" queue item. And the queue processor immediately sees that there's still 1 queue item, and if there's time (the default is 120 seconds), it processes it. And again it creates the next 1 "count" queue item while there are nodes left to process. That way we don't make undesirable amount of writes in database and won't hit PHP process timeout. The downside of this solution is that it can't be run in parallel because there will be 1 "count" queue item only, but I think it has more advantages than disadvantages considering scaling the site.

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024