Queue race condition leads to inconsistent available key list

Created on 18 October 2021, over 2 years ago
Updated 3 May 2024, about 2 months ago

Problem/Motivation

On our very busy server sometimes "empty" elements remain in the redis queue resulting in troublesome errors (queue workers starting to spit errors due to missing / expected properties.

Steps to reproduce

πŸ€·β€β™‚οΈ ... we couldn't. It seems to happen once in a while.

Proposed resolution

Harden the garbageCollection function for PhpRedis (uncertain about predis though).

πŸ› Bug report
Status

Needs review

Version

1.0

Component

Code

Created by

πŸ‡¦πŸ‡ΉAustria attisan Vienna

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡ΊπŸ‡ΈUnited States neclimdul Houston, TX

    I might be high-jacking this issue but it accurately describes the problem I've been tracking for a while now.

    Quick tangent:

    (queue workers starting to spit errors due to missing / expected properties.

    This actually sounds like πŸ› Use item_id instead of qid for Queue item Needs review which actually happens when queue items are being processed. The lists being inconsistent should trigger claimItem to start returning a lot of false values which clients should interpret as a failed queue fetch and skip processing.

    Theory of the problem

    So I the problem comes down to how the garbage collection gets handled. GC is not atomic and can happen across multiple clients at the same time.

    To set a scenario that might be able to reproduce this lets imagine a queue talking to an API and that API having some downtime. All those requests are failing. If the worker is smart enough to throw a RequeueException, those items would get immediately but if they crash the exception gets logged and the item is left to expire.

    Later the items expire and GC kicks in. Lets imagine 2 workers trigger at close to the same time(they're hot looping claim item so this is likely) and get basically the same list of claimed items. They then loop through checking the leases. They're both going to see them same expired items and remove the claim and importantly _both_ push the item back onto the available items. This is a list not a set so we now have duplicate entries in the list which starts causing problems.

    The best case is that 2 workers will be able to claim the duplicate item ids and work through them without a problem.

    More likely later workers will either fail triggering the the same process or be unable to claim the item because it no longer exists.

    Now the kicker that's been causing me headaches. Once a worker is able to process them item, claimItem will never be able to get the actual item because hget failures just get ignored meaning an expiring never gets set leaving the entry to be GC forever but also feeding back into the race condition.

    Proposal

    Making some things like the item list sets that can't have duplicates would probably be handy but I'm not sure how that would get implemented, if there are reasons for using a list over a set etc. So a couple of fixes can probably solve this.

    Also, having two parallel lists of the available items is probably bound to lead to these sort of inconsistencies and this theory around race conditions might not be the only cause. So some sort of GC to keep it consistent is probably a good idea.

    1. Implement a firewall so GC can't push items back into the available list if they no longer exist.
    2. Have claim item clear inconsistent entries?
    3. Implement some sort of cleanup like the exist patch that keeps the list consistent? The current approach looks a bit expensive for busy queues, should probably use exist instead of hget, and it might introduce its own race conditions so I'm on the fence for this.

  • πŸ‡©πŸ‡ͺGermany Daniel Kulbe Berlin

    I also find this very expensive iterating the whole list, if only one job is already known to be "empty".
    (For reference, the Error thrown by Redis in this case is "WRONGTYPE Operation against a key holding the wrong kind of value")
    When we try to claim the item, we already know the ID, which is empty. So only this one should be removed at this point.

    So I am adding my solution here, asking for review.

Production build 0.69.0 2024