Skip to content

Reliability: Wrap retry logic and persisting failed jobs from redis queue failure#1402

Merged
beesaferoot merged 3 commits intomainfrom
fail-tolerant-redis-queue
Mar 30, 2026
Merged

Reliability: Wrap retry logic and persisting failed jobs from redis queue failure#1402
beesaferoot merged 3 commits intomainfrom
fail-tolerant-redis-queue

Conversation

@beesaferoot
Copy link
Copy Markdown
Contributor

We currently run redis on k8s and to a large extent cannot gurantee that it will remain always available without temporary failures or restarts.

This PR tries to add more resilience to the default queues after handling two major challanges:

  1. Retry on transient failures — catch connection errors and retry briefly until Redis comes back
  2. Buffer jobs during extended outages — store jobs in MySQL when Redis is fully down, replay them when it recovers

Brief summary of the change made

Are there any other side effects of this change that we should be aware of?

Describe how you tested your changes?

Pull Request checklist

Please confirm you have completed any of the necessary steps below.

  • Meaningful Pull Request title and description
  • Changes tested as described above
  • Added appropriate documentation for the change.
  • Created GitHub issues for any relevant followup/future enhancements if appropriate.

@beesaferoot beesaferoot force-pushed the fail-tolerant-redis-queue branch from d5821f6 to 5e85071 Compare March 25, 2026 20:35
Copy link
Copy Markdown
Member

@dmohns dmohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like both ideas. But maybe it's a bit to much doing them at once?

The exponential retry one seems straight forward and simple. However, 3*1s seems a bit low. I would be thinking about 1s, 5s, 10s, 60s => fail.

The PendingJob approach seems a bit like last-resort. It's certainly going to work, but I'm not sure if the added complexity justifies it.

WDYT: Let's try just the backoff strategy first and see if the errors continue to exist. Then we can look into PendingJob approach?

@beesaferoot
Copy link
Copy Markdown
Contributor Author

I like both ideas. But maybe it's a bit to much doing them at once?

The exponential retry one seems straight forward and simple. However, 3*1s seems a bit low. I would be thinking about 1s, 5s, 10s, 60s => fail.

The PendingJob approach seems a bit like last-resort. It's certainly going to work, but I'm not sure if the added complexity justifies it.

WDYT: Let's try just the backoff strategy first and see if the errors continue to exist. Then we can look into PendingJob approach?

It is a pragmatic approach, given we already reduced the redis restarts on k8s. I will make the necessary changes.

@beesaferoot beesaferoot requested a review from dmohns March 26, 2026 13:07
Comment thread src/backend/bootstrap/app.php Outdated
Copy link
Copy Markdown
Member

@dmohns dmohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment, other than that LGTM

@beesaferoot beesaferoot requested a review from dmohns March 27, 2026 07:26
Copy link
Copy Markdown
Member

@dmohns dmohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!! :shipit:

@beesaferoot beesaferoot merged commit a1d4d5e into main Mar 30, 2026
17 checks passed
@beesaferoot beesaferoot deleted the fail-tolerant-redis-queue branch March 30, 2026 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants