Reliability: Wrap retry logic and persisting failed jobs from redis queue failure by beesaferoot · Pull Request #1402 · EnAccess/micropowermanager

beesaferoot · 2026-03-25T20:35:20Z

We currently run redis on k8s and to a large extent cannot gurantee that it will remain always available without temporary failures or restarts.

This PR tries to add more resilience to the default queues after handling two major challanges:

Retry on transient failures — catch connection errors and retry briefly until Redis comes back
Buffer jobs during extended outages — store jobs in MySQL when Redis is fully down, replay them when it recovers

Brief summary of the change made

Are there any other side effects of this change that we should be aware of?

Describe how you tested your changes?

Pull Request checklist

Please confirm you have completed any of the necessary steps below.

Meaningful Pull Request title and description
Changes tested as described above
Added appropriate documentation for the change.
Created GitHub issues for any relevant followup/future enhancements if appropriate.

…ueue failure

dmohns

I like both ideas. But maybe it's a bit to much doing them at once?

The exponential retry one seems straight forward and simple. However, 3*1s seems a bit low. I would be thinking about 1s, 5s, 10s, 60s => fail.

The PendingJob approach seems a bit like last-resort. It's certainly going to work, but I'm not sure if the added complexity justifies it.

WDYT: Let's try just the backoff strategy first and see if the errors continue to exist. Then we can look into PendingJob approach?

beesaferoot · 2026-03-26T12:09:02Z

I like both ideas. But maybe it's a bit to much doing them at once?

The exponential retry one seems straight forward and simple. However, 3*1s seems a bit low. I would be thinking about 1s, 5s, 10s, 60s => fail.

The PendingJob approach seems a bit like last-resort. It's certainly going to work, but I'm not sure if the added complexity justifies it.

WDYT: Let's try just the backoff strategy first and see if the errors continue to exist. Then we can look into PendingJob approach?

It is a pragmatic approach, given we already reduced the redis restarts on k8s. I will make the necessary changes.

dmohns

Small comment, other than that LGTM

dmohns

Thanks!!

Reliability: Wrap retry logic and persisting failed jobs from redis q…

5e85071

…ueue failure

beesaferoot force-pushed the fail-tolerant-redis-queue branch from d5821f6 to 5e85071 Compare March 25, 2026 20:35

dmohns reviewed Mar 26, 2026

View reviewed changes

Improve retry delay and remove db failback for now

4eacf60

beesaferoot requested a review from dmohns March 26, 2026 13:07

dmohns reviewed Mar 26, 2026

View reviewed changes

Comment thread src/backend/bootstrap/app.php Outdated

dmohns reviewed Mar 26, 2026

View reviewed changes

cleanup: remove left-over command

4fc1d55

beesaferoot requested a review from dmohns March 27, 2026 07:26

dmohns approved these changes Mar 30, 2026

View reviewed changes

beesaferoot merged commit a1d4d5e into main Mar 30, 2026
17 checks passed

beesaferoot deleted the fail-tolerant-redis-queue branch March 30, 2026 15:28

beesaferoot mentioned this pull request Apr 2, 2026

[Core] Platform Stability and Reliability #1420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reliability: Wrap retry logic and persisting failed jobs from redis queue failure#1402

Reliability: Wrap retry logic and persisting failed jobs from redis queue failure#1402
beesaferoot merged 3 commits intomainfrom
fail-tolerant-redis-queue

beesaferoot commented Mar 25, 2026

Uh oh!

dmohns left a comment

Uh oh!

beesaferoot commented Mar 26, 2026

Uh oh!

Uh oh!

dmohns left a comment

Uh oh!

dmohns left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

beesaferoot commented Mar 25, 2026

Brief summary of the change made

Are there any other side effects of this change that we should be aware of?

Describe how you tested your changes?

Pull Request checklist

Uh oh!

dmohns left a comment

Choose a reason for hiding this comment

Uh oh!

beesaferoot commented Mar 26, 2026

Uh oh!

Uh oh!

dmohns left a comment

Choose a reason for hiding this comment

Uh oh!

dmohns left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants