Reliability: Wrap retry logic and persisting failed jobs from redis queue failure#1402
Reliability: Wrap retry logic and persisting failed jobs from redis queue failure#1402beesaferoot merged 3 commits intomainfrom
Conversation
d5821f6 to
5e85071
Compare
dmohns
left a comment
There was a problem hiding this comment.
I like both ideas. But maybe it's a bit to much doing them at once?
The exponential retry one seems straight forward and simple. However, 3*1s seems a bit low. I would be thinking about 1s, 5s, 10s, 60s => fail.
The PendingJob approach seems a bit like last-resort. It's certainly going to work, but I'm not sure if the added complexity justifies it.
WDYT: Let's try just the backoff strategy first and see if the errors continue to exist. Then we can look into PendingJob approach?
It is a pragmatic approach, given we already reduced the redis restarts on k8s. I will make the necessary changes. |
dmohns
left a comment
There was a problem hiding this comment.
Small comment, other than that LGTM
We currently run redis on k8s and to a large extent cannot gurantee that it will remain always available without temporary failures or restarts.
This PR tries to add more resilience to the default queues after handling two major challanges:
Brief summary of the change made
Are there any other side effects of this change that we should be aware of?
Describe how you tested your changes?
Pull Request checklist
Please confirm you have completed any of the necessary steps below.