fix: skip locked retry claim rows#150
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughRewrites the retry batch-claim to a single multi-CTE query that atomically locks and updates up to the batch size of distinct webhook IDs using per-webhook oldest-first selection and FOR UPDATE ... SKIP LOCKED; docs, postgres implementation, tests, and e2e helper are updated accordingly. ChangesRetry Claim Batch Atomicity and Concurrency
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/storage/postgres/retry_test.go (1)
448-489: ⚡ Quick winAdd a same-webhook due/future regression test.
Great contention coverage here. Please add one more case for a single webhook with two
to retryattempts (one due, one future) and assert only the due one becomesretryingafter claim.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/storage/postgres/retry_test.go` around lines 448 - 489, Add a new unit test (e.g., TestClaimPrefersDueAttemptSameWebhook) that inserts a single webhook with two attempts via insertConfigAndAttempt using the same webhook ID: one with pastTime (due) and one with futureTime (not due). Call store.FindWebhookIDsToRetry(ctx, 1) and assert the webhook ID is returned once, then query attempts for that webhook (using db.NewSelect / Model((*webhooks.Attempt)(nil)) / Where("webhook_id = ?", webhookID)) and assert exactly one attempt has status webhooks.StatusAttemptRetrying (the due one) and the other remains webhooks.StatusAttemptToRetry (the future one); use the existing helper names (insertConfigAndAttempt, FindWebhookIDsToRetry, webhooks.Attempt, StatusAttemptRetrying, StatusAttemptToRetry) so the test follows the same pattern as TestClaimSkipsAttemptsLockedByAnotherWorker.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/storage/postgres/retry_test.go`:
- Around line 448-489: Add a new unit test (e.g.,
TestClaimPrefersDueAttemptSameWebhook) that inserts a single webhook with two
attempts via insertConfigAndAttempt using the same webhook ID: one with pastTime
(due) and one with futureTime (not due). Call store.FindWebhookIDsToRetry(ctx,
1) and assert the webhook ID is returned once, then query attempts for that
webhook (using db.NewSelect / Model((*webhooks.Attempt)(nil)) /
Where("webhook_id = ?", webhookID)) and assert exactly one attempt has status
webhooks.StatusAttemptRetrying (the due one) and the other remains
webhooks.StatusAttemptToRetry (the future one); use the existing helper names
(insertConfigAndAttempt, FindWebhookIDsToRetry, webhooks.Attempt,
StatusAttemptRetrying, StatusAttemptToRetry) so the test follows the same
pattern as TestClaimSkipsAttemptsLockedByAnotherWorker.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 67b5dfdf-9950-4915-abc4-4acfe6dcecbe
📒 Files selected for processing (3)
docs/retry-mechanism.mdpkg/storage/postgres/postgres.gopkg/storage/postgres/retry_test.go
fguery
left a comment
There was a problem hiding this comment.
Head hurts a bit 😂
Looks good, but I'm a bit worried about perfs, specially if/when we have an endpoint that's failing on the receiving end for a long time: we'll then have thousands of attempts right? Unless we have something to limit it? Did you run a pg plan to check we're ok on that front?
gfyrag
left a comment
There was a problem hiding this comment.
Review — Scalability concerns and regression risk
Preamble: should we even scale this service?
Before diving into the SQL details, I want to raise a higher-level concern: the webhooks service should not need to be scaled beyond a single replica.
Its job is to deliver HTTP calls to external endpoints on a best-effort basis with retries. The bottleneck is always the remote endpoint, not our service. A single instance with a bounded worker pool (which we already have via pond) can saturate any reasonable delivery throughput. Attempting to make the retry mechanism horizontally scalable introduces distributed coordination problems (locking, claim contention, partial claims) that are fundamentally harder than the problem we're solving.
The deadlock that motivated this PR (SQLSTATE 40P01) only occurs because multiple retrier loops compete on the same rows. With a single replica, there is no deadlock. The correct fix may simply be to not run multiple retrier instances — which is operationally simpler, easier to reason about, and eliminates an entire class of bugs.
If throughput is genuinely a concern (thousands of webhooks backing up), the right lever is tuning --retry-batch-size and the pond pool size on a single instance, not adding replicas that fight over the same database rows.
That said, here's the analysis of the PR as-is.
Bug: next_retry_after < NOW() in attempts_to_claim causes duplicate sends
The old claimed CTE updated all to retry attempts for a webhook, regardless of next_retry_after:
-- OLD
UPDATE attempts SET status = 'retrying', updated_at = NOW()
WHERE webhook_id IN (SELECT webhook_id FROM to_claim)
AND status = 'to retry'
-- no next_retry_after filterThe new attempts_to_claim CTE adds a temporal filter:
-- NEW
SELECT attempts.id FROM attempts ...
WHERE webhook_id IN (SELECT webhook_id FROM to_claim)
AND status = 'to retry'
AND attempts.next_retry_after < NOW() -- ← only past-due
FOR UPDATE OF attempts SKIP LOCKEDThis is a regression. Here's the scenario:
-
Webhook W1 has failed several times. State after N retries:
- A1:
to retry,next_retry_after= 10 min ago (past-due, from earlier cycle) - A2:
to retry,next_retry_after= 5 min from now (future, latest backoff)
- A1:
-
Retrier claims W1:
to_claimselects A1 (oldest past-due), locks it ✓attempts_to_claimfinds only A1 (A2 is future → excluded by the filter)claimedupdates only A1 →retrying- A2 remains
to retry
-
Worker processes W1 → HTTP call → success
UpdateAttemptsStatussets A1 (retrying) →success- A2 is
to retry, notretrying→ not touched
-
5 minutes later, A2 becomes past-due → claimed → HTTP call → duplicate delivery of an already-successful webhook.
The old code avoided this because claiming all to retry attempts (including future ones) meant they were all transitioned together on success.
This happens naturally: each retry cycle creates a new attempt with a future next_retry_after, while UpdateAttemptsStatus resets the old ones to to retry without changing their (already past-due) next_retry_after. After several failures, there's a growing set of attempts with staggered next_retry_after values.
TestClaimDoesNotClaimFutureAttemptForSameWebhook validates the new behavior but doesn't test the downstream consequence (duplicate send after success).
Fix: remove AND attempts.next_retry_after < NOW() from the attempts_to_claim CTE. Once a webhook_id is selected for processing, all its to retry attempts should be claimed together — matching the old semantics.
Minor: SKIP LOCKED in attempts_to_claim enables partial claims
If any external transaction holds a lock on a to retry row for a claimed webhook (unlikely in normal operation, but possible), SKIP LOCKED silently skips it. This creates a partial claim: some attempts are retrying, others stay to retry. The worker processes the partial set, and the skipped attempts are claimed later — potentially after a successful delivery.
This is a theoretical concern (no normal code path locks these rows externally), but worth noting as a correctness hole that wouldn't exist without SKIP LOCKED in this CTE.
Performance: NOT EXISTS vs DISTINCT ON
The old query used DISTINCT ON (webhook_id) which PostgreSQL optimizes well with an index scan. The new NOT EXISTS correlated subquery runs for each candidate row — O(N) per webhook × M webhooks. For endpoints that have been failing for a long time (thousands of accumulated attempts per webhook), this could become expensive. An index on (webhook_id, status, next_retry_after, id) would help, but the pattern is inherently more costly than DISTINCT ON.
TL;DR
- The
SKIP LOCKEDinto_claimcorrectly prevents deadlocks between concurrent workers. - But
next_retry_after < NOW()inattempts_to_claimis a regression that causes duplicate webhook deliveries after successful retries. Remove it. - More fundamentally: this complexity exists only to support multiple retrier replicas. A single-replica deployment has none of these problems, and I don't believe horizontal scaling is the right approach for this service.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/storage/postgres/postgres.go (1)
160-202:⚠️ Potential issue | 🟠 MajorAdd indexes to support the new retry-claim query access pattern.
The query now drives off per-
webhook_idanti-joins into_claimand a second pass onwebhook_id IN (...)inattempts_to_claim, but the existing indexes do not support this pattern. The currentidx_attempts_retry_pendingonly indexesnext_retry_after, which is insufficient. Add a partial index onattempts (webhook_id, next_retry_after, id)WHEREstatus = 'to retry'to efficiently support both the anti-join filtering and the second-pass lookup.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/storage/postgres/postgres.go` around lines 160 - 202, The new retry-claim query (the CTEs to_claim / attempts_to_claim and the UPDATE that sets status to webhooks.StatusAttemptRetrying) performs anti-joins and IN(...) lookups by attempts.webhook_id but the existing idx_attempts_retry_pending only covers next_retry_after; add a partial index to support these access patterns by creating an index on attempts(webhook_id, next_retry_after, id) WHERE status = 'to retry' (i.e. the same value as webhooks.StatusAttemptToRetry) so the planner can efficiently satisfy the per-webhook anti-join and the second-pass webhook_id IN(...) lookup.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/storage/postgres/postgres.go`:
- Around line 160-202: The new retry-claim query (the CTEs to_claim /
attempts_to_claim and the UPDATE that sets status to
webhooks.StatusAttemptRetrying) performs anti-joins and IN(...) lookups by
attempts.webhook_id but the existing idx_attempts_retry_pending only covers
next_retry_after; add a partial index to support these access patterns by
creating an index on attempts(webhook_id, next_retry_after, id) WHERE status =
'to retry' (i.e. the same value as webhooks.StatusAttemptToRetry) so the planner
can efficiently satisfy the per-webhook anti-join and the second-pass webhook_id
IN(...) lookup.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 09fb2b59-9431-4971-9985-8a2ca0a133a5
📒 Files selected for processing (4)
docs/retry-mechanism.mdpkg/storage/postgres/postgres.gopkg/storage/postgres/retry_test.gotest/e2e/trigger_test.go
Summary
FOR UPDATE OF attempts SKIP LOCKEDso concurrent workers skip rows already locked by another worker instead of waiting on them.webhook_idprocessing model by claiming the oldest eligible retry per webhook, then marking all currently available eligible attempts for those claimed webhooks asretrying.Root Cause
The previous claim query selected candidate
webhook_ids first, then updated matchingattemptsrows. With multiple retrier workers, two transactions could select overlapping candidates and then block while trying to update rows held by the other transaction. Under contention this can surface asclaiming webhook IDs to retry: ERROR: deadlock detected (SQLSTATE 40P01)and prevents horizontal scaling from draining backlog efficiently.Fix Details
The new query uses two CTEs before the update:
to_claimfinds the oldest eligible retry perwebhook_id, orders candidates globally bynext_retry_afterandid, and locks candidateattemptsrows withFOR UPDATE OF attempts SKIP LOCKED. Rows already held by another worker are skipped immediately.attempts_to_claimlocks the eligible attempts for the selected webhook IDs, again usingSKIP LOCKED, and the final update only changes those locked attempt IDs toretrying.This keeps workers independent: if one worker is processing or claiming a row, another worker can move on to other pending retries instead of blocking.
Testing
context deadline exceededwhile waiting on a locked attempt.go test ./pkg/storage/postgres -count=1go test ./... -count=1