fix: skip locked retry claim rows by flemzord · Pull Request #150 · formancehq/webhooks

flemzord · 2026-05-17T09:05:59Z

Summary

Rework the retry claim query to use FOR UPDATE OF attempts SKIP LOCKED so concurrent workers skip rows already locked by another worker instead of waiting on them.
Preserve the current webhook_id processing model by claiming the oldest eligible retry per webhook, then marking all currently available eligible attempts for those claimed webhooks as retrying.
Add a deterministic Postgres test that locks one eligible attempt in another transaction and verifies the retrier can still claim a different available attempt.
Update the retry mechanism documentation to describe the non-blocking claim behavior.

Root Cause

The previous claim query selected candidate webhook_ids first, then updated matching attempts rows. With multiple retrier workers, two transactions could select overlapping candidates and then block while trying to update rows held by the other transaction. Under contention this can surface as claiming webhook IDs to retry: ERROR: deadlock detected (SQLSTATE 40P01) and prevents horizontal scaling from draining backlog efficiently.

Fix Details

The new query uses two CTEs before the update:

to_claim finds the oldest eligible retry per webhook_id, orders candidates globally by next_retry_after and id, and locks candidate attempts rows with FOR UPDATE OF attempts SKIP LOCKED. Rows already held by another worker are skipped immediately.
attempts_to_claim locks the eligible attempts for the selected webhook IDs, again using SKIP LOCKED, and the final update only changes those locked attempt IDs to retrying.

This keeps workers independent: if one worker is processing or claiming a row, another worker can move on to other pending retries instead of blocking.

Testing

Confirmed the new test fails before the fix with context deadline exceeded while waiting on a locked attempt.
go test ./pkg/storage/postgres -count=1
go test ./... -count=1

coderabbitai · 2026-05-17T09:06:11Z

Warning

Rate limit exceeded

@flemzord has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 50 minutes and 25 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 77ad5b75-3325-43ba-8002-cd4082c58682

📥 Commits

Reviewing files that changed from the base of the PR and between 1397122 and 751f8c4.

📒 Files selected for processing (2)

docs/retry-mechanism.md
pkg/storage/migrations.go

Walkthrough

Rewrites the retry batch-claim to a single multi-CTE query that atomically locks and updates up to the batch size of distinct webhook IDs using per-webhook oldest-first selection and FOR UPDATE ... SKIP LOCKED; docs, postgres implementation, tests, and e2e helper are updated accordingly.

Changes

Retry Claim Batch Atomicity and Concurrency

Layer / File(s)	Summary
Claim query atomicity with per-webhook prioritization and row locking `docs/retry-mechanism.md`, `pkg/storage/postgres/postgres.go`, `pkg/storage/postgres/retry_test.go`, `test/e2e/trigger_test.go`	Documents and implements a single multi-CTE claim flow (`to_claim` → `attempts_to_claim` → `claimed`) that uses a `NOT EXISTS` anti-join to pick the oldest eligible attempt per `webhook_id`, joins `configs` scoped to `active = true`, filters by `next_retry_after < NOW()` and `status = 'to retry'`, locks candidate rows with `FOR UPDATE ... SKIP LOCKED`, updates matched attempts to `status = 'retrying'`, and returns distinct webhook IDs. Adds tests verifying locked attempts are skipped and that only eligible attempts per webhook are claimed; updates e2e test to wait until all pending retry attempts drain to zero and adds a DB helper to count pending attempts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A query reborn in CTE grace,
Per-webhook locks claim their proper place,
SKIP LOCKED whispers "go around,"
While atomic CTEs hold their ground,
Retries now hop back into race.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: a fix to make retry claim rows skip locked rows when multiple workers contend for them.
Description check	✅ Passed	The description is directly related to the changeset, providing clear context about the retry claim query rework, root cause analysis, fix details, and testing approach.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/skip-locked-retry-claims

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/storage/postgres/retry_test.go (1)

448-489: ⚡ Quick win

Add a same-webhook due/future regression test.

Great contention coverage here. Please add one more case for a single webhook with two to retry attempts (one due, one future) and assert only the due one becomes retrying after claim.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/storage/postgres/retry_test.go` around lines 448 - 489, Add a new unit
test (e.g., TestClaimPrefersDueAttemptSameWebhook) that inserts a single webhook
with two attempts via insertConfigAndAttempt using the same webhook ID: one with
pastTime (due) and one with futureTime (not due). Call
store.FindWebhookIDsToRetry(ctx, 1) and assert the webhook ID is returned once,
then query attempts for that webhook (using db.NewSelect /
Model((*webhooks.Attempt)(nil)) / Where("webhook_id = ?", webhookID)) and assert
exactly one attempt has status webhooks.StatusAttemptRetrying (the due one) and
the other remains webhooks.StatusAttemptToRetry (the future one); use the
existing helper names (insertConfigAndAttempt, FindWebhookIDsToRetry,
webhooks.Attempt, StatusAttemptRetrying, StatusAttemptToRetry) so the test
follows the same pattern as TestClaimSkipsAttemptsLockedByAnotherWorker.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/storage/postgres/retry_test.go`:
- Around line 448-489: Add a new unit test (e.g.,
TestClaimPrefersDueAttemptSameWebhook) that inserts a single webhook with two
attempts via insertConfigAndAttempt using the same webhook ID: one with pastTime
(due) and one with futureTime (not due). Call store.FindWebhookIDsToRetry(ctx,
1) and assert the webhook ID is returned once, then query attempts for that
webhook (using db.NewSelect / Model((*webhooks.Attempt)(nil)) /
Where("webhook_id = ?", webhookID)) and assert exactly one attempt has status
webhooks.StatusAttemptRetrying (the due one) and the other remains
webhooks.StatusAttemptToRetry (the future one); use the existing helper names
(insertConfigAndAttempt, FindWebhookIDsToRetry, webhooks.Attempt,
StatusAttemptRetrying, StatusAttemptToRetry) so the test follows the same
pattern as TestClaimSkipsAttemptsLockedByAnotherWorker.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 67b5dfdf-9950-4915-abc4-4acfe6dcecbe

📥 Commits

Reviewing files that changed from the base of the PR and between 60270aa and 38f57d8.

📒 Files selected for processing (3)

docs/retry-mechanism.md
pkg/storage/postgres/postgres.go
pkg/storage/postgres/retry_test.go

fguery

Head hurts a bit 😂

Looks good, but I'm a bit worried about perfs, specially if/when we have an endpoint that's failing on the receiving end for a long time: we'll then have thousands of attempts right? Unless we have something to limit it? Did you run a pg plan to check we're ok on that front?

gfyrag

Review — Scalability concerns and regression risk

Preamble: should we even scale this service?

Before diving into the SQL details, I want to raise a higher-level concern: the webhooks service should not need to be scaled beyond a single replica.

Its job is to deliver HTTP calls to external endpoints on a best-effort basis with retries. The bottleneck is always the remote endpoint, not our service. A single instance with a bounded worker pool (which we already have via pond) can saturate any reasonable delivery throughput. Attempting to make the retry mechanism horizontally scalable introduces distributed coordination problems (locking, claim contention, partial claims) that are fundamentally harder than the problem we're solving.

The deadlock that motivated this PR (SQLSTATE 40P01) only occurs because multiple retrier loops compete on the same rows. With a single replica, there is no deadlock. The correct fix may simply be to not run multiple retrier instances — which is operationally simpler, easier to reason about, and eliminates an entire class of bugs.

If throughput is genuinely a concern (thousands of webhooks backing up), the right lever is tuning --retry-batch-size and the pond pool size on a single instance, not adding replicas that fight over the same database rows.

That said, here's the analysis of the PR as-is.

Bug: `next_retry_after < NOW()` in `attempts_to_claim` causes duplicate sends

The old claimed CTE updated all to retry attempts for a webhook, regardless of next_retry_after:

-- OLD
UPDATE attempts SET status = 'retrying', updated_at = NOW()
WHERE webhook_id IN (SELECT webhook_id FROM to_claim)
  AND status = 'to retry'
  -- no next_retry_after filter

The new attempts_to_claim CTE adds a temporal filter:

-- NEW
SELECT attempts.id FROM attempts ...
WHERE webhook_id IN (SELECT webhook_id FROM to_claim)
  AND status = 'to retry'
  AND attempts.next_retry_after < NOW()   -- ← only past-due
FOR UPDATE OF attempts SKIP LOCKED

This is a regression. Here's the scenario:

Webhook W1 has failed several times. State after N retries:
- A1: to retry, next_retry_after = 10 min ago (past-due, from earlier cycle)
- A2: to retry, next_retry_after = 5 min from now (future, latest backoff)
Retrier claims W1:
- to_claim selects A1 (oldest past-due), locks it ✓
- attempts_to_claim finds only A1 (A2 is future → excluded by the filter)
- claimed updates only A1 → retrying
- A2 remains to retry
Worker processes W1 → HTTP call → success
- UpdateAttemptsStatus sets A1 (retrying) → success
- A2 is to retry, not retrying → not touched
5 minutes later, A2 becomes past-due → claimed → HTTP call → duplicate delivery of an already-successful webhook.

The old code avoided this because claiming all to retry attempts (including future ones) meant they were all transitioned together on success.

This happens naturally: each retry cycle creates a new attempt with a future next_retry_after, while UpdateAttemptsStatus resets the old ones to to retry without changing their (already past-due) next_retry_after. After several failures, there's a growing set of attempts with staggered next_retry_after values.

TestClaimDoesNotClaimFutureAttemptForSameWebhook validates the new behavior but doesn't test the downstream consequence (duplicate send after success).

Fix: remove AND attempts.next_retry_after < NOW() from the attempts_to_claim CTE. Once a webhook_id is selected for processing, all its to retry attempts should be claimed together — matching the old semantics.

Minor: SKIP LOCKED in `attempts_to_claim` enables partial claims

If any external transaction holds a lock on a to retry row for a claimed webhook (unlikely in normal operation, but possible), SKIP LOCKED silently skips it. This creates a partial claim: some attempts are retrying, others stay to retry. The worker processes the partial set, and the skipped attempts are claimed later — potentially after a successful delivery.

This is a theoretical concern (no normal code path locks these rows externally), but worth noting as a correctness hole that wouldn't exist without SKIP LOCKED in this CTE.

Performance: NOT EXISTS vs DISTINCT ON

The old query used DISTINCT ON (webhook_id) which PostgreSQL optimizes well with an index scan. The new NOT EXISTS correlated subquery runs for each candidate row — O(N) per webhook × M webhooks. For endpoints that have been failing for a long time (thousands of accumulated attempts per webhook), this could become expensive. An index on (webhook_id, status, next_retry_after, id) would help, but the pattern is inherently more costly than DISTINCT ON.

TL;DR

The SKIP LOCKED in to_claim correctly prevents deadlocks between concurrent workers.
But next_retry_after < NOW() in attempts_to_claim is a regression that causes duplicate webhook deliveries after successful retries. Remove it.
More fundamentally: this complexity exists only to support multiple retrier replicas. A single-replica deployment has none of these problems, and I don't believe horizontal scaling is the right approach for this service.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/storage/postgres/postgres.go (1)
160-202: ⚠️ Potential issue | 🟠 Major

Add indexes to support the new retry-claim query access pattern.

The query now drives off per-webhook_id anti-joins in to_claim and a second pass on webhook_id IN (...) in attempts_to_claim, but the existing indexes do not support this pattern. The current idx_attempts_retry_pending only indexes next_retry_after, which is insufficient. Add a partial index on attempts (webhook_id, next_retry_after, id) WHERE status = 'to retry' to efficiently support both the anti-join filtering and the second-pass lookup.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/storage/postgres/postgres.go` around lines 160 - 202, The new retry-claim
query (the CTEs to_claim / attempts_to_claim and the UPDATE that sets status to
webhooks.StatusAttemptRetrying) performs anti-joins and IN(...) lookups by
attempts.webhook_id but the existing idx_attempts_retry_pending only covers
next_retry_after; add a partial index to support these access patterns by
creating an index on attempts(webhook_id, next_retry_after, id) WHERE status =
'to retry' (i.e. the same value as webhooks.StatusAttemptToRetry) so the planner
can efficiently satisfy the per-webhook anti-join and the second-pass webhook_id
IN(...) lookup.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/storage/postgres/postgres.go`:
- Around line 160-202: The new retry-claim query (the CTEs to_claim /
attempts_to_claim and the UPDATE that sets status to
webhooks.StatusAttemptRetrying) performs anti-joins and IN(...) lookups by
attempts.webhook_id but the existing idx_attempts_retry_pending only covers
next_retry_after; add a partial index to support these access patterns by
creating an index on attempts(webhook_id, next_retry_after, id) WHERE status =
'to retry' (i.e. the same value as webhooks.StatusAttemptToRetry) so the planner
can efficiently satisfy the per-webhook anti-join and the second-pass webhook_id
IN(...) lookup.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 09fb2b59-9431-4971-9985-8a2ca0a133a5

📥 Commits

Reviewing files that changed from the base of the PR and between 38f57d8 and 1397122.

📒 Files selected for processing (4)

docs/retry-mechanism.md
pkg/storage/postgres/postgres.go
pkg/storage/postgres/retry_test.go
test/e2e/trigger_test.go

gfyrag

blind approve

fix: skip locked retry claim rows

38f57d8

flemzord requested a review from a team as a code owner May 17, 2026 09:06

flemzord changed the title ~~Fix retry claim contention with SKIP LOCKED~~ fix: retry claim contention with SKIP LOCKED May 17, 2026

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

Comment thread pkg/storage/postgres/postgres.go

flemzord changed the title ~~fix: retry claim contention with SKIP LOCKED~~ fix: skip locked retry claim rows May 17, 2026

flemzord added 2 commits May 17, 2026 11:11

test: cover future retry claim timing

3c02d3c

test: wait for retry queue drain

1b1dc2a

fguery reviewed May 18, 2026

View reviewed changes

gfyrag reviewed May 18, 2026

View reviewed changes

test: wait for in-flight retry attempts

1397122

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

perf: add retry claim indexes

751f8c4

gfyrag reviewed May 21, 2026

View reviewed changes

Comment thread test/e2e/trigger_test.go

gfyrag approved these changes May 21, 2026

View reviewed changes

Comment thread test/e2e/trigger_test.go

flemzord merged commit 5043fd5 into main May 21, 2026
10 of 11 checks passed

flemzord deleted the feat/skip-locked-retry-claims branch May 21, 2026 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skip locked retry claim rows#150

fix: skip locked retry claim rows#150
flemzord merged 5 commits into
mainfrom
feat/skip-locked-retry-claims

flemzord commented May 17, 2026

Uh oh!

coderabbitai Bot commented May 17, 2026 •

edited

Loading

Rate limit exceeded

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

fguery left a comment

Uh oh!

gfyrag left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

gfyrag left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

flemzord commented May 17, 2026

Summary

Root Cause

Fix Details

Testing

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fguery left a comment

Choose a reason for hiding this comment

Uh oh!

gfyrag left a comment

Choose a reason for hiding this comment

Review — Scalability concerns and regression risk

Preamble: should we even scale this service?

Bug: next_retry_after < NOW() in attempts_to_claim causes duplicate sends

Minor: SKIP LOCKED in attempts_to_claim enables partial claims

Performance: NOT EXISTS vs DISTINCT ON

TL;DR

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gfyrag left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented May 17, 2026 •

edited

Loading

Bug: `next_retry_after < NOW()` in `attempts_to_claim` causes duplicate sends

Minor: SKIP LOCKED in `attempts_to_claim` enables partial claims