Skip to content

fix(lock): jitter + tunable retry in distributed acquire (F9d)#153

Merged
Hazzng merged 2 commits into
mainfrom
fix/f9d-acquire-jitter
Jun 20, 2026
Merged

fix(lock): jitter + tunable retry in distributed acquire (F9d)#153
Hazzng merged 2 commits into
mainfrom
fix/f9d-acquire-jitter

Conversation

@Hazzng

@Hazzng Hazzng commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Summary

Closes #141 (F9d — distributed acquire fairness, severity:low).

The distributed exec lock (distributed-lock.ts) and the distributed RW lock (distributed-rw-lock.ts) polled Redis on a flat acquireRetryMs (default 50 ms) setTimeout/sleep. Across replicas the poll cycles stay phase-aligned, so a cross-replica writer can be repeatedly passed over by a peer that polls a hair earlier each cycle — bounded by acquireTimeoutMs (300 s) then a 503, never infinite.

This adds bounded jitter to every acquire/drain poll and makes the retry interval tunable. The ZSET FIFO ticket queue (the "Later" half of the issue's fix) is explicitly deferred.

What changed

  • jitteredDelayMs(retryMs) added to distributed-lock.ts (exported): returns retryMs/2 + random()*retryMs/2, i.e. a delay in [retryMs/2, retryMs]. Lower bound is retryMs/2 so we never poll Redis harder than ~2x the configured rate.
  • Applied to all four contention loops:
    • distributed-lock.ts — legacy single-key acquire loop.
    • distributed-rw-lock.tsacquireShared, acquireExclusive (Phase-A flag set), waitReadersDrained (reader drain).
    • Heartbeat/renew timers are intentionally left flat — they are not the contention loops F9d targets.
  • REDIS_EXEC_LOCK_ACQUIRE_RETRY_MS (default 50) wired through server.ts execLockOptions (it was already a first-class option on both lock types and threaded via session-manager.ts; only the env wiring was missing). Documented in the CLAUDE.md env table.
  • Circuit breaker / error-budget paths (F5, F5: Redis circuit breaker — stop the 300s availability fuse #134) untouched — jitter only reshapes the sleep between polls; success/contention/thrown-error accounting and breaker fast-fail are unchanged.

Deferred (follow-up)

TTL-reaped ZSET ticket queue inside acquireExclusive for true cross-replica FIFO admission — only worth it if multi-writer-per-sandbox becomes a real pattern; a naive LIST ticket would regress crash-safety.

Testing

Unit — new src/api/tests/unit/distributed-acquire-jitter.test.ts:

  • jitteredDelayMs stays within [retryMs/2, retryMs] across 10k samples; hits lower bound at random()=0; approaches upper bound as random()→1; scales with retryMs.
  • Configured acquireRetryMs honored: acquire still succeeds once the lock frees; still times out at acquireTimeoutMs (≥110 ms for a 120 ms timeout) when it never does.
  • Circuit-breaker fast-fail on persistent thrown acquire errors unaffected.

Full gate green: pnpm typecheck && pnpm lint:fix && pnpm test:unit981 passed, 4 skipped. Existing lock + circuit-breaker tests unchanged and passing.

Live (local server on :8133 vs Neon + Redis, vf9d- sandboxes):

  • Created a sandbox, ran 2 sequential execs — second observed the first's filesystem state (/tmp/a.txt → 2 lines), both exit 0.
  • Launched 2 overlapping execs (sleep 1 each) on the same sandbox — both exit 0, output not interleaved (A and B each emitted start+end as a block), confirming the distributed lock still serializes correctly with jitter in place.
  • Torn down: delete → 204, sandbox list empty, server stopped.

True fairness is statistical and de-synchronization can't be asserted from a single-process live run — it's covered by the Step-3 unit bounds test; the live run confirms no regression to acquire/serialize behavior.

Reviewer manual steps

  1. pnpm test -- src/api/tests/unit/distributed-acquire-jitter.test.ts
  2. Optionally set REDIS_EXEC_LOCK_ACQUIRE_RETRY_MS=100 and confirm acquire still works against a real Redis.

@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: d55f6dc8-c0f7-4520-bc52-e3cc9a4ca17d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/f9d-acquire-jitter

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Hazzng

Hazzng commented Jun 20, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ef6f99344

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/api/server.ts Outdated
…k use

parseNonNegativeInt accepted 0 for REDIS_EXEC_LOCK_ACQUIRE_RETRY_MS, but
both distributed lock validators require > 0. Add parsePositiveInt and use
it so bad config fails clearly at boot.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 9 files

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread .changeset/fix-f9d-acquire-jitter.md
@Hazzng Hazzng merged commit 4393e6a into main Jun 20, 2026
6 checks passed
@Hazzng Hazzng deleted the fix/f9d-acquire-jitter branch June 20, 2026 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

F9d: Distributed acquire fairness — jitter + tunable retry

1 participant