fix(lock): jitter + tunable retry in distributed acquire (F9d)#153
Conversation
…e writer starvation (#141)
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ef6f99344
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…k use parseNonNegativeInt accepted 0 for REDIS_EXEC_LOCK_ACQUIRE_RETRY_MS, but both distributed lock validators require > 0. Add parsePositiveInt and use it so bad config fails clearly at boot.
There was a problem hiding this comment.
1 issue found across 9 files
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
Summary
Closes #141 (F9d — distributed acquire fairness, severity:low).
The distributed exec lock (
distributed-lock.ts) and the distributed RW lock (distributed-rw-lock.ts) polled Redis on a flatacquireRetryMs(default 50 ms)setTimeout/sleep. Across replicas the poll cycles stay phase-aligned, so a cross-replica writer can be repeatedly passed over by a peer that polls a hair earlier each cycle — bounded byacquireTimeoutMs(300 s) then a 503, never infinite.This adds bounded jitter to every acquire/drain poll and makes the retry interval tunable. The ZSET FIFO ticket queue (the "Later" half of the issue's fix) is explicitly deferred.
What changed
jitteredDelayMs(retryMs)added todistributed-lock.ts(exported): returnsretryMs/2 + random()*retryMs/2, i.e. a delay in[retryMs/2, retryMs]. Lower bound isretryMs/2so we never poll Redis harder than ~2x the configured rate.distributed-lock.ts— legacy single-key acquire loop.distributed-rw-lock.ts—acquireShared,acquireExclusive(Phase-A flag set),waitReadersDrained(reader drain).REDIS_EXEC_LOCK_ACQUIRE_RETRY_MS(default 50) wired throughserver.tsexecLockOptions(it was already a first-class option on both lock types and threaded viasession-manager.ts; only the env wiring was missing). Documented in the CLAUDE.md env table.Deferred (follow-up)
TTL-reaped ZSET ticket queue inside
acquireExclusivefor true cross-replica FIFO admission — only worth it if multi-writer-per-sandbox becomes a real pattern; a naive LIST ticket would regress crash-safety.Testing
Unit — new
src/api/tests/unit/distributed-acquire-jitter.test.ts:jitteredDelayMsstays within[retryMs/2, retryMs]across 10k samples; hits lower bound atrandom()=0; approaches upper bound asrandom()→1; scales withretryMs.acquireRetryMshonored: acquire still succeeds once the lock frees; still times out atacquireTimeoutMs(≥110 ms for a 120 ms timeout) when it never does.Full gate green:
pnpm typecheck && pnpm lint:fix && pnpm test:unit→ 981 passed, 4 skipped. Existing lock + circuit-breaker tests unchanged and passing.Live (local server on :8133 vs Neon + Redis,
vf9d-sandboxes):/tmp/a.txt→ 2 lines), both exit 0.sleep 1each) on the same sandbox — both exit 0, output not interleaved (A and B each emitted start+end as a block), confirming the distributed lock still serializes correctly with jitter in place.True fairness is statistical and de-synchronization can't be asserted from a single-process live run — it's covered by the Step-3 unit bounds test; the live run confirms no regression to acquire/serialize behavior.
Reviewer manual steps
pnpm test -- src/api/tests/unit/distributed-acquire-jitter.test.tsREDIS_EXEC_LOCK_ACQUIRE_RETRY_MS=100and confirm acquire still works against a real Redis.