fix(coherence): background-drain stranded version publishes (F3)#156
Merged
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes F3 (#132): a committed write whose Redis
INCRfails leaves other replicas serving the pre-write tree indefinitely. The bump was previously hostage to a later mutating write on the same replica — and if the reaper idle-evicted the session first, the in-memorypublishPendingflag (and the bump) was discarded. Data stays durable in Postgres; only cross-replica visibility was stranded for an unbounded window.This adds in-process healing that flushes the stranded bump independent of further client traffic.
What changed (
src/api/session-manager.ts)pendingPublishes: Map<sessionKey, {tenantId, sandboxId}>— populated in theINCR-failure branch ofpublishVersionIfDirty; removed on a successful publish, on poison-suppression, and ondestroy.drainPendingPublishes()— an unref'dsetInterval(10s, tied to the reaper lifecycle instartReaper, cleared instopReaper/shutdown). Per pending key it takessession.lock.runExclusive(() => publishVersionIfDirty(...))and re-checkssession.publishPendingunder the lock — this serializes againstensureFreshCache(which clears#dirtyon the-1reload) and a concurrent exec turn, preventing a doubleINCR(V+2). A still-failing publish stays enqueued for the next tick (fixed-interval backoff). A missing/closing session is dropped (the reaper handles that case).runReapermakes onepublishVersionIfDirtyattempt under the lock (FS still connected) before disconnecting apublishPendingsession, then drops the entry.INCRfailure domain).Session.publishPendingdoc-comment.Plan:
thoughts/shared/plans/2026-06-13_f3-publish-drainer.md.Verification
Unit (
src/api/tests/unit/session-manager.f3-publish-drainer.test.ts) — 5 new tests, all greenenqueues a stranded publish on INCR failure and drains it once Redis recovers(idempotent second tick)does not double-INCR when a reload clears dirty before the drainer runs(no V+2)makes a best-effort publish when a publishPending session is reapeddrops the pending entry for a session that vanished before the drainer ranleaves the entry enqueued when the retry INCR also failsExisting
session-manager.version-counter.test.tsunchanged. Full suite: 979 passed | 4 skipped.pnpm typecheck && pnpm lint:fix && pnpm test:unitall clean.Live (Postgres + Redis, server on :8131)
The
INCR-failure → drainer recovery path cannot be triggered cleanly against a live Redis, so it is covered by the Step-3 unit tests above and called out here.Reviewer manual steps
.unref()-ed and cleared on bothstopReaperandshutdown(clean process exit verified live).drainPendingPublishesre-checkspublishPendingunder the lock (double-INCR guard) — covered by the V+2 regression test.Closes #132