Skip to content

fix(coherence): background-drain stranded version publishes (F3)#156

Merged
Hazzng merged 1 commit into
mainfrom
fix/f3-publish-drainer
Jun 20, 2026
Merged

fix(coherence): background-drain stranded version publishes (F3)#156
Hazzng merged 1 commit into
mainfrom
fix/f3-publish-drainer

Conversation

@Hazzng

@Hazzng Hazzng commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes F3 (#132): a committed write whose Redis INCR fails leaves other replicas serving the pre-write tree indefinitely. The bump was previously hostage to a later mutating write on the same replica — and if the reaper idle-evicted the session first, the in-memory publishPending flag (and the bump) was discarded. Data stays durable in Postgres; only cross-replica visibility was stranded for an unbounded window.

This adds in-process healing that flushes the stranded bump independent of further client traffic.

What changed (src/api/session-manager.ts)

  • pendingPublishes: Map<sessionKey, {tenantId, sandboxId}> — populated in the INCR-failure branch of publishVersionIfDirty; removed on a successful publish, on poison-suppression, and on destroy.
  • drainPendingPublishes() — an unref'd setInterval (10s, tied to the reaper lifecycle in startReaper, cleared in stopReaper/shutdown). Per pending key it takes session.lock.runExclusive(() => publishVersionIfDirty(...)) and re-checks session.publishPending under the lock — this serializes against ensureFreshCache (which clears #dirty on the -1 reload) and a concurrent exec turn, preventing a double INCR (V+2). A still-failing publish stays enqueued for the next tick (fixed-interval backoff). A missing/closing session is dropped (the reaper handles that case).
  • Reap-time best-effort publishrunReaper makes one publishVersionIfDirty attempt under the lock (FS still connected) before disconnecting a publishPending session, then drops the entry.
  • No DEL-key fallback (shares the INCR failure domain).
  • Corrected the now-false Session.publishPending doc-comment.

Plan: thoughts/shared/plans/2026-06-13_f3-publish-drainer.md.

Verification

Unit (src/api/tests/unit/session-manager.f3-publish-drainer.test.ts) — 5 new tests, all green

  • enqueues a stranded publish on INCR failure and drains it once Redis recovers (idempotent second tick)
  • does not double-INCR when a reload clears dirty before the drainer runs (no V+2)
  • makes a best-effort publish when a publishPending session is reaped
  • drops the pending entry for a session that vanished before the drainer ran
  • leaves the entry enqueued when the retry INCR also fails

Existing session-manager.version-counter.test.ts unchanged. Full suite: 979 passed | 4 skipped. pnpm typecheck && pnpm lint:fix && pnpm test:unit all clean.

Live (Postgres + Redis, server on :8131)

POST /v1/sandboxes                      -> 201 (id 333ed2ff-…)
exec "echo hi > /tmp/a"                 -> exit 0
exec "cat /tmp/a"                       -> stdout "hi\n"   (read-your-writes)
redis-cli get vfs:default:ver:<id>      -> 1               (counter incremented)
DELETE /v1/sandboxes/<id>               -> 204
redis-cli get vfs:default:ver:<id>      -> (nil)           (version key deleted)
SIGTERM                                 -> shutdown_begin → shutdown_complete, port freed in 1s (no force-exit / hang from the new interval)

The INCR-failure → drainer recovery path cannot be triggered cleanly against a live Redis, so it is covered by the Step-3 unit tests above and called out here.

Reviewer manual steps

  • Confirm the drainer interval is .unref()-ed and cleared on both stopReaper and shutdown (clean process exit verified live).
  • Sanity-check that drainPendingPublishes re-checks publishPending under the lock (double-INCR guard) — covered by the V+2 regression test.

Closes #132

@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0ff46117-7cfc-4897-8d83-13db55564725

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/f3-publish-drainer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Hazzng Hazzng merged commit 90d826a into main Jun 20, 2026
4 checks passed
@Hazzng Hazzng deleted the fix/f3-publish-drainer branch June 20, 2026 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

F3: Heal stranded replicas — durable background publish drainer

1 participant