Skip to content

feat(observability): event-loop lag monitoring for Redis leases (F8) (#137)#157

Merged
Hazzng merged 1 commit into
mainfrom
fix/event-loop-observe
Jun 20, 2026
Merged

feat(observability): event-loop lag monitoring for Redis leases (F8) (#137)#157
Hazzng merged 1 commit into
mainfrom
fix/event-loop-observe

Conversation

@Hazzng

@Hazzng Hazzng commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Closes #137.

Problem

The three Redis leases — the exec-lock writer lease (distributed-lock.ts), the RW-lock writer flag, and the RW-lock reader ZSET scores (distributed-rw-lock.ts) — are each kept alive by a setTimeout heartbeat that renews well before expiry (renewMs 20s < leaseMs 60s). They silently assume the timer fires on schedule. A long event-loop stall (a V8 GC pause or a pathological synchronous bash stretch) fires the renewal too late; Redis has already expired the key / reaped the reader entry, another replica can step in, and the next renew returns 0 → LockLostError.

Lock 3 (PG advisory + per-tx serialization) keeps Postgres consistent, so this is an observability gap, not a correctness bug — but nothing measured how close the process ran to the lease floor (the issue's "the actual ask").

Fix (Primary only — purely observational)

New src/api/event-loop-monitor.ts:

  • A perf_hooks.monitorEventLoopDelay histogram started at boot, sampled every EVENT_LOOP_MONITOR_INTERVAL_MS (default 10s) and logged as event:"event_loop_lag" (p50Ms/p99Ms/maxMs/meanMs), then reset.
  • recordHeartbeatGap() wired into all three lease heartbeats: each reports actual-minus-expected fire time as event:"heartbeat_gap" at severity:"warn" (gap > renewMs) or "critical" (gap > leaseMs), tagged with the lock kind (exec/rw-writer/rw-reader) and key. Silent below renewMs; safe to call when the monitor was never started.
  • eventLoopLagSnapshot() exposes the live histogram for a future health endpoint.

Boot/shutdown wired in server.ts. Alert thresholds documented in DEVELOPER.md ("Lock observability"); env var added to CLAUDE.md.

No behavior change to any lock path — observability only.

Deferred per discussion: the secondary publishVersionIfDirty encode-deferral (only relevant when REDIS_PATH_SNAPSHOT_ENABLED=true, default off).

Tests

  • Unit smoke (heartbeat-gap.smoke.test.ts): reproduces a >lease sync stall on each of the three leases via FakeRedis, asserting the real failure (LockLostError for exec/writer) and a critical heartbeat_gap with the right lock tag + gapMs ≥ leaseMs. No-stall control asserts zero gap events.
  • Module unit (event-loop-monitor.test.ts): classifier, emitter, lifecycle idempotency, and a warmed-histogram test (real busy-loop).
  • Layer-2 integration (heartbeat-gap.redis.integration.test.ts, skipIf(!REDIS_URL)): verified end-to-end against a real Redis (Docker, redis:7-alpine). Real LockLostError + critical heartbeat_gap + key genuinely lapsed in Redis; a peer connection acquires the lapsed lease.

Verification

REDIS_URL=redis://localhost:6380 → 3 passed
unset REDIS_URL                  → 3 skipped (clean)
pnpm typecheck ✓   pnpm lint:fix ✓   pnpm test:unit ✓ (1026 passed)

Real-Redis log evidence (lease=800ms, renew=250ms, 1.5s stall):

{"event":"heartbeat_gap","severity":"critical","lock":"exec","gapMs":1250,"renewMs":250,"leaseMs":800}
{"event":"event_loop_lag","p50Ms":1507,"p99Ms":1507,"maxMs":1507,"meanMs":1506,"windowMs":500}

Summary by cubic

Adds event-loop lag observability for Redis leases to surface stalls that can delay heartbeats and void leases. Emits event_loop_lag metrics and heartbeat_gap warn/critical logs; no lock behavior changes.

  • New Features

    • New src/api/event-loop-monitor.ts: process-wide monitorEventLoopDelay histogram, sampled every EVENT_LOOP_MONITOR_INTERVAL_MS (default 10s) and logged as event:"event_loop_lag" (p50Ms, p99Ms, maxMs, meanMs).
    • Heartbeats now call recordHeartbeatGap() for exec lock, RW writer, and RW reader: emits event:"heartbeat_gap" at severity:"warn" when gapMs > renewMs and severity:"critical" when gapMs > leaseMs, tagged with lock and key; silent otherwise.
    • Boot/shutdown wiring in server.ts (timer is unref()'d); eventLoopLagSnapshot() exposed for future health checks. Docs updated with alert guidance.
  • Migration

    • No action required; observability only.
    • Optional: tune EVENT_LOOP_MONITOR_INTERVAL_MS and add alerts for heartbeat_gap criticals and rising event_loop_lag near renewMs.

Written for commit b48066b. Summary will update on new commits.

Review in cubic

…137)

The exec-lock writer lease, the RW-lock writer flag, and the RW-lock reader
ZSET scores are kept alive by setTimeout heartbeats that silently assume timers
fire on schedule. A long event-loop stall (V8 GC pause or pathological sync
bash stretch) can fire a renewal past the lease, voiding it. Lock 3 keeps
Postgres consistent, so this was always an observability gap, not a correctness
bug — but nothing measured how close the process ran to the lease floor.

New src/api/event-loop-monitor.ts (purely observational, no behavior change):
- perf_hooks.monitorEventLoopDelay histogram started at boot, sampled every
  EVENT_LOOP_MONITOR_INTERVAL_MS (default 10s) and logged as event:"event_loop_lag"
  (p50/p99/max/mean ms), then reset.
- recordHeartbeatGap wired into all three lease heartbeats: each reports
  actual-minus-expected fire time as event:"heartbeat_gap" at severity "warn"
  (gap > renewMs) or "critical" (gap > leaseMs), tagged with lock kind and key.

Alert thresholds documented in DEVELOPER.md ("Lock observability"). Unit smoke
tests reproduce a >lease stall on each lease (FakeRedis) and a real-Redis
Layer-2 integration smoke test (skipped unless REDIS_URL) verifies the critical
heartbeat_gap fires end-to-end, with a no-stall control proving no false positives.
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 077828b5-ba03-4867-828a-69bc8fcd704d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/event-loop-observe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 10 files

Tip: cubic could auto-approve low-risk PRs like this, if it thinks it's safe to merge. Learn more

Re-trigger cubic

@Hazzng Hazzng merged commit 6a4848e into main Jun 20, 2026
5 checks passed
@Hazzng Hazzng deleted the fix/event-loop-observe branch June 20, 2026 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

F8: Event-loop lag observability — measure heartbeat-gap vs lease

1 participant