feat(observability): event-loop lag monitoring for Redis leases (F8) (#137)#157
Conversation
…137) The exec-lock writer lease, the RW-lock writer flag, and the RW-lock reader ZSET scores are kept alive by setTimeout heartbeats that silently assume timers fire on schedule. A long event-loop stall (V8 GC pause or pathological sync bash stretch) can fire a renewal past the lease, voiding it. Lock 3 keeps Postgres consistent, so this was always an observability gap, not a correctness bug — but nothing measured how close the process ran to the lease floor. New src/api/event-loop-monitor.ts (purely observational, no behavior change): - perf_hooks.monitorEventLoopDelay histogram started at boot, sampled every EVENT_LOOP_MONITOR_INTERVAL_MS (default 10s) and logged as event:"event_loop_lag" (p50/p99/max/mean ms), then reset. - recordHeartbeatGap wired into all three lease heartbeats: each reports actual-minus-expected fire time as event:"heartbeat_gap" at severity "warn" (gap > renewMs) or "critical" (gap > leaseMs), tagged with lock kind and key. Alert thresholds documented in DEVELOPER.md ("Lock observability"). Unit smoke tests reproduce a >lease stall on each lease (FakeRedis) and a real-Redis Layer-2 integration smoke test (skipped unless REDIS_URL) verifies the critical heartbeat_gap fires end-to-end, with a no-stall control proving no false positives.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
No issues found across 10 files
Tip: cubic could auto-approve low-risk PRs like this, if it thinks it's safe to merge. Learn more
Re-trigger cubic
Closes #137.
Problem
The three Redis leases — the exec-lock writer lease (
distributed-lock.ts), the RW-lock writer flag, and the RW-lock reader ZSET scores (distributed-rw-lock.ts) — are each kept alive by asetTimeoutheartbeat that renews well before expiry (renewMs20s <leaseMs60s). They silently assume the timer fires on schedule. A long event-loop stall (a V8 GC pause or a pathological synchronous bash stretch) fires the renewal too late; Redis has already expired the key / reaped the reader entry, another replica can step in, and the next renew returns 0 →LockLostError.Lock 3 (PG advisory + per-tx serialization) keeps Postgres consistent, so this is an observability gap, not a correctness bug — but nothing measured how close the process ran to the lease floor (the issue's "the actual ask").
Fix (Primary only — purely observational)
New
src/api/event-loop-monitor.ts:perf_hooks.monitorEventLoopDelayhistogram started at boot, sampled everyEVENT_LOOP_MONITOR_INTERVAL_MS(default 10s) and logged asevent:"event_loop_lag"(p50Ms/p99Ms/maxMs/meanMs), then reset.recordHeartbeatGap()wired into all three lease heartbeats: each reports actual-minus-expected fire time asevent:"heartbeat_gap"atseverity:"warn"(gap > renewMs) or"critical"(gap > leaseMs), tagged with the lock kind (exec/rw-writer/rw-reader) and key. Silent belowrenewMs; safe to call when the monitor was never started.eventLoopLagSnapshot()exposes the live histogram for a future health endpoint.Boot/shutdown wired in
server.ts. Alert thresholds documented inDEVELOPER.md("Lock observability"); env var added toCLAUDE.md.No behavior change to any lock path — observability only.
Tests
heartbeat-gap.smoke.test.ts): reproduces a >lease sync stall on each of the three leases via FakeRedis, asserting the real failure (LockLostErrorfor exec/writer) and acriticalheartbeat_gapwith the rightlocktag +gapMs ≥ leaseMs. No-stall control asserts zero gap events.event-loop-monitor.test.ts): classifier, emitter, lifecycle idempotency, and a warmed-histogram test (real busy-loop).heartbeat-gap.redis.integration.test.ts,skipIf(!REDIS_URL)): verified end-to-end against a real Redis (Docker,redis:7-alpine). RealLockLostError+criticalheartbeat_gap + key genuinely lapsed in Redis; a peer connection acquires the lapsed lease.Verification
Real-Redis log evidence (lease=800ms, renew=250ms, 1.5s stall):
{"event":"heartbeat_gap","severity":"critical","lock":"exec","gapMs":1250,"renewMs":250,"leaseMs":800} {"event":"event_loop_lag","p50Ms":1507,"p99Ms":1507,"maxMs":1507,"meanMs":1506,"windowMs":500}Summary by cubic
Adds event-loop lag observability for Redis leases to surface stalls that can delay heartbeats and void leases. Emits
event_loop_lagmetrics andheartbeat_gapwarn/critical logs; no lock behavior changes.New Features
src/api/event-loop-monitor.ts: process-widemonitorEventLoopDelayhistogram, sampled everyEVENT_LOOP_MONITOR_INTERVAL_MS(default 10s) and logged asevent:"event_loop_lag"(p50Ms,p99Ms,maxMs,meanMs).recordHeartbeatGap()for exec lock, RW writer, and RW reader: emitsevent:"heartbeat_gap"atseverity:"warn"whengapMs > renewMsandseverity:"critical"whengapMs > leaseMs, tagged withlockandkey; silent otherwise.server.ts(timer isunref()'d);eventLoopLagSnapshot()exposed for future health checks. Docs updated with alert guidance.Migration
EVENT_LOOP_MONITOR_INTERVAL_MSand add alerts forheartbeat_gapcriticals and risingevent_loop_lagnearrenewMs.Written for commit b48066b. Summary will update on new commits.