fix(gastown): restore agent working status on heartbeat after dispatch timeout race#1359
Open
fix(gastown): restore agent working status on heartbeat after dispatch timeout race#1359
Conversation
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Files Reviewed (6 files)
Reviewed by gpt-5.4-20260305 · 1,023,247 tokens |
…h timeout race (#1358) Three compounding fixes for the 5-minute bead reset cycle caused by a timing race between startAgentInContainer's 60s timeout and slow cold starts: 1. touchAgent restores idle→working on heartbeat — a heartbeat is proof the agent is alive in the container regardless of its recorded status. 2. reconcileBeads Rule 3 checks last_activity_at freshness — defense in depth so an agent with a recent heartbeat is never treated as lost, even if its status field is wrong. 3. dispatchAgent !started path no longer sets agent to idle — leaves it working so the reconciler doesn't reset the bead. reconcileAgents catches truly dead agents after 90s of missing heartbeats. Closes #1358
The container status pre-phase polls /agents/:id/status on every alarm tick. During a cold start (git clone + worktree), the agent hasn't registered in the process manager yet, so the container returns 404. This was immediately setting the agent to idle, undoing the dispatch timeout fix. Add a 3-minute grace period for not_found status: if the agent was dispatched recently (last_activity_at < 3 min ago), ignore the 404. Truly dead agents are still caught by reconcileAgents after 90s of missing heartbeats.
… bead recovery
reconcileBeads Rule 3 compared ISO 8601 timestamps (2026-03-21T05:55:50Z)
against SQLite datetime() output (2026-03-21 05:55:50). Since 'T' (ASCII
84) > ' ' (ASCII 32), the comparison last_activity_at > datetime('now',
'-90 seconds') was ALWAYS TRUE — the heartbeat check never expired. Rule 3
thought every hooked agent had a fresh heartbeat and never recovered stuck
in_progress beads.
Fix: use strftime('%Y-%m-%dT%H:%M:%fZ', ...) to produce ISO 8601 format
matching the stored timestamps.
Also: move invariant violation logging from console.error (spamming Workers
logs every 5s per town) to analytics events for observability dashboards.
Closes #1361
…tart immediately The refinery's gt_done path unhooks the agent but doesn't set it to idle. The refinery stays 'working' with no hook until agentCompleted fires (when the container process exits, which can take 10-30s after gt_done). During that time processReviewQueue sees the refinery as non-idle and won't pop the next MR bead. Set the refinery to idle immediately after unhooking in agentDone. The container process continues running but the DO knows the refinery is available for new reviews.
Working agents with fresh heartbeats but no hook are running in the container doing nothing — gt_done already ran and unhooked them, or the hook was cleared by another path. Without this, the refinery stays 'working' indefinitely (heartbeats keep it alive), blocking processReviewQueue from dispatching it for the next review. Also skip the mayor in the working-agent check (mayors are always working with no hook — that's normal). This eliminates the invariant 7 false positive from #1364.
…hed agents agentCompleted unconditionally set the agent to idle, which could clobber a live dispatch if the agent was re-hooked and dispatched for new work between gt_done and the container's completion callback. Add a guard: don't set to idle if the agent is working AND has a hook (re-dispatched). Only set to idle if the agent is working with no hook (gt_done completed, waiting for process exit) or already idle.
e31cd95 to
726e2ea
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a 5-minute bead reset cycle caused by a timing race between
startAgentInContainer's 60-second HTTP timeout and slow container cold starts (git clone + worktree setup). When the timeout fired, the agent was set toidleeven though the container had successfully started the agent. The reconciler then saw anin_progressbead with noworkingagent and reset it toopen, triggering a re-dispatch cycle every 5 minutes.Four complementary fixes (defense-in-depth):
touchAgentrestoresidle→workingon heartbeat (agents.ts): A heartbeat is definitive proof the agent is alive. If the agent's status isidledue to the timeout race, the heartbeat corrects it via a SQLCASEexpression.reconcileBeadsRule 3 checkslast_activity_atfreshness (reconciler.ts): Even if an agent's status is wrong, a heartbeat within the last 90 seconds proves the agent is alive, preventing the bead from being reset.dispatchAgent!startedpath no longer sets agent toidle(scheduling.ts): WhenstartAgentInContainerreturnsfalse(timeout), the agent is left asworking. If the agent truly didn't start,reconcileAgentscatches it after 90s of missing heartbeats. The catch block (thrown exceptions) still setsidlesince that indicates a genuine failure.Cold start grace period for
container_status: not_found(reconciler.ts): The pre-phase container status poll queries/agents/:id/statusevery alarm tick (5s). During a cold start, the container returns 404 because the agent hasn't registered in the process manager yet. This was immediately setting the agent toidle, undoing fix # 3. Now,not_foundis ignored for agents dispatched within the last 3 minutes (covers the 60s timeout + typical cold start time).Closes #1358
Verification
pnpm typecheck— all packages passpnpm --filter cloudflare-gastown test:integration— reconciler.test.ts: 16/16 pass (3 new tests for this fix)pnpm format:check— passespnpm lint— passesVisual Changes
N/A
Reviewer Notes
idlevianot_foundbefore heartbeats could arrive.agents.tsdiff looks larger than the semantic change due to the formatter — the only meaningful change is theCASE WHEN status = 'idle' THEN 'working'addition intouchAgent.rig-do.test.tsandhttp-api.test.tsare unrelated to this change.