Skip to content

fix(gastown): restore agent working status on heartbeat after dispatch timeout race#1359

Open
jrf0110 wants to merge 8 commits intomainfrom
1358-heartbeat-restores-working-status
Open

fix(gastown): restore agent working status on heartbeat after dispatch timeout race#1359
jrf0110 wants to merge 8 commits intomainfrom
1358-heartbeat-restores-working-status

Conversation

@jrf0110
Copy link
Contributor

@jrf0110 jrf0110 commented Mar 21, 2026

Summary

Fixes a 5-minute bead reset cycle caused by a timing race between startAgentInContainer's 60-second HTTP timeout and slow container cold starts (git clone + worktree setup). When the timeout fired, the agent was set to idle even though the container had successfully started the agent. The reconciler then saw an in_progress bead with no working agent and reset it to open, triggering a re-dispatch cycle every 5 minutes.

Four complementary fixes (defense-in-depth):

  1. touchAgent restores idleworking on heartbeat (agents.ts): A heartbeat is definitive proof the agent is alive. If the agent's status is idle due to the timeout race, the heartbeat corrects it via a SQL CASE expression.

  2. reconcileBeads Rule 3 checks last_activity_at freshness (reconciler.ts): Even if an agent's status is wrong, a heartbeat within the last 90 seconds proves the agent is alive, preventing the bead from being reset.

  3. dispatchAgent !started path no longer sets agent to idle (scheduling.ts): When startAgentInContainer returns false (timeout), the agent is left as working. If the agent truly didn't start, reconcileAgents catches it after 90s of missing heartbeats. The catch block (thrown exceptions) still sets idle since that indicates a genuine failure.

  4. Cold start grace period for container_status: not_found (reconciler.ts): The pre-phase container status poll queries /agents/:id/status every alarm tick (5s). During a cold start, the container returns 404 because the agent hasn't registered in the process manager yet. This was immediately setting the agent to idle, undoing fix # 3. Now, not_found is ignored for agents dispatched within the last 3 minutes (covers the 60s timeout + typical cold start time).

Closes #1358

Verification

  • pnpm typecheck — all packages pass
  • pnpm --filter cloudflare-gastown test:integration — reconciler.test.ts: 16/16 pass (3 new tests for this fix)
  • pnpm format:check — passes
  • pnpm lint — passes
  • Pre-push hooks (format + lint + typecheck) — all pass

Visual Changes

N/A

Reviewer Notes

  • Fix # 4 (cold start grace period) was added after the initial deploy because the first 3 fixes were insufficient — the container status pre-phase was resetting agents to idle via not_found before heartbeats could arrive.
  • The agents.ts diff looks larger than the semantic change due to the formatter — the only meaningful change is the CASE WHEN status = 'idle' THEN 'working' addition in touchAgent.
  • The four fixes are intentionally layered: # 3 prevents the bad state from occurring, AppBuilder - Fix model selector not showing saved model on reload #4 prevents the container status poll from undoing # 3, # 1 self-heals on the first heartbeat, and AppBuilder - Make entire project card clickable #2 is a safety net in the reconciler.
  • Pre-existing test failures in rig-do.test.ts and http-api.test.ts are unrelated to this change.

@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 21, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (6 files)
  • cloudflare-gastown/src/dos/Town.do.ts
  • cloudflare-gastown/src/dos/town/agents.ts
  • cloudflare-gastown/src/dos/town/reconciler.ts
  • cloudflare-gastown/src/dos/town/review-queue.ts
  • cloudflare-gastown/src/dos/town/scheduling.ts
  • cloudflare-gastown/test/integration/reconciler.test.ts

Reviewed by gpt-5.4-20260305 · 1,023,247 tokens

jrf0110 added 8 commits March 21, 2026 21:19
…h timeout race (#1358)

Three compounding fixes for the 5-minute bead reset cycle caused by a
timing race between startAgentInContainer's 60s timeout and slow cold
starts:

1. touchAgent restores idle→working on heartbeat — a heartbeat is proof
   the agent is alive in the container regardless of its recorded status.

2. reconcileBeads Rule 3 checks last_activity_at freshness — defense in
   depth so an agent with a recent heartbeat is never treated as lost,
   even if its status field is wrong.

3. dispatchAgent !started path no longer sets agent to idle — leaves it
   working so the reconciler doesn't reset the bead. reconcileAgents
   catches truly dead agents after 90s of missing heartbeats.

Closes #1358
The container status pre-phase polls /agents/:id/status on every alarm
tick. During a cold start (git clone + worktree), the agent hasn't
registered in the process manager yet, so the container returns 404.
This was immediately setting the agent to idle, undoing the dispatch
timeout fix.

Add a 3-minute grace period for not_found status: if the agent was
dispatched recently (last_activity_at < 3 min ago), ignore the 404.
Truly dead agents are still caught by reconcileAgents after 90s of
missing heartbeats.
… bead recovery

reconcileBeads Rule 3 compared ISO 8601 timestamps (2026-03-21T05:55:50Z)
against SQLite datetime() output (2026-03-21 05:55:50). Since 'T' (ASCII
84) > ' ' (ASCII 32), the comparison last_activity_at > datetime('now',
'-90 seconds') was ALWAYS TRUE — the heartbeat check never expired. Rule 3
thought every hooked agent had a fresh heartbeat and never recovered stuck
in_progress beads.

Fix: use strftime('%Y-%m-%dT%H:%M:%fZ', ...) to produce ISO 8601 format
matching the stored timestamps.

Also: move invariant violation logging from console.error (spamming Workers
logs every 5s per town) to analytics events for observability dashboards.

Closes #1361
…tart immediately

The refinery's gt_done path unhooks the agent but doesn't set it to
idle. The refinery stays 'working' with no hook until agentCompleted
fires (when the container process exits, which can take 10-30s after
gt_done). During that time processReviewQueue sees the refinery as
non-idle and won't pop the next MR bead.

Set the refinery to idle immediately after unhooking in agentDone.
The container process continues running but the DO knows the refinery
is available for new reviews.
Working agents with fresh heartbeats but no hook are running in the
container doing nothing — gt_done already ran and unhooked them, or the
hook was cleared by another path. Without this, the refinery stays
'working' indefinitely (heartbeats keep it alive), blocking
processReviewQueue from dispatching it for the next review.

Also skip the mayor in the working-agent check (mayors are always
working with no hook — that's normal). This eliminates the invariant 7
false positive from #1364.
…hed agents

agentCompleted unconditionally set the agent to idle, which could
clobber a live dispatch if the agent was re-hooked and dispatched
for new work between gt_done and the container's completion callback.

Add a guard: don't set to idle if the agent is working AND has a
hook (re-dispatched). Only set to idle if the agent is working with
no hook (gt_done completed, waiting for process exit) or already idle.
@jrf0110 jrf0110 force-pushed the 1358-heartbeat-restores-working-status branch from e31cd95 to 726e2ea Compare March 22, 2026 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Polecat heartbeats don't restore working status after dispatch timeout race

1 participant