fix(gastown): restore agent working status on heartbeat after dispatch timeout race by jrf0110 · Pull Request #1359 · Kilo-Org/cloud

jrf0110 · 2026-03-21T05:30:44Z

Summary

Fixes a 5-minute bead reset cycle caused by a timing race between startAgentInContainer's 60-second HTTP timeout and slow container cold starts (git clone + worktree setup). When the timeout fired, the agent was set to idle even though the container had successfully started the agent. The reconciler then saw an in_progress bead with no working agent and reset it to open, triggering a re-dispatch cycle every 5 minutes.

Four complementary fixes (defense-in-depth):

touchAgent restores idle → working on heartbeat (agents.ts): A heartbeat is definitive proof the agent is alive. If the agent's status is idle due to the timeout race, the heartbeat corrects it via a SQL CASE expression.
reconcileBeads Rule 3 checks last_activity_at freshness (reconciler.ts): Even if an agent's status is wrong, a heartbeat within the last 90 seconds proves the agent is alive, preventing the bead from being reset.
dispatchAgent !started path no longer sets agent to idle (scheduling.ts): When startAgentInContainer returns false (timeout), the agent is left as working. If the agent truly didn't start, reconcileAgents catches it after 90s of missing heartbeats. The catch block (thrown exceptions) still sets idle since that indicates a genuine failure.
Cold start grace period for container_status: not_found (reconciler.ts): The pre-phase container status poll queries /agents/:id/status every alarm tick (5s). During a cold start, the container returns 404 because the agent hasn't registered in the process manager yet. This was immediately setting the agent to idle, undoing fix # 3. Now, not_found is ignored for agents dispatched within the last 3 minutes (covers the 60s timeout + typical cold start time).

Closes #1358

Verification

pnpm typecheck — all packages pass
pnpm --filter cloudflare-gastown test:integration — reconciler.test.ts: 16/16 pass (3 new tests for this fix)
pnpm format:check — passes
pnpm lint — passes
Pre-push hooks (format + lint + typecheck) — all pass

Visual Changes

N/A

Reviewer Notes

Fix # 4 (cold start grace period) was added after the initial deploy because the first 3 fixes were insufficient — the container status pre-phase was resetting agents to idle via not_found before heartbeats could arrive.
The agents.ts diff looks larger than the semantic change due to the formatter — the only meaningful change is the CASE WHEN status = 'idle' THEN 'working' addition in touchAgent.
The four fixes are intentionally layered: # 3 prevents the bad state from occurring, AppBuilder - Fix model selector not showing saved model on reload #4 prevents the container status poll from undoing # 3, # 1 self-heals on the first heartbeat, and AppBuilder - Make entire project card clickable #2 is a safety net in the reconciler.
Pre-existing test failures in rig-do.test.ts and http-api.test.ts are unrelated to this change.

cloudflare-gastown/src/dos/town/reconciler.ts

kilo-code-bot · 2026-03-21T05:33:51Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (6 files)

cloudflare-gastown/src/dos/Town.do.ts
cloudflare-gastown/src/dos/town/agents.ts
cloudflare-gastown/src/dos/town/reconciler.ts
cloudflare-gastown/src/dos/town/review-queue.ts
cloudflare-gastown/src/dos/town/scheduling.ts
cloudflare-gastown/test/integration/reconciler.test.ts

_{Reviewed by gpt-5.4-20260305 · 1,023,247 tokens}

cloudflare-gastown/src/dos/town/reconciler.ts

cloudflare-gastown/src/dos/town/review-queue.ts

…h timeout race (#1358) Three compounding fixes for the 5-minute bead reset cycle caused by a timing race between startAgentInContainer's 60s timeout and slow cold starts: 1. touchAgent restores idle→working on heartbeat — a heartbeat is proof the agent is alive in the container regardless of its recorded status. 2. reconcileBeads Rule 3 checks last_activity_at freshness — defense in depth so an agent with a recent heartbeat is never treated as lost, even if its status field is wrong. 3. dispatchAgent !started path no longer sets agent to idle — leaves it working so the reconciler doesn't reset the bead. reconcileAgents catches truly dead agents after 90s of missing heartbeats. Closes #1358

The container status pre-phase polls /agents/:id/status on every alarm tick. During a cold start (git clone + worktree), the agent hasn't registered in the process manager yet, so the container returns 404. This was immediately setting the agent to idle, undoing the dispatch timeout fix. Add a 3-minute grace period for not_found status: if the agent was dispatched recently (last_activity_at < 3 min ago), ignore the 404. Truly dead agents are still caught by reconcileAgents after 90s of missing heartbeats.

… bead recovery reconcileBeads Rule 3 compared ISO 8601 timestamps (2026-03-21T05:55:50Z) against SQLite datetime() output (2026-03-21 05:55:50). Since 'T' (ASCII 84) > ' ' (ASCII 32), the comparison last_activity_at > datetime('now', '-90 seconds') was ALWAYS TRUE — the heartbeat check never expired. Rule 3 thought every hooked agent had a fresh heartbeat and never recovered stuck in_progress beads. Fix: use strftime('%Y-%m-%dT%H:%M:%fZ', ...) to produce ISO 8601 format matching the stored timestamps. Also: move invariant violation logging from console.error (spamming Workers logs every 5s per town) to analytics events for observability dashboards. Closes #1361

…tart immediately The refinery's gt_done path unhooks the agent but doesn't set it to idle. The refinery stays 'working' with no hook until agentCompleted fires (when the container process exits, which can take 10-30s after gt_done). During that time processReviewQueue sees the refinery as non-idle and won't pop the next MR bead. Set the refinery to idle immediately after unhooking in agentDone. The container process continues running but the DO knows the refinery is available for new reviews.

Working agents with fresh heartbeats but no hook are running in the container doing nothing — gt_done already ran and unhooked them, or the hook was cleared by another path. Without this, the refinery stays 'working' indefinitely (heartbeats keep it alive), blocking processReviewQueue from dispatching it for the next review. Also skip the mayor in the working-agent check (mayors are always working with no hook — that's normal). This eliminates the invariant 7 false positive from #1364.

…hed agents agentCompleted unconditionally set the agent to idle, which could clobber a live dispatch if the agent was re-hooked and dispatched for new work between gt_done and the container's completion callback. Add a guard: don't set to idle if the agent is working AND has a hook (re-dispatched). Only set to idle if the agent is working with no hook (gt_done completed, waiting for process exit) or already idle.

kilo-code-bot bot reviewed Mar 21, 2026

View reviewed changes

cloudflare-gastown/src/dos/town/reconciler.ts Outdated Show resolved Hide resolved

kilo-code-bot bot reviewed Mar 21, 2026

View reviewed changes

cloudflare-gastown/src/dos/town/reconciler.ts Show resolved Hide resolved

kilo-code-bot bot reviewed Mar 22, 2026

View reviewed changes

cloudflare-gastown/src/dos/town/reconciler.ts Show resolved Hide resolved

cloudflare-gastown/src/dos/town/review-queue.ts Show resolved Hide resolved

jrf0110 added 8 commits March 21, 2026 21:19

style: run oxfmt formatter

de3ada6

style: format review-queue.ts

726e2ea

jrf0110 force-pushed the 1358-heartbeat-restores-working-status branch from e31cd95 to 726e2ea Compare March 22, 2026 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gastown): restore agent working status on heartbeat after dispatch timeout race#1359

fix(gastown): restore agent working status on heartbeat after dispatch timeout race#1359
jrf0110 wants to merge 8 commits intomainfrom
1358-heartbeat-restores-working-status

jrf0110 commented Mar 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

kilo-code-bot bot commented Mar 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jrf0110 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Visual Changes

Reviewer Notes

Uh oh!

Uh oh!

kilo-code-bot bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jrf0110 commented Mar 21, 2026 •

edited

Loading

kilo-code-bot bot commented Mar 21, 2026 •

edited

Loading