Skip to content

DO NOT MERGE: serverless restart race condition#4997

Draft
NathanFlurry wants to merge 1 commit into
rivetkit-shutdown/exit-pid1-after-drainfrom
05-07-do_not_merge_serverless_restart_race_condition
Draft

DO NOT MERGE: serverless restart race condition#4997
NathanFlurry wants to merge 1 commit into
rivetkit-shutdown/exit-pid1-after-drainfrom
05-07-do_not_merge_serverless_restart_race_condition

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 7, 2026

🚅 Deployed to the rivet-pr-4997 environment in rivet-frontend

Service Status Web Updated (UTC)
website ◻️ Removed (View Logs) Web May 9, 2026 at 7:34 am
mcp-hub ◻️ Removed (View Logs) Web May 9, 2026 at 7:34 am
frontend-inspector ◻️ Removed (View Logs) Web May 9, 2026 at 7:34 am
kitchen-sink ◻️ Removed (View Logs) Web May 9, 2026 at 7:34 am
frontend-cloud ◻️ Removed (View Logs) Web May 9, 2026 at 7:34 am
ladle ◻️ Removed (View Logs) Web May 9, 2026 at 7:34 am

@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-4997 May 7, 2026 09:39 Destroyed
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 7, 2026

PR Review: Engine Restart / Serverless Race Investigation

This is a diagnostic/investigation branch — not a production fix. It adds a standalone Node orchestration harness and fixture actor to reproduce and time the gateway routing race after engine restart. The investigation methodology is sound and the timing findings are clearly documented.

Notes below are intended to guide cleanup before the findings become a real fix PR.


Issues

1. tick action in probes resets sleep deadline (CLAUDE.md)

In tests/driver/engine-restart-serverless.ts:

probeActor("same-handle-tick", () => input.actorHandle.tick(8192)),

CLAUDE.md states: "Driver tests that wait for actor sleep must not poll actor actions while waiting; each action counts as activity and can reset the sleep deadline." The probe set includes tick, which writes to SQLite. If the test is intentionally stressing the actor while it sleeps, add a comment explaining why tick is included rather than a read-only probe.

2. Bare ROLLBACK in tick action swallows the original error

In tests/fixtures/engine-restart-serverless-runtime.ts:

} catch (error) {
    await database.run("ROLLBACK");  // if this throws, original error is lost
    throw error;
}

If ROLLBACK itself throws (e.g., after engine disconnect), the rollback error replaces the original. The runHeartbeatSql and commitDuringEngineRestart functions correctly wrap rollback in a try/catch — the tick action should do the same.

3. runHeartbeatSql reads outside the transaction boundary

await database.run("COMMIT");

const rows = await database.query(   // outside the transaction
    "SELECT count FROM restart_heartbeat WHERE id = ?",
    [1],
);

The SELECT runs after COMMIT, so a concurrent engine reconnect could interleave and return a stale count. Move the SELECT inside the transaction before COMMIT, or add a comment accepting the potential off-by-one in this diagnostic context.

4. Dead code: rawSqlDatabaseProvider stub

const rawSqlDatabaseProvider = {
    createClient: async () => ({ execute: async () => [], close: async () => {} }),
    onMigrate: async () => {},
};

This stub is registered as db: but the actor immediately uses ctx.sql directly via the native SQLite runtime. Either remove db: rawSqlDatabaseProvider or add a comment explaining why both are present.

5. disableMetadataLookup: true lacks a comment

client = createClient({ ..., disableMetadataLookup: true });

Add a one-line comment explaining this is needed because the test manages runner config directly — otherwise it reads as an unexplained suppression.

6. waitForRunnerConfigReady uses polling loop with sleep(250)

CLAUDE.md discourages loop { check; sleep } patterns. For this standalone diagnostic harness it may be unavoidable (no event source for runner config readiness). Add a comment acknowledging this if so.

7. Investigation doc belongs in .agent/research/, not tests/driver/

Per CLAUDE.md: ".agent/research/ — research documents on external systems, prior art, and design analysis." engine-restart-serverless.md should live at .agent/research/engine-restart-serverless-race.md and not be committed to the test directory.


Minor

  • The as number cast on getCount() suggests the registry type may not be propagating through createClient<typeof registry> correctly — worth verifying.
  • ctx.vars as HeartbeatVars and ctx.sql as SqliteDatabase casts throughout the fixture are an existing pattern; not a new violation but worth cleaning up if the types can be inferred.

Summary

Status DO NOT MERGE (draft) — investigation harness only
Blocker before any real fix PR Move .md to .agent/research/; fix bare ROLLBACK error swallow in tick; remove or explain rawSqlDatabaseProvider
Minor Post-commit read outside transaction; missing comments on disableMetadataLookup and tick probe rationale
No issues Port allocation, process lifecycle, heartbeat state machine, WebSocket ping-pong harness, gateway sweep infrastructure

@NathanFlurry NathanFlurry changed the base branch from main to graphite-base/4997 May 9, 2026 05:56
@NathanFlurry NathanFlurry force-pushed the 05-07-do_not_merge_serverless_restart_race_condition branch from aa455c7 to 154af5d Compare May 9, 2026 05:57
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-4997 May 9, 2026 05:57 Destroyed
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4997 to runtime-options/preserve-bridge-errors May 9, 2026 05:57
@NathanFlurry NathanFlurry force-pushed the runtime-options/preserve-bridge-errors branch from 7d0b73c to d754df5 Compare May 9, 2026 06:24
@NathanFlurry NathanFlurry force-pushed the 05-07-do_not_merge_serverless_restart_race_condition branch from 154af5d to 011798b Compare May 9, 2026 06:24
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-4997 May 9, 2026 06:24 Destroyed
@NathanFlurry NathanFlurry changed the base branch from runtime-options/preserve-bridge-errors to graphite-base/4997 May 9, 2026 07:33
@NathanFlurry NathanFlurry reopened this May 9, 2026
@MasterPtato MasterPtato force-pushed the 05-07-do_not_merge_serverless_restart_race_condition branch from 011798b to 673ed87 Compare May 9, 2026 23:19
@MasterPtato MasterPtato force-pushed the graphite-base/4997 branch from d754df5 to 72cbc7c Compare May 9, 2026 23:19
@MasterPtato MasterPtato changed the base branch from graphite-base/4997 to main May 9, 2026 23:19
@MasterPtato MasterPtato force-pushed the 05-07-do_not_merge_serverless_restart_race_condition branch from 673ed87 to e1d5a39 Compare May 9, 2026 23:21
@NathanFlurry NathanFlurry changed the base branch from main to graphite-base/4997 May 12, 2026 18:02
@MasterPtato MasterPtato force-pushed the 05-07-do_not_merge_serverless_restart_race_condition branch from aeb5da6 to e960beb Compare May 14, 2026 17:26
@MasterPtato MasterPtato force-pushed the rivetkit-shutdown/exit-pid1-after-drain branch from a20fe59 to e344619 Compare May 14, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant