Skip to content

fix(rivetkit): use engine actor stop threshold for shutdown#5049

Closed
NathanFlurry wants to merge 1 commit into
05-07-do_not_merge_serverless_restart_race_conditionfrom
rivetkit-shutdown/use-engine-stop-threshold
Closed

fix(rivetkit): use engine actor stop threshold for shutdown#5049
NathanFlurry wants to merge 1 commit into
05-07-do_not_merge_serverless_restart_race_conditionfrom
rivetkit-shutdown/use-engine-stop-threshold

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 12, 2026

Code Review: fix(rivetkit): use engine actor stop threshold for shutdown

Overview

This PR replaces the hardcoded ENVOY_STOP_WAIT_MS = 15_000 constant with the engine-provided actorStopThreshold from Pegboard's protocol metadata, so the shutdown grace period is coordinated with the engine's own hard cutoff for actors. The implementation threads this value from Rust (envoy-client protocol metadata) through NAPI into the TypeScript layer. A comprehensive SIGTERM + onSleep smoke test is added alongside a new sigtermSleepProbe kitchen-sink actor.

The core motivation is correct and well-executed. A few issues worth addressing below.


Issues

1. 30-minute fallback is dangerously large

// rivetkit-typescript/packages/rivetkit/src/registry/index.ts
const gracePeriodMs =
    config.shutdown?.gracePeriodMs ??
    (await this.#actorStopThresholdMs(configuredRegistryPromise)) ??
    30 * 60 * 1000;  // 30 minutes

The previous default was 30 s. The new fallback is 30 minutes. If the envoy never connects (engine down, network partition, startup failure), actor_stop_threshold_ms() returns None and shutdown waits 30 minutes before the OS kills the process. This can turn any engine-unreachable scenario into a 30-minute zombie. A smaller fallback (e.g., the original 30 s or the Kubernetes terminationGracePeriodSeconds default) would be safer.

2. Redundant ?? undefined in napi-runtime.ts

return (
    (await asNativeRegistry(registry).actorStopThresholdMs()) ??
    undefined
);

null ?? undefined evaluates to undefined, so the ?? undefined is a no-op. Either drop it or add a brief comment explaining the null-to-undefined coercion intent.

3. Magic number in readyState check (sigterm-sleep-probe.ts)

// used twice in onSleep
if (websocket.readyState !== 1) continue;

1 is WebSocket.OPEN. The named constant is available and used elsewhere in the smoke-test script (ws.readyState === WebSocket.OPEN). Prefer the named constant for consistency.

4. Type coercion masking in #envoyStopWaitMs()

const actorStopThreshold = Number(
    this.#envoy.getProtocolMetadata()?.actorStopThreshold,
);

Number(undefined) yields NaN so the fallback fires correctly, but the Number() wrapper hides whether actorStopThreshold is already typed as a number in the protocol metadata type. If it is, a direct optional-chain access with an explicit typeof guard reads cleaner and makes the type contract explicit.

5. Smoke test exit code on failure

// on-sleep-sigterm-smoke.ts, last line
main().then(() => process.exit(0));

There is no .catch() handler. A thrown error causes an unhandled promise rejection which exits non-zero on Node.js >= 15, but prints without a clean [test] FAIL banner. A .catch((err) => { console.error(err); process.exit(1); }) makes CI output easier to diagnose.


Positive Observations

  • Exhaustive match in registry.rs — all RegistryState variants are enumerated explicitly with no _ => catch-all, matching the CLAUDE.md convention.
  • Sync lock guard not held across awaitsself.serving_envoy.lock().clone() drops the guard immediately before any async work, which is correct for a parking_lot lock in an async context.
  • Error isolation in #actorStopThresholdMs — wrapping the threshold fetch in a try/catch so a flaky engine connection degrades gracefully to the fallback is the right call.
  • Smoke test cleanup — the finally block in main() properly closes WebSockets and kills both runners, and the .catch(() => undefined) on runner1ExitPromise prevents unhandled rejection when it is awaited later.
  • migrations.js fix — adding with { type: 'json' } is the correct modern import-attributes syntax for JSON imports under Node's --experimental-strip-types.
  • Consistent threshold propagation — the threshold surfaces through both the Serving (direct envoy) and Serverless (via active_envoy_actor_stop_threshold_ms) code paths, which is necessary for correctness.

Summary

The main concern is the 30-minute fallback: it risks turning any startup failure or engine-unreachable scenario into a very long-running zombie process. A fallback on the order of the original 30 s (or at most the expected Kubernetes terminationGracePeriodSeconds) would be safer. The other items are minor nits.

@NathanFlurry NathanFlurry changed the base branch from rivetkit-shutdown/exit-pid1-after-drain to graphite-base/5049 May 12, 2026 18:02
@NathanFlurry NathanFlurry force-pushed the rivetkit-shutdown/use-engine-stop-threshold branch from 837e20f to f492c44 Compare May 12, 2026 18:02
@NathanFlurry NathanFlurry changed the base branch from graphite-base/5049 to 05-07-do_not_merge_serverless_restart_race_condition May 12, 2026 18:02
@MasterPtato MasterPtato force-pushed the 05-07-do_not_merge_serverless_restart_race_condition branch from aeb5da6 to e960beb Compare May 14, 2026 17:26
@MasterPtato MasterPtato force-pushed the rivetkit-shutdown/use-engine-stop-threshold branch from f492c44 to 5102269 Compare May 14, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant