[world-postgres, world-local] Fix TOCTOU races in entity state transitions#1434
[world-postgres, world-local] Fix TOCTOU races in entity state transitions#1434
Conversation
The step_started UPDATE had no conditional guard on step status, allowing a concurrent execution to revert a completed step back to 'running'. This caused duplicate step_completed events, triggering CORRUPTED_EVENT_LOG. Add notInArray guard to match the existing pattern on step_completed and the DynamoDB conditional expression used in the Vercel world. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 6f0f64a The changes in this PR will be included in the next version bump. This PR includes changesets to release 18 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (2 failed)astro (1 failed):
hono (1 failed):
💻 Local Development (60 failed)astro-stable (5 failed):
express-stable (5 failed):
fastify-stable (5 failed):
hono-stable (5 failed):
nextjs-turbopack-canary (5 failed):
nextjs-turbopack-stable (5 failed):
nextjs-webpack-canary (5 failed):
nextjs-webpack-stable (5 failed):
nitro-stable (5 failed):
nuxt-stable (5 failed):
sveltekit-stable (5 failed):
vite-stable (5 failed):
📦 Local Production (60 failed)astro-stable (5 failed):
express-stable (5 failed):
fastify-stable (5 failed):
hono-stable (5 failed):
nextjs-turbopack-canary (5 failed):
nextjs-turbopack-stable (5 failed):
nextjs-webpack-canary (5 failed):
nextjs-webpack-stable (5 failed):
nitro-stable (5 failed):
nuxt-stable (5 failed):
sveltekit-stable (5 failed):
vite-stable (5 failed):
🪟 Windows (5 failed)nextjs-turbopack (5 failed):
🌍 Community Worlds (56 failed)mongodb (3 failed):
redis (2 failed):
turso (51 failed):
📋 Other (10 failed)e2e-local-dev-nest-stable (5 failed):
e2e-local-prod-nest-stable (5 failed):
Details by Category❌ ▲ Vercel Production
❌ 💻 Local Development
❌ 📦 Local Production
✅ 🐘 Local Postgres
❌ 🪟 Windows
❌ 🌍 Community Worlds
❌ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
There was a problem hiding this comment.
Pull request overview
Fixes a concurrency bug in the Postgres world’s event-sourcing storage where step_started could race with step_completed and revert a terminal step back to running, leading to duplicated step events and replay failures (CORRUPTED_EVENT_LOG).
Changes:
- Adds a terminal-state guard to the
step_startedstep UPDATE (status NOT IN ('completed','failed')) to prevent reverting completed/failed steps. - Improves in-code documentation explaining the TOCTOU scenario and why the conditional UPDATE is required.
- Adds a changeset to release the fix as a patch for
@workflow/world-postgres.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| packages/world-postgres/src/storage.ts | Adds a conditional WHERE clause to step_started UPDATE to prevent reverting terminal steps; expands comments and adds a fallback existence/terminal-state check when the UPDATE affects 0 rows. |
| .changeset/fix-step-started-race-condition.md | Patch changeset documenting the race-condition fix and its rationale. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| .where( | ||
| and( | ||
| eq(Schema.steps.runId, effectiveRunId), | ||
| eq(Schema.steps.stepId, data.correlationId!) | ||
| eq(Schema.steps.stepId, data.correlationId!), | ||
| // Only update if not already in terminal state (prevents TOCTOU race) | ||
| notInArray(Schema.steps.status, ['completed', 'failed']) | ||
| ) |
| '@workflow/world-postgres': patch | ||
| --- | ||
|
|
||
| Fix race condition in `step_started` that could corrupt the event log. The `UPDATE` for `step_started` now includes a conditional guard (`status NOT IN ('completed', 'failed')`) to prevent a concurrent step execution from reverting a completed step back to running. This matches the existing guard on `step_completed` and the DynamoDB conditional expression used in the Vercel world. |
There was a problem hiding this comment.
Unnecessary context.
| Fix race condition in `step_started` that could corrupt the event log. The `UPDATE` for `step_started` now includes a conditional guard (`status NOT IN ('completed', 'failed')`) to prevent a concurrent step execution from reverting a completed step back to running. This matches the existing guard on `step_completed` and the DynamoDB conditional expression used in the Vercel world. | |
| Fix race condition in `step_started` that could corrupt the event log. The `UPDATE` for `step_started` now includes a conditional guard (`status NOT IN ('completed', 'failed')`) to prevent a concurrent step execution from reverting a completed step back to running. |
Add conditional WHERE clauses to match the Vercel world's DynamoDB conditional expressions, preventing TOCTOU races where concurrent requests could bypass pre-validation and write invalid state transitions. Changes: - step_started: add NOT IN (completed, failed, cancelled) guard - step_retrying: add terminal-state guard (was unguarded) - step_completed/step_failed: add cancelled to guard - run_completed/run_failed/run_cancelled: add terminal-state guards - isStepTerminal: include cancelled status Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Local world fixes: - step_completed/step_failed: use writeExclusive lock to prevent concurrent duplicate terminal transitions - step_started: check for terminal lock file before allowing start - wait_completed: use writeExclusive lock (port from PR #1388) - isStepTerminal: include cancelled status Tests: - Concurrent step_completed race (exactly one succeeds, one gets 409) - Concurrent step_failed race - step_started rejection after concurrent step_completed - Concurrent wait_completed race Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| Add atomic terminal-state guards to all entity update operations across both postgres and local worlds to match the Vercel world's DynamoDB conditional expressions. | ||
|
|
||
| **Postgres world**: Add conditional WHERE clauses to prevent TOCTOU races where concurrent requests bypass pre-validation and corrupt the event log. | ||
|
|
||
| **Local world**: Use `writeExclusive` lock files to atomically prevent concurrent terminal state transitions for steps and waits. Add `cancelled` to `isStepTerminal`. |
There was a problem hiding this comment.
split these into 2 separate changelings, 1 for each package
Summary
Add atomic terminal-state guards to all entity update operations across both postgres and local worlds to match the Vercel world's DynamoDB conditional expressions. Previously, several updates had no atomic guard on status, allowing concurrent requests to bypass TOCTOU pre-validation and corrupt the event log.
Root cause
The postgres world's entity UPDATEs were not atomic with respect to state validation. Pre-validation would SELECT the current status, check it, then proceed to an unconditional UPDATE — a classic TOCTOU (time-of-check-time-of-use) race.
When multiple webhook
hook_receivedevents triggered concurrent workflow continuations that each queued the same step, two Graphile Worker jobs could execute simultaneously:step_started→ execute step →step_completed(sets status tocompleted)step_startedvalidation reads step asrunning(passes) → unconditional UPDATE reverts status fromcompletedback torunning→ executes step again →step_completedsucceeds (status wasrunning)step_completedevents for one step → second is unconsumed →CORRUPTED_EVENT_LOGPostgres world changes
Add conditional WHERE clauses to all entity UPDATEs:
step_startedNOT IN (completed, failed, cancelled)step_completedcancelledNOT IN (completed, failed, cancelled)step_failedcancelledNOT IN (completed, failed, cancelled)step_retryingNOT IN (completed, failed, cancelled)run_completedNOT IN (completed, failed, cancelled)run_failedNOT IN (completed, failed, cancelled)run_cancelledNOT IN (completed, failed, cancelled)wait_completedLocal world changes
step_completed/step_failed: UsewriteExclusivelock file to atomically prevent concurrent duplicate terminal transitionsstep_started: Check for terminal lock file before allowing startwait_completed: UsewriteExclusivelock (same pattern as PR Fix concurrent wait_completed race condition in world-local #1388)isStepTerminal: IncludecancelledstatusTests added
step_completedrace (exactly one succeeds, one gets 409)step_failedracestep_startedrejection after concurrentstep_completedwait_completedraceTest plan
pnpm buildpasseswebhookWorkflowtest)Closes #1388
🤖 Generated with Claude Code