Skip to content

feat: enforce max queue deliveries in handlers with graceful failure#1344

Draft
pranaygp wants to merge 1 commit intopgp/semantic-world-errorsfrom
pgp/handler-max-deliveries
Draft

feat: enforce max queue deliveries in handlers with graceful failure#1344
pranaygp wants to merge 1 commit intopgp/semantic-world-errorsfrom
pgp/handler-max-deliveries

Conversation

@pranaygp
Copy link
Collaborator

@pranaygp pranaygp commented Mar 12, 2026

Summary

Replaces VQS maxDeliveries: 64 cap with handler-level enforcement. Handlers now gracefully fail runs/steps after excessive queue redeliveries, preventing "phantom stuck" runs.

Stacked on #1342#1340

Problem

When infrastructure is down (OOMs, network outages), VQS retries messages up to maxDeliveries: 64 times at 5s intervals. After exhausting retries, VQS drops the message — the run stays in running status forever with no error, no failure event.

Solution

  1. Remove maxDeliveries from VQS config — allow infinite retries at queue level
  2. Keep retryAfterSeconds: 5 — VQS owns retry timing (works even after SIGKILL/OOM)
  3. Handlers check metadata.attempt — when > MAX_QUEUE_DELIVERIES (64), fail gracefully with MAX_DELIVERIES_EXCEEDED error code
  4. If even failure event creation fails — log detailed error and consume the message (no point retrying further)

Queue error log examples (before → after)

Before (dumped full body, no run context):

[local world] Failed to queue message {
  queueName: '__wkf_step_...',
  text: '"WorkflowAPIError: Injected 5xx"',
  status: 500,
  headers: { ... },
  body: '{"workflowName":"...","workflowRunId":"wrun_01KKF...",
    "stepId":"step_01KKF...","traceCarrier":{...}}'
}

After (structured, includes run/step IDs, separates HTTP status from handler error):

[world-local] Queue message failed (attempt 3, HTTP 500) {
  queueName: '__wkf_step_...',
  messageId: 'msg_01KKF...',
  runId: 'wrun_01KKF...',
  stepId: 'step_01KKF...',
  handlerError: '"WorkflowAPIError: Injected 5xx"'
}

Local world queue

  • Removed hardcoded 3-retry cap → 1000 safety limit (handler enforces the real limit at 64)
  • Matches production VQS behavior

Test plan

  • 3 new unit tests for step handler max delivery enforcement
  • All core tests pass
  • All world-local tests pass
  • E2E: persistent failure → failed with MAX_DELIVERIES_EXCEEDED
  • E2E: transient failure → normal completion

🤖 Generated with Claude Code

@changeset-bot
Copy link

changeset-bot bot commented Mar 12, 2026

🦋 Changeset detected

Latest commit: a6fdf01

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 20 packages
Name Type
@workflow/errors Patch
@workflow/core Patch
@workflow/world-local Patch
@workflow/builders Patch
@workflow/cli Patch
workflow Patch
@workflow/world-postgres Patch
@workflow/world-vercel Patch
@workflow/next Patch
@workflow/nitro Patch
@workflow/vitest Patch
@workflow/web-shared Patch
@workflow/world-testing Patch
@workflow/astro Patch
@workflow/nest Patch
@workflow/rollup Patch
@workflow/sveltekit Patch
@workflow/vite Patch
@workflow/ai Patch
@workflow/nuxt Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Contributor

vercel bot commented Mar 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
example-nextjs-workflow-turbopack Ready Ready Preview, Comment Mar 18, 2026 2:29am
example-nextjs-workflow-webpack Ready Ready Preview, Comment Mar 18, 2026 2:29am
example-workflow Ready Ready Preview, Comment Mar 18, 2026 2:29am
workbench-astro-workflow Ready Ready Preview, Comment Mar 18, 2026 2:29am
workbench-express-workflow Ready Ready Preview, Comment Mar 18, 2026 2:29am
workbench-fastify-workflow Ready Ready Preview, Comment Mar 18, 2026 2:29am
workbench-hono-workflow Ready Ready Preview, Comment Mar 18, 2026 2:29am
workbench-nitro-workflow Ready Ready Preview, Comment Mar 18, 2026 2:29am
workbench-nuxt-workflow Ready Ready Preview, Comment Mar 18, 2026 2:29am
workbench-sveltekit-workflow Ready Ready Preview, Comment Mar 18, 2026 2:29am
workbench-vite-workflow Ready Ready Preview, Comment Mar 18, 2026 2:29am
workflow-docs Ready Ready Preview, Comment, Open in v0 Mar 18, 2026 2:29am
workflow-nest Ready Ready Preview, Comment Mar 18, 2026 2:29am
workflow-swc-playground Ready Ready Preview, Comment Mar 18, 2026 2:29am

@github-actions
Copy link
Contributor

github-actions bot commented Mar 12, 2026

🧪 E2E Test Results

Some tests failed

Summary

Passed Failed Skipped Total
❌ ▲ Vercel Production 755 3 67 825
✅ 💻 Local Development 727 0 98 825
✅ 📦 Local Production 782 0 118 900
❌ 🐘 Local Postgres 781 1 118 900
✅ 🪟 Windows 72 0 3 75
❌ 🌍 Community Worlds 118 56 15 189
✅ 📋 Other 198 0 27 225
Total 3433 60 446 3939

❌ Failed Tests

▲ Vercel Production (3 failed)

fastify (1 failed):

  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE

nitro (1 failed):

  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep

sveltekit (1 failed):

  • error handling error propagation step errors cross-file step error preserves message and function names in stack
🐘 Local Postgres (1 failed)

nitro-stable (1 failed):

  • webhookWorkflow
🌍 Community Worlds (56 failed)

mongodb (3 failed):

  • hookWorkflow is not resumable via public webhook endpoint
  • webhookWorkflow
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously

redis (2 failed):

  • hookWorkflow is not resumable via public webhook endpoint
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously

turso (51 failed):

  • addTenWorkflow
  • addTenWorkflow
  • wellKnownAgentWorkflow (.well-known/agent)
  • should work with react rendering in step
  • promiseAllWorkflow
  • promiseRaceWorkflow
  • promiseAnyWorkflow
  • importedStepOnlyWorkflow
  • hookWorkflow
  • hookWorkflow is not resumable via public webhook endpoint
  • webhookWorkflow
  • sleepingWorkflow
  • parallelSleepWorkflow
  • nullByteWorkflow
  • workflowAndStepMetadataWorkflow
  • fetchWorkflow
  • promiseRaceStressTestWorkflow
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation workflow errors cross-file imports preserve message and stack trace
  • error handling error propagation step errors basic step error preserves message and stack trace
  • error handling error propagation step errors cross-file step error preserves message and function names in stack
  • error handling retry behavior regular Error retries until success
  • error handling retry behavior FatalError fails immediately without retries
  • error handling retry behavior RetryableError respects custom retryAfter delay
  • error handling retry behavior maxRetries=0 disables retries
  • error handling catchability FatalError can be caught and detected with FatalError.is()
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • stepFunctionPassingWorkflow - step function references can be passed as arguments (without closure vars)
  • stepFunctionWithClosureWorkflow - step function with closure variables passed as argument
  • closureVariableWorkflow - nested step functions with closure variables
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step
  • health check (queue-based) - workflow and step endpoints respond to health check messages
  • pathsAliasWorkflow - TypeScript path aliases resolve correctly
  • Calculator.calculate - static workflow method using static step methods from another class
  • AllInOneService.processNumber - static workflow method using sibling static step methods
  • ChainableService.processWithThis - static step methods using this to reference the class
  • thisSerializationWorkflow - step function invoked with .call() and .apply()
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE
  • instanceMethodStepWorkflow - instance methods with "use step" directive
  • crossContextSerdeWorkflow - classes defined in step code are deserializable in workflow context
  • stepFunctionAsStartArgWorkflow - step function reference passed as start() argument
  • cancelRun - cancelling a running workflow
  • cancelRun via CLI - cancelling a running workflow
  • pages router addTenWorkflow via pages router
  • pages router promiseAllWorkflow via pages router
  • pages router sleepingWorkflow via pages router
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep
  • sleepInLoopWorkflow - sleep inside loop with steps actually delays each iteration
  • sleepWithSequentialStepsWorkflow - sequential steps work with concurrent sleep (control)

Details by Category

❌ ▲ Vercel Production
App Passed Failed Skipped
✅ astro 68 0 7
✅ example 68 0 7
✅ express 68 0 7
❌ fastify 67 1 7
✅ hono 68 0 7
✅ nextjs-turbopack 73 0 2
✅ nextjs-webpack 73 0 2
❌ nitro 67 1 7
✅ nuxt 68 0 7
❌ sveltekit 67 1 7
✅ vite 68 0 7
✅ 💻 Local Development
App Passed Failed Skipped
✅ astro-stable 66 0 9
✅ express-stable 66 0 9
✅ fastify-stable 66 0 9
✅ hono-stable 66 0 9
✅ nextjs-turbopack-stable 72 0 3
✅ nextjs-webpack-canary 55 0 20
✅ nextjs-webpack-stable 72 0 3
✅ nitro-stable 66 0 9
✅ nuxt-stable 66 0 9
✅ sveltekit-stable 66 0 9
✅ vite-stable 66 0 9
✅ 📦 Local Production
App Passed Failed Skipped
✅ astro-stable 66 0 9
✅ express-stable 66 0 9
✅ fastify-stable 66 0 9
✅ hono-stable 66 0 9
✅ nextjs-turbopack-canary 55 0 20
✅ nextjs-turbopack-stable 72 0 3
✅ nextjs-webpack-canary 55 0 20
✅ nextjs-webpack-stable 72 0 3
✅ nitro-stable 66 0 9
✅ nuxt-stable 66 0 9
✅ sveltekit-stable 66 0 9
✅ vite-stable 66 0 9
❌ 🐘 Local Postgres
App Passed Failed Skipped
✅ astro-stable 66 0 9
✅ express-stable 66 0 9
✅ fastify-stable 66 0 9
✅ hono-stable 66 0 9
✅ nextjs-turbopack-canary 55 0 20
✅ nextjs-turbopack-stable 72 0 3
✅ nextjs-webpack-canary 55 0 20
✅ nextjs-webpack-stable 72 0 3
❌ nitro-stable 65 1 9
✅ nuxt-stable 66 0 9
✅ sveltekit-stable 66 0 9
✅ vite-stable 66 0 9
✅ 🪟 Windows
App Passed Failed Skipped
✅ nextjs-turbopack 72 0 3
❌ 🌍 Community Worlds
App Passed Failed Skipped
✅ mongodb-dev 3 0 2
❌ mongodb 52 3 3
✅ redis-dev 3 0 2
❌ redis 53 2 3
✅ turso-dev 3 0 2
❌ turso 4 51 3
✅ 📋 Other
App Passed Failed Skipped
✅ e2e-local-dev-nest-stable 66 0 9
✅ e2e-local-postgres-nest-stable 66 0 9
✅ e2e-local-prod-nest-stable 66 0 9

📋 View full workflow run


Some E2E test jobs failed:

  • Vercel Prod: failure
  • Local Dev: failure
  • Local Prod: success
  • Local Postgres: failure
  • Windows: success

Check the workflow run for details.

Copy link
Collaborator Author

pranaygp commented Mar 12, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link
Contributor

@vercel vercel bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Suggestion:

SvelteKit package has hardcoded maxDeliveries: 64 on queue triggers, causing VQS to silently drop messages before the handler can gracefully fail runs/steps.

Fix on Vercel

@pranaygp pranaygp force-pushed the pgp/handler-max-deliveries branch from d34edf9 to c2ed3e7 Compare March 17, 2026 19:35
@pranaygp pranaygp force-pushed the pgp/semantic-world-errors branch from 90bd273 to 0a9c6f7 Compare March 17, 2026 19:35
@pranaygp pranaygp force-pushed the pgp/handler-max-deliveries branch from c2ed3e7 to 085a05a Compare March 17, 2026 22:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR moves enforcement of the queue delivery cap from the Vercel Queue trigger configuration into the workflow/step runtime handlers, and updates local-queue behavior/logging to support the new approach.

Changes:

  • Add a shared MAX_QUEUE_DELIVERIES constant and enforce it in both workflow and step handlers with graceful failure (run_failed / step_failed + requeue workflow for step).
  • Remove maxDeliveries from queue trigger definitions in @workflow/builders.
  • Improve world-local queue logging with runId/stepId context and add a local retry safety limit.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/world-local/src/queue.ts Adds structured identifiers to logs and replaces the old retry counter with a fixed safety-loop cap.
packages/errors/src/error-codes.ts Introduces MAX_DELIVERIES_EXCEEDED run error code.
packages/core/src/runtime/step-handler.ts Enforces max deliveries for steps and adjusts event creation/logging behavior.
packages/core/src/runtime/step-handler.test.ts Adds test coverage for step max-deliveries behavior.
packages/core/src/runtime/constants.ts Defines MAX_QUEUE_DELIVERIES.
packages/core/src/runtime.ts Enforces max deliveries for workflow handler and records run_failed with a specific error code.
packages/builders/src/constants.ts Removes VQS maxDeliveries from trigger constants.
.changeset/handler-max-deliveries.md Changeset describing the behavior shift from VQS config to handler enforcement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +172 to +176
const startResult = await world.events.create(workflowRunId, {
eventType: 'step_started',
specVersion: SPEC_VERSION_CURRENT,
correlationId: stepId,
});
Comment on lines +120 to +131
try {
const world = getWorld();
await world.events.create(runId, {
eventType: 'run_failed',
specVersion: SPEC_VERSION_CURRENT,
eventData: {
error: {
message: `Workflow exceeded maximum queue deliveries (${metadata.attempt}/${MAX_QUEUE_DELIVERIES})`,
},
errorCode: RUN_ERROR_CODES.MAX_DELIVERIES_EXCEEDED,
},
});
Comment on lines +117 to +121
// Safety limit to prevent infinite loops in the local queue.
// The actual max delivery enforcement happens in the workflow/step handlers.
const MAX_LOCAL_SAFETY_LIMIT = 1000;
try {
let defaultRetriesLeft = 3;
for (let attempt = 0; defaultRetriesLeft > 0; attempt++) {
defaultRetriesLeft--;

for (let attempt = 0; attempt < MAX_LOCAL_SAFETY_LIMIT; attempt++) {
Comment on lines +529 to +568
it('should post step_failed and re-queue workflow when delivery count exceeds max', async () => {
const result = await capturedHandler(
createMessage(),
{ ...createMetadata('myStep'), attempt: 65 }
);

expect(result).toBeUndefined();
expect(mockEventsCreate).toHaveBeenCalledWith(
'wrun_test123',
expect.objectContaining({
eventType: 'step_failed',
correlationId: 'step_abc',
})
);
expect(mockQueueMessage).toHaveBeenCalled();
expect(mockRuntimeLogger.error).toHaveBeenCalledWith(
expect.stringContaining('exceeded max deliveries'),
expect.objectContaining({ workflowRunId: 'wrun_test123' })
);
});

it('should consume message silently when step_failed fails with EntityConflictError', async () => {
mockEventsCreate.mockRejectedValue(
new EntityConflictError('Step already completed')
);

const result = await capturedHandler(
createMessage(),
{ ...createMetadata('myStep'), attempt: 65 }
);

expect(result).toBeUndefined();
expect(mockStepFn).not.toHaveBeenCalled();
});

it('should not trigger max deliveries check when under limit', async () => {
const result = await capturedHandler(
createMessage(),
{ ...createMetadata('myStep'), attempt: 64 }
);
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants