feat: enforce max queue deliveries in handlers with graceful failure#1344
feat: enforce max queue deliveries in handlers with graceful failure#1344pranaygp wants to merge 1 commit intopgp/semantic-world-errorsfrom
Conversation
🦋 Changeset detectedLatest commit: a6fdf01 The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (3 failed)fastify (1 failed):
nitro (1 failed):
sveltekit (1 failed):
🐘 Local Postgres (1 failed)nitro-stable (1 failed):
🌍 Community Worlds (56 failed)mongodb (3 failed):
redis (2 failed):
turso (51 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
❌ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
3ff6de8 to
9929b57
Compare
d34edf9 to
c2ed3e7
Compare
90bd273 to
0a9c6f7
Compare
c2ed3e7 to
085a05a
Compare
There was a problem hiding this comment.
Pull request overview
This PR moves enforcement of the queue delivery cap from the Vercel Queue trigger configuration into the workflow/step runtime handlers, and updates local-queue behavior/logging to support the new approach.
Changes:
- Add a shared
MAX_QUEUE_DELIVERIESconstant and enforce it in both workflow and step handlers with graceful failure (run_failed/step_failed+ requeue workflow for step). - Remove
maxDeliveriesfrom queue trigger definitions in@workflow/builders. - Improve
world-localqueue logging withrunId/stepIdcontext and add a local retry safety limit.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/world-local/src/queue.ts | Adds structured identifiers to logs and replaces the old retry counter with a fixed safety-loop cap. |
| packages/errors/src/error-codes.ts | Introduces MAX_DELIVERIES_EXCEEDED run error code. |
| packages/core/src/runtime/step-handler.ts | Enforces max deliveries for steps and adjusts event creation/logging behavior. |
| packages/core/src/runtime/step-handler.test.ts | Adds test coverage for step max-deliveries behavior. |
| packages/core/src/runtime/constants.ts | Defines MAX_QUEUE_DELIVERIES. |
| packages/core/src/runtime.ts | Enforces max deliveries for workflow handler and records run_failed with a specific error code. |
| packages/builders/src/constants.ts | Removes VQS maxDeliveries from trigger constants. |
| .changeset/handler-max-deliveries.md | Changeset describing the behavior shift from VQS config to handler enforcement. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| const startResult = await world.events.create(workflowRunId, { | ||
| eventType: 'step_started', | ||
| specVersion: SPEC_VERSION_CURRENT, | ||
| correlationId: stepId, | ||
| }); |
| try { | ||
| const world = getWorld(); | ||
| await world.events.create(runId, { | ||
| eventType: 'run_failed', | ||
| specVersion: SPEC_VERSION_CURRENT, | ||
| eventData: { | ||
| error: { | ||
| message: `Workflow exceeded maximum queue deliveries (${metadata.attempt}/${MAX_QUEUE_DELIVERIES})`, | ||
| }, | ||
| errorCode: RUN_ERROR_CODES.MAX_DELIVERIES_EXCEEDED, | ||
| }, | ||
| }); |
| // Safety limit to prevent infinite loops in the local queue. | ||
| // The actual max delivery enforcement happens in the workflow/step handlers. | ||
| const MAX_LOCAL_SAFETY_LIMIT = 1000; | ||
| try { | ||
| let defaultRetriesLeft = 3; | ||
| for (let attempt = 0; defaultRetriesLeft > 0; attempt++) { | ||
| defaultRetriesLeft--; | ||
|
|
||
| for (let attempt = 0; attempt < MAX_LOCAL_SAFETY_LIMIT; attempt++) { |
| it('should post step_failed and re-queue workflow when delivery count exceeds max', async () => { | ||
| const result = await capturedHandler( | ||
| createMessage(), | ||
| { ...createMetadata('myStep'), attempt: 65 } | ||
| ); | ||
|
|
||
| expect(result).toBeUndefined(); | ||
| expect(mockEventsCreate).toHaveBeenCalledWith( | ||
| 'wrun_test123', | ||
| expect.objectContaining({ | ||
| eventType: 'step_failed', | ||
| correlationId: 'step_abc', | ||
| }) | ||
| ); | ||
| expect(mockQueueMessage).toHaveBeenCalled(); | ||
| expect(mockRuntimeLogger.error).toHaveBeenCalledWith( | ||
| expect.stringContaining('exceeded max deliveries'), | ||
| expect.objectContaining({ workflowRunId: 'wrun_test123' }) | ||
| ); | ||
| }); | ||
|
|
||
| it('should consume message silently when step_failed fails with EntityConflictError', async () => { | ||
| mockEventsCreate.mockRejectedValue( | ||
| new EntityConflictError('Step already completed') | ||
| ); | ||
|
|
||
| const result = await capturedHandler( | ||
| createMessage(), | ||
| { ...createMetadata('myStep'), attempt: 65 } | ||
| ); | ||
|
|
||
| expect(result).toBeUndefined(); | ||
| expect(mockStepFn).not.toHaveBeenCalled(); | ||
| }); | ||
|
|
||
| it('should not trigger max deliveries check when under limit', async () => { | ||
| const result = await capturedHandler( | ||
| createMessage(), | ||
| { ...createMetadata('myStep'), attempt: 64 } | ||
| ); |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Summary
Replaces VQS
maxDeliveries: 64cap with handler-level enforcement. Handlers now gracefully fail runs/steps after excessive queue redeliveries, preventing "phantom stuck" runs.Stacked on #1342 → #1340
Problem
When infrastructure is down (OOMs, network outages), VQS retries messages up to
maxDeliveries: 64times at 5s intervals. After exhausting retries, VQS drops the message — the run stays inrunningstatus forever with no error, no failure event.Solution
maxDeliveriesfrom VQS config — allow infinite retries at queue levelretryAfterSeconds: 5— VQS owns retry timing (works even after SIGKILL/OOM)metadata.attempt— when >MAX_QUEUE_DELIVERIES(64), fail gracefully withMAX_DELIVERIES_EXCEEDEDerror codeQueue error log examples (before → after)
Before (dumped full body, no run context):
After (structured, includes run/step IDs, separates HTTP status from handler error):
Local world queue
Test plan
failedwithMAX_DELIVERIES_EXCEEDED🤖 Generated with Claude Code