.NET: Fix RequestInfoEvent lost when resuming workflow from checkpoint#4955
Open
.NET: Fix RequestInfoEvent lost when resuming workflow from checkpoint#4955
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a checkpoint-resume regression where pending RequestInfoEvents were republished before any event-stream subscriber was attached, causing resumed runs to silently drop those events and remain stuck (e.g., WatchStreamAsync yields nothing and status stays NotStarted).
Changes:
- Split checkpoint restore into “state restore” vs. “deferred republish” for initial resumes; republish now occurs after event stream subscription.
- Added a republish hook (
ISuperStepRunner.RepublishPendingEventsAsync) and invoked it from both off-thread and lockstep event streams at subscription time. - Added/updated state persistence for subworkflow/forwarding scenarios (
WorkflowHostExecutorpending response port mapping,RequestInfoExecutorwrapped requests) and introduced new regression tests.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/CheckpointResumeTests.cs | Adds regression coverage for resume/restore with pending requests across OffThread/Lockstep and subworkflow forwarding. |
| dotnet/src/Microsoft.Agents.AI.Workflows/WorkflowSession.cs | Uses an internal resume path to suppress republishing to avoid duplicate consumer-visible events in session scenarios. |
| dotnet/src/Microsoft.Agents.AI.Workflows/Specialized/WorkflowHostExecutor.cs | Persists request-id → original port mapping so qualified/unqualified response routing survives checkpoint restore. |
| dotnet/src/Microsoft.Agents.AI.Workflows/Specialized/RequestInfoExecutor.cs | Persists wrapped-request mapping across checkpoints so forwarded external requests can be correctly rewired after restore. |
| dotnet/src/Microsoft.Agents.AI.Workflows/InProc/InProcessRunner.cs | Defers pending-request republish on initial resume via a gated flag; keeps runtime restore republish behavior. |
| dotnet/src/Microsoft.Agents.AI.Workflows/InProc/InProcessExecutionEnvironment.cs | Adds an internal resume API that can suppress pending-event republish. |
| dotnet/src/Microsoft.Agents.AI.Workflows/Execution/StreamingRunEventStream.cs | Calls RepublishPendingEventsAsync immediately after subscribing to runner events. |
| dotnet/src/Microsoft.Agents.AI.Workflows/Execution/LockstepRunEventStream.cs | Buffers events across subscription timing and adds early-drain behavior for “pending requests only” resumes. |
| dotnet/src/Microsoft.Agents.AI.Workflows/Execution/ISuperStepRunner.cs | Adds the RepublishPendingEventsAsync contract for event streams. |
| dotnet/src/Microsoft.Agents.AI.Workflows/Execution/AsyncRunHandle.cs | Signals the run loop on resume when there are unserviced requests; clears buffered events for lockstep restores. |
| dotnet/src/Microsoft.Agents.AI.Workflows/Checkpointing/ICheckpointingHandle.cs | Clarifies runtime-restore vs. initial-resume expectations around replaying pending requests. |
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/LockstepRunEventStream.cs
Show resolved
Hide resolved
dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/CheckpointResumeTests.cs
Outdated
Show resolved
Hide resolved
…before Started event is emitted.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation and Context
When resuming a workflow from a JSON checkpoint, RestoreCheckpointAsync restored state and republished pending RequestInfoEvents immediately — before any event stream subscriber was attached. The events were sent to the EventSink, but no observer was listening yet, so they were silently dropped. This caused
WatchStreamAsyncto never yield the pending requests, andGetStatusAsyncto remainNotStarted.Fixes #2485
Description
This fix splits checkpoint restoration into a state-only restore step and a deferred republish step, so pending events are now republished only after the event stream subscribes, ensuring they are always delivered to the consumer. The fix covers both OffThread and Lockstep execution modes, runtime mid-flight restores, and subworkflow scenarios with qualified request ports.
Contribution Checklist