fix(3383): loss of pod causes stdio transport to stop#3384
fix(3383): loss of pod causes stdio transport to stop#3384
Conversation
Signed-off-by: Jeremy Drouillard <jeremy@stacklok.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3384 +/- ##
==========================================
+ Coverage 64.79% 64.98% +0.19%
==========================================
Files 375 375
Lines 36626 36630 +4
==========================================
+ Hits 23730 23804 +74
+ Misses 11024 10947 -77
- Partials 1872 1879 +7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical bug where the stdio transport enters a "zombie" state when Docker becomes unavailable during re-attachment. The transport would appear running (IsRunning() returns true) but could not communicate with the container, preventing automatic restart. The fix ensures proper shutdown by calling Stop() when re-attachment fails.
Changes:
- Added
Stop()call inprocessStdoutafter failed re-attachment to properly shut down the zombie transport - Refactored monitoring and cleanup logic in
Runner.Run()into testable helper functions (monitorTransportandhandleTransportStopped) - Added comprehensive test coverage for the zombie state bug and refactored monitoring/cleanup logic
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/transport/stdio.go | Added Stop() call after failed re-attachment to prevent zombie state |
| pkg/transport/stdio_test.go | Added TestTransport_ZombieStateWhenDockerUnavailable and updated existing tests with new mock expectations for Stop() and StopWorkload() calls |
| pkg/runner/runner.go | Extracted monitoring loop into monitorTransport and cleanup logic into handleTransportStopped for better testability |
| pkg/runner/runner_test.go | Added tests for monitorTransport and handleTransportStopped to verify monitoring and restart decision logic |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| // transportStoppedDeps holds the dependencies for handleTransportStopped. | ||
| // transportStoppedDeps holds the dependencies for handleTransportStopped. |
There was a problem hiding this comment.
Duplicate comment found. The comment "transportStoppedDeps holds the dependencies for handleTransportStopped." appears on both lines 700 and 701. Remove the duplicate on line 701.
| // transportStoppedDeps holds the dependencies for handleTransportStopped. |
|
Closing because I was not able to reproduce the underlying issue that motivates this change |
Summary
Fixes #3383
When the Docker daemon becomes unavailable while a stdio transport proxy is running, the transport would enter a "zombie" state where
IsRunning()returnstruebut it cannot communicate with the container. This prevented the automatic restart mechanism from triggering. The fix ensures the transport properly shuts down when re-attachment fails.Changes
The fix here is actually quite small, but I found the existing recovery code hard to understand. Consequently, I made slight refactors to backfill unit tests to document the actual relationship between the transport and the runner's monitoring behavior.
Fix: Call
Stop()after failed re-attachment (pkg/transport/stdio.go)Added a call to
t.Stop(ctx)inprocessStdoutafter re-attachment fails. Previously, the function would log "Container stdout closed - exiting read loop" and return without shutting down the transport, leaving it in a zombie state where:processMessagescontinued runningIsRunning()returnedtrueRefactor: Extract monitoring loop into
monitorTransport(pkg/runner/runner.go)Extracted the transport monitoring goroutine from
Run()into a standalonemonitorTransportfunction with atransportCheckerfunction type alias. This improves testability and reduces code in the mainRun()method.Refactor: Extract transport stopped handling into
handleTransportStopped(pkg/runner/runner.go)Extracted the
case <-doneChhandling logic into a testablehandleTransportStoppedfunction with atransportStoppedDepsstruct for dependency injection. This allows unit testing the restart decision logic without complex integration setup.Test updates (
pkg/transport/stdio_test.go)TestTransport_ZombieStateWhenDockerUnavailableto verify the fixTestProcessStdout_EOFWithFailedReattachment,TestConcurrentReattachment, andTestStdinRaceConditionwith additional mock expectations forStop()andStopWorkload()callsStop()is calledTesting
TestTransport_ZombieStateWhenDockerUnavailable- verifies transport shuts down properly when Docker is unavailable during re-attachmentTest_monitorTransport- verifies monitoring loop detects stopped transport, continues polling on errors, and respects context cancellationTestHandleTransportStopped- verifies restart decision logic (restart when workload exists, no restart when removed, restart on check failure)