Skip to content

fix #539: surface SVT-AV1 EOS-flush watchdog trip as node failure#615

Merged
streamer45 merged 5 commits into
mainfrom
devin/1782066896-svt-av1-watchdog-failure
Jun 25, 2026
Merged

fix #539: surface SVT-AV1 EOS-flush watchdog trip as node failure#615
streamer45 merged 5 commits into
mainfrom
devin/1782066896-svt-av1-watchdog-failure

Conversation

@staging-devin-ai-integration

@staging-devin-ai-integration staging-devin-ai-integration Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Pseudocode of the new contract:

let outcome = codec_forward_loop(/* ... */).await;       // Completed | WatchdogAbandoned | CodecPanicked
finalize_codec_run(outcome, &ctx.state_tx, &node, label) // -> Result<(), StreamKitError>
//  Completed                       => emit Stopped("input_closed"); Ok(())
//  WatchdogAbandoned | CodecPanicked => emit Failed { reason }; Err(Codec(reason))

Review & Validation

  • Confirm the watchdog branch is the only path that yields WatchdogAbandoned, and the panic branch the only one yielding CodecPanicked — a normal codec/input/downstream close or shutdown must stay Completed (i.e. don't regress clean EOS into a spurious failure).
  • Confirm the fix(nodes): bound SVT-AV1 flush + codec loop against deadlock #602/SVT-AV1 EOS-flush watchdog trip should surface as an explicit node failure #539 leak tradeoff is intact: no detached encoder still in get_packet is dropped.
  • cargo test -p streamkit-nodes (870 passed) and default-feature lint (fmt + clippy) pass locally. Tests added: codec_utils outcome assertions on the existing watchdog regression tests, finalize_codec_run mapping (Completed→Stopped/Ok, Abandoned→Failed/Err, Panicked→Failed/Err), a loop-level forward_loop_reports_panic_as_degraded, and an end-to-end run_encoder_surfaces_watchdog_trip_as_failure.

Notes

  • A codec-task panic now also surfaces as a degraded run (CodecPanickedFailed/Err), closing the same false-success gap as the watchdog trip rather than leaving panics reported as a clean stop.
  • The failure messages say "output is truncated" (not "encoded output"), since decoders and the pixel-format converter route through the same helper.
  • Known limitation — oneshot HTTP (follow-up): in oneshot mode the response is committed as 200 and the node task handles are never awaited for the response (apps/skit/src/server/oneshot.rs returns the body stream as soon as the graph is spawned). A node returning Err only ends the output channel, so a truncated oneshot encode still returns HTTP 200 with a short body — the new Failed state is observable only via state/stats subscribers. SVT-AV1 EOS-flush watchdog trip should surface as an explicit node failure #539's "callers can tell a degraded encode from a complete one" is therefore fully met for the dynamic/subscriber path but not for oneshot HTTP clients; surfacing it there (await/collect node results before finalizing the response) is left as a separate change.
  • The HW-codec call sites (vaapi, nvcodec, vulkan_video, dav1d) received the same mechanical change but could not be compiled locally (missing libva/CUDA/Vulkan/dav1d system libs) — relying on the feature-gated CI jobs to validate them.

Closes #539

Link to Devin session: https://staging.itsdev.in/sessions/7a7aa10058014ad5a38ef182be036fb0
Requested by: @streamer45


Devin Review

Status Commit
🟢 Reviewed 022de6a
Open in Devin Review (Staging)

When the bounded EOS-flush idle watchdog abandons a wedged codec task,
`codec_forward_loop` now reports a `WatchdogAbandoned` outcome instead of
silently finalizing. Encoder/decoder nodes map that outcome via the shared
`finalize_codec_run` helper to a terminal `NodeState::Failed` and an `Err`
return, so a truncated/degraded encode is programmatically distinguishable
from a clean `Stopped("input_closed")` + `Ok(())`.

The deliberate handle-leak tradeoff from #602/#539 is preserved: the
detached worker may still be in a blocking FFI call, so nothing it owns is
dropped.

Refs #539

Signed-off-by: streamkit-devin <devin@streamkit.dev>
@staging-devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

staging-devin-ai-integration[bot]

This comment was marked as resolved.

@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 99.27536% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.99%. Comparing base (cadb70e) to head (022de6a).

Files with missing lines Patch % Lines
crates/nodes/src/video/encoder_trait.rs 97.05% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main     #615    +/-   ##
========================================
  Coverage   84.99%   84.99%            
========================================
  Files         248      247     -1     
  Lines       74624    74733   +109     
  Branches     2381     2378     -3     
========================================
+ Hits        63423    63523   +100     
- Misses      11196    11205     +9     
  Partials        5        5            
Flag Coverage Δ
backend 85.00% <99.27%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
core 85.64% <ø> (ø)
engine 83.68% <ø> (ø)
api 91.14% <ø> (ø)
nodes 84.61% <99.27%> (+0.03%) ⬆️
server 84.78% <ø> (ø)
plugin-native 84.79% <ø> (ø)
plugin-wasm 95.41% <ø> (ø)
ui-services 86.29% <ø> (+0.02%) ⬆️
ui-components 71.96% <ø> (ø)
Files with missing lines Coverage Δ
crates/nodes/src/audio/codecs/opus.rs 92.82% <ø> (ø)
crates/nodes/src/codec_utils.rs 97.58% <100.00%> (+1.96%) ⬆️
crates/nodes/src/video/av1.rs 83.17% <ø> (ø)
crates/nodes/src/video/pixel_convert.rs 93.31% <ø> (ø)
crates/nodes/src/video/vp9.rs 86.76% <ø> (ø)
crates/nodes/src/video/encoder_trait.rs 86.59% <97.05%> (+1.15%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@staging-devin-ai-integration staging-devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

Open in Devin Review (Staging)
Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: Mechanical transformation verified complete across all codec nodes

Confirmed via grep that every file calling codec_forward_loop (opus encoder/decoder, av1 encoder/decoder, dav1d decoder, vp9 encoder/decoder, nv_av1 decoder, vaapi_av1 decoder, vaapi_h264 decoder, vulkan_video h264 encoder/decoder, pixel_convert, and all EncoderNodeRunner implementors via run_encoder) now captures the CodecLoopOutcome and passes it to finalize_codec_run. No call site was missed. SVT-AV1 is covered through the shared run_encoder path in encoder_trait.rs:253.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Comment on lines +36 to +47
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[must_use]
pub enum CodecLoopOutcome {
/// The loop finalized cleanly: the codec finished, input or the downstream
/// channel closed, or a shutdown was requested.
Completed,
/// The idle watchdog abandoned a wedged codec worker mid-stream or
/// mid-flush. Any output already forwarded is truncated, and the worker's
/// native handle was intentionally leaked (it may still be in a blocking
/// FFI call), so the run is degraded, not successful.
WatchdogAbandoned,
}

@staging-devin-ai-integration staging-devin-ai-integration Bot Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: CodecLoopOutcome is pub, expanding the crate's public API surface

CodecLoopOutcome at codec_utils.rs:38 and finalize_codec_run at codec_utils.rs:62 are both pub (not pub(crate)). Since codec_forward_loop is also pub, this is consistent. However, CODEC_IDLE_TIMEOUT remains pub(crate). If these are only intended for internal use within the nodes crate, pub(crate) would be more restrictive. Currently the nodes crate doesn't appear to be consumed as a library by external code, so this is a minor API hygiene point rather than a real concern.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

A panicking codec task left the forward loop's outcome at its default
Completed, so finalize_codec_run emitted Stopped("input_closed") + Ok(())
for a panic-truncated stream \u2014 the same false success #539 set out to
eliminate. Add a CodecPanicked outcome that maps to Failed + Err.

Also generalize the abandonment message from "encoded output is
truncated" to "output is truncated", since finalize_codec_run is shared
by decoders and the pixel-convert node, not just encoders.

Signed-off-by: streamkit-devin <devin@streamkit.dev>
Signed-off-by: streamkit-devin <devin@streamkit.dev>
staging-devin-ai-integration[bot]

This comment was marked as resolved.

After main merged the NodeStateSender refactor (#606), NodeContext.state_tx
is a NodeStateSender, not a raw mpsc::Sender<NodeStateUpdate>. Correct the
finalize_codec_run parameter type so it matches every call site and the
state_helpers::emit_* calls, and wrap the raw channels in the unit tests
with NodeStateSender::new.

Signed-off-by: streamkit-devin <devin@streamkit.dev>

@staging-devin-ai-integration staging-devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 new potential issues.

Open in Devin Review (Staging)
Debug

Playground

Comment on lines +37 to +38
#[must_use]
pub enum CodecLoopOutcome {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: The #[must_use] on CodecLoopOutcome prevents silent ignoring of degraded outcomes

The #[must_use] attribute on CodecLoopOutcome (codec_utils.rs:37) is a well-chosen compile-time safety net: if any future caller of codec_forward_loop forgets to pass the outcome to finalize_codec_run, the compiler will emit a warning. This is the kind of annotation that prevents the exact regression this PR fixes (silently treating a degraded run as successful).

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Comment on lines +87 to +89
tracing::error!("{reason}");
state_helpers::emit_failed(state_tx, node_id, reason.clone());
Err(StreamKitError::Codec(reason))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: Minor: reason string is cloned where it could be avoided

In finalize_codec_run, reason.clone() at line 88 creates a copy of the error string so it can be passed to both emit_failed and StreamKitError::Codec. Since emit_failed takes impl Into<String>, you could pass &reason (via a &strString conversion inside emit_failed) or reorder to pass the owned reason to Err(StreamKitError::Codec(reason)) first and a reference to emit_failed. This is a negligible allocation on an error path, so not worth changing — just noting for completeness.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Comment on lines +67 to +90
pub fn finalize_codec_run(
outcome: CodecLoopOutcome,
state_tx: &NodeStateSender,
node_id: &str,
label: &str,
) -> Result<(), StreamKitError> {
let reason = match outcome {
CodecLoopOutcome::Completed => {
state_helpers::emit_stopped(state_tx, node_id, "input_closed");
tracing::info!("{label} finished");
return Ok(());
},
CodecLoopOutcome::WatchdogAbandoned => format!(
"{label} abandoned a wedged codec worker after {CODEC_IDLE_TIMEOUT:?} of output \
idleness; output is truncated"
),
CodecLoopOutcome::CodecPanicked => {
format!("{label} codec task panicked mid-stream; output is truncated")
},
};
tracing::error!("{reason}");
state_helpers::emit_failed(state_tx, node_id, reason.clone());
Err(StreamKitError::Codec(reason))
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Behavioral change: codec nodes now return Err on degraded runs where they previously returned Ok

Before this PR, all codec nodes unconditionally called emit_stopped(... "input_closed") and returned Ok(()) regardless of whether the codec was abandoned by the watchdog or panicked. After this PR, watchdog-abandoned and panicked codec runs return Err(StreamKitError::Codec(...)) and emit a Failed state. This is an intentional behavioral change (per #539), but callers that previously relied on codec run() always returning Ok (e.g., error handling in the engine/pipeline executor) should be verified to handle the new Err gracefully. The engine likely already handles node run() errors, but worth confirming no upstream code treats a codec Err differently from other node errors in a way that could cause issues (e.g., retry loops, error propagation to clients).

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

@streamer45 streamer45 merged commit 7c399b9 into main Jun 25, 2026
30 checks passed
@streamer45 streamer45 deleted the devin/1782066896-svt-av1-watchdog-failure branch June 25, 2026 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SVT-AV1 EOS-flush watchdog trip should surface as an explicit node failure

2 participants