Skip to content

fix: Add deterministic error clearing for FAIL_AFTER_EXECUTABLE_NODES_COMPLETE policy#6896

Closed
andreahlert wants to merge 1 commit intoflyteorg:masterfrom
andreahlert:fix/etcd-state-error-clearing-determinism
Closed

fix: Add deterministic error clearing for FAIL_AFTER_EXECUTABLE_NODES_COMPLETE policy#6896
andreahlert wants to merge 1 commit intoflyteorg:masterfrom
andreahlert:fix/etcd-state-error-clearing-determinism

Conversation

@andreahlert
Copy link
Copy Markdown

Summary

This PR addresses the determinism concern raised by @hamersaw in #4624 regarding error clearing behavior when multiple nodes fail with FAIL_AFTER_EXECUTABLE_NODES_COMPLETE policy.

The Problem

When multiple nodes fail, we want to preserve only one error to reduce etcd state size. The original implementation in #4624 would clear errors during iteration, but didn't account for the fact that Go map iteration order is non-deterministic. This could cause ALL errors to be cleared if iteration order varied between reconciliation rounds:

  1. Round 1: Iterate [node-a, node-b] - clear node-a's error, keep node-b's
  2. Round 2: Iterate [node-b, node-a] - node-b has error, clear it, then node-a has no error anymore

The Solution

Add a nil check before considering a node as having a "new" failure:

if nodeStatus.GetExecutionError() != nil {
    if nodeStatusWithError != nil && !c.enableCRDebugMetadata {
        nodeStatusWithError.ClearExecutionError()
    }
    nodeStatusWithError = nodeStatus
}

Once an error is cleared, that node won't be re-processed in subsequent rounds, ensuring determinism regardless of iteration order.

Changes

  • Add ClearExecutionError() method to MutableNodeStatus interface
  • Implement ClearExecutionError() in NodeStatus struct
  • Add nil check in handleDownstream before error clearing
  • Add comprehensive unit tests validating the deterministic behavior

Test Plan

  • Unit tests for ClearExecutionError() method
  • Test that only the last error is preserved when multiple failures occur
  • Test that clearing already-nil errors is a no-op
  • All existing node controller tests pass

Related Issues

@welcome
Copy link
Copy Markdown

welcome Bot commented Feb 5, 2026

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

…_COMPLETE

This commit addresses the determinism concern raised in PR flyteorg#4624 by ensuring
that error clearing behavior is consistent across reconciliation rounds.

The issue: When multiple nodes fail with FAIL_AFTER_EXECUTABLE_NODES_COMPLETE
policy, only one error should be preserved to reduce etcd state size. However,
the previous implementation could clear ALL errors if iteration order varied
between reconciliation rounds.

The fix: Before clearing an error from a node status, check if the error is
still present (not nil). Once an error is cleared, that node won't be
considered as a "new" failure in subsequent rounds, ensuring determinism
regardless of map iteration order.

Changes:
- Add ClearExecutionError() method to MutableNodeStatus interface
- Implement ClearExecutionError() in NodeStatus struct
- Add nil check in handleDownstream before error clearing
- Add comprehensive unit tests for the new behavior

Fixes part of flyteorg#4569
Related to flyteorg#4624

Signed-off-by: André Ahlert <andre@aex.partners>
@andreahlert
Copy link
Copy Markdown
Author

Hi @hamersaw, friendly ping! This addresses the determinism concern you raised in #4624 regarding error clearing with non-deterministic map iteration. CI is green - would love your review when you get a chance. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant