fix: Add deterministic error clearing for FAIL_AFTER_EXECUTABLE_NODES_COMPLETE policy#6896
Closed
andreahlert wants to merge 1 commit intoflyteorg:masterfrom
Closed
Conversation
|
Thank you for opening this pull request! 🙌 These tips will help get your PR across the finish line:
|
…_COMPLETE This commit addresses the determinism concern raised in PR flyteorg#4624 by ensuring that error clearing behavior is consistent across reconciliation rounds. The issue: When multiple nodes fail with FAIL_AFTER_EXECUTABLE_NODES_COMPLETE policy, only one error should be preserved to reduce etcd state size. However, the previous implementation could clear ALL errors if iteration order varied between reconciliation rounds. The fix: Before clearing an error from a node status, check if the error is still present (not nil). Once an error is cleared, that node won't be considered as a "new" failure in subsequent rounds, ensuring determinism regardless of map iteration order. Changes: - Add ClearExecutionError() method to MutableNodeStatus interface - Implement ClearExecutionError() in NodeStatus struct - Add nil check in handleDownstream before error clearing - Add comprehensive unit tests for the new behavior Fixes part of flyteorg#4569 Related to flyteorg#4624 Signed-off-by: André Ahlert <andre@aex.partners>
ebe452f to
17248a6
Compare
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses the determinism concern raised by @hamersaw in #4624 regarding error clearing behavior when multiple nodes fail with
FAIL_AFTER_EXECUTABLE_NODES_COMPLETEpolicy.The Problem
When multiple nodes fail, we want to preserve only one error to reduce etcd state size. The original implementation in #4624 would clear errors during iteration, but didn't account for the fact that Go map iteration order is non-deterministic. This could cause ALL errors to be cleared if iteration order varied between reconciliation rounds:
The Solution
Add a nil check before considering a node as having a "new" failure:
Once an error is cleared, that node won't be re-processed in subsequent rounds, ensuring determinism regardless of iteration order.
Changes
ClearExecutionError()method toMutableNodeStatusinterfaceClearExecutionError()inNodeStatusstructhandleDownstreambefore error clearingTest Plan
ClearExecutionError()methodRelated Issues