fix: sort partitions after filtering for clustering planning#18092
fix: sort partitions after filtering for clustering planning#18092danny0405 merged 1 commit intoapache:masterfrom
Conversation
|
@hudi-bot run azure |
| */ | ||
| public Pair<List<String>, List<String>> filterPartitionPaths(HoodieWriteConfig writeConfig, List<String> partitions) { | ||
| return ClusteringPlanPartitionFilter.filter(partitions, getWriteConfig()); | ||
| Pair<List<String>, List<String>> result = ClusteringPlanPartitionFilter.filter(partitions, getWriteConfig()); |
There was a problem hiding this comment.
which can lead to non-deterministic behavior in clustering operations.
is it because the following up filtering like the file group numbers limits in #generateClusteringPlan, we you think we should move the sorting into where these limits takes place(like in ##generateClusteringPlan)? do we also got issues for compaction.
There was a problem hiding this comment.
Thanks for the review!
Why sort here vs generateClusteringPlan():
The sorting is intentionally placed in filterPartitionPaths() rather than near the limit in generateClusteringPlan() because:
filterPartitionPaths()is the source of partition paths for clustering - sorting here ensures consistency regardless of how partitions are used downstream- The
clusteringMaxNumGroupslimit ingenerateClusteringPlan()operates on clustering groups (not partitions directly), so sorting partitions there would be too late - Sorting at the filter level is more defensive and ensures deterministic behavior in all code paths that consume the filtered partitions
Compaction:
Looking at the compaction code, the partition-aware strategies already handle sorting:
BoundedPartitionAwareCompactionStrategy.filterPartitionPaths()sorts withComparator.reverseOrder()(line 73-74)UnBoundedPartitionAwareCompactionStrategy.filterPartitionPaths()also sorts in reverse order
So compaction doesn't have this issue - the bounded strategies already include sorting as part of their logic.
There was a problem hiding this comment.
Thanks for the answer, LGTM.
nsivabalan
left a comment
There was a problem hiding this comment.
LGTM
but since danny has a question, I will let him stamp and land.
|
@danny0405 Thanks for the review! Regarding your questions:
Let me know if you have any other concerns! |
|
@hudi-bot run azure |
1 similar comment
|
@hudi-bot run azure |
|
The test failure seems not related: [ERROR] Failures:
[ERROR] org.apache.hudi.utilities.sources.helpers.TestProtoConversionUtil.allFieldsSet_wellKnownTypesAndTimestampsAsRecords |
|
@hudi-bot run azure |
4b5a080 to
613b681
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18092 +/- ##
============================================
- Coverage 57.29% 57.28% -0.01%
+ Complexity 18632 18629 -3
============================================
Files 1955 1955
Lines 106995 106998 +3
Branches 13241 13241
============================================
- Hits 61302 61298 -4
- Misses 39901 39907 +6
- Partials 5792 5793 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
This PR addresses HUDI-6912. The clustering plan partition filter does not guarantee a consistent sort order for filtered partitions, which can lead to non-deterministic behavior in clustering operations.
Summary and Changelog
Sort partitions after filtering in
PartitionAwareClusteringPlanStrategy.filterPartitionPaths()to ensure consistent ordering.Changes:
result.getLeft().sort(String::compareTo)after callingClusteringPlanPartitionFilter.filter()to ensure partitions are always sorted in ascending orderImpact
No public API changes. Ensures deterministic partition ordering in clustering plan generation.
Risk Level
low - This is a minor behavioral change that ensures consistent ordering. The sorting is applied after existing filtering logic and does not change what partitions are selected, only their order.
Documentation Update
none
Contributor's checklist