Skip to content

fix: sort partitions after filtering for clustering planning#18092

Merged
danny0405 merged 1 commit intoapache:masterfrom
prashantwason:pw_oss_commit_porting_47
Mar 11, 2026
Merged

fix: sort partitions after filtering for clustering planning#18092
danny0405 merged 1 commit intoapache:masterfrom
prashantwason:pw_oss_commit_porting_47

Conversation

@prashantwason
Copy link
Copy Markdown
Member

Describe the issue this Pull Request addresses

This PR addresses HUDI-6912. The clustering plan partition filter does not guarantee a consistent sort order for filtered partitions, which can lead to non-deterministic behavior in clustering operations.

Summary and Changelog

Sort partitions after filtering in PartitionAwareClusteringPlanStrategy.filterPartitionPaths() to ensure consistent ordering.

Changes:

  • Added result.getLeft().sort(String::compareTo) after calling ClusteringPlanPartitionFilter.filter() to ensure partitions are always sorted in ascending order
  • Updated log message to indicate sorting was applied
  • Updated unit tests to verify sorted output

Impact

No public API changes. Ensures deterministic partition ordering in clustering plan generation.

Risk Level

low - This is a minor behavioral change that ensures consistent ordering. The sorting is applied after existing filtering logic and does not change what partitions are selected, only their order.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Feb 4, 2026
@prashantwason
Copy link
Copy Markdown
Member Author

@hudi-bot run azure

*/
public Pair<List<String>, List<String>> filterPartitionPaths(HoodieWriteConfig writeConfig, List<String> partitions) {
return ClusteringPlanPartitionFilter.filter(partitions, getWriteConfig());
Pair<List<String>, List<String>> result = ClusteringPlanPartitionFilter.filter(partitions, getWriteConfig());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which can lead to non-deterministic behavior in clustering operations.

is it because the following up filtering like the file group numbers limits in #generateClusteringPlan, we you think we should move the sorting into where these limits takes place(like in ##generateClusteringPlan)? do we also got issues for compaction.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Why sort here vs generateClusteringPlan():
The sorting is intentionally placed in filterPartitionPaths() rather than near the limit in generateClusteringPlan() because:

  • filterPartitionPaths() is the source of partition paths for clustering - sorting here ensures consistency regardless of how partitions are used downstream
  • The clusteringMaxNumGroups limit in generateClusteringPlan() operates on clustering groups (not partitions directly), so sorting partitions there would be too late
  • Sorting at the filter level is more defensive and ensures deterministic behavior in all code paths that consume the filtered partitions

Compaction:
Looking at the compaction code, the partition-aware strategies already handle sorting:

  • BoundedPartitionAwareCompactionStrategy.filterPartitionPaths() sorts with Comparator.reverseOrder() (line 73-74)
  • UnBoundedPartitionAwareCompactionStrategy.filterPartitionPaths() also sorts in reverse order

So compaction doesn't have this issue - the bounded strategies already include sorting as part of their logic.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the answer, LGTM.

Copy link
Copy Markdown
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
but since danny has a question, I will let him stamp and land.

@prashantwason
Copy link
Copy Markdown
Member Author

@danny0405 Thanks for the review!

Regarding your questions:

  1. Why sort in filterPartitionPaths() vs generateClusteringPlan():
    The sorting is intentionally placed in filterPartitionPaths() rather than near the limit in generateClusteringPlan() because:

    • filterPartitionPaths() is the source of partition paths for clustering - sorting here ensures consistency regardless of how partitions are used downstream
    • The clusteringMaxNumGroups limit in generateClusteringPlan() operates on clustering groups (not partitions directly), so sorting partitions there would be too late
    • Sorting at the filter level is more defensive and ensures deterministic behavior in all code paths that consume the filtered partitions
  2. Similar issues in compaction:
    Looking at the compaction code, the partition-aware strategies already handle sorting:

    • BoundedPartitionAwareCompactionStrategy.filterPartitionPaths() sorts with Comparator.reverseOrder() (line 73-74)
    • UnBoundedPartitionAwareCompactionStrategy.filterPartitionPaths() also sorts in reverse order

    So compaction doesn't have this issue - the bounded strategies already include sorting as part of their logic.

Let me know if you have any other concerns!

@prashantwason
Copy link
Copy Markdown
Member Author

@hudi-bot run azure

1 similar comment
@prashantwason
Copy link
Copy Markdown
Member Author

@hudi-bot run azure

@danny0405
Copy link
Copy Markdown
Contributor

The test failure seems not related:

[ERROR] Failures: 
[ERROR] org.apache.hudi.utilities.sources.helpers.TestProtoConversionUtil.allFieldsSet_wellKnownTypesAndTimestampsAsRecords

@danny0405 danny0405 changed the title fix: sort partitions after filtering fix: sort partitions after filtering for clustering planning Feb 24, 2026
@prashantwason
Copy link
Copy Markdown
Member Author

@hudi-bot run azure

@prashantwason prashantwason force-pushed the pw_oss_commit_porting_47 branch from 4b5a080 to 613b681 Compare March 10, 2026 22:47
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.28%. Comparing base (da244e1) to head (613b681).

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18092      +/-   ##
============================================
- Coverage     57.29%   57.28%   -0.01%     
+ Complexity    18632    18629       -3     
============================================
  Files          1955     1955              
  Lines        106995   106998       +3     
  Branches      13241    13241              
============================================
- Hits          61302    61298       -4     
- Misses        39901    39907       +6     
- Partials       5792     5793       +1     
Flag Coverage Δ
hadoop-mr-java-client 45.21% <100.00%> (+0.06%) ⬆️
spark-java-tests 47.46% <100.00%> (-0.02%) ⬇️
spark-scala-tests 45.57% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...strategy/PartitionAwareClusteringPlanStrategy.java 92.85% <100.00%> (+0.17%) ⬆️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@danny0405 danny0405 merged commit b01ae22 into apache:master Mar 11, 2026
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants