[SPARK-56306][SQL] Fix collation-aware PIVOT#55109
[SPARK-56306][SQL] Fix collation-aware PIVOT#55109aknayar wants to merge 7 commits intoapache:masterfrom
Conversation
ab18720 to
b0bd784
Compare
stefankandic
left a comment
There was a problem hiding this comment.
Left one comment, otherwise LGTM!
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PivotFirst.scala
Outdated
Show resolved
Hide resolved
f3fe268 to
5800197
Compare
|
@cloud-fan Please take a look when you get a chance. Thanks! |
cloud-fan
left a comment
There was a problem hiding this comment.
Summary
Prior state and problem: PivotFirst maps pivot column values to array indices using either a HashMap (for AtomicType) or a TreeMap (for non-atomic types like structs). The HashMap path relies on UTF8String.equals(), which does byte-level comparison — ignoring collation semantics. For non-binary collations (e.g., UTF8_LCASE, UNICODE_CI), strings that should match under the collation fail to find their pivot index, producing NULL instead of the correct aggregate.
Design approach: Switch from HashMap to TreeMap (with a collation-aware Ordering) when the pivot column's data type does not support binary equality. The decision is centralized in a new usesTreeMap boolean derived from TypeUtils.typeWithProperEquals.
Key design decisions:
- Reuses the existing
TreeMap+getInterpretedOrderingpattern already used for non-atomic types, extending it to cover non-binary-stable atomic types. - The
typeWithProperEqualscheck correctly captures all cases whereHashMapwould fail (non-binary collated strings, binary type, non-atomic types).
Implementation sketch: Two changed methods in PivotFirst:
usesTreeMap/pivotIndex: selectsTreeMap(with collation-aware ordering) vsHashMap(with identity-based equality).findPivotIndex: guards null lookups on theTreeMappath to avoid NPE.
Three new tests cover UTF8_LCASE, UNICODE_CI, and null handling with collated string columns.
What changes were proposed in this pull request?
This PR fixes PivotFirst to respect pivot value collation. Below is a minimalistic repro for the bug:
Given the
UTF8_LCASEcollation, both PIVOT queries should return aSUM(amount)of 150. However, since Scala's ImmutableHashMap does a byte-level comparison of the keys inpivotIndexwith incomingpivotColumnvalues,SUM(amount)for'SALES'returns150andNULLforsales. This is because the HashMap does a byte-by-byte comparison between'sales'and the resolved group-key ('SALES').The fix is to fall back to
TreeMap(already used on nested types) for non-binary-stable string types (e.g.,UTF8_LCASE,UNICODE_CI).Why are the changes needed?
There is currently a correctness bug.
PIVOTcompares atomic strings at the byte level, completely ignoring specified collation and silently returningNULL.Does this PR introduce any user-facing change?
Yes, collation-aware pivot values will now adhere to the specified collation. This will result in fewer false negative pivot column value matches and more meaningful results.
How was this patch tested?
Unit tests testing
UTF8_LCASEandUNICODE_CIcollation with nulls.Was this patch authored or co-authored using generative AI tooling?
Generated-by: claude-4.6-opus-high