[SPARK-56306][SQL] Fix collation-aware PIVOT by aknayar · Pull Request #55109 · apache/spark

aknayar · 2026-03-31T04:26:18Z

What changes were proposed in this pull request?

This PR fixes PivotFirst to respect pivot value collation. Below is a minimalistic repro for the bug:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PivotCollationTest").getOrCreate()

data = [(1, "SALES", 100), (1, "sales", 50)]
df = spark.createDataFrame(data, ["emp_id", "dept", "amount"])
df.createOrReplaceTempView("df")

COLLATED = "(SELECT emp_id, COLLATE(dept, 'UTF8_LCASE') AS dept, amount FROM df)"

print("Pivot with 'SALES' - expects 150, gets 150")
spark.sql(f"""
    SELECT * FROM {COLLATED}
    PIVOT (SUM(amount) FOR dept IN ('SALES'))
""").show()

print("Pivot with 'sales' - expects 150, gets NULL")
spark.sql(f"""
    SELECT * FROM {COLLATED}
    PIVOT (SUM(amount) FOR dept IN ('sales'))
""").show()

spark.stop()

Given the UTF8_LCASE collation, both PIVOT queries should return a SUM(amount) of 150. However, since Scala's ImmutableHashMap does a byte-level comparison of the keys in pivotIndex with incoming pivotColumn values, SUM(amount) for 'SALES' returns 150 and NULL for sales. This is because the HashMap does a byte-by-byte comparison between 'sales' and the resolved group-key ('SALES').

The fix is to fall back to TreeMap (already used on nested types) for non-binary-stable string types (e.g., UTF8_LCASE, UNICODE_CI).

Why are the changes needed?

There is currently a correctness bug. PIVOT compares atomic strings at the byte level, completely ignoring specified collation and silently returning NULL.

Does this PR introduce any user-facing change?

Yes, collation-aware pivot values will now adhere to the specified collation. This will result in fewer false negative pivot column value matches and more meaningful results.

How was this patch tested?

Unit tests testing UTF8_LCASE and UNICODE_CI collation with nulls.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: claude-4.6-opus-high

stefankandic

Left one comment, otherwise LGTM!

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PivotFirst.scala

stefankandic · 2026-03-31T22:01:36Z

@cloud-fan Please take a look when you get a chance. Thanks!

cloud-fan

Summary

Prior state and problem: PivotFirst maps pivot column values to array indices using either a HashMap (for AtomicType) or a TreeMap (for non-atomic types like structs). The HashMap path relies on UTF8String.equals(), which does byte-level comparison — ignoring collation semantics. For non-binary collations (e.g., UTF8_LCASE, UNICODE_CI), strings that should match under the collation fail to find their pivot index, producing NULL instead of the correct aggregate.

Design approach: Switch from HashMap to TreeMap (with a collation-aware Ordering) when the pivot column's data type does not support binary equality. The decision is centralized in a new usesTreeMap boolean derived from TypeUtils.typeWithProperEquals.

Key design decisions:

Reuses the existing TreeMap + getInterpretedOrdering pattern already used for non-atomic types, extending it to cover non-binary-stable atomic types.
The typeWithProperEquals check correctly captures all cases where HashMap would fail (non-binary collated strings, binary type, non-atomic types).

Implementation sketch: Two changed methods in PivotFirst:

usesTreeMap / pivotIndex: selects TreeMap (with collation-aware ordering) vs HashMap (with identity-based equality).
findPivotIndex: guards null lookups on the TreeMap path to avoid NPE.

Three new tests cover UTF8_LCASE, UNICODE_CI, and null handling with collated string columns.

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PivotFirst.scala

aknayar marked this pull request as draft March 31, 2026 04:26

aknayar marked this pull request as ready for review March 31, 2026 16:29

aknayar changed the title ~~Collation-aware PIVOT~~ [SPARK-56305] Fix collation-aware PIVOT Mar 31, 2026

aknayar force-pushed the collation-aware-pivot branch from ab18720 to b0bd784 Compare March 31, 2026 16:47

aknayar changed the title ~~[SPARK-56305] Fix collation-aware PIVOT~~ [SPARK-56306] Fix collation-aware PIVOT Mar 31, 2026

aknayar changed the title ~~[SPARK-56306] Fix collation-aware PIVOT~~ [SPARK-56306][SQL] Fix collation-aware PIVOT Mar 31, 2026

stefankandic approved these changes Mar 31, 2026

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PivotFirst.scala Outdated Show resolved Hide resolved

aknayar added 4 commits March 31, 2026 19:02

Respect pivot column collation

f35463f

Clean up

2feedef

Use typeWithProperEquals

ff8585b

Remove unused import

5800197

aknayar force-pushed the collation-aware-pivot branch from f3fe268 to 5800197 Compare March 31, 2026 19:02

aknayar added 2 commits March 31, 2026 19:42

Format

fd18311

Turn off formatter for non-ascii comments

28c5711

cloud-fan approved these changes Apr 1, 2026

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PivotFirst.scala Outdated Show resolved Hide resolved

Simplify

2abb16b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56306][SQL] Fix collation-aware PIVOT#55109

[SPARK-56306][SQL] Fix collation-aware PIVOT#55109
aknayar wants to merge 7 commits intoapache:masterfrom
aknayar:collation-aware-pivot

aknayar commented Mar 31, 2026 •

edited

Loading

Uh oh!

stefankandic left a comment

Uh oh!

Uh oh!

stefankandic commented Mar 31, 2026

Uh oh!

cloud-fan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aknayar commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

stefankandic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stefankandic commented Mar 31, 2026

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aknayar commented Mar 31, 2026 •

edited

Loading