Skip to content

[SPARK-56306][SQL] Fix collation-aware PIVOT#55109

Open
aknayar wants to merge 7 commits intoapache:masterfrom
aknayar:collation-aware-pivot
Open

[SPARK-56306][SQL] Fix collation-aware PIVOT#55109
aknayar wants to merge 7 commits intoapache:masterfrom
aknayar:collation-aware-pivot

Conversation

@aknayar
Copy link
Copy Markdown

@aknayar aknayar commented Mar 31, 2026

What changes were proposed in this pull request?

This PR fixes PivotFirst to respect pivot value collation. Below is a minimalistic repro for the bug:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PivotCollationTest").getOrCreate()

data = [(1, "SALES", 100), (1, "sales", 50)]
df = spark.createDataFrame(data, ["emp_id", "dept", "amount"])
df.createOrReplaceTempView("df")

COLLATED = "(SELECT emp_id, COLLATE(dept, 'UTF8_LCASE') AS dept, amount FROM df)"

print("Pivot with 'SALES' - expects 150, gets 150")
spark.sql(f"""
    SELECT * FROM {COLLATED}
    PIVOT (SUM(amount) FOR dept IN ('SALES'))
""").show()

print("Pivot with 'sales' - expects 150, gets NULL")
spark.sql(f"""
    SELECT * FROM {COLLATED}
    PIVOT (SUM(amount) FOR dept IN ('sales'))
""").show()

spark.stop()

Given the UTF8_LCASE collation, both PIVOT queries should return a SUM(amount) of 150. However, since Scala's ImmutableHashMap does a byte-level comparison of the keys in pivotIndex with incoming pivotColumn values, SUM(amount) for 'SALES' returns 150 and NULL for sales. This is because the HashMap does a byte-by-byte comparison between 'sales' and the resolved group-key ('SALES').

The fix is to fall back to TreeMap (already used on nested types) for non-binary-stable string types (e.g., UTF8_LCASE, UNICODE_CI).

Why are the changes needed?

There is currently a correctness bug. PIVOT compares atomic strings at the byte level, completely ignoring specified collation and silently returning NULL.

Does this PR introduce any user-facing change?

Yes, collation-aware pivot values will now adhere to the specified collation. This will result in fewer false negative pivot column value matches and more meaningful results.

How was this patch tested?

Unit tests testing UTF8_LCASE and UNICODE_CI collation with nulls.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: claude-4.6-opus-high

@aknayar aknayar marked this pull request as draft March 31, 2026 04:26
@aknayar aknayar marked this pull request as ready for review March 31, 2026 16:29
@aknayar aknayar changed the title Collation-aware PIVOT [SPARK-56305] Fix collation-aware PIVOT Mar 31, 2026
@aknayar aknayar force-pushed the collation-aware-pivot branch from ab18720 to b0bd784 Compare March 31, 2026 16:47
@aknayar aknayar changed the title [SPARK-56305] Fix collation-aware PIVOT [SPARK-56306] Fix collation-aware PIVOT Mar 31, 2026
@aknayar aknayar changed the title [SPARK-56306] Fix collation-aware PIVOT [SPARK-56306][SQL] Fix collation-aware PIVOT Mar 31, 2026
Copy link
Copy Markdown
Contributor

@stefankandic stefankandic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment, otherwise LGTM!

@aknayar aknayar force-pushed the collation-aware-pivot branch from f3fe268 to 5800197 Compare March 31, 2026 19:02
@stefankandic
Copy link
Copy Markdown
Contributor

@cloud-fan Please take a look when you get a chance. Thanks!

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Prior state and problem: PivotFirst maps pivot column values to array indices using either a HashMap (for AtomicType) or a TreeMap (for non-atomic types like structs). The HashMap path relies on UTF8String.equals(), which does byte-level comparison — ignoring collation semantics. For non-binary collations (e.g., UTF8_LCASE, UNICODE_CI), strings that should match under the collation fail to find their pivot index, producing NULL instead of the correct aggregate.

Design approach: Switch from HashMap to TreeMap (with a collation-aware Ordering) when the pivot column's data type does not support binary equality. The decision is centralized in a new usesTreeMap boolean derived from TypeUtils.typeWithProperEquals.

Key design decisions:

  • Reuses the existing TreeMap + getInterpretedOrdering pattern already used for non-atomic types, extending it to cover non-binary-stable atomic types.
  • The typeWithProperEquals check correctly captures all cases where HashMap would fail (non-binary collated strings, binary type, non-atomic types).

Implementation sketch: Two changed methods in PivotFirst:

  • usesTreeMap / pivotIndex: selects TreeMap (with collation-aware ordering) vs HashMap (with identity-based equality).
  • findPivotIndex: guards null lookups on the TreeMap path to avoid NPE.

Three new tests cover UTF8_LCASE, UNICODE_CI, and null handling with collated string columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants