-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
I'm not sure exactly what's happening here.
Discovered while porting this polars.DataFrame.pivot test in (narwhals-dev/narwhals#3373)
Repro
from __future__ import annotations
def pivot_wider(on: str, index: str, values: str) -> pa.Table:
import pyarrow as pa
import pyarrow.compute as pc # noqa: F811
from pyarrow.acero import AggregateNodeOptions, Declaration, TableSourceNodeOptions
data = {
"foo": ["A", "A", "B", "B", "C"],
"bar": ["k", "l", "m", "n", "o"],
"N1": [1, 2, 2, 4, 2],
"N2": [1, 2, 2, 4, 2],
}
table = pa.Table.from_pydict(data)
source = Declaration("table_source", options=TableSourceNodeOptions(table))
key_names = table.column(on).unique().to_pylist()
options = pc.PivotWiderOptions(key_names, unexpected_key_behavior="raise")
aggregates = [([on, values], "hash_pivot_wider", options, values)]
pivot = Declaration("aggregate", AggregateNodeOptions(aggregates, [index]))
result = Declaration.from_sequence([source, pivot]).to_table()
return result.flatten()The original test was for roughly this:
print(pivot_wider(on="bar", index="foo", values="N1").to_pydict())And by swapping on and index, it looks like there's some sorting happening behind the scenes:
print(pivot_wider(on="foo", index="bar", values="N1").to_pydict())Output
"n" and "o" appear in the wrong field:
{'foo': ['A', 'B', 'C'],
'N1.k': [1, None, None],
'N1.l': [2, None, None],
'N1.m': [None, 2, None],
'N1.n': [None, None, 2],
'N1.o': [None, 4, None]}Here, they're correct (refer to the right column) - but you can see the order has changed in the same way:
{'bar': ['k', 'l', 'm', 'o', 'n'],
'N1.A': [1, 2, None, None, None],
'N1.B': [None, None, 2, None, 4],
'N1.C': [None, None, None, 2, None]}Expected results
Ignore the difference in column names, but this:
import polars as pl
data = {
"foo": ["A", "A", "B", "B", "C"],
"bar": ["k", "l", "m", "n", "o"],
"N1": [1, 2, 2, 4, 2],
"N2": [1, 2, 2, 4, 2],
}
df = pl.DataFrame(data)
print(df.pivot(on="bar", index="foo", values="N1").to_dict(as_series=False)){'foo': ['A', 'B', 'C'],
'k': [1, None, None],
'l': [2, None, None],
'm': [None, 2, None],
'n': [None, 4, None],
'o': [None, None, 2]}print(df.pivot(on="foo", index="bar", values="N1").to_dict(as_series=False)){'bar': ['k', 'l', 'm', 'n', 'o'],
'A': [1, 2, None, None, None],
'B': [None, None, 2, 4, None],
'C': [None, None, None, None, 2]}Component(s)
Python, C++