Preserve ORDER BY in Unparser for projection -> order by pattern #19483

adriangb · 2025-12-24T23:42:49Z

Because of #15886 a parse -> unparse -> parse loop changed the query so that it would give incorrect results.

adriangb · 2025-12-25T16:06:42Z

@alamb @goldmedal @y-f-u could you folks take a look at this since you originally added this bit of code in #11527? As far as I can tell this has kept all of those tests passing and only produced some formatting changes in one test's SQL, but I'm not familiar with the Unparser code in general so this needs some critical thought.

adriangb · 2025-12-25T16:10:14Z

datafusion/core/tests/sql/orderby.rs

+        SELECT
+          col * 2 as x_bucket,
+          count(*)
+        FROM t1
+        GROUP BY x_bucket
+        ORDER BY x_bucket, count(*)


We can probably move this to a test in plan_to_sql.rs but I struggled a bit translating it since there's limited functions available (e.g. count(*)). I do also think e2e tests with data are useful in that they don't require a specific SQL representation as long as query semantics are maintained. But I will try to port again once we get some initial feedback here.

adriangb · 2025-12-26T05:25:44Z

datafusion/sql/tests/cases/plan_to_sql.rs

        assert_snapshot!(
            sql,
-            @"SELECT j1.j1_id, j1.j1_string, lochierarchy FROM (SELECT j1.j1_id, j1.j1_string, (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy, grouping(j1.j1_string), grouping(j1.j1_id) FROM j1 GROUP BY ROLLUP (j1.j1_id, j1.j1_string) ORDER BY lochierarchy DESC NULLS FIRST, CASE WHEN ((grouping(j1.j1_id) + grouping(j1.j1_string)) = 0) THEN j1.j1_id END ASC NULLS LAST) LIMIT 100"
+            @r#"SELECT j1.j1_id, j1.j1_string, lochierarchy FROM (SELECT j1.j1_id, j1.j1_string, (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy, grouping(j1.j1_string), grouping(j1.j1_id) FROM j1 GROUP BY ROLLUP (j1.j1_id, j1.j1_string)) ORDER BY lochierarchy DESC NULLS FIRST, CASE WHEN (("grouping(j1.j1_id)" + "grouping(j1.j1_string)") = 0) THEN j1.j1_id END ASC NULLS LAST LIMIT 100"#


Formatted difference:

- @"SELECT - j1.j1_id, - j1.j1_string, - lochierarchy - FROM ( - SELECT - j1.j1_id, - j1.j1_string, - (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy, - grouping(j1.j1_string), - grouping(j1.j1_id) - FROM j1 - GROUP BY ROLLUP (j1.j1_id, j1.j1_string) - ORDER BY - lochierarchy DESC NULLS FIRST, - CASE - WHEN ((grouping(j1.j1_id) + grouping(j1.j1_string)) = 0) THEN j1.j1_id - END ASC NULLS LAST - ) - LIMIT 100" + @r#"SELECT + j1.j1_id, + j1.j1_string, + lochierarchy + FROM ( + SELECT + j1.j1_id, + j1.j1_string, + (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy, + grouping(j1.j1_string), + grouping(j1.j1_id) + FROM j1 + GROUP BY ROLLUP (j1.j1_id, j1.j1_string) + ) + ORDER BY + lochierarchy DESC NULLS FIRST, + CASE + WHEN (("grouping(j1.j1_id)" + "grouping(j1.j1_string)") = 0) THEN j1.j1_id + END ASC NULLS LAST + LIMIT 100"#

As you can see the ORDER BY got moved outside of the subquery, which is what we want.

adriangb · 2025-12-26T05:31:31Z

I've added a property based test that asserts the property that results should be the same after unparsing and re-parsing a query given the same input data*. I think this is a good test because:

It uses real world queries and data
It's a property based test on the thing users care about in general (correct results) instead of e.g. asserting the unparsed SQL matches some shape

*: Not all queries have a deterministic sort order. I check if the original query has a known output ordering and if it doesn't I sort both outputs.

These tests show that without these fixes there are two issues for ClickBench queries:

Column name quoting is missing for columns with uppercase letters
The ORDER BY bug

Here is the failure output (also relevant to judge since the tests are being added):

3 Clickbench test(s) failed:

Results mismatch for q15.
Original SQL:
-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "UserID", COUNT(*) FROM hits GROUP BY "UserID" ORDER BY COUNT(*) DESC LIMIT 10;

Unparsed SQL:
SELECT
  hits."UserID",
  "count(*)"
FROM
  (
    SELECT
      hits."UserID",
      count(1) AS "count(*)",
      count(1)
    FROM
      hits
    GROUP BY
      hits."UserID" ORDER BY count(1) DESC NULLS FIRST
  ) LIMIT 10

---

Results mismatch for q16.
Original SQL:
-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

Unparsed SQL:
SELECT
  hits."UserID",
  hits."SearchPhrase",
  "count(*)"
FROM
  (
    SELECT
      hits."UserID",
      hits."SearchPhrase",
      count(1) AS "count(*)",
      count(1)
    FROM
      hits
    GROUP BY
      hits."UserID", hits."SearchPhrase" ORDER BY count(1) DESC NULLS FIRST
  ) LIMIT 10

---

Results mismatch for q18.
Original SQL:
-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

Unparsed SQL:
SELECT
  hits."UserID",
  m,
  hits."SearchPhrase",
  "count(*)"
FROM
  (
    SELECT
      hits."UserID",
      date_part('MINUTE', to_timestamp_seconds(hits."EventTime")) AS m,
      hits."SearchPhrase",
      count(1) AS "count(*)",
      count(1)
    FROM
      hits
    GROUP BY
      hits."UserID", date_part('MINUTE', to_timestamp_seconds(hits."EventTime")), hits."SearchPhrase" ORDER BY count(1) DESC NULLS FIRST
  ) LIMIT 10

github-actions bot added the core Core DataFusion crate label Dec 24, 2025

adriangb mentioned this pull request Dec 24, 2025

[DISCUSSION] Sorts being removed from subqueries #15886

Open

github-actions bot added the sql SQL Planner label Dec 25, 2025

adriangb changed the title ~~Demonstarte that Unparser inserts subquery which looses order~~ Preserve ORDER BY in Unparser for projection -> order by pattern Dec 25, 2025

adriangb mentioned this pull request Dec 25, 2025

Preserve ordering from subqueries #19484

Draft

adriangb marked this pull request as ready for review December 25, 2025 16:05

adriangb requested review from alamb and goldmedal December 25, 2025 16:05

adriangb commented Dec 25, 2025

View reviewed changes

adriangb added 2 commits December 25, 2025 20:44

add roundtrip tests for Unparser using clickbench / tpch

062e170

fixed

3019e75

adriangb force-pushed the orderby-bug branch from 7c40d24 to 3019e75 Compare December 26, 2025 05:21

adriangb commented Dec 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preserve ORDER BY in Unparser for projection -> order by pattern #19483

Preserve ORDER BY in Unparser for projection -> order by pattern #19483

adriangb commented Dec 24, 2025 •

edited

Loading

Uh oh!

adriangb commented Dec 25, 2025

Uh oh!

adriangb Dec 25, 2025 •

edited

Loading

Uh oh!

adriangb Dec 26, 2025

Uh oh!

adriangb commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Preserve ORDER BY in Unparser for projection -> order by pattern #19483

Are you sure you want to change the base?

Preserve ORDER BY in Unparser for projection -> order by pattern #19483

Conversation

adriangb commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Dec 25, 2025

Uh oh!

adriangb Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adriangb commented Dec 24, 2025 •

edited

Loading

adriangb Dec 25, 2025 •

edited

Loading