Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Dec 24, 2025

Because of #15886 a parse -> unparse -> parse loop changed the query so that it would give incorrect results.

@github-actions github-actions bot added the core Core DataFusion crate label Dec 24, 2025
@github-actions github-actions bot added the sql SQL Planner label Dec 25, 2025
@adriangb adriangb changed the title Demonstarte that Unparser inserts subquery which looses order Preserve ORDER BY in Unparser for projection -> order by pattern Dec 25, 2025
@adriangb adriangb marked this pull request as ready for review December 25, 2025 16:05
@adriangb adriangb requested review from alamb and goldmedal December 25, 2025 16:05
@adriangb
Copy link
Contributor Author

@alamb @goldmedal @y-f-u could you folks take a look at this since you originally added this bit of code in #11527? As far as I can tell this has kept all of those tests passing and only produced some formatting changes in one test's SQL, but I'm not familiar with the Unparser code in general so this needs some critical thought.

Comment on lines 46 to 51
SELECT
col * 2 as x_bucket,
count(*)
FROM t1
GROUP BY x_bucket
ORDER BY x_bucket, count(*)
Copy link
Contributor Author

@adriangb adriangb Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably move this to a test in plan_to_sql.rs but I struggled a bit translating it since there's limited functions available (e.g. count(*)). I do also think e2e tests with data are useful in that they don't require a specific SQL representation as long as query semantics are maintained. But I will try to port again once we get some initial feedback here.

assert_snapshot!(
sql,
@"SELECT j1.j1_id, j1.j1_string, lochierarchy FROM (SELECT j1.j1_id, j1.j1_string, (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy, grouping(j1.j1_string), grouping(j1.j1_id) FROM j1 GROUP BY ROLLUP (j1.j1_id, j1.j1_string) ORDER BY lochierarchy DESC NULLS FIRST, CASE WHEN ((grouping(j1.j1_id) + grouping(j1.j1_string)) = 0) THEN j1.j1_id END ASC NULLS LAST) LIMIT 100"
@r#"SELECT j1.j1_id, j1.j1_string, lochierarchy FROM (SELECT j1.j1_id, j1.j1_string, (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy, grouping(j1.j1_string), grouping(j1.j1_id) FROM j1 GROUP BY ROLLUP (j1.j1_id, j1.j1_string)) ORDER BY lochierarchy DESC NULLS FIRST, CASE WHEN (("grouping(j1.j1_id)" + "grouping(j1.j1_string)") = 0) THEN j1.j1_id END ASC NULLS LAST LIMIT 100"#
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatted difference:

- @"SELECT
-   j1.j1_id,
-   j1.j1_string,
-   lochierarchy
- FROM (
-   SELECT
-     j1.j1_id,
-     j1.j1_string,
-     (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy,
-     grouping(j1.j1_string),
-     grouping(j1.j1_id)
-   FROM j1
-   GROUP BY ROLLUP (j1.j1_id, j1.j1_string)
-   ORDER BY
-     lochierarchy DESC NULLS FIRST,
-     CASE
-       WHEN ((grouping(j1.j1_id) + grouping(j1.j1_string)) = 0) THEN j1.j1_id
-     END ASC NULLS LAST
- )
- LIMIT 100"
+ @r#"SELECT
+   j1.j1_id,
+   j1.j1_string,
+   lochierarchy
+ FROM (
+   SELECT
+     j1.j1_id,
+     j1.j1_string,
+     (grouping(j1.j1_id) + grouping(j1.j1_string)) AS lochierarchy,
+     grouping(j1.j1_string),
+     grouping(j1.j1_id)
+   FROM j1
+   GROUP BY ROLLUP (j1.j1_id, j1.j1_string)
+ )
+ ORDER BY
+   lochierarchy DESC NULLS FIRST,
+   CASE
+     WHEN (("grouping(j1.j1_id)" + "grouping(j1.j1_string)") = 0) THEN j1.j1_id
+   END ASC NULLS LAST
+ LIMIT 100"#

As you can see the ORDER BY got moved outside of the subquery, which is what we want.

@adriangb
Copy link
Contributor Author

I've added a property based test that asserts the property that results should be the same after unparsing and re-parsing a query given the same input data*. I think this is a good test because:

  • It uses real world queries and data
  • It's a property based test on the thing users care about in general (correct results) instead of e.g. asserting the unparsed SQL matches some shape

*: Not all queries have a deterministic sort order. I check if the original query has a known output ordering and if it doesn't I sort both outputs.

These tests show that without these fixes there are two issues for ClickBench queries:

  • Column name quoting is missing for columns with uppercase letters
  • The ORDER BY bug

Here is the failure output (also relevant to judge since the tests are being added):

3 Clickbench test(s) failed:

Results mismatch for q15.
Original SQL:
-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "UserID", COUNT(*) FROM hits GROUP BY "UserID" ORDER BY COUNT(*) DESC LIMIT 10;

Unparsed SQL:
SELECT
  hits."UserID",
  "count(*)"
FROM
  (
    SELECT
      hits."UserID",
      count(1) AS "count(*)",
      count(1)
    FROM
      hits
    GROUP BY
      hits."UserID" ORDER BY count(1) DESC NULLS FIRST
  ) LIMIT 10

---

Results mismatch for q16.
Original SQL:
-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

Unparsed SQL:
SELECT
  hits."UserID",
  hits."SearchPhrase",
  "count(*)"
FROM
  (
    SELECT
      hits."UserID",
      hits."SearchPhrase",
      count(1) AS "count(*)",
      count(1)
    FROM
      hits
    GROUP BY
      hits."UserID", hits."SearchPhrase" ORDER BY count(1) DESC NULLS FIRST
  ) LIMIT 10

---

Results mismatch for q18.
Original SQL:
-- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

Unparsed SQL:
SELECT
  hits."UserID",
  m,
  hits."SearchPhrase",
  "count(*)"
FROM
  (
    SELECT
      hits."UserID",
      date_part('MINUTE', to_timestamp_seconds(hits."EventTime")) AS m,
      hits."SearchPhrase",
      count(1) AS "count(*)",
      count(1)
    FROM
      hits
    GROUP BY
      hits."UserID", date_part('MINUTE', to_timestamp_seconds(hits."EventTime")), hits."SearchPhrase" ORDER BY count(1) DESC NULLS FIRST
  ) LIMIT 10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant