Coerce union #49

friendlymatthew · 2025-12-18T17:10:16Z

NOTE:
For simplicity, this PR implements everything within datafusion. In a more complete design, some pieces like casting support would likely belong in arrow's cast kernel. If this PR is directionally sound and aligned with other folks, I'm happy to follow up with a properly layered implementation

Which issue does this PR close?

Rationale for this change

This PR adds coercion logic for union data types. This way, you are able to compare union data types with scalar types.

Specifically making queries like the following work. (Note attributes->'id' is a DataType::Union)

select * from records where attributes->'id' = 67

When comparing a union type with a scalar type, the logic will check if the scalar type matches any Union variant

The scalar type matching is greedy, meaning if you have a Union type with variants of the same field, it will match by the smaller type id (since union variants are sorted by type id)

…e#18954) ## Which issue does this PR close? Closes apache#18947 ## Rationale for this change Currently, DataFusion uses default compression levels when writing compressed JSON and CSV files. For ZSTD, this means level 3, which prioritizes speed over compression ratio. Users working with large datasets who want to optimize for storage costs or network transfer have no way to increase the compression level. This is particularly important for cloud data lake scenarios where storage and egress costs can be significant. ## What changes are included in this PR? - Add `compression_level: Option<u32>` field to `JsonOptions` and `CsvOptions` in `config.rs` - Add `convert_async_writer_with_level()` method to `FileCompressionType` (non-breaking API extension) - Keep original `convert_async_writer()` as a convenience wrapper for backward compatibility - Update `JsonWriterOptions` and `CsvWriterOptions` with `compression_level` field - Update `ObjectWriterBuilder` to support compression level - Update JSON and CSV sinks to pass compression level through the write pipeline - Update proto definitions and conversions for serialization support - Fix unrelated unused import warning in `udf.rs` (conditional compilation for debug-only imports) ## Are these changes tested? The changes follow the existing patterns used throughout the codebase. The implementation was verified by: - Building successfully with `cargo build` - Running existing tests with `cargo test --package datafusion-proto` - All 131 proto integration tests pass ## Are there any user-facing changes? Yes, users can now specify compression level when writing JSON/CSV files: ```rust use datafusion::common::config::JsonOptions; use datafusion::common::parsers::CompressionTypeVariant; let json_opts = JsonOptions { compression: CompressionTypeVariant::ZSTD, compression_level: Some(9), // Higher compression ..Default::default() }; ``` **Supported compression levels:** - ZSTD: 1-22 (default: 3) - GZIP: 0-9 (default: 6) - BZIP2: 1-9 (default: 9) - XZ: 0-9 (default: 6) **This is a non-breaking change** - the original `convert_async_writer()` method signature is preserved for backward compatibility. Co-authored-by: Andrew Lamb <[email protected]>

## Which issue does this PR close? - Related to apache#19241 ## Rationale for this change This PR enhances the `in_list` benchmark suite to provide more comprehensive performance measurements across a wider range of data types and list sizes. These improvements are necessary groundwork for evaluating optimizations proposed in apache#19241. The current benchmarks were limited in scope, making it difficult to assess the performance impact of potential `in_list` optimizations across different data types and scenarios. ## What changes are included in this PR? - Added benchmarks for `UInt8Array`, `Int16Array`, and `TimestampNanosecondArray` - Added `28` to `IN_LIST_LENGTHS` (now `[3, 8, 28, 100]`) to better cover the range between small and large lists - Increased `ARRAY_LENGTH` from `1024` to `8192` to be aligned with the default DataFusionbatch size - Configured criterion with shorter warm-up (100ms) and measurement times (500ms) for faster iteration ## Are these changes tested? Yes, this PR adds benchmark coverage. The benchmarks can be run with: ```bash cargo bench --bench in_list ``` The benchmarks verify that the `in_list` expression evaluates correctly for all the new data types. ## Are there any user-facing changes? No user-facing changes. This PR only affects the benchmark suite used for performance testing and development.

## Which issue does this PR close? - Part of apache#18411 ## Rationale for this change I want to optimize hashing for StringViewArray. In order to do I would like a benchmark to show it works ## What changes are included in this PR? Add benchmark for `with_hashes` Run like ```shell cargo bench --bench with_hashes ``` Note I did not add all the possible types of arrays as I don't plan to optimize othrs ## Are these changes tested? I ran it manually ## Are there any user-facing changes?

## Which issue does this PR close?  - Closes apache#19156 - Part of apache#19144 ## Rationale for this change  ## What changes are included in this PR? - Spark `next_day` uses `return_field_from_args`  ## Are these changes tested? - Added new unit tests, previous all tests pass  ## Are there any user-facing changes?

…e#19313) ## Summary This PR moves the CSV-specific `newlines_in_values` configuration option from `FileScanConfig` (a shared format-agnostic configuration) to `CsvSource` where it belongs. - Add `newlines_in_values` field and methods to `CsvSource` - Add `has_newlines_in_values()` method to `FileSource` trait (returns `false` by default) - Update `FileSource::repartitioned()` to use the new trait method - Remove `new_lines_in_values` from `FileScanConfig` and its builder - Update proto serialization to read from/write to `CsvSource` - Update tests and documentation - Add migration guide to `upgrading.md` Closes apache#18453 ## Test plan - [x] All existing tests pass - [x] Doc tests pass - [x] Proto roundtrip tests pass - [x] Clippy clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <[email protected]>

…filing (apache#19127) ## Which issue does this PR close?  - Closes apache#18138 ## Rationale for this change The `list` operation returns a stream, so it previously recorded `duration: None`, missing performance insights. Time-to-first-item is a useful metric for list operations, indicating how quickly results start. This adds duration tracking by measuring time until the first item is yielded (or the stream ends). ## What changes are included in this PR? 1. Added `TimeToFirstItemStream`: A stream wrapper that measures elapsed time from creation until the first item is yielded (or the stream ends if empty). 2. Updated `instrumented_list`: Wraps the inner stream with `TimeToFirstItemStream` to record duration. 3. Changed `requests` field: Switched from `Mutex<Vec<RequestDetails>>` to `Arc<Mutex<Vec<RequestDetails>>>` to allow sharing across async boundaries (needed for the stream wrapper). 4. Updated tests: Modified `instrumented_store_list` to consume at least one stream item and verify that `duration` is now `Some(Duration)` instead of `None`. ## Are these changes tested? Yes. The existing test `instrumented_store_list` was updated to: - Consume at least one item from the stream using `stream.next().await` - Assert that `request.duration.is_some()` (previously `is_none()`) All tests pass, including the updated list test and other instrumented operation tests. ## Are there any user-facing changes? Users with profiling enabled will see duration values for `list` operations instead of nothing.

…e#19382) Bumps [taiki-e/install-action](https://github.com/taiki-e/install-action) from 2.63.3 to 2.64.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/releases">taiki-e/install-action's releases</a>.</em></p> <blockquote> <h2>2.64.0</h2> <ul> <li> <p><code>tool</code> input option now supports whitespace (space, tab, and line) or comma separated list. Previously, only comma-separated list was supported. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1366">#1366</a>)</p> </li> <li> <p>Support <code>prek</code>. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1357">#1357</a>, thanks <a href="https://github.com/j178"><code>@j178</code></a>)</p> </li> <li> <p>Support <code>mdbook-mermaid</code>. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1359">#1359</a>, thanks <a href="https://github.com/CommanderStorm"><code>@CommanderStorm</code></a>)</p> </li> <li> <p>Support <code>martin</code>. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1364">#1364</a>, thanks <a href="https://github.com/CommanderStorm"><code>@CommanderStorm</code></a>)</p> </li> <li> <p>Update <code>trivy@latest</code> to 0.68.2.</p> </li> <li> <p>Update <code>xh@latest</code> to 0.25.3.</p> </li> <li> <p>Update <code>mise@latest</code> to 2025.12.10.</p> </li> <li> <p>Update <code>uv@latest</code> to 0.9.18.</p> </li> <li> <p>Update <code>cargo-shear@latest</code> to 1.9.1.</p> </li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/blob/main/CHANGELOG.md">taiki-e/install-action's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <p>All notable changes to this project will be documented in this file.</p> <p>This project adheres to <a href="https://semver.org">Semantic Versioning</a>.</p>  <h2>[Unreleased]</h2> <ul> <li>Update <code>mise@latest</code> to 2025.12.11.</li> </ul> <h2>[2.64.0] - 2025-12-17</h2> <ul> <li> <p><code>tool</code> input option now supports whitespace (space, tab, and line) or comma separated list. Previously, only comma-separated list was supported. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1366">#1366</a>)</p> </li> <li> <p>Support <code>prek</code>. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1357">#1357</a>, thanks <a href="https://github.com/j178"><code>@j178</code></a>)</p> </li> <li> <p>Support <code>mdbook-mermaid</code>. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1359">#1359</a>, thanks <a href="https://github.com/CommanderStorm"><code>@CommanderStorm</code></a>)</p> </li> <li> <p>Support <code>martin</code>. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1364">#1364</a>, thanks <a href="https://github.com/CommanderStorm"><code>@CommanderStorm</code></a>)</p> </li> <li> <p>Update <code>trivy@latest</code> to 0.68.2.</p> </li> <li> <p>Update <code>xh@latest</code> to 0.25.3.</p> </li> <li> <p>Update <code>mise@latest</code> to 2025.12.10.</p> </li> <li> <p>Update <code>uv@latest</code> to 0.9.18.</p> </li> <li> <p>Update <code>cargo-shear@latest</code> to 1.9.1.</p> </li> </ul> <h2>[2.63.3] - 2025-12-15</h2> <ul> <li>Update <code>cargo-nextest@latest</code> to 0.9.115.</li> </ul> <h2>[2.63.2] - 2025-12-15</h2> <ul> <li> <p>Update <code>mise@latest</code> to 2025.12.7.</p> </li> <li> <p>Update <code>git-cliff@latest</code> to 2.11.0.</p> </li> <li> <p>Update <code>coreutils@latest</code> to 0.5.0.</p> </li> <li> <p>Update <code>cargo-binstall@latest</code> to 1.16.4.</p> </li> </ul> <h2>[2.63.1] - 2025-12-14</h2>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/taiki-e/install-action/commit/69e777b377e4ec209ddad9426ae3e0c1008b0ef3"><code>69e777b</code></a> Release 2.64.0</li> <li><a href="https://github.com/taiki-e/install-action/commit/c8a7c7764cebc151687ee69895469c63dfa5abff"><code>c8a7c77</code></a> Update changelog</li> <li><a href="https://github.com/taiki-e/install-action/commit/a62e6211cb81c68c57a180c5702ffe1dee406d82"><code>a62e621</code></a> codegen: Use ring instead of sha2</li> <li><a href="https://github.com/taiki-e/install-action/commit/936dbd8ac63902a52ab34d8c7d13d427f6a414e6"><code>936dbd8</code></a> codegen: Update base manifests</li> <li><a href="https://github.com/taiki-e/install-action/commit/5d018ee3d220f595ecf68892affd8fe5bac42f58"><code>5d018ee</code></a> Support whitespace separated list</li> <li><a href="https://github.com/taiki-e/install-action/commit/72b24c709c6db51074c8c41b3ce38bddfae9bff3"><code>72b24c7</code></a> Support martin (<a href="https://redirect.github.com/taiki-e/install-action/issues/1364">#1364</a>)</li> <li><a href="https://github.com/taiki-e/install-action/commit/a9e9081aa4eb13ec32770e0bd24eb97bc2400e12"><code>a9e9081</code></a> Support prek (<a href="https://redirect.github.com/taiki-e/install-action/issues/1357">#1357</a>)</li> <li><a href="https://github.com/taiki-e/install-action/commit/ddb68c9d25d9ea6e17be49edea16b674bc79b77c"><code>ddb68c9</code></a> Support mdbook-mermaid (<a href="https://redirect.github.com/taiki-e/install-action/issues/1359">#1359</a>)</li> <li><a href="https://github.com/taiki-e/install-action/commit/19d2d1dff9b6624e9696e43caeb2fe0f4480bbe3"><code>19d2d1d</code></a> Update <code>trivy@latest</code> to 0.68.2</li> <li><a href="https://github.com/taiki-e/install-action/commit/63c44454be578a6e1d4e1c8756847c31801f86fe"><code>63c4445</code></a> Update <code>xh@latest</code> to 0.25.3</li> <li>Additional commits viewable in <a href="https://github.com/taiki-e/install-action/compare/d850aa816998e5cf15f67a78c7b933f2a5033f8a...69e777b377e4ec209ddad9426ae3e0c1008b0ef3">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=taiki-e/install-action&package-manager=github_actions&previous-version=2.63.3&new-version=2.64.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

## Which issue does this PR close?  - Closes apache#19380. ## Rationale for this change  Snapshot test passes but the existing value is in a legacy format. Updated insta snapshots to new format by running `cargo insta test --force-update-snapshots` ## What changes are included in this PR? Snapshots in various directories.  ## Are these changes tested?  Yes ## Are there any user-facing changes?  No  No

## Which issue does this PR close? part of apache#15914 ## Rationale for this change Migrate spark functions from https://github.com/lakehq/sail/ to datafusion engine to unify codebase ## What changes are included in this PR? implement spark udaf try_sum https://spark.apache.org/docs/latest/api/sql/index.html#try_sum ## Are these changes tested? unit-tests and sqllogictests added ## Are there any user-facing changes? now can be called in queries

## Which issue does this PR close? - Part of apache#17555 . ## Rationale for this change ### Analysis Other engines: 1. Clickhouse seems to only consider `"(U)Int*", "Float*", "Decimal*"` as arguments for log https://github.com/ClickHouse/ClickHouse/blob/master/src/Functions/log.cpp#L47-L63 Libraries 1. There a C++ library libdecimal which internally uses [Intel Decimal Floating Point Library](https://www.intel.com/content/www/us/en/developer/articles/tool/intel-decimal-floating-point-math-library.html) for it's [decimal32](https://github.com/GaryHughes/stddecimal/blob/main/libdecimal/decimal_cmath.cpp#L150-L159) operations. Intel's library itself converts the decimal32 to double and calls `log`. https://github.com/karlorz/IntelRDFPMathLib20U2/blob/main/LIBRARY/src/bid32_log.c 2. There was another C++ library based on IBM's decimal decNumber library https://github.com/semihc/CppDecimal . This one's implementation of [`log`](https://github.com/semihc/CppDecimal/blob/main/src/decNumber.c#L1384-L1518) is fully using decimal, but I don't think this would be very performant way to do this I'm going to go with an approach similar to the one inside Intel's decimal library. To begin with the `decimal32 -> double` is done by a simple scaling ## What changes are included in this PR? 1. Support Decimal32 for log ## Are these changes tested? Yes, unit tests have been added, and I've tested this from the datafusion cli for Decimal32 ``` > select log(2.0, arrow_cast(12345.67, 'Decimal32(9, 2)')); +-----------------------------------------------------------------------+ | log(Float64(2),arrow_cast(Float64(12345.67),Utf8("Decimal32(9, 2)"))) | +-----------------------------------------------------------------------+ | 13.591717513271785 | +-----------------------------------------------------------------------+ 1 row(s) fetched. Elapsed 0.021 seconds. ``` ## Are there any user-facing changes? 1. The precision of the result for Decimal32 will change, the precision loss in apache#18524 does not occur in this PR --------- Co-authored-by: Andrew Lamb <[email protected]>

## Which issue does this PR close? - Part of apache#19250 ## Rationale for this change Previously, the `log` function would fail when operating on decimal values with negative scales. Negative scales in decimals represent values where the scale indicates padding zeros to the right (e.g., `Decimal128(38, -2)` with value `100` represents `10000`). This PR restores support for negative-scale decimals in the `log` function by implementing the logarithmic property: `log_base(value * 10^(-scale)) = log_base(value) + (-scale) * log_base(10)`. ## What changes are included in this PR? 1. **Enhanced `log_decimal128` function**: - Added support for negative scales using the logarithmic property - For negative scales, computes `log_base(value) + (-scale) * log_base(10)` instead of trying to convert to unscaled value - Added detection for negative-scale decimals in both the number and base arguments - Skips simplification when negative scales are detected to avoid errors with `ScalarValue` (which doesn't support negative scales yet) 2. **Added comprehensive tests**: - Unit tests in `log.rs` for negative-scale decimals with various bases (2, 3, 10) - SQL logic tests in `decimal.slt` using scientific notation (e.g., `1e4`, `8e1`) to create decimals with negative scales ## Are these changes tested? Yes, this PR includes comprehensive tests: 1. Unit tests: - `test_log_decimal128_negative_scale`: Tests array inputs with negative scales - `test_log_decimal128_negative_scale_base2`: Tests with base 2 and negative scales - `test_log_decimal128_negative_scale_scalar`: Tests scalar inputs with negative scales 2. SQL logic tests: - Tests for unary log with negative scales (`log(1e4)`) - Tests for binary log with explicit base 10 (`log(10, 1e4)`) - Tests for binary log with base 2 (`log(2.0, 8e1)`, `log(2.0, 16e1)`) - Tests for different negative scale values (`log(5e3)`) - Tests for array operations with negative scales - Tests for different bases (2, 3, 10) with negative-scale decimals All tests pass successfully. ## Are there any user-facing changes? Yes, this is a user-facing change: **Before**: The `log` function would fail with an error when operating on decimal values with negative scales: ```sql -- This would fail SELECT log(1e4); -- Error: Negative scale is not supported ``` **After**: The `log` function now correctly handles decimal values with negative scales: ```sql -- This now works SELECT log(1e4); -- Returns 4.0 (log10(10000)) SELECT log(2.0, 8e1); -- Returns ~6.32 (log2(80)) ``` --------- Co-authored-by: Jeffrey Vo <[email protected]> Co-authored-by: Martin Grigorov <[email protected]>

## Which issue does this PR close?  - Closes apache#7689. ## Rationale for this change  ## What changes are included in this PR? - Added dedicated ceil/floor UDF implementations that keep existing float/int behavior but operate directly on Decimal128 arrays, including overflow checks and metadata preservation. - Updated the math module wiring plus sqllogictest coverage so decimal cases are executed and validated end to end.  ## Are these changes tested? - All existing tests pass - Added new tests for the changes  ## Are there any user-facing changes?   --------- Co-authored-by: Jeffrey Vo <[email protected]>

…#18754) ## Which issue does this PR close?  Part of apache#12725 ## Rationale for this change  Started refactoring encoding functions (`encode`, `decode`) to remove user defined signature (per linked issue). However, discovered in the process it was bugged in handling of certain inputs. For example on main we get these errors: ```sql DataFusion CLI v51.0.0 > select encode(arrow_cast(column1, 'LargeUtf8'), 'hex') from values ('a'), ('b'); Internal error: Function 'encode' returned value of type 'Utf8' while the following type was promised at planning time and expected: 'LargeUtf8'. This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues > select encode(arrow_cast(column1, 'LargeBinary'), 'hex') from values ('a'), ('b'); Internal error: Function 'encode' returned value of type 'Utf8' while the following type was promised at planning time and expected: 'LargeUtf8'. This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues > select encode(arrow_cast(column1, 'BinaryView'), 'hex') from values ('a'), ('b'); Error during planning: Execution error: Function 'encode' user-defined coercion failed with "Error during planning: 1st argument should be Utf8 or Binary or Null, got BinaryView" No function matches the given name and argument types 'encode(BinaryView, Utf8)'. You might need to add explicit type casts. Candidate functions: encode(UserDefined) ``` - LargeUtf8/LargeBinary array inputs are broken - BinaryView input not supported (but Utf8View input is supported) So went about fixing this input handling as well as doing various refactors to try simplify the code. (I also discovered apache#18746 in the process of this refactor). ## What changes are included in this PR?  Refactor signatures away from user defined to the signature coercion API; importantly, we now accept only binary inputs, letting string inputs be coerced by type coercion. This simplifies the internal code of encode/decode to only need to consider binary inputs, instead of duplicating essentially exact code for string inputs (given for string inputs we just grabbed the underlying bytes anyway) Consolidating the inner functions used by encode/decode to try simplify/inline where possible. ## Are these changes tested?  Added new SLTs. ## Are there any user-facing changes?  No.

Closes apache#16800 We could leave some of these methods around as deprecated and make them no-ops but I'd be afraid that would create a false sense of security (compiles but behaves wrong at runtime).

## Which issue does this PR close? Closes apache#19396. ## Rationale for this change Previously, when a `DefaultListFilesCache` was created with a TTL (e.g., `DefaultListFilesCache::new(1024, Some(Duration::from_secs(1)))`) and passed to `CacheManagerConfig` without explicitly setting `list_files_cache_ttl`, the cache's TTL would be unexpectedly unset (overwritten to `None`). This happened because `CacheManager::try_new()` always called `update_cache_ttl(config.list_files_cache_ttl)`, even when the config value was `None`. ## What changes are included in this PR? - Modified `CacheManager::try_new()` to only update the cache's TTL if `config.list_files_cache_ttl` is explicitly set (`Some(value)`). If the config TTL is `None`, the cache's existing TTL is preserved. - Added two test cases: - `test_ttl_preserved_when_not_set_in_config`: Verifies that TTL is preserved when not set in config - `test_ttl_overridden_when_set_in_config`: Verifies that TTL can still be overridden when explicitly set in config ## Are these changes tested? Yes ## Are there any user-facing changes? Yes

## Summary This PR adds protobuf serialization/deserialization support for `HashExpr`, enabling distributed query execution to serialize hash expressions used in hash joins and repartitioning. This is a followup to apache#18393 which introduced `HashExpr` but did not add serialization support. This causes errors when serialization is triggered on a query that pushes down dynamic filters from a `HashJoinExec`. As of apache#18393 `HashJoinExec` produces filters of the form: ```sql CASE (hash_repartition % 2) WHEN 0 THEN a >= ab AND a <= ab AND b >= bb AND b <= bb AND hash_lookup(a,b) WHEN 1 THEN a >= aa AND a <= aa AND b >= ba AND b <= ba AND hash_lookup(a,b) ELSE FALSE END ``` Where `hash_lookup` is an expression that holds a reference to a given partitions hash join hash table and will check for membership. Since we created these new expressions but didn't make any of them serializable any attempt to do a distributed query or similar would run into errors. In apache#19300 we fixed `hash_lookup` by replacing it with `true` since it can't be serialized across the wire (we'd have to send the entire hash table). The logic was that this preserves the bounds checks, which as still valuable. This PR handles `hash_repartition` which determines which partition (and hence which branch of the `CASE` expression) the row belongs to. For this expression we *can* serialize it, so that's what I'm doing in this PR. ### Key Changes - **SeededRandomState wrapper**: Added a `SeededRandomState` struct that wraps `ahash::RandomState` while preserving the seeds used to create it. This is necessary because `RandomState` doesn't expose seeds after creation, but we need them for serialization. - **Updated seed constants**: Changed `HASH_JOIN_SEED` and `REPARTITION_RANDOM_STATE` constants to use `SeededRandomState` instead of raw `RandomState`. - **HashExpr enhancements**: - Changed `HashExpr` to use `SeededRandomState` - Added getter methods: `on_columns()`, `seeds()`, `description()` - Exported `HashExpr` and `SeededRandomState` from the joins module - **Protobuf support**: - Added `PhysicalHashExprNode` message to `datafusion.proto` with fields for `on_columns`, seeds (4 `u64` values), and `description` - Implemented serialization in `to_proto.rs` - Implemented deserialization in `from_proto.rs` ## Test plan - [x] Added roundtrip test in `roundtrip_physical_plan.rs` that creates a `HashExpr`, serializes it, deserializes it, and verifies the result - [x] All existing hash join tests pass (583 tests) - [x] All proto roundtrip tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <[email protected]>

…he#19400) ## Which issue does this PR close?  close apache#19398 ## Rationale for this change see issue apache#19398 ## What changes are included in this PR? impl `try_swapping_with_projection` for `CooperativeExec` and `CoalesceBatchesExec` ## Are these changes tested? add test case ## Are there any user-facing changes?

…pache#19412) ## Which issue does this PR close? Closes apache#19410 ## Rationale for this change This PR fixes a regression introduced in apache#18831 where queries using GROUP BY with ORDER BY positional reference to an aliased aggregate fail with: ``` Error during planning: Column in ORDER BY must be in GROUP BY or an aggregate function ``` **Failing query (now fixed):** ```sql with t as (select 'foo' as x) select x, count(*) as "Count" from t group by x order by 2 desc; ``` ## What changes are included in this PR? **Root cause:** When building the list of valid columns for ORDER BY validation in `select.rs`, alias names were converted to `Column` using `.into()`, which calls `from_qualified_name()` and normalizes identifiers to lowercase. However, ORDER BY positional references resolve to columns using schema field names, which preserve case. This caused a mismatch (e.g., `Column("Count")` vs `Column("count")`). **Fix:** Use `Column::new_unqualified()` instead of `.into()` to preserve the exact case of alias names, matching how the schema stores field names. ## Are these changes tested? Yes, added a regression test to `order.slt`. ## Are there any user-facing changes? No, this is a bug fix that restores expected behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <[email protected]>

## Which issue does this PR close? - Closes apache#19269. ## Rationale for this change See to issue apache#19269 for deeper rationale. DF did not have the notion that being partitioned on a superset of the required partitioning satisfied the condition. Having this logic will eliminate unnecessary repartitions and in turn other operators like partial aggregations. I introduced this behavior with the `repartition_subset_satisfactions` flag (default false) as there are some cases where repartitioning may still be wanted when we satisfy partitioning via this subset property. In particular, if when partitioned via Hash(a) there is data skew but when partitioned on Hash(a, b) there is better distribution, a user may want to turn this optimization off. I also made it the case such that if we satisfy repartitioning via a subset but the current amount of partitions < target_partitions, then we will still repartition to maintain and increase parallelism in the system when possible. ## What changes are included in this PR? - Modified `satisfy()` logic to check for subsets and return an enum of type of match: exact, subset, none - Do in `EnforceDistribution`, where `satisfy()` is called, do not allow subset logic for partitioned join operators as partitioning on each side much match exactly, thus need to repartition if subset logic is true - Created unit and sqllogictests ## Are these changes tested? - Unit test - sqllogictest - tpch correctness ### Benchmarks I did not see any drastic changes in benches, but the shuffle eliminations will be great improvements for distributed DF. <img width="628" height="762" alt="Screenshot 2025-12-12 at 8 28 15 PM" src="https://github.com/user-attachments/assets/4b42945f-34e0-46c9-a4ce-e7ccdd0c0603" /> <img width="490" height="746" alt="Screenshot 2025-12-12 at 8 30 15 PM" src="https://github.com/user-attachments/assets/846aef1b-8c5d-462d-83e7-7fa1e2a9372e" /> ## Are there any user-facing changes? Yes, users will now have the `repartition_subset_satisfications` option as described in this PR --------- Co-authored-by: Andrew Lamb <[email protected]>

@camuel

…5% faster (apache#19413) ## Which issue does this PR close?  - Part of apache#18411 - Closes apache#19344 - Closes apache#19364 Note this is an alternate to apache#19364 ## Rationale for this change @camuel found a query where DuckDB's raw grouping is is faster. I looked into it and much of the difference can be explained by better vectorization in the comparisons and short string optimizations ## What changes are included in this PR? Optimize (will comment inline) ## Are these changes tested? By CI. See also benchmark results below. I tested manually as well Create Data: ```shell nice tpchgen-cli --tables=lineitem --format=parquet --scale-factor 100 ``` Run query: ```shell hyperfine --warmup 3 " datafusion-cli -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" " ``` Before (main): 1.368s ```shell Benchmark 1: datafusion-cli -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 1.393 s ± 0.020 s [User: 16.778 s, System: 0.688 s] Range (min … max): 1.368 s … 1.438 s 10 runs ``` After (this PR) 1.022s ```shell Benchmark 1: ./datafusion-cli-multi-gby-try2 -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 1.022 s ± 0.015 s [User: 11.685 s, System: 0.644 s] Range (min … max): 1.005 s … 1.052 s 10 runs ``` I have a PR that improves string view hashing performance too, see - apache#19374 ## Are there any user-facing changes? Faster performance

…19374) ## Which issue does this PR close? - builds on apache#19373 - part of apache#18411 - Broken out of apache#19344 - Closes apache#19344 ## Rationale for this change While looking at performance as part of apache#18411, I noticed we could speed up string view hashing by optimizing for small strings ## What changes are included in this PR? Optimize StringView hashing, specifically by using the inlined view for short strings ## Are these changes tested? Functionally by existing coverage Performance by benchmarks (added in apache#19373) which show * 15%-20% faster for mixed short/long strings * 50%-70% faster for "short" arrays where we know there are no strings longer than 12 bytes ``` utf8_view (small): multiple, no nulls 1.00 47.9±1.71µs ? ?/sec 4.00 191.6±1.15µs ? ?/sec utf8_view (small): multiple, nulls 1.00 78.4±0.48µs ? ?/sec 3.08 241.6±1.11µs ? ?/sec utf8_view (small): single, no nulls 1.00 13.9±0.19µs ? ?/sec 4.29 59.7±0.30µs ? ?/sec utf8_view (small): single, nulls 1.00 23.8±0.20µs ? ?/sec 3.10 73.7±1.03µs ? ?/sec utf8_view: multiple, no nulls 1.00 235.4±2.14µs ? ?/sec 1.11 262.2±1.34µs ? ?/sec utf8_view: multiple, nulls 1.00 227.2±2.11µs ? ?/sec 1.34 303.9±2.23µs ? ?/sec utf8_view: single, no nulls 1.00 71.6±0.74µs ? ?/sec 1.05 75.2±1.27µs ? ?/sec utf8_view: single, nulls 1.00 71.5±1.92µs ? ?/sec 1.28 91.6±4.65µs ``` <details><summary>Details</summary> <p> ``` Gnuplot not found, using plotters backend utf8_view: single, no nulls time: [20.872 µs 20.906 µs 20.944 µs] change: [−15.863% −15.614% −15.331%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 8 (8.00%) high mild 5 (5.00%) high severe utf8_view: single, nulls time: [22.968 µs 23.050 µs 23.130 µs] change: [−17.796% −17.384% −16.918%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe utf8_view: multiple, no nulls time: [66.005 µs 66.155 µs 66.325 µs] change: [−19.077% −18.785% −18.512%] (p = 0.00 < 0.05) Performance has improved. utf8_view: multiple, nulls time: [72.155 µs 72.375 µs 72.649 µs] change: [−17.944% −17.612% −17.266%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 6 (6.00%) high mild 5 (5.00%) high severe utf8_view (small): single, no nulls time: [6.1401 µs 6.1563 µs 6.1747 µs] change: [−69.623% −69.484% −69.333%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe utf8_view (small): single, nulls time: [10.234 µs 10.250 µs 10.270 µs] change: [−53.969% −53.815% −53.666%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high severe utf8_view (small): multiple, no nulls time: [20.853 µs 20.905 µs 20.961 µs] change: [−66.006% −65.883% −65.759%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe utf8_view (small): multiple, nulls time: [32.519 µs 32.600 µs 32.675 µs] change: [−53.937% −53.581% −53.232%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild ``` </p> </details> ## Are there any user-facing changes?

## Which issue does this PR close?  - Closes apache#13433 ## Rationale for this change Keep up to date with hashbrown and possible (performance) improvements, remove the `rawtable` functionality. ## What changes are included in this PR? Upgrading, fixing case of nondeterminism in `HashSet` usage. ## Are these changes tested? Yes existiing tests. ## Are there any user-facing changes? No.

## Which issue does this PR close?  - Closes #. ## Rationale for this change 1. add crypto function benchmark code 2. simplify inner digest function code 1. I checked for any performance changes using the benchmark code(1), but there was no changement.  ## What changes are included in this PR?  ## Are these changes tested? yes.  ## Are there any user-facing changes? no   Co-authored-by: Jeffrey Vo <[email protected]>

…e#19559) Bumps [taiki-e/install-action](https://github.com/taiki-e/install-action) from 2.65.6 to 2.65.8. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/releases">taiki-e/install-action's releases</a>.</em></p> <blockquote> <h2>2.65.8</h2> <ul> <li> <p>Update <code>tombi@latest</code> to 0.7.14.</p> </li> <li> <p>Update <code>uv@latest</code> to 0.9.20.</p> </li> <li> <p>Update <code>typos@latest</code> to 1.40.1.</p> </li> </ul> <h2>2.65.7</h2> <ul> <li> <p>Update <code>cargo-no-dev-deps@latest</code> to 0.2.19.</p> </li> <li> <p>Update <code>cargo-minimal-versions@latest</code> to 0.1.34.</p> </li> <li> <p>Update <code>cargo-insta@latest</code> to 1.45.1.</p> </li> <li> <p>Update <code>cargo-hack@latest</code> to 0.6.40.</p> </li> <li> <p>Update <code>dprint@latest</code> to 0.51.1.</p> </li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/blob/main/CHANGELOG.md">taiki-e/install-action's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <p>All notable changes to this project will be documented in this file.</p> <p>This project adheres to <a href="https://semver.org">Semantic Versioning</a>.</p>  <h2>[Unreleased]</h2> <h2>[2.65.8] - 2025-12-30</h2> <ul> <li> <p>Update <code>tombi@latest</code> to 0.7.14.</p> </li> <li> <p>Update <code>uv@latest</code> to 0.9.20.</p> </li> <li> <p>Update <code>typos@latest</code> to 1.40.1.</p> </li> </ul> <h2>[2.65.7] - 2025-12-29</h2> <ul> <li> <p>Update <code>cargo-no-dev-deps@latest</code> to 0.2.19.</p> </li> <li> <p>Update <code>cargo-minimal-versions@latest</code> to 0.1.34.</p> </li> <li> <p>Update <code>cargo-insta@latest</code> to 1.45.1.</p> </li> <li> <p>Update <code>cargo-hack@latest</code> to 0.6.40.</p> </li> <li> <p>Update <code>dprint@latest</code> to 0.51.1.</p> </li> </ul> <h2>[2.65.6] - 2025-12-28</h2> <ul> <li> <p>Update <code>dprint@latest</code> to 0.51.0.</p> </li> <li> <p>Update <code>vacuum@latest</code> to 0.22.0.</p> </li> </ul> <h2>[2.65.5] - 2025-12-27</h2> <ul> <li> <p>Update <code>tombi@latest</code> to 0.7.12.</p> </li> <li> <p>Update <code>cargo-binstall@latest</code> to 1.16.6.</p> </li> </ul> <h2>[2.65.4] - 2025-12-27</h2> <ul> <li> <p>Update <code>cargo-nextest@latest</code> to 0.9.116.</p> </li> <li> <p>Update <code>prek@latest</code> to 0.2.25.</p> </li> </ul>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/taiki-e/install-action/commit/ff581034fb69296c525e51afd68cb9823bfbe4ed"><code>ff58103</code></a> Release 2.65.8</li> <li><a href="https://github.com/taiki-e/install-action/commit/766eefa747e7fe9a871104d27d7363a9afad3a06"><code>766eefa</code></a> Update changelog</li> <li><a href="https://github.com/taiki-e/install-action/commit/db0301613d90942e9412b9e0152d98b6a4f40942"><code>db03016</code></a> Update <code>tombi@latest</code> to 0.7.14</li> <li><a href="https://github.com/taiki-e/install-action/commit/78f63804f5bd1b38ca7dadb1af0afb328a2c701e"><code>78f6380</code></a> Update <code>uv@latest</code> to 0.9.20</li> <li><a href="https://github.com/taiki-e/install-action/commit/614b8622046c9f8bed8c8fec2093946b3b15dcf2"><code>614b862</code></a> Update <code>typos@latest</code> to 1.40.1</li> <li><a href="https://github.com/taiki-e/install-action/commit/447ff350f809b67060e03706fdf9a5f8b63be9fc"><code>447ff35</code></a> Update <code>tombi@latest</code> to 0.7.13</li> <li><a href="https://github.com/taiki-e/install-action/commit/4c6723ec9c638cccae824b8957c5085b695c8085"><code>4c6723e</code></a> Release 2.65.7</li> <li><a href="https://github.com/taiki-e/install-action/commit/9ff15877d9cf002da5c45f2297f2bc736978a8d7"><code>9ff1587</code></a> Update <code>cargo-no-dev-deps@latest</code> to 0.2.19</li> <li><a href="https://github.com/taiki-e/install-action/commit/4f0419fae32f77fcc791b79eb61b83694d208e4a"><code>4f0419f</code></a> Update <code>cargo-minimal-versions@latest</code> to 0.1.34</li> <li><a href="https://github.com/taiki-e/install-action/commit/1eecdc5eb1c3f829c89d7e6d794725f1bce75111"><code>1eecdc5</code></a> Update <code>cargo-insta@latest</code> to 1.45.1</li> <li>Additional commits viewable in <a href="https://github.com/taiki-e/install-action/compare/28a9d316db64b78a951f3f8587a5d08cc97ad8eb...ff581034fb69296c525e51afd68cb9823bfbe4ed">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=taiki-e/install-action&package-manager=github_actions&previous-version=2.65.6&new-version=2.65.8)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

## Which issue does this PR close? - Closes apache#19292. ## Rationale for this change Metadata on a schema for a record batch is lost during FFI conversion. This is not always obvious because our other integrations like `ExecutionPlan` and `TableProvider` have their own ways to provide the schema. This is currently an issue because we have FFI table providers who say they have a specific schema but the schema of the actual record batches does not match. The metadata is dropped. ## What changes are included in this PR? We already have the schema in the FFI object, so use it both when converting to and from FFI. ## Are these changes tested? Unit test added. ## Are there any user-facing changes? No

…implementation (apache#19142) Add infrastructure for row-level DML operations (DELETE/UPDATE) to the TableProvider trait, enabling storage engines to implement SQL-based mutations. Changes: - Add TableProvider::delete_from() method for DELETE operations - Add TableProvider::update() method for UPDATE operations - Wire physical planner to route DML operations to TableProvider - Implement DELETE/UPDATE for MemTable as reference implementation - Add comprehensive sqllogictest coverage This provides the API surface for downstream projects (iceberg-rust, delta-rs) to implement DML without custom query planners. ## Which issue does this PR close?  - Closes apache#16959 - Related to apache#12406 ## Rationale for this change Datafusion parses DELETE/UPDATE but returns NotImplemented("Unsupported logical plan: Dml(Delete)") at physical planning. Downstream projects (iceberg-rust, delta-rs) must implement custom planners to work around this.  ## What changes are included in this PR? Adds TableProvider hooks for row-level DML: - delete_from(state, filters) - deletes rows matching filter predicates - update(state, assignments, filters) - updates matching rows with new values Physical planner routes WriteOp::Delete and WriteOp::Update to these methods. Tables that don't support DML return NotImplemented (the default behavior). MemTable reference implementation demonstrates: - Filter evaluation with SQL three-valued logic (NULL predicates don't match) - Multi-column updates with expression evaluation - Proper row count reporting  ## Are these changes tested? Yes: - Unit tests for physical planning: cargo test -p datafusion --test custom_sources_cases - SQL logic tests: dml_delete.slt and dml_update.slt with comprehensive coverage  ## Are there any user-facing changes? New trait methods on TableProvider: ``` async fn delete_from(&self, state: &dyn Session, filters: Vec<Expr>) -> Result<Arc<dyn ExecutionPlan>>; async fn update(&self, state: &dyn Session, assignments: Vec<(String, Expr)>, filters: Vec<Expr>) -> Result<Arc<dyn ExecutionPlan>>; ``` Fully backward compatible. Default implementations return NotImplemented.   --------- Co-authored-by: Andrew Lamb <[email protected]>

…pache#19523) ## Which issue does this PR close? None ## Rationale for this change when i run the benchmark for topk_aggregate, i find the mostly of time cost is generate the data, so the result is not real reflect the performance for topk_aggregate. ``` cargo bench --bench topk_aggregate ``` ### main branch ![main](https://github.com/user-attachments/assets/15bc0368-f870-4ad8-84f8-0b6e23e442cb) ### this pr ![this pr](https://github.com/user-attachments/assets/bf92ec25-217a-40aa-8b0f-2cf89bea8737) ## What changes are included in this PR? extract the data generate out of topk_aggregate benchmark ## Are these changes tested? ## Are there any user-facing changes?

Optimized lpad and rpad functions to eliminate per-row allocations by reusing buffers for graphemes and fill characters. The previous implementation allocated new Vec<&str> for graphemes and Vec<char> for fill characters on every row, which was inefficient. This optimization introduces reusable buffers that are allocated once and cleared/refilled for each row. Changes: - lpad: Added graphemes_buf and fill_chars_buf outside loops, clear and refill per row instead of allocating new Vec each time - rpad: Added graphemes_buf outside loops to reuse across iterations - Both functions now allocate buffers once and reuse them for all rows - Buffers are cleared and reused for each row via .clear() and .extend() Optimization impact: - For lpad with fill parameter: Eliminates 2 Vec allocations per row (graphemes + fill_chars) - For lpad without fill: Eliminates 1 Vec allocation per row (graphemes) - For rpad: Eliminates 1 Vec allocation per row (graphemes) This optimization is particularly effective for: - Large arrays with many rows - Strings with multiple graphemes (unicode characters) - Workloads with custom fill patterns Benchmark results comparing main vs optimized branch: lpad benchmarks: - size=1024, str_len=5, target=20: 116.53 µs -> 63.226 µs (45.7% faster) - size=1024, str_len=20, target=50: 314.07 µs -> 190.30 µs (39.4% faster) - size=4096, str_len=5, target=20: 467.35 µs -> 261.29 µs (44.1% faster) - size=4096, str_len=20, target=50: 1.2286 ms -> 754.24 µs (38.6% faster) rpad benchmarks: - size=1024, str_len=5, target=20: 113.89 µs -> 72.645 µs (36.2% faster) - size=1024, str_len=20, target=50: 313.68 µs -> 202.98 µs (35.3% faster) - size=4096, str_len=5, target=20: 456.08 µs -> 295.57 µs (35.2% faster) - size=4096, str_len=20, target=50: 1.2523 ms -> 818.47 µs (34.6% faster) Overall improvements: 35-46% faster across all workloads ## Which issue does this PR close?  - Closes #. ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?

…ition is provided (apache#19553) ## Which issue does this PR close?  - Closes #. ## Rationale for this change  ## What changes are included in this PR?  Replace `.chars().skip().collect::<String>()` with zero-copy string slicing using `char_indices()` to find the byte offset, then slice with `&value[byte_offset..]`. This eliminates unnecessary String allocation per row when a start position is specified. Changes: - Use char_indices().nth() to find byte offset for start position (1-based) - Use string slicing &value[byte_offset..] instead of collecting chars - Added benchmark to measure performance improvements Optimization: - Before: Allocated new String via .collect() for each row with start position - After: Uses zero-copy string slice Benchmark results: - size=1024, str_len=32: 96.361 µs -> 41.458 µs (57.0% faster, 2.3x speedup) - size=1024, str_len=128: 210.16 µs -> 56.064 µs (73.3% faster, 3.7x speedup) - size=4096, str_len=32: 376.90 µs -> 162.98 µs (56.8% faster, 2.3x speedup) - size=4096, str_len=128: 855.68 µs -> 263.61 µs (69.2% faster, 3.2x speedup) The optimization shows greater improvements for longer strings (up to 73% faster) since string slicing is O(1) regardless of length, while the previous approach had allocation costs that grew with string length. ## Are these changes tested?  ## Are there any user-facing changes?

@alamb

## Which issue does this PR close? - Closes apache#18827 - Closes apache#9654 ## Rationale for this change Now that the `DefaultListFilesCache` can be configured by users it's safe to enable it by default and fix the tests that caching broke! ## What changes are included in this PR? - Sets the DefaultListFilesCache to be enabled by default - Adds additional object store access tests to show list caching behavior - Adds variable setting/reading sqllogic test cases - Updates tests to disable caching when they relied on COPY commands so changes can be detected for each query - Updates docs to help users upgrade ## Are these changes tested? Yes, additional test cases have been added to help show the behavior of the caching ## Are there any user-facing changes? Yes, this changes the default behavior of DataFusion, however this information is already captured in the upgrade guide. ## cc @alamb --------- Co-authored-by: Andrew Lamb <[email protected]>

) ## Which issue does this PR close?  Closes apache#17527 ## Rationale for this change Currently, DataFusion computes bounds for all queries that contain a HashJoinExec node whenever the option enable_dynamic_filter_pushdown is set to true (default). It might make sense to compute these bounds only when we explicitly know there is a consumer that will use them. ## What changes are included in this PR? As suggested in apache#17527 (comment), this PR adds an is_used() method to DynamicFilterPhysicalExpr that checks if any consumers are holding a reference to the filter using Arc::strong_count(). During filter pushdown, consumers that accept the filter and use it later in execution have to retain a reference to Arc. For example, scan nodes like ParquetSource. ## Are these changes tested? I added a unit test in dynamic_filters.rs (test_is_used) that verifies the Arc reference counting behavior. Existing integration tests in datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs validate the end-to-end behavior. These tests verify that dynamic filters are computed and filled when consumers are present. ## Are there any user-facing changes? new is_used() function

## Which issue does this PR close? N/A ## Rationale for this change  Clean up some signatures & unnecessary code in string functions ## What changes are included in this PR?  Various refactors, see comments. ## Are these changes tested?  Existing tests. ## Are there any user-facing changes?  No.

Smotrov and others added 7 commits December 17, 2025 22:51

github-actions bot added core physical-expr logical-expr labels Dec 18, 2025

friendlymatthew force-pushed the friendlymatthew/union-coercion branch 2 times, most recently from c71e795 to a73c903 Compare December 18, 2025 17:22

friendlymatthew force-pushed the friendlymatthew/union-coercion branch 3 times, most recently from 1fe1793 to 00ada31 Compare December 18, 2025 18:46

friendlymatthew mentioned this pull request Dec 18, 2025

Extend comparison support for Union arrays against an opaque array apache/arrow-rs#8881

Open

Mark1626 and others added 12 commits December 19, 2025 02:48

Remove SchemaAdapter (apache#19345)

75d2473

Closes apache#16800 We could leave some of these methods around as deprecated and make them no-ops but I'd be afraid that would create a false sense of security (compiles but behaves wrong at runtime).

Dandandan and others added 14 commits December 30, 2025 05:07

Initial impl of union coercion

d5d63e5

Update: support any compatible casts

2131cda

clippy

45de48f

adriangb force-pushed the friendlymatthew/union-coercion branch from 00ada31 to 45de48f Compare December 31, 2025 02:59

github-actions bot added documentation Improvements or additions to documentation optimizer sqllogictest sql common datasource proto physical-plan substrait development-process ffi catalog execution functions spark labels Dec 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Coerce union #49

Coerce union #49

Uh oh!

friendlymatthew commented Dec 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Coerce union #49

Are you sure you want to change the base?

Coerce union #49

Uh oh!

Conversation

friendlymatthew commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

friendlymatthew commented Dec 18, 2025 •

edited

Loading