forked from apache/datafusion
-
Notifications
You must be signed in to change notification settings - Fork 2
Coerce union #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
friendlymatthew
wants to merge
107
commits into
main
Choose a base branch
from
friendlymatthew/union-coercion
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Coerce union #49
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…e#18954) ## Which issue does this PR close? Closes apache#18947 ## Rationale for this change Currently, DataFusion uses default compression levels when writing compressed JSON and CSV files. For ZSTD, this means level 3, which prioritizes speed over compression ratio. Users working with large datasets who want to optimize for storage costs or network transfer have no way to increase the compression level. This is particularly important for cloud data lake scenarios where storage and egress costs can be significant. ## What changes are included in this PR? - Add `compression_level: Option<u32>` field to `JsonOptions` and `CsvOptions` in `config.rs` - Add `convert_async_writer_with_level()` method to `FileCompressionType` (non-breaking API extension) - Keep original `convert_async_writer()` as a convenience wrapper for backward compatibility - Update `JsonWriterOptions` and `CsvWriterOptions` with `compression_level` field - Update `ObjectWriterBuilder` to support compression level - Update JSON and CSV sinks to pass compression level through the write pipeline - Update proto definitions and conversions for serialization support - Fix unrelated unused import warning in `udf.rs` (conditional compilation for debug-only imports) ## Are these changes tested? The changes follow the existing patterns used throughout the codebase. The implementation was verified by: - Building successfully with `cargo build` - Running existing tests with `cargo test --package datafusion-proto` - All 131 proto integration tests pass ## Are there any user-facing changes? Yes, users can now specify compression level when writing JSON/CSV files: ```rust use datafusion::common::config::JsonOptions; use datafusion::common::parsers::CompressionTypeVariant; let json_opts = JsonOptions { compression: CompressionTypeVariant::ZSTD, compression_level: Some(9), // Higher compression ..Default::default() }; ``` **Supported compression levels:** - ZSTD: 1-22 (default: 3) - GZIP: 0-9 (default: 6) - BZIP2: 1-9 (default: 9) - XZ: 0-9 (default: 6) **This is a non-breaking change** - the original `convert_async_writer()` method signature is preserved for backward compatibility. Co-authored-by: Andrew Lamb <[email protected]>
## Which issue does this PR close? - Related to apache#19241 ## Rationale for this change This PR enhances the `in_list` benchmark suite to provide more comprehensive performance measurements across a wider range of data types and list sizes. These improvements are necessary groundwork for evaluating optimizations proposed in apache#19241. The current benchmarks were limited in scope, making it difficult to assess the performance impact of potential `in_list` optimizations across different data types and scenarios. ## What changes are included in this PR? - Added benchmarks for `UInt8Array`, `Int16Array`, and `TimestampNanosecondArray` - Added `28` to `IN_LIST_LENGTHS` (now `[3, 8, 28, 100]`) to better cover the range between small and large lists - Increased `ARRAY_LENGTH` from `1024` to `8192` to be aligned with the default DataFusionbatch size - Configured criterion with shorter warm-up (100ms) and measurement times (500ms) for faster iteration ## Are these changes tested? Yes, this PR adds benchmark coverage. The benchmarks can be run with: ```bash cargo bench --bench in_list ``` The benchmarks verify that the `in_list` expression evaluates correctly for all the new data types. ## Are there any user-facing changes? No user-facing changes. This PR only affects the benchmark suite used for performance testing and development.
## Which issue does this PR close? - Part of apache#18411 ## Rationale for this change I want to optimize hashing for StringViewArray. In order to do I would like a benchmark to show it works ## What changes are included in this PR? Add benchmark for `with_hashes` Run like ```shell cargo bench --bench with_hashes ``` Note I did not add all the possible types of arrays as I don't plan to optimize othrs ## Are these changes tested? I ran it manually ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#19156 - Part of apache#19144 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Spark `next_day` uses `return_field_from_args` <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? - Added new unit tests, previous all tests pass <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
…e#19313) ## Summary This PR moves the CSV-specific `newlines_in_values` configuration option from `FileScanConfig` (a shared format-agnostic configuration) to `CsvSource` where it belongs. - Add `newlines_in_values` field and methods to `CsvSource` - Add `has_newlines_in_values()` method to `FileSource` trait (returns `false` by default) - Update `FileSource::repartitioned()` to use the new trait method - Remove `new_lines_in_values` from `FileScanConfig` and its builder - Update proto serialization to read from/write to `CsvSource` - Update tests and documentation - Add migration guide to `upgrading.md` Closes apache#18453 ## Test plan - [x] All existing tests pass - [x] Doc tests pass - [x] Proto roundtrip tests pass - [x] Clippy clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <[email protected]>
…filing (apache#19127) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#18138 ## Rationale for this change The `list` operation returns a stream, so it previously recorded `duration: None`, missing performance insights. Time-to-first-item is a useful metric for list operations, indicating how quickly results start. This adds duration tracking by measuring time until the first item is yielded (or the stream ends). ## What changes are included in this PR? 1. Added `TimeToFirstItemStream`: A stream wrapper that measures elapsed time from creation until the first item is yielded (or the stream ends if empty). 2. Updated `instrumented_list`: Wraps the inner stream with `TimeToFirstItemStream` to record duration. 3. Changed `requests` field: Switched from `Mutex<Vec<RequestDetails>>` to `Arc<Mutex<Vec<RequestDetails>>>` to allow sharing across async boundaries (needed for the stream wrapper). 4. Updated tests: Modified `instrumented_store_list` to consume at least one stream item and verify that `duration` is now `Some(Duration)` instead of `None`. ## Are these changes tested? Yes. The existing test `instrumented_store_list` was updated to: - Consume at least one item from the stream using `stream.next().await` - Assert that `request.duration.is_some()` (previously `is_none()`) All tests pass, including the updated list test and other instrumented operation tests. ## Are there any user-facing changes? Users with profiling enabled will see duration values for `list` operations instead of nothing.
…e#19382) Bumps [taiki-e/install-action](https://github.com/taiki-e/install-action) from 2.63.3 to 2.64.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/releases">taiki-e/install-action's releases</a>.</em></p> <blockquote> <h2>2.64.0</h2> <ul> <li> <p><code>tool</code> input option now supports whitespace (space, tab, and line) or comma separated list. Previously, only comma-separated list was supported. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1366">#1366</a>)</p> </li> <li> <p>Support <code>prek</code>. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1357">#1357</a>, thanks <a href="https://github.com/j178"><code>@j178</code></a>)</p> </li> <li> <p>Support <code>mdbook-mermaid</code>. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1359">#1359</a>, thanks <a href="https://github.com/CommanderStorm"><code>@CommanderStorm</code></a>)</p> </li> <li> <p>Support <code>martin</code>. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1364">#1364</a>, thanks <a href="https://github.com/CommanderStorm"><code>@CommanderStorm</code></a>)</p> </li> <li> <p>Update <code>trivy@latest</code> to 0.68.2.</p> </li> <li> <p>Update <code>xh@latest</code> to 0.25.3.</p> </li> <li> <p>Update <code>mise@latest</code> to 2025.12.10.</p> </li> <li> <p>Update <code>uv@latest</code> to 0.9.18.</p> </li> <li> <p>Update <code>cargo-shear@latest</code> to 1.9.1.</p> </li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/blob/main/CHANGELOG.md">taiki-e/install-action's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <p>All notable changes to this project will be documented in this file.</p> <p>This project adheres to <a href="https://semver.org">Semantic Versioning</a>.</p> <!-- raw HTML omitted --> <h2>[Unreleased]</h2> <ul> <li>Update <code>mise@latest</code> to 2025.12.11.</li> </ul> <h2>[2.64.0] - 2025-12-17</h2> <ul> <li> <p><code>tool</code> input option now supports whitespace (space, tab, and line) or comma separated list. Previously, only comma-separated list was supported. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1366">#1366</a>)</p> </li> <li> <p>Support <code>prek</code>. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1357">#1357</a>, thanks <a href="https://github.com/j178"><code>@j178</code></a>)</p> </li> <li> <p>Support <code>mdbook-mermaid</code>. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1359">#1359</a>, thanks <a href="https://github.com/CommanderStorm"><code>@CommanderStorm</code></a>)</p> </li> <li> <p>Support <code>martin</code>. (<a href="https://redirect.github.com/taiki-e/install-action/pull/1364">#1364</a>, thanks <a href="https://github.com/CommanderStorm"><code>@CommanderStorm</code></a>)</p> </li> <li> <p>Update <code>trivy@latest</code> to 0.68.2.</p> </li> <li> <p>Update <code>xh@latest</code> to 0.25.3.</p> </li> <li> <p>Update <code>mise@latest</code> to 2025.12.10.</p> </li> <li> <p>Update <code>uv@latest</code> to 0.9.18.</p> </li> <li> <p>Update <code>cargo-shear@latest</code> to 1.9.1.</p> </li> </ul> <h2>[2.63.3] - 2025-12-15</h2> <ul> <li>Update <code>cargo-nextest@latest</code> to 0.9.115.</li> </ul> <h2>[2.63.2] - 2025-12-15</h2> <ul> <li> <p>Update <code>mise@latest</code> to 2025.12.7.</p> </li> <li> <p>Update <code>git-cliff@latest</code> to 2.11.0.</p> </li> <li> <p>Update <code>coreutils@latest</code> to 0.5.0.</p> </li> <li> <p>Update <code>cargo-binstall@latest</code> to 1.16.4.</p> </li> </ul> <h2>[2.63.1] - 2025-12-14</h2> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/taiki-e/install-action/commit/69e777b377e4ec209ddad9426ae3e0c1008b0ef3"><code>69e777b</code></a> Release 2.64.0</li> <li><a href="https://github.com/taiki-e/install-action/commit/c8a7c7764cebc151687ee69895469c63dfa5abff"><code>c8a7c77</code></a> Update changelog</li> <li><a href="https://github.com/taiki-e/install-action/commit/a62e6211cb81c68c57a180c5702ffe1dee406d82"><code>a62e621</code></a> codegen: Use ring instead of sha2</li> <li><a href="https://github.com/taiki-e/install-action/commit/936dbd8ac63902a52ab34d8c7d13d427f6a414e6"><code>936dbd8</code></a> codegen: Update base manifests</li> <li><a href="https://github.com/taiki-e/install-action/commit/5d018ee3d220f595ecf68892affd8fe5bac42f58"><code>5d018ee</code></a> Support whitespace separated list</li> <li><a href="https://github.com/taiki-e/install-action/commit/72b24c709c6db51074c8c41b3ce38bddfae9bff3"><code>72b24c7</code></a> Support martin (<a href="https://redirect.github.com/taiki-e/install-action/issues/1364">#1364</a>)</li> <li><a href="https://github.com/taiki-e/install-action/commit/a9e9081aa4eb13ec32770e0bd24eb97bc2400e12"><code>a9e9081</code></a> Support prek (<a href="https://redirect.github.com/taiki-e/install-action/issues/1357">#1357</a>)</li> <li><a href="https://github.com/taiki-e/install-action/commit/ddb68c9d25d9ea6e17be49edea16b674bc79b77c"><code>ddb68c9</code></a> Support mdbook-mermaid (<a href="https://redirect.github.com/taiki-e/install-action/issues/1359">#1359</a>)</li> <li><a href="https://github.com/taiki-e/install-action/commit/19d2d1dff9b6624e9696e43caeb2fe0f4480bbe3"><code>19d2d1d</code></a> Update <code>trivy@latest</code> to 0.68.2</li> <li><a href="https://github.com/taiki-e/install-action/commit/63c44454be578a6e1d4e1c8756847c31801f86fe"><code>63c4445</code></a> Update <code>xh@latest</code> to 0.25.3</li> <li>Additional commits viewable in <a href="https://github.com/taiki-e/install-action/compare/d850aa816998e5cf15f67a78c7b933f2a5033f8a...69e777b377e4ec209ddad9426ae3e0c1008b0ef3">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
c71e795 to
a73c903
Compare
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#19380. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Snapshot test passes but the existing value is in a legacy format. Updated insta snapshots to new format by running `cargo insta test --force-update-snapshots` ## What changes are included in this PR? Snapshots in various directories. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> No
1fe1793 to
00ada31
Compare
## Which issue does this PR close? part of apache#15914 ## Rationale for this change Migrate spark functions from https://github.com/lakehq/sail/ to datafusion engine to unify codebase ## What changes are included in this PR? implement spark udaf try_sum https://spark.apache.org/docs/latest/api/sql/index.html#try_sum ## Are these changes tested? unit-tests and sqllogictests added ## Are there any user-facing changes? now can be called in queries
## Which issue does this PR close? - Part of apache#17555 . ## Rationale for this change ### Analysis Other engines: 1. Clickhouse seems to only consider `"(U)Int*", "Float*", "Decimal*"` as arguments for log https://github.com/ClickHouse/ClickHouse/blob/master/src/Functions/log.cpp#L47-L63 Libraries 1. There a C++ library libdecimal which internally uses [Intel Decimal Floating Point Library](https://www.intel.com/content/www/us/en/developer/articles/tool/intel-decimal-floating-point-math-library.html) for it's [decimal32](https://github.com/GaryHughes/stddecimal/blob/main/libdecimal/decimal_cmath.cpp#L150-L159) operations. Intel's library itself converts the decimal32 to double and calls `log`. https://github.com/karlorz/IntelRDFPMathLib20U2/blob/main/LIBRARY/src/bid32_log.c 2. There was another C++ library based on IBM's decimal decNumber library https://github.com/semihc/CppDecimal . This one's implementation of [`log`](https://github.com/semihc/CppDecimal/blob/main/src/decNumber.c#L1384-L1518) is fully using decimal, but I don't think this would be very performant way to do this I'm going to go with an approach similar to the one inside Intel's decimal library. To begin with the `decimal32 -> double` is done by a simple scaling ## What changes are included in this PR? 1. Support Decimal32 for log ## Are these changes tested? Yes, unit tests have been added, and I've tested this from the datafusion cli for Decimal32 ``` > select log(2.0, arrow_cast(12345.67, 'Decimal32(9, 2)')); +-----------------------------------------------------------------------+ | log(Float64(2),arrow_cast(Float64(12345.67),Utf8("Decimal32(9, 2)"))) | +-----------------------------------------------------------------------+ | 13.591717513271785 | +-----------------------------------------------------------------------+ 1 row(s) fetched. Elapsed 0.021 seconds. ``` ## Are there any user-facing changes? 1. The precision of the result for Decimal32 will change, the precision loss in apache#18524 does not occur in this PR --------- Co-authored-by: Andrew Lamb <[email protected]>
## Which issue does this PR close? - Part of apache#19250 ## Rationale for this change Previously, the `log` function would fail when operating on decimal values with negative scales. Negative scales in decimals represent values where the scale indicates padding zeros to the right (e.g., `Decimal128(38, -2)` with value `100` represents `10000`). This PR restores support for negative-scale decimals in the `log` function by implementing the logarithmic property: `log_base(value * 10^(-scale)) = log_base(value) + (-scale) * log_base(10)`. ## What changes are included in this PR? 1. **Enhanced `log_decimal128` function**: - Added support for negative scales using the logarithmic property - For negative scales, computes `log_base(value) + (-scale) * log_base(10)` instead of trying to convert to unscaled value - Added detection for negative-scale decimals in both the number and base arguments - Skips simplification when negative scales are detected to avoid errors with `ScalarValue` (which doesn't support negative scales yet) 2. **Added comprehensive tests**: - Unit tests in `log.rs` for negative-scale decimals with various bases (2, 3, 10) - SQL logic tests in `decimal.slt` using scientific notation (e.g., `1e4`, `8e1`) to create decimals with negative scales ## Are these changes tested? Yes, this PR includes comprehensive tests: 1. Unit tests: - `test_log_decimal128_negative_scale`: Tests array inputs with negative scales - `test_log_decimal128_negative_scale_base2`: Tests with base 2 and negative scales - `test_log_decimal128_negative_scale_scalar`: Tests scalar inputs with negative scales 2. SQL logic tests: - Tests for unary log with negative scales (`log(1e4)`) - Tests for binary log with explicit base 10 (`log(10, 1e4)`) - Tests for binary log with base 2 (`log(2.0, 8e1)`, `log(2.0, 16e1)`) - Tests for different negative scale values (`log(5e3)`) - Tests for array operations with negative scales - Tests for different bases (2, 3, 10) with negative-scale decimals All tests pass successfully. ## Are there any user-facing changes? Yes, this is a user-facing change: **Before**: The `log` function would fail with an error when operating on decimal values with negative scales: ```sql -- This would fail SELECT log(1e4); -- Error: Negative scale is not supported ``` **After**: The `log` function now correctly handles decimal values with negative scales: ```sql -- This now works SELECT log(1e4); -- Returns 4.0 (log10(10000)) SELECT log(2.0, 8e1); -- Returns ~6.32 (log2(80)) ``` --------- Co-authored-by: Jeffrey Vo <[email protected]> Co-authored-by: Martin Grigorov <[email protected]>
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#7689. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Added dedicated ceil/floor UDF implementations that keep existing float/int behavior but operate directly on Decimal128 arrays, including overflow checks and metadata preservation. - Updated the math module wiring plus sqllogictest coverage so decimal cases are executed and validated end to end. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? - All existing tests pass - Added new tests for the changes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Jeffrey Vo <[email protected]>
…#18754) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> Part of apache#12725 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Started refactoring encoding functions (`encode`, `decode`) to remove user defined signature (per linked issue). However, discovered in the process it was bugged in handling of certain inputs. For example on main we get these errors: ```sql DataFusion CLI v51.0.0 > select encode(arrow_cast(column1, 'LargeUtf8'), 'hex') from values ('a'), ('b'); Internal error: Function 'encode' returned value of type 'Utf8' while the following type was promised at planning time and expected: 'LargeUtf8'. This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues > select encode(arrow_cast(column1, 'LargeBinary'), 'hex') from values ('a'), ('b'); Internal error: Function 'encode' returned value of type 'Utf8' while the following type was promised at planning time and expected: 'LargeUtf8'. This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues > select encode(arrow_cast(column1, 'BinaryView'), 'hex') from values ('a'), ('b'); Error during planning: Execution error: Function 'encode' user-defined coercion failed with "Error during planning: 1st argument should be Utf8 or Binary or Null, got BinaryView" No function matches the given name and argument types 'encode(BinaryView, Utf8)'. You might need to add explicit type casts. Candidate functions: encode(UserDefined) ``` - LargeUtf8/LargeBinary array inputs are broken - BinaryView input not supported (but Utf8View input is supported) So went about fixing this input handling as well as doing various refactors to try simplify the code. (I also discovered apache#18746 in the process of this refactor). ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Refactor signatures away from user defined to the signature coercion API; importantly, we now accept only binary inputs, letting string inputs be coerced by type coercion. This simplifies the internal code of encode/decode to only need to consider binary inputs, instead of duplicating essentially exact code for string inputs (given for string inputs we just grabbed the underlying bytes anyway) Consolidating the inner functions used by encode/decode to try simplify/inline where possible. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Added new SLTs. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No. <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
Closes apache#16800 We could leave some of these methods around as deprecated and make them no-ops but I'd be afraid that would create a false sense of security (compiles but behaves wrong at runtime).
## Which issue does this PR close? Closes apache#19396. ## Rationale for this change Previously, when a `DefaultListFilesCache` was created with a TTL (e.g., `DefaultListFilesCache::new(1024, Some(Duration::from_secs(1)))`) and passed to `CacheManagerConfig` without explicitly setting `list_files_cache_ttl`, the cache's TTL would be unexpectedly unset (overwritten to `None`). This happened because `CacheManager::try_new()` always called `update_cache_ttl(config.list_files_cache_ttl)`, even when the config value was `None`. ## What changes are included in this PR? - Modified `CacheManager::try_new()` to only update the cache's TTL if `config.list_files_cache_ttl` is explicitly set (`Some(value)`). If the config TTL is `None`, the cache's existing TTL is preserved. - Added two test cases: - `test_ttl_preserved_when_not_set_in_config`: Verifies that TTL is preserved when not set in config - `test_ttl_overridden_when_set_in_config`: Verifies that TTL can still be overridden when explicitly set in config ## Are these changes tested? Yes ## Are there any user-facing changes? Yes
## Summary This PR adds protobuf serialization/deserialization support for `HashExpr`, enabling distributed query execution to serialize hash expressions used in hash joins and repartitioning. This is a followup to apache#18393 which introduced `HashExpr` but did not add serialization support. This causes errors when serialization is triggered on a query that pushes down dynamic filters from a `HashJoinExec`. As of apache#18393 `HashJoinExec` produces filters of the form: ```sql CASE (hash_repartition % 2) WHEN 0 THEN a >= ab AND a <= ab AND b >= bb AND b <= bb AND hash_lookup(a,b) WHEN 1 THEN a >= aa AND a <= aa AND b >= ba AND b <= ba AND hash_lookup(a,b) ELSE FALSE END ``` Where `hash_lookup` is an expression that holds a reference to a given partitions hash join hash table and will check for membership. Since we created these new expressions but didn't make any of them serializable any attempt to do a distributed query or similar would run into errors. In apache#19300 we fixed `hash_lookup` by replacing it with `true` since it can't be serialized across the wire (we'd have to send the entire hash table). The logic was that this preserves the bounds checks, which as still valuable. This PR handles `hash_repartition` which determines which partition (and hence which branch of the `CASE` expression) the row belongs to. For this expression we *can* serialize it, so that's what I'm doing in this PR. ### Key Changes - **SeededRandomState wrapper**: Added a `SeededRandomState` struct that wraps `ahash::RandomState` while preserving the seeds used to create it. This is necessary because `RandomState` doesn't expose seeds after creation, but we need them for serialization. - **Updated seed constants**: Changed `HASH_JOIN_SEED` and `REPARTITION_RANDOM_STATE` constants to use `SeededRandomState` instead of raw `RandomState`. - **HashExpr enhancements**: - Changed `HashExpr` to use `SeededRandomState` - Added getter methods: `on_columns()`, `seeds()`, `description()` - Exported `HashExpr` and `SeededRandomState` from the joins module - **Protobuf support**: - Added `PhysicalHashExprNode` message to `datafusion.proto` with fields for `on_columns`, seeds (4 `u64` values), and `description` - Implemented serialization in `to_proto.rs` - Implemented deserialization in `from_proto.rs` ## Test plan - [x] Added roundtrip test in `roundtrip_physical_plan.rs` that creates a `HashExpr`, serializes it, deserializes it, and verifies the result - [x] All existing hash join tests pass (583 tests) - [x] All proto roundtrip tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <[email protected]>
…he#19400) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> close apache#19398 ## Rationale for this change see issue apache#19398 ## What changes are included in this PR? impl `try_swapping_with_projection` for `CooperativeExec` and `CoalesceBatchesExec` ## Are these changes tested? add test case ## Are there any user-facing changes?
…pache#19412) ## Which issue does this PR close? Closes apache#19410 ## Rationale for this change This PR fixes a regression introduced in apache#18831 where queries using GROUP BY with ORDER BY positional reference to an aliased aggregate fail with: ``` Error during planning: Column in ORDER BY must be in GROUP BY or an aggregate function ``` **Failing query (now fixed):** ```sql with t as (select 'foo' as x) select x, count(*) as "Count" from t group by x order by 2 desc; ``` ## What changes are included in this PR? **Root cause:** When building the list of valid columns for ORDER BY validation in `select.rs`, alias names were converted to `Column` using `.into()`, which calls `from_qualified_name()` and normalizes identifiers to lowercase. However, ORDER BY positional references resolve to columns using schema field names, which preserve case. This caused a mismatch (e.g., `Column("Count")` vs `Column("count")`). **Fix:** Use `Column::new_unqualified()` instead of `.into()` to preserve the exact case of alias names, matching how the schema stores field names. ## Are these changes tested? Yes, added a regression test to `order.slt`. ## Are there any user-facing changes? No, this is a bug fix that restores expected behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <[email protected]>
## Which issue does this PR close? - Closes apache#19269. ## Rationale for this change See to issue apache#19269 for deeper rationale. DF did not have the notion that being partitioned on a superset of the required partitioning satisfied the condition. Having this logic will eliminate unnecessary repartitions and in turn other operators like partial aggregations. I introduced this behavior with the `repartition_subset_satisfactions` flag (default false) as there are some cases where repartitioning may still be wanted when we satisfy partitioning via this subset property. In particular, if when partitioned via Hash(a) there is data skew but when partitioned on Hash(a, b) there is better distribution, a user may want to turn this optimization off. I also made it the case such that if we satisfy repartitioning via a subset but the current amount of partitions < target_partitions, then we will still repartition to maintain and increase parallelism in the system when possible. ## What changes are included in this PR? - Modified `satisfy()` logic to check for subsets and return an enum of type of match: exact, subset, none - Do in `EnforceDistribution`, where `satisfy()` is called, do not allow subset logic for partitioned join operators as partitioning on each side much match exactly, thus need to repartition if subset logic is true - Created unit and sqllogictests ## Are these changes tested? - Unit test - sqllogictest - tpch correctness ### Benchmarks I did not see any drastic changes in benches, but the shuffle eliminations will be great improvements for distributed DF. <img width="628" height="762" alt="Screenshot 2025-12-12 at 8 28 15 PM" src="https://github.com/user-attachments/assets/4b42945f-34e0-46c9-a4ce-e7ccdd0c0603" /> <img width="490" height="746" alt="Screenshot 2025-12-12 at 8 30 15 PM" src="https://github.com/user-attachments/assets/846aef1b-8c5d-462d-83e7-7fa1e2a9372e" /> ## Are there any user-facing changes? Yes, users will now have the `repartition_subset_satisfications` option as described in this PR --------- Co-authored-by: Andrew Lamb <[email protected]>
…5% faster (apache#19413) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Part of apache#18411 - Closes apache#19344 - Closes apache#19364 Note this is an alternate to apache#19364 ## Rationale for this change @camuel found a query where DuckDB's raw grouping is is faster. I looked into it and much of the difference can be explained by better vectorization in the comparisons and short string optimizations ## What changes are included in this PR? Optimize (will comment inline) ## Are these changes tested? By CI. See also benchmark results below. I tested manually as well Create Data: ```shell nice tpchgen-cli --tables=lineitem --format=parquet --scale-factor 100 ``` Run query: ```shell hyperfine --warmup 3 " datafusion-cli -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" " ``` Before (main): 1.368s ```shell Benchmark 1: datafusion-cli -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 1.393 s ± 0.020 s [User: 16.778 s, System: 0.688 s] Range (min … max): 1.368 s … 1.438 s 10 runs ``` After (this PR) 1.022s ```shell Benchmark 1: ./datafusion-cli-multi-gby-try2 -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 1.022 s ± 0.015 s [User: 11.685 s, System: 0.644 s] Range (min … max): 1.005 s … 1.052 s 10 runs ``` I have a PR that improves string view hashing performance too, see - apache#19374 ## Are there any user-facing changes? Faster performance
…19374) ## Which issue does this PR close? - builds on apache#19373 - part of apache#18411 - Broken out of apache#19344 - Closes apache#19344 ## Rationale for this change While looking at performance as part of apache#18411, I noticed we could speed up string view hashing by optimizing for small strings ## What changes are included in this PR? Optimize StringView hashing, specifically by using the inlined view for short strings ## Are these changes tested? Functionally by existing coverage Performance by benchmarks (added in apache#19373) which show * 15%-20% faster for mixed short/long strings * 50%-70% faster for "short" arrays where we know there are no strings longer than 12 bytes ``` utf8_view (small): multiple, no nulls 1.00 47.9±1.71µs ? ?/sec 4.00 191.6±1.15µs ? ?/sec utf8_view (small): multiple, nulls 1.00 78.4±0.48µs ? ?/sec 3.08 241.6±1.11µs ? ?/sec utf8_view (small): single, no nulls 1.00 13.9±0.19µs ? ?/sec 4.29 59.7±0.30µs ? ?/sec utf8_view (small): single, nulls 1.00 23.8±0.20µs ? ?/sec 3.10 73.7±1.03µs ? ?/sec utf8_view: multiple, no nulls 1.00 235.4±2.14µs ? ?/sec 1.11 262.2±1.34µs ? ?/sec utf8_view: multiple, nulls 1.00 227.2±2.11µs ? ?/sec 1.34 303.9±2.23µs ? ?/sec utf8_view: single, no nulls 1.00 71.6±0.74µs ? ?/sec 1.05 75.2±1.27µs ? ?/sec utf8_view: single, nulls 1.00 71.5±1.92µs ? ?/sec 1.28 91.6±4.65µs ``` <details><summary>Details</summary> <p> ``` Gnuplot not found, using plotters backend utf8_view: single, no nulls time: [20.872 µs 20.906 µs 20.944 µs] change: [−15.863% −15.614% −15.331%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 8 (8.00%) high mild 5 (5.00%) high severe utf8_view: single, nulls time: [22.968 µs 23.050 µs 23.130 µs] change: [−17.796% −17.384% −16.918%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe utf8_view: multiple, no nulls time: [66.005 µs 66.155 µs 66.325 µs] change: [−19.077% −18.785% −18.512%] (p = 0.00 < 0.05) Performance has improved. utf8_view: multiple, nulls time: [72.155 µs 72.375 µs 72.649 µs] change: [−17.944% −17.612% −17.266%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 6 (6.00%) high mild 5 (5.00%) high severe utf8_view (small): single, no nulls time: [6.1401 µs 6.1563 µs 6.1747 µs] change: [−69.623% −69.484% −69.333%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe utf8_view (small): single, nulls time: [10.234 µs 10.250 µs 10.270 µs] change: [−53.969% −53.815% −53.666%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high severe utf8_view (small): multiple, no nulls time: [20.853 µs 20.905 µs 20.961 µs] change: [−66.006% −65.883% −65.759%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe utf8_view (small): multiple, nulls time: [32.519 µs 32.600 µs 32.675 µs] change: [−53.937% −53.581% −53.232%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild ``` </p> </details> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#13433 ## Rationale for this change Keep up to date with hashbrown and possible (performance) improvements, remove the `rawtable` functionality. ## What changes are included in this PR? Upgrading, fixing case of nondeterminism in `HashSet` usage. ## Are these changes tested? Yes existiing tests. ## Are there any user-facing changes? No.
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes #. ## Rationale for this change 1. add crypto function benchmark code 2. simplify inner digest function code 1. I checked for any performance changes using the benchmark code(1), but there was no changement. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? yes. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 4. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? no <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> Co-authored-by: Jeffrey Vo <[email protected]>
…e#19559) Bumps [taiki-e/install-action](https://github.com/taiki-e/install-action) from 2.65.6 to 2.65.8. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/releases">taiki-e/install-action's releases</a>.</em></p> <blockquote> <h2>2.65.8</h2> <ul> <li> <p>Update <code>tombi@latest</code> to 0.7.14.</p> </li> <li> <p>Update <code>uv@latest</code> to 0.9.20.</p> </li> <li> <p>Update <code>typos@latest</code> to 1.40.1.</p> </li> </ul> <h2>2.65.7</h2> <ul> <li> <p>Update <code>cargo-no-dev-deps@latest</code> to 0.2.19.</p> </li> <li> <p>Update <code>cargo-minimal-versions@latest</code> to 0.1.34.</p> </li> <li> <p>Update <code>cargo-insta@latest</code> to 1.45.1.</p> </li> <li> <p>Update <code>cargo-hack@latest</code> to 0.6.40.</p> </li> <li> <p>Update <code>dprint@latest</code> to 0.51.1.</p> </li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/blob/main/CHANGELOG.md">taiki-e/install-action's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <p>All notable changes to this project will be documented in this file.</p> <p>This project adheres to <a href="https://semver.org">Semantic Versioning</a>.</p> <!-- raw HTML omitted --> <h2>[Unreleased]</h2> <h2>[2.65.8] - 2025-12-30</h2> <ul> <li> <p>Update <code>tombi@latest</code> to 0.7.14.</p> </li> <li> <p>Update <code>uv@latest</code> to 0.9.20.</p> </li> <li> <p>Update <code>typos@latest</code> to 1.40.1.</p> </li> </ul> <h2>[2.65.7] - 2025-12-29</h2> <ul> <li> <p>Update <code>cargo-no-dev-deps@latest</code> to 0.2.19.</p> </li> <li> <p>Update <code>cargo-minimal-versions@latest</code> to 0.1.34.</p> </li> <li> <p>Update <code>cargo-insta@latest</code> to 1.45.1.</p> </li> <li> <p>Update <code>cargo-hack@latest</code> to 0.6.40.</p> </li> <li> <p>Update <code>dprint@latest</code> to 0.51.1.</p> </li> </ul> <h2>[2.65.6] - 2025-12-28</h2> <ul> <li> <p>Update <code>dprint@latest</code> to 0.51.0.</p> </li> <li> <p>Update <code>vacuum@latest</code> to 0.22.0.</p> </li> </ul> <h2>[2.65.5] - 2025-12-27</h2> <ul> <li> <p>Update <code>tombi@latest</code> to 0.7.12.</p> </li> <li> <p>Update <code>cargo-binstall@latest</code> to 1.16.6.</p> </li> </ul> <h2>[2.65.4] - 2025-12-27</h2> <ul> <li> <p>Update <code>cargo-nextest@latest</code> to 0.9.116.</p> </li> <li> <p>Update <code>prek@latest</code> to 0.2.25.</p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/taiki-e/install-action/commit/ff581034fb69296c525e51afd68cb9823bfbe4ed"><code>ff58103</code></a> Release 2.65.8</li> <li><a href="https://github.com/taiki-e/install-action/commit/766eefa747e7fe9a871104d27d7363a9afad3a06"><code>766eefa</code></a> Update changelog</li> <li><a href="https://github.com/taiki-e/install-action/commit/db0301613d90942e9412b9e0152d98b6a4f40942"><code>db03016</code></a> Update <code>tombi@latest</code> to 0.7.14</li> <li><a href="https://github.com/taiki-e/install-action/commit/78f63804f5bd1b38ca7dadb1af0afb328a2c701e"><code>78f6380</code></a> Update <code>uv@latest</code> to 0.9.20</li> <li><a href="https://github.com/taiki-e/install-action/commit/614b8622046c9f8bed8c8fec2093946b3b15dcf2"><code>614b862</code></a> Update <code>typos@latest</code> to 1.40.1</li> <li><a href="https://github.com/taiki-e/install-action/commit/447ff350f809b67060e03706fdf9a5f8b63be9fc"><code>447ff35</code></a> Update <code>tombi@latest</code> to 0.7.13</li> <li><a href="https://github.com/taiki-e/install-action/commit/4c6723ec9c638cccae824b8957c5085b695c8085"><code>4c6723e</code></a> Release 2.65.7</li> <li><a href="https://github.com/taiki-e/install-action/commit/9ff15877d9cf002da5c45f2297f2bc736978a8d7"><code>9ff1587</code></a> Update <code>cargo-no-dev-deps@latest</code> to 0.2.19</li> <li><a href="https://github.com/taiki-e/install-action/commit/4f0419fae32f77fcc791b79eb61b83694d208e4a"><code>4f0419f</code></a> Update <code>cargo-minimal-versions@latest</code> to 0.1.34</li> <li><a href="https://github.com/taiki-e/install-action/commit/1eecdc5eb1c3f829c89d7e6d794725f1bce75111"><code>1eecdc5</code></a> Update <code>cargo-insta@latest</code> to 1.45.1</li> <li>Additional commits viewable in <a href="https://github.com/taiki-e/install-action/compare/28a9d316db64b78a951f3f8587a5d08cc97ad8eb...ff581034fb69296c525e51afd68cb9823bfbe4ed">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
## Which issue does this PR close? - Closes apache#19292. ## Rationale for this change Metadata on a schema for a record batch is lost during FFI conversion. This is not always obvious because our other integrations like `ExecutionPlan` and `TableProvider` have their own ways to provide the schema. This is currently an issue because we have FFI table providers who say they have a specific schema but the schema of the actual record batches does not match. The metadata is dropped. ## What changes are included in this PR? We already have the schema in the FFI object, so use it both when converting to and from FFI. ## Are these changes tested? Unit test added. ## Are there any user-facing changes? No
…implementation (apache#19142) Add infrastructure for row-level DML operations (DELETE/UPDATE) to the TableProvider trait, enabling storage engines to implement SQL-based mutations. Changes: - Add TableProvider::delete_from() method for DELETE operations - Add TableProvider::update() method for UPDATE operations - Wire physical planner to route DML operations to TableProvider - Implement DELETE/UPDATE for MemTable as reference implementation - Add comprehensive sqllogictest coverage This provides the API surface for downstream projects (iceberg-rust, delta-rs) to implement DML without custom query planners. ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#16959 - Related to apache#12406 ## Rationale for this change Datafusion parses DELETE/UPDATE but returns NotImplemented("Unsupported logical plan: Dml(Delete)") at physical planning. Downstream projects (iceberg-rust, delta-rs) must implement custom planners to work around this. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? Adds TableProvider hooks for row-level DML: - delete_from(state, filters) - deletes rows matching filter predicates - update(state, assignments, filters) - updates matching rows with new values Physical planner routes WriteOp::Delete and WriteOp::Update to these methods. Tables that don't support DML return NotImplemented (the default behavior). MemTable reference implementation demonstrates: - Filter evaluation with SQL three-valued logic (NULL predicates don't match) - Multi-column updates with expression evaluation - Proper row count reporting <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes: - Unit tests for physical planning: cargo test -p datafusion --test custom_sources_cases - SQL logic tests: dml_delete.slt and dml_update.slt with comprehensive coverage <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? New trait methods on TableProvider: ``` async fn delete_from(&self, state: &dyn Session, filters: Vec<Expr>) -> Result<Arc<dyn ExecutionPlan>>; async fn update(&self, state: &dyn Session, assignments: Vec<(String, Expr)>, filters: Vec<Expr>) -> Result<Arc<dyn ExecutionPlan>>; ``` Fully backward compatible. Default implementations return NotImplemented. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Andrew Lamb <[email protected]>
…pache#19523) ## Which issue does this PR close? None ## Rationale for this change when i run the benchmark for topk_aggregate, i find the mostly of time cost is generate the data, so the result is not real reflect the performance for topk_aggregate. ``` cargo bench --bench topk_aggregate ``` ### main branch  ### this pr  ## What changes are included in this PR? extract the data generate out of topk_aggregate benchmark ## Are these changes tested? ## Are there any user-facing changes?
Optimized lpad and rpad functions to eliminate per-row allocations by reusing buffers for graphemes and fill characters. The previous implementation allocated new Vec<&str> for graphemes and Vec<char> for fill characters on every row, which was inefficient. This optimization introduces reusable buffers that are allocated once and cleared/refilled for each row. Changes: - lpad: Added graphemes_buf and fill_chars_buf outside loops, clear and refill per row instead of allocating new Vec each time - rpad: Added graphemes_buf outside loops to reuse across iterations - Both functions now allocate buffers once and reuse them for all rows - Buffers are cleared and reused for each row via .clear() and .extend() Optimization impact: - For lpad with fill parameter: Eliminates 2 Vec allocations per row (graphemes + fill_chars) - For lpad without fill: Eliminates 1 Vec allocation per row (graphemes) - For rpad: Eliminates 1 Vec allocation per row (graphemes) This optimization is particularly effective for: - Large arrays with many rows - Strings with multiple graphemes (unicode characters) - Workloads with custom fill patterns Benchmark results comparing main vs optimized branch: lpad benchmarks: - size=1024, str_len=5, target=20: 116.53 µs -> 63.226 µs (45.7% faster) - size=1024, str_len=20, target=50: 314.07 µs -> 190.30 µs (39.4% faster) - size=4096, str_len=5, target=20: 467.35 µs -> 261.29 µs (44.1% faster) - size=4096, str_len=20, target=50: 1.2286 ms -> 754.24 µs (38.6% faster) rpad benchmarks: - size=1024, str_len=5, target=20: 113.89 µs -> 72.645 µs (36.2% faster) - size=1024, str_len=20, target=50: 313.68 µs -> 202.98 µs (35.3% faster) - size=4096, str_len=5, target=20: 456.08 µs -> 295.57 µs (35.2% faster) - size=4096, str_len=20, target=50: 1.2523 ms -> 818.47 µs (34.6% faster) Overall improvements: 35-46% faster across all workloads ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
…ition is provided (apache#19553) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Replace `.chars().skip().collect::<String>()` with zero-copy string slicing using `char_indices()` to find the byte offset, then slice with `&value[byte_offset..]`. This eliminates unnecessary String allocation per row when a start position is specified. Changes: - Use char_indices().nth() to find byte offset for start position (1-based) - Use string slicing &value[byte_offset..] instead of collecting chars - Added benchmark to measure performance improvements Optimization: - Before: Allocated new String via .collect() for each row with start position - After: Uses zero-copy string slice Benchmark results: - size=1024, str_len=32: 96.361 µs -> 41.458 µs (57.0% faster, 2.3x speedup) - size=1024, str_len=128: 210.16 µs -> 56.064 µs (73.3% faster, 3.7x speedup) - size=4096, str_len=32: 376.90 µs -> 162.98 µs (56.8% faster, 2.3x speedup) - size=4096, str_len=128: 855.68 µs -> 263.61 µs (69.2% faster, 3.2x speedup) The optimization shows greater improvements for longer strings (up to 73% faster) since string slicing is O(1) regardless of length, while the previous approach had allocation costs that grew with string length. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
## Which issue does this PR close? - Closes apache#18827 - Closes apache#9654 ## Rationale for this change Now that the `DefaultListFilesCache` can be configured by users it's safe to enable it by default and fix the tests that caching broke! ## What changes are included in this PR? - Sets the DefaultListFilesCache to be enabled by default - Adds additional object store access tests to show list caching behavior - Adds variable setting/reading sqllogic test cases - Updates tests to disable caching when they relied on COPY commands so changes can be detected for each query - Updates docs to help users upgrade ## Are these changes tested? Yes, additional test cases have been added to help show the behavior of the caching ## Are there any user-facing changes? Yes, this changes the default behavior of DataFusion, however this information is already captured in the upgrade guide. ## cc @alamb --------- Co-authored-by: Andrew Lamb <[email protected]>
) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> Closes apache#17527 ## Rationale for this change Currently, DataFusion computes bounds for all queries that contain a HashJoinExec node whenever the option enable_dynamic_filter_pushdown is set to true (default). It might make sense to compute these bounds only when we explicitly know there is a consumer that will use them. ## What changes are included in this PR? As suggested in apache#17527 (comment), this PR adds an is_used() method to DynamicFilterPhysicalExpr that checks if any consumers are holding a reference to the filter using Arc::strong_count(). During filter pushdown, consumers that accept the filter and use it later in execution have to retain a reference to Arc. For example, scan nodes like ParquetSource. ## Are these changes tested? I added a unit test in dynamic_filters.rs (test_is_used) that verifies the Arc reference counting behavior. Existing integration tests in datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs validate the end-to-end behavior. These tests verify that dynamic filters are computed and filled when consumers are present. ## Are there any user-facing changes? new is_used() function
## Which issue does this PR close? N/A ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Clean up some signatures & unnecessary code in string functions ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Various refactors, see comments. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Existing tests. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No. <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
00ada31 to
45de48f
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
catalog
common
core
datasource
development-process
documentation
Improvements or additions to documentation
execution
ffi
functions
logical-expr
optimizer
physical-expr
physical-plan
proto
spark
sql
sqllogictest
substrait
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NOTE:
For simplicity, this PR implements everything within datafusion. In a more complete design, some pieces like casting support would likely belong in arrow's cast kernel. If this PR is directionally sound and aligned with other folks, I'm happy to follow up with a properly layered implementation
Which issue does this PR close?
Uniondata type coercion apache/datafusion#18825Rationale for this change
This PR adds coercion logic for union data types. This way, you are able to compare union data types with scalar types.
Specifically making queries like the following work. (Note
attributes->'id' is aDataType::Union)When comparing a union type with a scalar type, the logic will check if the scalar type matches any Union variant
The scalar type matching is greedy, meaning if you have a Union type with variants of the same field, it will match by the smaller type id (since union variants are sorted by type id)