Skip to content

Conversation

@friendlymatthew
Copy link
Member

@friendlymatthew friendlymatthew commented Dec 18, 2025

NOTE:
For simplicity, this PR implements everything within datafusion. In a more complete design, some pieces like casting support would likely belong in arrow's cast kernel. If this PR is directionally sound and aligned with other folks, I'm happy to follow up with a properly layered implementation

Which issue does this PR close?

Rationale for this change

This PR adds coercion logic for union data types. This way, you are able to compare union data types with scalar types.

Specifically making queries like the following work. (Note attributes->'id' is a DataType::Union)

select * from records where attributes->'id' = 67

When comparing a union type with a scalar type, the logic will check if the scalar type matches any Union variant

The scalar type matching is greedy, meaning if you have a Union type with variants of the same field, it will match by the smaller type id (since union variants are sorted by type id)

Smotrov and others added 7 commits December 17, 2025 22:51
…e#18954)

## Which issue does this PR close?

Closes apache#18947

## Rationale for this change

Currently, DataFusion uses default compression levels when writing
compressed JSON and CSV files. For ZSTD, this means level 3, which
prioritizes speed over compression ratio. Users working with large
datasets who want to optimize for storage costs or network transfer have
no way to increase the compression level.

This is particularly important for cloud data lake scenarios where
storage and egress costs can be significant.

## What changes are included in this PR?

- Add `compression_level: Option<u32>` field to `JsonOptions` and
`CsvOptions` in `config.rs`
- Add `convert_async_writer_with_level()` method to
`FileCompressionType` (non-breaking API extension)
- Keep original `convert_async_writer()` as a convenience wrapper for
backward compatibility
- Update `JsonWriterOptions` and `CsvWriterOptions` with
`compression_level` field
- Update `ObjectWriterBuilder` to support compression level
- Update JSON and CSV sinks to pass compression level through the write
pipeline
- Update proto definitions and conversions for serialization support
- Fix unrelated unused import warning in `udf.rs` (conditional
compilation for debug-only imports)

## Are these changes tested?

The changes follow the existing patterns used throughout the codebase.
The implementation was verified by:
- Building successfully with `cargo build`
- Running existing tests with `cargo test --package datafusion-proto`
- All 131 proto integration tests pass

## Are there any user-facing changes?

Yes, users can now specify compression level when writing JSON/CSV
files:

```rust
use datafusion::common::config::JsonOptions;
use datafusion::common::parsers::CompressionTypeVariant;

let json_opts = JsonOptions {
    compression: CompressionTypeVariant::ZSTD,
    compression_level: Some(9),  // Higher compression
    ..Default::default()
};
```

**Supported compression levels:**
- ZSTD: 1-22 (default: 3)
- GZIP: 0-9 (default: 6)  
- BZIP2: 1-9 (default: 9)
- XZ: 0-9 (default: 6)

**This is a non-breaking change** - the original
`convert_async_writer()` method signature is preserved for backward
compatibility.

Co-authored-by: Andrew Lamb <[email protected]>
## Which issue does this PR close?

- Related to apache#19241

## Rationale for this change

This PR enhances the `in_list` benchmark suite to provide more
comprehensive performance measurements across a wider range of data
types and list sizes. These improvements are necessary groundwork for
evaluating optimizations proposed in apache#19241.

The current benchmarks were limited in scope, making it difficult to
assess the performance impact of potential `in_list` optimizations
across different data types and scenarios.

## What changes are included in this PR?

- Added benchmarks for `UInt8Array`, `Int16Array`, and
`TimestampNanosecondArray`
- Added `28` to `IN_LIST_LENGTHS` (now `[3, 8, 28, 100]`) to better
cover the range between small and large lists
- Increased `ARRAY_LENGTH` from `1024` to `8192` to be aligned with the
default DataFusionbatch size
- Configured criterion with shorter warm-up (100ms) and measurement
times (500ms) for faster iteration

## Are these changes tested?

Yes, this PR adds benchmark coverage. The benchmarks can be run with:
```bash
cargo bench --bench in_list
```

The benchmarks verify that the `in_list` expression evaluates correctly
for all the new data types.

## Are there any user-facing changes?

No user-facing changes. This PR only affects the benchmark suite used
for performance testing and development.
## Which issue does this PR close?


- Part of apache#18411

## Rationale for this change

I want to optimize hashing for StringViewArray. In order to do I would
like a benchmark to show it works

## What changes are included in this PR?

Add benchmark for `with_hashes`

Run like
```shell
cargo bench  --bench with_hashes
```

Note I did not add all the possible types of arrays as I don't plan to
optimize othrs

## Are these changes tested?

I ran it manually
## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#19156 
- Part of apache#19144 

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Spark `next_day` uses `return_field_from_args`

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

- Added new unit tests, previous all tests pass

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
…e#19313)

## Summary

This PR moves the CSV-specific `newlines_in_values` configuration option
from `FileScanConfig` (a shared format-agnostic configuration) to
`CsvSource` where it belongs.

- Add `newlines_in_values` field and methods to `CsvSource`
- Add `has_newlines_in_values()` method to `FileSource` trait (returns
`false` by default)
- Update `FileSource::repartitioned()` to use the new trait method
- Remove `new_lines_in_values` from `FileScanConfig` and its builder
- Update proto serialization to read from/write to `CsvSource`
- Update tests and documentation
- Add migration guide to `upgrading.md`

Closes apache#18453

## Test plan

- [x] All existing tests pass
- [x] Doc tests pass
- [x] Proto roundtrip tests pass
- [x] Clippy clean

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <[email protected]>
…filing (apache#19127)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#18138

## Rationale for this change

The `list` operation returns a stream, so it previously recorded
`duration: None`, missing performance insights. Time-to-first-item is a
useful metric for list operations, indicating how quickly results start.
This adds duration tracking by measuring time until the first item is
yielded (or the stream ends).

## What changes are included in this PR?

1. Added `TimeToFirstItemStream`: A stream wrapper that measures elapsed
time from creation until the first item is yielded (or the stream ends
if empty).
2. Updated `instrumented_list`: Wraps the inner stream with
`TimeToFirstItemStream` to record duration.
3. Changed `requests` field: Switched from `Mutex<Vec<RequestDetails>>`
to `Arc<Mutex<Vec<RequestDetails>>>` to allow sharing across async
boundaries (needed for the stream wrapper).
4. Updated tests: Modified `instrumented_store_list` to consume at least
one stream item and verify that `duration` is now `Some(Duration)`
instead of `None`.

## Are these changes tested?

Yes. The existing test `instrumented_store_list` was updated to:
- Consume at least one item from the stream using `stream.next().await`
- Assert that `request.duration.is_some()` (previously `is_none()`)

All tests pass, including the updated list test and other instrumented
operation tests.

## Are there any user-facing changes?

Users with profiling enabled will see duration values for `list`
operations instead of nothing.
…e#19382)

Bumps
[taiki-e/install-action](https://github.com/taiki-e/install-action) from
2.63.3 to 2.64.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/taiki-e/install-action/releases">taiki-e/install-action's
releases</a>.</em></p>
<blockquote>
<h2>2.64.0</h2>
<ul>
<li>
<p><code>tool</code> input option now supports whitespace (space, tab,
and line) or comma separated list. Previously, only comma-separated list
was supported. (<a
href="https://redirect.github.com/taiki-e/install-action/pull/1366">#1366</a>)</p>
</li>
<li>
<p>Support <code>prek</code>. (<a
href="https://redirect.github.com/taiki-e/install-action/pull/1357">#1357</a>,
thanks <a href="https://github.com/j178"><code>@​j178</code></a>)</p>
</li>
<li>
<p>Support <code>mdbook-mermaid</code>. (<a
href="https://redirect.github.com/taiki-e/install-action/pull/1359">#1359</a>,
thanks <a
href="https://github.com/CommanderStorm"><code>@​CommanderStorm</code></a>)</p>
</li>
<li>
<p>Support <code>martin</code>. (<a
href="https://redirect.github.com/taiki-e/install-action/pull/1364">#1364</a>,
thanks <a
href="https://github.com/CommanderStorm"><code>@​CommanderStorm</code></a>)</p>
</li>
<li>
<p>Update <code>trivy@latest</code> to 0.68.2.</p>
</li>
<li>
<p>Update <code>xh@latest</code> to 0.25.3.</p>
</li>
<li>
<p>Update <code>mise@latest</code> to 2025.12.10.</p>
</li>
<li>
<p>Update <code>uv@latest</code> to 0.9.18.</p>
</li>
<li>
<p>Update <code>cargo-shear@latest</code> to 1.9.1.</p>
</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/taiki-e/install-action/blob/main/CHANGELOG.md">taiki-e/install-action's
changelog</a>.</em></p>
<blockquote>
<h1>Changelog</h1>
<p>All notable changes to this project will be documented in this
file.</p>
<p>This project adheres to <a href="https://semver.org">Semantic
Versioning</a>.</p>
<!-- raw HTML omitted -->
<h2>[Unreleased]</h2>
<ul>
<li>Update <code>mise@latest</code> to 2025.12.11.</li>
</ul>
<h2>[2.64.0] - 2025-12-17</h2>
<ul>
<li>
<p><code>tool</code> input option now supports whitespace (space, tab,
and line) or comma separated list. Previously, only comma-separated list
was supported. (<a
href="https://redirect.github.com/taiki-e/install-action/pull/1366">#1366</a>)</p>
</li>
<li>
<p>Support <code>prek</code>. (<a
href="https://redirect.github.com/taiki-e/install-action/pull/1357">#1357</a>,
thanks <a href="https://github.com/j178"><code>@​j178</code></a>)</p>
</li>
<li>
<p>Support <code>mdbook-mermaid</code>. (<a
href="https://redirect.github.com/taiki-e/install-action/pull/1359">#1359</a>,
thanks <a
href="https://github.com/CommanderStorm"><code>@​CommanderStorm</code></a>)</p>
</li>
<li>
<p>Support <code>martin</code>. (<a
href="https://redirect.github.com/taiki-e/install-action/pull/1364">#1364</a>,
thanks <a
href="https://github.com/CommanderStorm"><code>@​CommanderStorm</code></a>)</p>
</li>
<li>
<p>Update <code>trivy@latest</code> to 0.68.2.</p>
</li>
<li>
<p>Update <code>xh@latest</code> to 0.25.3.</p>
</li>
<li>
<p>Update <code>mise@latest</code> to 2025.12.10.</p>
</li>
<li>
<p>Update <code>uv@latest</code> to 0.9.18.</p>
</li>
<li>
<p>Update <code>cargo-shear@latest</code> to 1.9.1.</p>
</li>
</ul>
<h2>[2.63.3] - 2025-12-15</h2>
<ul>
<li>Update <code>cargo-nextest@latest</code> to 0.9.115.</li>
</ul>
<h2>[2.63.2] - 2025-12-15</h2>
<ul>
<li>
<p>Update <code>mise@latest</code> to 2025.12.7.</p>
</li>
<li>
<p>Update <code>git-cliff@latest</code> to 2.11.0.</p>
</li>
<li>
<p>Update <code>coreutils@latest</code> to 0.5.0.</p>
</li>
<li>
<p>Update <code>cargo-binstall@latest</code> to 1.16.4.</p>
</li>
</ul>
<h2>[2.63.1] - 2025-12-14</h2>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/taiki-e/install-action/commit/69e777b377e4ec209ddad9426ae3e0c1008b0ef3"><code>69e777b</code></a>
Release 2.64.0</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/c8a7c7764cebc151687ee69895469c63dfa5abff"><code>c8a7c77</code></a>
Update changelog</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/a62e6211cb81c68c57a180c5702ffe1dee406d82"><code>a62e621</code></a>
codegen: Use ring instead of sha2</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/936dbd8ac63902a52ab34d8c7d13d427f6a414e6"><code>936dbd8</code></a>
codegen: Update base manifests</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/5d018ee3d220f595ecf68892affd8fe5bac42f58"><code>5d018ee</code></a>
Support whitespace separated list</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/72b24c709c6db51074c8c41b3ce38bddfae9bff3"><code>72b24c7</code></a>
Support martin (<a
href="https://redirect.github.com/taiki-e/install-action/issues/1364">#1364</a>)</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/a9e9081aa4eb13ec32770e0bd24eb97bc2400e12"><code>a9e9081</code></a>
Support prek (<a
href="https://redirect.github.com/taiki-e/install-action/issues/1357">#1357</a>)</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/ddb68c9d25d9ea6e17be49edea16b674bc79b77c"><code>ddb68c9</code></a>
Support mdbook-mermaid (<a
href="https://redirect.github.com/taiki-e/install-action/issues/1359">#1359</a>)</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/19d2d1dff9b6624e9696e43caeb2fe0f4480bbe3"><code>19d2d1d</code></a>
Update <code>trivy@latest</code> to 0.68.2</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/63c44454be578a6e1d4e1c8756847c31801f86fe"><code>63c4445</code></a>
Update <code>xh@latest</code> to 0.25.3</li>
<li>Additional commits viewable in <a
href="https://github.com/taiki-e/install-action/compare/d850aa816998e5cf15f67a78c7b933f2a5033f8a...69e777b377e4ec209ddad9426ae3e0c1008b0ef3">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=taiki-e/install-action&package-manager=github_actions&previous-version=2.63.3&new-version=2.64.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/union-coercion branch 2 times, most recently from c71e795 to a73c903 Compare December 18, 2025 17:22
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#19380.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Snapshot test passes but the existing value is in a legacy format.

Updated insta snapshots to new format by running 
`cargo insta test --force-update-snapshots`

## What changes are included in this PR?

Snapshots in various directories.

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
Yes
## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

No

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
No
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/union-coercion branch 3 times, most recently from 1fe1793 to 00ada31 Compare December 18, 2025 18:46
## Which issue does this PR close?

part of apache#15914

## Rationale for this change

Migrate spark functions from https://github.com/lakehq/sail/ to
datafusion engine to unify codebase

## What changes are included in this PR?

implement spark udaf try_sum
https://spark.apache.org/docs/latest/api/sql/index.html#try_sum

## Are these changes tested?

unit-tests and sqllogictests added

## Are there any user-facing changes?

now can be called in queries
Mark1626 and others added 12 commits December 19, 2025 02:48
## Which issue does this PR close?

- Part of apache#17555 .

## Rationale for this change

### Analysis 

Other engines:
1. Clickhouse seems to only consider `"(U)Int*", "Float*", "Decimal*"`
as arguments for log
https://github.com/ClickHouse/ClickHouse/blob/master/src/Functions/log.cpp#L47-L63

Libraries

1. There a C++ library libdecimal which internally uses [Intel Decimal
Floating Point
Library](https://www.intel.com/content/www/us/en/developer/articles/tool/intel-decimal-floating-point-math-library.html)
for it's
[decimal32](https://github.com/GaryHughes/stddecimal/blob/main/libdecimal/decimal_cmath.cpp#L150-L159)
operations. Intel's library itself converts the decimal32 to double and
calls `log`.
https://github.com/karlorz/IntelRDFPMathLib20U2/blob/main/LIBRARY/src/bid32_log.c
2. There was another C++ library based on IBM's decimal decNumber
library https://github.com/semihc/CppDecimal . This one's implementation
of
[`log`](https://github.com/semihc/CppDecimal/blob/main/src/decNumber.c#L1384-L1518)
is fully using decimal, but I don't think this would be very performant
way to do this

I'm going to go with an approach similar to the one inside Intel's
decimal library. To begin with the `decimal32 -> double` is done by a
simple scaling

## What changes are included in this PR?

1. Support Decimal32 for log

## Are these changes tested?

Yes, unit tests have been added, and I've tested this from the
datafusion cli for Decimal32

```
> select log(2.0, arrow_cast(12345.67, 'Decimal32(9, 2)'));
+-----------------------------------------------------------------------+
| log(Float64(2),arrow_cast(Float64(12345.67),Utf8("Decimal32(9, 2)"))) |
+-----------------------------------------------------------------------+
| 13.591717513271785                                                    |
+-----------------------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.021 seconds.
```

## Are there any user-facing changes?

1. The precision of the result for Decimal32 will change, the precision
loss in apache#18524 does not occur in this PR

---------

Co-authored-by: Andrew Lamb <[email protected]>
## Which issue does this PR close?

- Part of apache#19250 

## Rationale for this change

Previously, the `log` function would fail when operating on decimal
values with negative scales.

Negative scales in decimals represent values where the scale indicates
padding zeros to the right (e.g., `Decimal128(38, -2)` with value `100`
represents `10000`). This PR restores support for negative-scale
decimals in the `log` function by implementing the logarithmic property:
`log_base(value * 10^(-scale)) = log_base(value) + (-scale) *
log_base(10)`.

## What changes are included in this PR?

1. **Enhanced `log_decimal128` function**:
   - Added support for negative scales using the logarithmic property
- For negative scales, computes `log_base(value) + (-scale) *
log_base(10)` instead of trying to convert to unscaled value
- Added detection for negative-scale decimals in both the number and
base arguments
- Skips simplification when negative scales are detected to avoid errors
with `ScalarValue` (which doesn't support negative scales yet)
  
2. **Added comprehensive tests**:
- Unit tests in `log.rs` for negative-scale decimals with various bases
(2, 3, 10)
- SQL logic tests in `decimal.slt` using scientific notation (e.g.,
`1e4`, `8e1`) to create decimals with negative scales

## Are these changes tested?

Yes, this PR includes comprehensive tests:

1. Unit tests:
- `test_log_decimal128_negative_scale`: Tests array inputs with negative
scales
- `test_log_decimal128_negative_scale_base2`: Tests with base 2 and
negative scales
- `test_log_decimal128_negative_scale_scalar`: Tests scalar inputs with
negative scales

2. SQL logic tests:
   - Tests for unary log with negative scales (`log(1e4)`)
   - Tests for binary log with explicit base 10 (`log(10, 1e4)`)
- Tests for binary log with base 2 (`log(2.0, 8e1)`, `log(2.0, 16e1)`)
   - Tests for different negative scale values (`log(5e3)`)
   - Tests for array operations with negative scales
   - Tests for different bases (2, 3, 10) with negative-scale decimals

All tests pass successfully.

## Are there any user-facing changes?

Yes, this is a user-facing change:

**Before**: The `log` function would fail with an error when operating
on decimal values with negative scales:
```sql
-- This would fail
SELECT log(1e4);  -- Error: Negative scale is not supported
```

**After**: The `log` function now correctly handles decimal values with
negative scales:
```sql
-- This now works
SELECT log(1e4);  -- Returns 4.0 (log10(10000))
SELECT log(2.0, 8e1);  -- Returns ~6.32 (log2(80))
```

---------

Co-authored-by: Jeffrey Vo <[email protected]>
Co-authored-by: Martin Grigorov <[email protected]>
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#7689.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Added dedicated ceil/floor UDF implementations that keep existing
float/int behavior but operate directly on Decimal128 arrays, including
overflow checks and metadata preservation.
- Updated the math module wiring plus sqllogictest coverage so decimal
cases are executed and validated end to end.

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

- All existing tests pass
- Added new tests for the changes

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Jeffrey Vo <[email protected]>
…#18754)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

Part of apache#12725

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Started refactoring encoding functions (`encode`, `decode`) to remove
user defined signature (per linked issue). However, discovered in the
process it was bugged in handling of certain inputs. For example on main
we get these errors:

```sql
DataFusion CLI v51.0.0
> select encode(arrow_cast(column1, 'LargeUtf8'), 'hex') from values ('a'), ('b');
Internal error: Function 'encode' returned value of type 'Utf8' while the following type was promised at planning time and expected: 'LargeUtf8'.
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues
> select encode(arrow_cast(column1, 'LargeBinary'), 'hex') from values ('a'), ('b');
Internal error: Function 'encode' returned value of type 'Utf8' while the following type was promised at planning time and expected: 'LargeUtf8'.
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues
> select encode(arrow_cast(column1, 'BinaryView'), 'hex') from values ('a'), ('b');
Error during planning: Execution error: Function 'encode' user-defined coercion failed with "Error during planning: 1st argument should be Utf8 or Binary or Null, got BinaryView" No function matches the given name and argument types 'encode(BinaryView, Utf8)'. You might need to add explicit type casts.
        Candidate functions:
        encode(UserDefined)
```

- LargeUtf8/LargeBinary array inputs are broken
- BinaryView input not supported (but Utf8View input is supported)

So went about fixing this input handling as well as doing various
refactors to try simplify the code.

(I also discovered apache#18746 in the process of this refactor).

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Refactor signatures away from user defined to the signature coercion
API; importantly, we now accept only binary inputs, letting string
inputs be coerced by type coercion. This simplifies the internal code of
encode/decode to only need to consider binary inputs, instead of
duplicating essentially exact code for string inputs (given for string
inputs we just grabbed the underlying bytes anyway)

Consolidating the inner functions used by encode/decode to try
simplify/inline where possible.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Added new SLTs.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

No.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Closes apache#16800

We could leave some of these methods around as deprecated and make them
no-ops but I'd be afraid that would create a false sense of security
(compiles but behaves wrong at runtime).
## Which issue does this PR close?

Closes apache#19396.

## Rationale for this change

Previously, when a `DefaultListFilesCache` was created with a TTL (e.g.,
`DefaultListFilesCache::new(1024, Some(Duration::from_secs(1)))`) and
passed to `CacheManagerConfig` without explicitly setting
`list_files_cache_ttl`, the cache's TTL would be unexpectedly unset
(overwritten to `None`).

This happened because `CacheManager::try_new()` always called
`update_cache_ttl(config.list_files_cache_ttl)`, even when the config
value was `None`.

## What changes are included in this PR?

- Modified `CacheManager::try_new()` to only update the cache's TTL if
`config.list_files_cache_ttl` is explicitly set (`Some(value)`). If the
config TTL is `None`, the cache's existing TTL is preserved.
- Added two test cases:
- `test_ttl_preserved_when_not_set_in_config`: Verifies that TTL is
preserved when not set in config
- `test_ttl_overridden_when_set_in_config`: Verifies that TTL can still
be overridden when explicitly set in config

## Are these changes tested?

Yes

## Are there any user-facing changes?

Yes
## Summary

This PR adds protobuf serialization/deserialization support for
`HashExpr`, enabling distributed query execution to serialize hash
expressions used in hash joins and repartitioning.

This is a followup to apache#18393 which introduced `HashExpr` but did not add
serialization support.
This causes errors when serialization is triggered on a query that
pushes down dynamic filters from a `HashJoinExec`.

As of  apache#18393 `HashJoinExec` produces filters of the form:

```sql
CASE (hash_repartition % 2)
    WHEN 0 THEN 
        a >= ab AND a <= ab AND 
        b >= bb AND b <= bb AND 
        hash_lookup(a,b)
    WHEN 1 THEN 
        a >= aa AND a <= aa AND 
        b >= ba AND b <= ba AND 
        hash_lookup(a,b)
    ELSE 
        FALSE
END
```

Where `hash_lookup` is an expression that holds a reference to a given
partitions hash join hash table and will check for membership.

Since we created these new expressions but didn't make any of them
serializable any attempt to do a distributed query or similar would run
into errors.

In apache#19300 we fixed
`hash_lookup` by replacing it with `true` since it can't be serialized
across the wire (we'd have to send the entire hash table). The logic was
that this preserves the bounds checks, which as still valuable.

This PR handles `hash_repartition` which determines which partition (and
hence which branch of the `CASE` expression) the row belongs to. For
this expression we *can* serialize it, so that's what I'm doing in this
PR.

### Key Changes

- **SeededRandomState wrapper**: Added a `SeededRandomState` struct that
wraps `ahash::RandomState` while preserving the seeds used to create it.
This is necessary because `RandomState` doesn't expose seeds after
creation, but we need them for serialization.

- **Updated seed constants**: Changed `HASH_JOIN_SEED` and
`REPARTITION_RANDOM_STATE` constants to use `SeededRandomState` instead
of raw `RandomState`.

- **HashExpr enhancements**: 
  - Changed `HashExpr` to use `SeededRandomState`
  - Added getter methods: `on_columns()`, `seeds()`, `description()`
  - Exported `HashExpr` and `SeededRandomState` from the joins module

- **Protobuf support**:
- Added `PhysicalHashExprNode` message to `datafusion.proto` with fields
for `on_columns`, seeds (4 `u64` values), and `description`
  - Implemented serialization in `to_proto.rs`
  - Implemented deserialization in `from_proto.rs`

## Test plan

- [x] Added roundtrip test in `roundtrip_physical_plan.rs` that creates
a `HashExpr`, serializes it, deserializes it, and verifies the result
- [x] All existing hash join tests pass (583 tests)
- [x] All proto roundtrip tests pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <[email protected]>
…he#19400)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

close apache#19398 

## Rationale for this change

see issue apache#19398

## What changes are included in this PR?

impl `try_swapping_with_projection` for `CooperativeExec` and
`CoalesceBatchesExec`

## Are these changes tested?

add test case

## Are there any user-facing changes?
…pache#19412)

## Which issue does this PR close?

Closes apache#19410

## Rationale for this change

This PR fixes a regression introduced in apache#18831 where queries using
GROUP BY with ORDER BY positional reference to an aliased aggregate fail
with:

```
Error during planning: Column in ORDER BY must be in GROUP BY or an aggregate function
```

**Failing query (now fixed):**
```sql
with t as (select 'foo' as x)
select x, count(*) as "Count"
from t
group by x
order by 2 desc;
```

## What changes are included in this PR?

**Root cause:** When building the list of valid columns for ORDER BY
validation in `select.rs`, alias names were converted to `Column` using
`.into()`, which calls `from_qualified_name()` and normalizes
identifiers to lowercase. However, ORDER BY positional references
resolve to columns using schema field names, which preserve case. This
caused a mismatch (e.g., `Column("Count")` vs `Column("count")`).

**Fix:** Use `Column::new_unqualified()` instead of `.into()` to
preserve the exact case of alias names, matching how the schema stores
field names.

## Are these changes tested?

Yes, added a regression test to `order.slt`.

## Are there any user-facing changes?

No, this is a bug fix that restores expected behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.5 <[email protected]>
## Which issue does this PR close?

- Closes apache#19269.

## Rationale for this change
See to issue apache#19269 for deeper rationale.

DF did not have the notion that being partitioned on a superset of the
required partitioning satisfied the condition. Having this logic will
eliminate unnecessary repartitions and in turn other operators like
partial aggregations.

I introduced this behavior with the `repartition_subset_satisfactions`
flag (default false) as there are some cases where repartitioning may
still be wanted when we satisfy partitioning via this subset property.
In particular, if when partitioned via Hash(a) there is data skew but
when partitioned on Hash(a, b) there is better distribution, a user may
want to turn this optimization off.

I also made it the case such that if we satisfy repartitioning via a
subset but the current amount of partitions < target_partitions, then we
will still repartition to maintain and increase parallelism in the
system when possible.

## What changes are included in this PR?

- Modified `satisfy()` logic to check for subsets and return an enum of
type of match: exact, subset, none
- Do in `EnforceDistribution`, where `satisfy()` is called, do not allow
subset logic for partitioned join operators as partitioning on each side
much match exactly, thus need to repartition if subset logic is true
- Created unit and sqllogictests

## Are these changes tested?
- Unit test
- sqllogictest
- tpch correctness

### Benchmarks
I did not see any drastic changes in benches, but the shuffle
eliminations will be great improvements for distributed DF.

<img width="628" height="762" alt="Screenshot 2025-12-12 at 8 28 15 PM"
src="https://github.com/user-attachments/assets/4b42945f-34e0-46c9-a4ce-e7ccdd0c0603"
/>

<img width="490" height="746" alt="Screenshot 2025-12-12 at 8 30 15 PM"
src="https://github.com/user-attachments/assets/846aef1b-8c5d-462d-83e7-7fa1e2a9372e"
/>

## Are there any user-facing changes?

Yes, users will now have the `repartition_subset_satisfications` option
as described in this PR

---------

Co-authored-by: Andrew Lamb <[email protected]>
…5% faster (apache#19413)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Part of apache#18411
- Closes apache#19344
- Closes apache#19364

Note this is an alternate to
apache#19364

## Rationale for this change

@camuel found a query where DuckDB's raw grouping is is faster. 

I looked into it and much of the difference can be explained by better
vectorization in the comparisons and short string optimizations

## What changes are included in this PR?

Optimize (will comment inline)

## Are these changes tested?

By CI. See also benchmark results below. I tested manually as well

Create Data:
```shell
nice tpchgen-cli --tables=lineitem --format=parquet --scale-factor 100
```

Run query:
```shell
hyperfine --warmup 3 " datafusion-cli   -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "
```

Before (main): 1.368s
```shell
Benchmark 1:  datafusion-cli   -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):      1.393 s ±  0.020 s    [User: 16.778 s, System: 0.688 s]
  Range (min … max):    1.368 s …  1.438 s    10 runs
```

After (this PR) 1.022s
```shell
Benchmark 1:  ./datafusion-cli-multi-gby-try2   -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):      1.022 s ±  0.015 s    [User: 11.685 s, System: 0.644 s]
  Range (min … max):    1.005 s …  1.052 s    10 runs
```

I have a PR that improves string view hashing performance too, see
- apache#19374

## Are there any user-facing changes?
Faster performance
…19374)

## Which issue does this PR close?

- builds on apache#19373
- part of apache#18411
- Broken out of apache#19344
- Closes apache#19344

## Rationale for this change

While looking at performance as part of
apache#18411, I noticed we could
speed up string view hashing by optimizing for small strings

## What changes are included in this PR?

Optimize StringView hashing, specifically by using the inlined view for
short strings

## Are these changes tested?

Functionally by existing coverage

Performance by benchmarks (added in
apache#19373) which show
* 15%-20% faster for mixed short/long strings
* 50%-70% faster for "short" arrays where we know there are no strings
longer than 12 bytes

```
utf8_view (small): multiple, no nulls        1.00     47.9±1.71µs        ? ?/sec    4.00    191.6±1.15µs        ? ?/sec
utf8_view (small): multiple, nulls           1.00     78.4±0.48µs        ? ?/sec    3.08    241.6±1.11µs        ? ?/sec
utf8_view (small): single, no nulls          1.00     13.9±0.19µs        ? ?/sec    4.29     59.7±0.30µs        ? ?/sec
utf8_view (small): single, nulls             1.00     23.8±0.20µs        ? ?/sec    3.10     73.7±1.03µs        ? ?/sec
utf8_view: multiple, no nulls                1.00    235.4±2.14µs        ? ?/sec    1.11    262.2±1.34µs        ? ?/sec
utf8_view: multiple, nulls                   1.00    227.2±2.11µs        ? ?/sec    1.34    303.9±2.23µs        ? ?/sec
utf8_view: single, no nulls                  1.00     71.6±0.74µs        ? ?/sec    1.05     75.2±1.27µs        ? ?/sec
utf8_view: single, nulls                     1.00     71.5±1.92µs        ? ?/sec    1.28     91.6±4.65µs  
```


<details><summary>Details</summary>
<p>

```
Gnuplot not found, using plotters backend
utf8_view: single, no nulls
                        time:   [20.872 µs 20.906 µs 20.944 µs]
                        change: [−15.863% −15.614% −15.331%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe

utf8_view: single, nulls
                        time:   [22.968 µs 23.050 µs 23.130 µs]
                        change: [−17.796% −17.384% −16.918%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

utf8_view: multiple, no nulls
                        time:   [66.005 µs 66.155 µs 66.325 µs]
                        change: [−19.077% −18.785% −18.512%] (p = 0.00 < 0.05)
                        Performance has improved.

utf8_view: multiple, nulls
                        time:   [72.155 µs 72.375 µs 72.649 µs]
                        change: [−17.944% −17.612% −17.266%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

utf8_view (small): single, no nulls
                        time:   [6.1401 µs 6.1563 µs 6.1747 µs]
                        change: [−69.623% −69.484% −69.333%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

utf8_view (small): single, nulls
                        time:   [10.234 µs 10.250 µs 10.270 µs]
                        change: [−53.969% −53.815% −53.666%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe

utf8_view (small): multiple, no nulls
                        time:   [20.853 µs 20.905 µs 20.961 µs]
                        change: [−66.006% −65.883% −65.759%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

utf8_view (small): multiple, nulls
                        time:   [32.519 µs 32.600 µs 32.675 µs]
                        change: [−53.937% −53.581% −53.232%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
```

</p>
</details> 

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Dandandan and others added 14 commits December 30, 2025 05:07
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#13433

## Rationale for this change

Keep up to date with hashbrown and possible (performance) improvements,
remove the `rawtable` functionality.


## What changes are included in this PR?

Upgrading, fixing case of nondeterminism in `HashSet` usage.

## Are these changes tested?

Yes existiing tests.

## Are there any user-facing changes?

No.
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #.

## Rationale for this change
1. add crypto function benchmark code
2. simplify inner digest function code
1. I checked for any performance changes using the benchmark code(1),
but there was no changement.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?
yes.

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
4. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?
no
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

Co-authored-by: Jeffrey Vo <[email protected]>
…e#19559)

Bumps
[taiki-e/install-action](https://github.com/taiki-e/install-action) from
2.65.6 to 2.65.8.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/taiki-e/install-action/releases">taiki-e/install-action's
releases</a>.</em></p>
<blockquote>
<h2>2.65.8</h2>
<ul>
<li>
<p>Update <code>tombi@latest</code> to 0.7.14.</p>
</li>
<li>
<p>Update <code>uv@latest</code> to 0.9.20.</p>
</li>
<li>
<p>Update <code>typos@latest</code> to 1.40.1.</p>
</li>
</ul>
<h2>2.65.7</h2>
<ul>
<li>
<p>Update <code>cargo-no-dev-deps@latest</code> to 0.2.19.</p>
</li>
<li>
<p>Update <code>cargo-minimal-versions@latest</code> to 0.1.34.</p>
</li>
<li>
<p>Update <code>cargo-insta@latest</code> to 1.45.1.</p>
</li>
<li>
<p>Update <code>cargo-hack@latest</code> to 0.6.40.</p>
</li>
<li>
<p>Update <code>dprint@latest</code> to 0.51.1.</p>
</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/taiki-e/install-action/blob/main/CHANGELOG.md">taiki-e/install-action's
changelog</a>.</em></p>
<blockquote>
<h1>Changelog</h1>
<p>All notable changes to this project will be documented in this
file.</p>
<p>This project adheres to <a href="https://semver.org">Semantic
Versioning</a>.</p>
<!-- raw HTML omitted -->
<h2>[Unreleased]</h2>
<h2>[2.65.8] - 2025-12-30</h2>
<ul>
<li>
<p>Update <code>tombi@latest</code> to 0.7.14.</p>
</li>
<li>
<p>Update <code>uv@latest</code> to 0.9.20.</p>
</li>
<li>
<p>Update <code>typos@latest</code> to 1.40.1.</p>
</li>
</ul>
<h2>[2.65.7] - 2025-12-29</h2>
<ul>
<li>
<p>Update <code>cargo-no-dev-deps@latest</code> to 0.2.19.</p>
</li>
<li>
<p>Update <code>cargo-minimal-versions@latest</code> to 0.1.34.</p>
</li>
<li>
<p>Update <code>cargo-insta@latest</code> to 1.45.1.</p>
</li>
<li>
<p>Update <code>cargo-hack@latest</code> to 0.6.40.</p>
</li>
<li>
<p>Update <code>dprint@latest</code> to 0.51.1.</p>
</li>
</ul>
<h2>[2.65.6] - 2025-12-28</h2>
<ul>
<li>
<p>Update <code>dprint@latest</code> to 0.51.0.</p>
</li>
<li>
<p>Update <code>vacuum@latest</code> to 0.22.0.</p>
</li>
</ul>
<h2>[2.65.5] - 2025-12-27</h2>
<ul>
<li>
<p>Update <code>tombi@latest</code> to 0.7.12.</p>
</li>
<li>
<p>Update <code>cargo-binstall@latest</code> to 1.16.6.</p>
</li>
</ul>
<h2>[2.65.4] - 2025-12-27</h2>
<ul>
<li>
<p>Update <code>cargo-nextest@latest</code> to 0.9.116.</p>
</li>
<li>
<p>Update <code>prek@latest</code> to 0.2.25.</p>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/taiki-e/install-action/commit/ff581034fb69296c525e51afd68cb9823bfbe4ed"><code>ff58103</code></a>
Release 2.65.8</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/766eefa747e7fe9a871104d27d7363a9afad3a06"><code>766eefa</code></a>
Update changelog</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/db0301613d90942e9412b9e0152d98b6a4f40942"><code>db03016</code></a>
Update <code>tombi@latest</code> to 0.7.14</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/78f63804f5bd1b38ca7dadb1af0afb328a2c701e"><code>78f6380</code></a>
Update <code>uv@latest</code> to 0.9.20</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/614b8622046c9f8bed8c8fec2093946b3b15dcf2"><code>614b862</code></a>
Update <code>typos@latest</code> to 1.40.1</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/447ff350f809b67060e03706fdf9a5f8b63be9fc"><code>447ff35</code></a>
Update <code>tombi@latest</code> to 0.7.13</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/4c6723ec9c638cccae824b8957c5085b695c8085"><code>4c6723e</code></a>
Release 2.65.7</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/9ff15877d9cf002da5c45f2297f2bc736978a8d7"><code>9ff1587</code></a>
Update <code>cargo-no-dev-deps@latest</code> to 0.2.19</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/4f0419fae32f77fcc791b79eb61b83694d208e4a"><code>4f0419f</code></a>
Update <code>cargo-minimal-versions@latest</code> to 0.1.34</li>
<li><a
href="https://github.com/taiki-e/install-action/commit/1eecdc5eb1c3f829c89d7e6d794725f1bce75111"><code>1eecdc5</code></a>
Update <code>cargo-insta@latest</code> to 1.45.1</li>
<li>Additional commits viewable in <a
href="https://github.com/taiki-e/install-action/compare/28a9d316db64b78a951f3f8587a5d08cc97ad8eb...ff581034fb69296c525e51afd68cb9823bfbe4ed">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=taiki-e/install-action&package-manager=github_actions&previous-version=2.65.6&new-version=2.65.8)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
## Which issue does this PR close?

- Closes apache#19292.

## Rationale for this change

Metadata on a schema for a record batch is lost during FFI conversion.
This is not always obvious because our other integrations like
`ExecutionPlan` and `TableProvider` have their own ways to provide the
schema.

This is currently an issue because we have FFI table providers who say
they have a specific schema but the schema of the actual record batches
does not match. The metadata is dropped.

## What changes are included in this PR?

We already have the schema in the FFI object, so use it both when
converting to and from FFI.

## Are these changes tested?

Unit test added.

## Are there any user-facing changes?

No
…implementation (apache#19142)

Add infrastructure for row-level DML operations (DELETE/UPDATE) to the
TableProvider trait, enabling storage engines to implement SQL-based
mutations. Changes:

- Add TableProvider::delete_from() method for DELETE operations
- Add TableProvider::update() method for UPDATE operations
- Wire physical planner to route DML operations to TableProvider
- Implement DELETE/UPDATE for MemTable as reference implementation
- Add comprehensive sqllogictest coverage

This provides the API surface for downstream projects (iceberg-rust,
delta-rs) to implement DML without custom query planners.

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#16959
- Related to apache#12406

## Rationale for this change

Datafusion parses DELETE/UPDATE but returns NotImplemented("Unsupported
logical plan: Dml(Delete)") at physical planning. Downstream projects
(iceberg-rust, delta-rs) must implement custom planners to work around
this.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

Adds TableProvider hooks for row-level DML:

- delete_from(state, filters) - deletes rows matching filter predicates
- update(state, assignments, filters) - updates matching rows with new
values

Physical planner routes WriteOp::Delete and WriteOp::Update to these
methods. Tables that don't support DML return NotImplemented (the
default behavior). MemTable reference implementation demonstrates:

- Filter evaluation with SQL three-valued logic (NULL predicates don't
match)
- Multi-column updates with expression evaluation
- Proper row count reporting

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes:

- Unit tests for physical planning: cargo test -p datafusion --test
custom_sources_cases
- SQL logic tests: dml_delete.slt and dml_update.slt with comprehensive
coverage

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

New trait methods on TableProvider:

```
async fn delete_from(&self, state: &dyn Session, filters: Vec<Expr>) -> Result<Arc<dyn ExecutionPlan>>;
async fn update(&self, state: &dyn Session, assignments: Vec<(String, Expr)>, filters: Vec<Expr>) -> Result<Arc<dyn ExecutionPlan>>;
```

Fully backward compatible. Default implementations return
NotImplemented.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Andrew Lamb <[email protected]>
…pache#19523)

## Which issue does this PR close?

None

## Rationale for this change

when i run the benchmark for topk_aggregate, i find the mostly of time
cost is generate the data, so the result is not real reflect the
performance for topk_aggregate.

```
cargo bench --bench topk_aggregate
```


### main branch


![main](https://github.com/user-attachments/assets/15bc0368-f870-4ad8-84f8-0b6e23e442cb)

### this pr

![this
pr](https://github.com/user-attachments/assets/bf92ec25-217a-40aa-8b0f-2cf89bea8737)


## What changes are included in this PR?

extract the data generate out of topk_aggregate benchmark

## Are these changes tested?


## Are there any user-facing changes?
Optimized lpad and rpad functions to eliminate per-row allocations by
reusing buffers for graphemes and fill characters.

The previous implementation allocated new Vec<&str> for graphemes and
Vec<char> for fill characters on every row, which was inefficient. This
optimization introduces reusable buffers that are allocated once and
cleared/refilled for each row.

Changes:
- lpad: Added graphemes_buf and fill_chars_buf outside loops, clear and
refill per row instead of allocating new Vec each time
- rpad: Added graphemes_buf outside loops to reuse across iterations
- Both functions now allocate buffers once and reuse them for all rows
- Buffers are cleared and reused for each row via .clear() and .extend()

Optimization impact:
- For lpad with fill parameter: Eliminates 2 Vec allocations per row
(graphemes + fill_chars)
- For lpad without fill: Eliminates 1 Vec allocation per row (graphemes)
- For rpad: Eliminates 1 Vec allocation per row (graphemes)

This optimization is particularly effective for:
- Large arrays with many rows
- Strings with multiple graphemes (unicode characters)
- Workloads with custom fill patterns

Benchmark results comparing main vs optimized branch:

lpad benchmarks:
- size=1024, str_len=5, target=20: 116.53 µs -> 63.226 µs (45.7% faster)
- size=1024, str_len=20, target=50: 314.07 µs -> 190.30 µs (39.4%
faster)
- size=4096, str_len=5, target=20: 467.35 µs -> 261.29 µs (44.1% faster)
- size=4096, str_len=20, target=50: 1.2286 ms -> 754.24 µs (38.6%
faster)

rpad benchmarks:
- size=1024, str_len=5, target=20: 113.89 µs -> 72.645 µs (36.2% faster)
- size=1024, str_len=20, target=50: 313.68 µs -> 202.98 µs (35.3%
faster)
- size=4096, str_len=5, target=20: 456.08 µs -> 295.57 µs (35.2% faster)
- size=4096, str_len=20, target=50: 1.2523 ms -> 818.47 µs (34.6%
faster)

Overall improvements: 35-46% faster across all workloads

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
…ition is provided (apache#19553)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->


Replace `.chars().skip().collect::<String>()` with zero-copy string
slicing using `char_indices()` to find the byte offset, then slice with
`&value[byte_offset..]`.

This eliminates unnecessary String allocation per row when a start
position is specified.

Changes:
- Use char_indices().nth() to find byte offset for start position
(1-based)
  - Use string slicing &value[byte_offset..] instead of collecting chars
  - Added benchmark to measure performance improvements

Optimization:
- Before: Allocated new String via .collect() for each row with start
position
  - After: Uses zero-copy string slice

Benchmark results:
- size=1024, str_len=32: 96.361 µs -> 41.458 µs (57.0% faster, 2.3x
speedup)
- size=1024, str_len=128: 210.16 µs -> 56.064 µs (73.3% faster, 3.7x
speedup)
- size=4096, str_len=32: 376.90 µs -> 162.98 µs (56.8% faster, 2.3x
speedup)
- size=4096, str_len=128: 855.68 µs -> 263.61 µs (69.2% faster, 3.2x
speedup)

The optimization shows greater improvements for longer strings (up to
73% faster) since string slicing is O(1) regardless of length, while the
previous approach had allocation costs that grew with string length.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
## Which issue does this PR close?

- Closes apache#18827
- Closes apache#9654

## Rationale for this change

Now that the `DefaultListFilesCache` can be configured by users it's
safe to enable it by default and fix the tests that caching broke!

## What changes are included in this PR?

 - Sets the DefaultListFilesCache to be enabled by default
- Adds additional object store access tests to show list caching
behavior
 - Adds variable setting/reading sqllogic test cases
- Updates tests to disable caching when they relied on COPY commands so
changes can be detected for each query
 - Updates docs to help users upgrade


## Are these changes tested?

Yes, additional test cases have been added to help show the behavior of
the caching

## Are there any user-facing changes?

Yes, this changes the default behavior of DataFusion, however this
information is already captured in the upgrade guide.

##
cc @alamb

---------

Co-authored-by: Andrew Lamb <[email protected]>
)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

Closes apache#17527

## Rationale for this change

Currently, DataFusion computes bounds for all queries that contain a
HashJoinExec node whenever the option enable_dynamic_filter_pushdown is
set to true (default). It might make sense to compute these bounds only
when we explicitly know there is a consumer that will use them.

## What changes are included in this PR?

As suggested in
apache#17527 (comment),
this PR adds an is_used() method to DynamicFilterPhysicalExpr that
checks if any consumers are holding a reference to the filter using
Arc::strong_count().

During filter pushdown, consumers that accept the filter and use it
later in execution have to retain a reference to Arc. For example, scan
nodes like ParquetSource.

## Are these changes tested?

I added a unit test in dynamic_filters.rs (test_is_used) that verifies
the Arc reference counting behavior.
Existing integration tests in
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs validate
the end-to-end behavior. These tests verify that dynamic filters are
computed and filled when consumers are present.



## Are there any user-facing changes?

new is_used() function
## Which issue does this PR close?

N/A

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Clean up some signatures & unnecessary code in string functions

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Various refactors, see comments.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Existing tests.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

No.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.