feat(python/sedonadb): add aggregate expression methods on Expr by jiayuasu · Pull Request #882 · apache/sedona-db

jiayuasu · 2026-05-27T05:52:47Z

First step toward grouping/aggregation, per the reprioritization recorded on #791 (aggregation/join/UDFs ahead of the remaining small schema ops).

What's new

Aggregate builder methods on Expr:

col("x").sum()      # Expr(sum(x))
col("x").count()    # Expr(count(x))
col("x").mean()     # Expr(avg(x))
col("x").min()      # Expr(min(x))
col("x").max()      # Expr(max(x))

Methods on Expr (matching Polars / PySpark) rather than free functions — consistent with the operator-method pattern already on Expr.
mean maps to DataFusion's avg aggregate but is named for the pandas/Polars vocabulary.
Each wraps the corresponding helper from datafusion::functions_aggregate::expr_fn.

Scope

This PR ships only the expression builders. The resulting Expr is an aggregate AST node — valid only inside an aggregation context (DataFrame.agg(...) / group_by().agg()), which come in follow-up PRs. Shipping the builders in isolation mirrors how the original Expr foundation (#807) landed before any DataFrame integration existed; tests are repr-level.

Planned follow-ups:

df.agg(*exprs) — global (ungrouped) aggregation.
df.group_by(*keys).agg(*exprs) — the GroupedDataFrame layer.

Test plan

6 tests in tests/expr/test_expression.py:

Parametrized exact-repr() check for all five aggregates.
Aggregate over a compound expression ((col("x") + col("y")).sum() → Expr(sum(x + y))).

Local: 54 expression tests + 15 doctests + ruff format + ruff check all clean.

First step toward grouping/aggregation (per the reprioritization on apache#791). Adds aggregate builder methods to Expr: col("x").sum() -> Expr(sum(x)) col("x").count() -> Expr(count(x)) col("x").mean() -> Expr(avg(x)) col("x").min() -> Expr(min(x)) col("x").max() -> Expr(max(x)) These follow the operator-method pattern already on Expr (matching Polars / PySpark) rather than free functions. `mean` maps to DataFusion's `avg` aggregate but is named to match the pandas/Polars vocabulary. Each wraps the corresponding helper from `datafusion::functions_aggregate::expr_fn`. The resulting Expr is an aggregate AST node — valid only inside an aggregation context, which arrives in the follow-up `DataFrame.agg` / `group_by().agg()` PRs. This PR ships the builders + repr-level tests in isolation, the same way the original Expr foundation (apache#807) landed before DataFrame integration existed. Tests: parametrized exact-repr check for all five aggregates plus an aggregate-over-compound-expression case.

Copilot

Pull request overview

Adds aggregate expression builder methods (sum, count, mean, min, max) to the Python Expr type, wrapping DataFusion's aggregate function helpers. This is a small, isolated step toward the upcoming DataFrame.agg / group_by().agg() integration tracked in #791.

Changes:

Rust PyExpr gains five aggregate methods that wrap datafusion::functions_aggregate::expr_fn helpers.
Python Expr exposes corresponding methods with docstring examples; mean maps to DataFusion's avg.
Tests verify repr() output for each aggregate and for an aggregate over a compound expression.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
python/sedonadb/src/expr.rs	Adds `sum`/`count`/`mean`/`min`/`max` PyO3 methods on `PyExpr` delegating to DataFusion aggregate helpers.
python/sedonadb/python/sedonadb/expr/expression.py	Adds the Python-side `Expr` aggregate methods with docstring examples.
python/sedonadb/tests/expr/test_expression.py	Parametrized repr tests for the five aggregates plus a compound-expression case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jiayuasu · 2026-05-28T05:06:20Z

Closing in favor of the approach in #885 — exposing scalar and aggregate UDFs through a registry-walking sd.funcs.<name>(args) dispatch rather than hand-rolling per-function methods on Expr. That mechanism:

avoids growing the Rust bindings by one method per aggregate,
automatically surfaces plugin and Python-registered UDFs that a hard-coded list would miss,
uses one path for scalar (spatial st_*) and aggregate functions instead of two parallel mechanisms.

Once #885 lands I'll build DataFrame.agg(...) and DataFrame.group_by(...).agg(...) on top, with call sites like sd.funcs.sum(col("x")). The repr-level tests from this PR will move to that follow-up.

github-actions Bot requested a review from zhangfengcdt May 27, 2026 05:53

jiayuasu requested review from Copilot and paleolimbot May 27, 2026 06:39

Copilot started reviewing on behalf of jiayuasu May 27, 2026 06:39 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

jiayuasu closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python/sedonadb): add aggregate expression methods on Expr#882

feat(python/sedonadb): add aggregate expression methods on Expr#882
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/expr-aggregates

jiayuasu commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jiayuasu commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jiayuasu commented May 27, 2026

What's new

Scope

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jiayuasu commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants