Skip to content

feat(python/sedonadb): add aggregate expression methods on Expr#882

Closed
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/expr-aggregates
Closed

feat(python/sedonadb): add aggregate expression methods on Expr#882
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/expr-aggregates

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

First step toward grouping/aggregation, per the reprioritization recorded on #791 (aggregation/join/UDFs ahead of the remaining small schema ops).

What's new

Aggregate builder methods on Expr:

col("x").sum()      # Expr(sum(x))
col("x").count()    # Expr(count(x))
col("x").mean()     # Expr(avg(x))
col("x").min()      # Expr(min(x))
col("x").max()      # Expr(max(x))
  • Methods on Expr (matching Polars / PySpark) rather than free functions — consistent with the operator-method pattern already on Expr.
  • mean maps to DataFusion's avg aggregate but is named for the pandas/Polars vocabulary.
  • Each wraps the corresponding helper from datafusion::functions_aggregate::expr_fn.

Scope

This PR ships only the expression builders. The resulting Expr is an aggregate AST node — valid only inside an aggregation context (DataFrame.agg(...) / group_by().agg()), which come in follow-up PRs. Shipping the builders in isolation mirrors how the original Expr foundation (#807) landed before any DataFrame integration existed; tests are repr-level.

Planned follow-ups:

  1. df.agg(*exprs) — global (ungrouped) aggregation.
  2. df.group_by(*keys).agg(*exprs) — the GroupedDataFrame layer.

Test plan

6 tests in tests/expr/test_expression.py:

  • Parametrized exact-repr() check for all five aggregates.
  • Aggregate over a compound expression ((col("x") + col("y")).sum()Expr(sum(x + y))).

Local: 54 expression tests + 15 doctests + ruff format + ruff check all clean.

First step toward grouping/aggregation (per the reprioritization on
apache#791). Adds aggregate builder methods to Expr:

    col("x").sum()      -> Expr(sum(x))
    col("x").count()    -> Expr(count(x))
    col("x").mean()     -> Expr(avg(x))
    col("x").min()      -> Expr(min(x))
    col("x").max()      -> Expr(max(x))

These follow the operator-method pattern already on Expr (matching
Polars / PySpark) rather than free functions. `mean` maps to
DataFusion's `avg` aggregate but is named to match the pandas/Polars
vocabulary.

Each wraps the corresponding helper from
`datafusion::functions_aggregate::expr_fn`. The resulting Expr is an
aggregate AST node — valid only inside an aggregation context, which
arrives in the follow-up `DataFrame.agg` / `group_by().agg()` PRs.
This PR ships the builders + repr-level tests in isolation, the same
way the original Expr foundation (apache#807) landed before DataFrame
integration existed.

Tests: parametrized exact-repr check for all five aggregates plus an
aggregate-over-compound-expression case.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds aggregate expression builder methods (sum, count, mean, min, max) to the Python Expr type, wrapping DataFusion's aggregate function helpers. This is a small, isolated step toward the upcoming DataFrame.agg / group_by().agg() integration tracked in #791.

Changes:

  • Rust PyExpr gains five aggregate methods that wrap datafusion::functions_aggregate::expr_fn helpers.
  • Python Expr exposes corresponding methods with docstring examples; mean maps to DataFusion's avg.
  • Tests verify repr() output for each aggregate and for an aggregate over a compound expression.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
python/sedonadb/src/expr.rs Adds sum/count/mean/min/max PyO3 methods on PyExpr delegating to DataFusion aggregate helpers.
python/sedonadb/python/sedonadb/expr/expression.py Adds the Python-side Expr aggregate methods with docstring examples.
python/sedonadb/tests/expr/test_expression.py Parametrized repr tests for the five aggregates plus a compound-expression case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu
Copy link
Copy Markdown
Member Author

Closing in favor of the approach in #885 — exposing scalar and aggregate UDFs through a registry-walking sd.funcs.<name>(args) dispatch rather than hand-rolling per-function methods on Expr. That mechanism:

  • avoids growing the Rust bindings by one method per aggregate,
  • automatically surfaces plugin and Python-registered UDFs that a hard-coded list would miss,
  • uses one path for scalar (spatial st_*) and aggregate functions instead of two parallel mechanisms.

Once #885 lands I'll build DataFrame.agg(...) and DataFrame.group_by(...).agg(...) on top, with call sites like sd.funcs.sum(col("x")). The repr-level tests from this PR will move to that follow-up.

@jiayuasu jiayuasu closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants