[SPARK-56286][PYTHON] Add DataFrame.dataQuality API for column profiling by sougata99 · Pull Request #55095 · apache/spark

sougata99 · 2026-03-30T14:04:04Z

What changes were proposed in this pull request?

This PR adds a new PySpark DataFrame.dataQuality() API for exploratory dataset profiling.

The new method returns a DataFrame with one row per input column and one synthetic __dataset__ row for overall dataset-level completeness metrics. The output includes row_count, column_count, total_cells, non_null_count, null_count, null_ratio, distinct_count, min, max, and mode. For numeric columns, it also includes mean, stddev, and median.

This PR also adds PySpark unit test coverage for the new API, including null handling, NaN handling for floating-point columns, numeric profiling, categorical mode, and overall dataset metrics.

Why are the changes needed?

PySpark currently provides describe() and summary(), but there is no built-in API focused on practical data quality profiling.

A common early step in exploratory analysis is understanding completeness and basic quality characteristics of a dataset, such as null distribution, distinct values, central tendency, and per-column summary metrics. Today, users typically have to compose several custom aggregations to gather this information. This change makes that workflow easier and more discoverable through a single DataFrame API.

Does this PR introduce any user-facing change?

Yes.

This PR introduces a new PySpark API:

df.dataQuality()

Example:

df = spark.createDataFrame(
    [(1, 10.0, "ok"), (2, None, "ok"), (None, 30.0, None)],
    ["id", "score", "status"],
)

df.dataQuality().show()

This allows users to retrieve column-level and dataset-level quality metrics directly from a DataFrame without composing multiple manual aggregations.

How was this patch tested?

Added a new PySpark test in python/pyspark/sql/tests/test_dataframe.py.

The test covers:

per-column null and non-null counts
dataset-level completeness metrics
distinct counts
mean and median for numeric columns
mode for categorical columns
NaN handling for floating-point columns

I also verified that the modified Python files compile successfully with:

python -m py_compile python/pyspark/sql/dataframe.py python/pyspark/sql/tests/test_dataframe.py

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex (GPT-5)

allisonwang-db

Thanks for the contribution! I have concerns about whether dataQuality() belongs as a built-in DataFrame API in PySpark. It has significant overlaps with the existing describe() and summary() API which already provide count, mean, stddev, min, and max. I'd suggest discussing the design on the dev mailing list before proceeding. cc @HyukjinKwon

sougata99 · 2026-03-31T02:20:39Z

Thanks for the feedback @allisonwang-db . That makes sense.

My intention was to provide a more data-quality-focused profiling API that includes metrics such as null counts, null ratios, distinct counts, mode, median, and a dataset-level summary row, which are not directly available from describe() or summary() today.

That said, I agree this is a public API design question and should be discussed more broadly first. I’m happy to start a thread on the dev mailing list to gather feedback on whether this should be a new DataFrame API, an extension of an existing API, or something else.

sougata99 added 2 commits March 30, 2026 19:15

[SPARK-56286][PYTHON] Add DataFrame.dataQuality API for column profiling

3a8d21d

[SPARK-56286][PYTHON] Trigger CI

1c59eb9

sougata99 force-pushed the master branch from 1aa4a2a to 1c59eb9 Compare March 30, 2026 15:41

[SPARK-56286][PYTHON] Retrigger PR checks

bf9f685

allisonwang-db reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56286][PYTHON] Add DataFrame.dataQuality API for column profiling#55095

[SPARK-56286][PYTHON] Add DataFrame.dataQuality API for column profiling#55095
sougata99 wants to merge 3 commits intoapache:masterfrom
sougata99:master

sougata99 commented Mar 30, 2026

Uh oh!

allisonwang-db left a comment

Uh oh!

sougata99 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sougata99 commented Mar 30, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

sougata99 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants