Skip to content

[SPARK-56286][PYTHON] Add DataFrame.dataQuality API for column profiling#55095

Open
sougata99 wants to merge 3 commits intoapache:masterfrom
sougata99:master
Open

[SPARK-56286][PYTHON] Add DataFrame.dataQuality API for column profiling#55095
sougata99 wants to merge 3 commits intoapache:masterfrom
sougata99:master

Conversation

@sougata99
Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR adds a new PySpark DataFrame.dataQuality() API for exploratory dataset profiling.

The new method returns a DataFrame with one row per input column and one synthetic __dataset__ row for overall dataset-level completeness metrics. The output includes row_count, column_count, total_cells, non_null_count, null_count, null_ratio, distinct_count, min, max, and mode. For numeric columns, it also includes mean, stddev, and median.

This PR also adds PySpark unit test coverage for the new API, including null handling, NaN handling for floating-point columns, numeric profiling, categorical mode, and overall dataset metrics.

Why are the changes needed?

PySpark currently provides describe() and summary(), but there is no built-in API focused on practical data quality profiling.

A common early step in exploratory analysis is understanding completeness and basic quality characteristics of a dataset, such as null distribution, distinct values, central tendency, and per-column summary metrics. Today, users typically have to compose several custom aggregations to gather this information. This change makes that workflow easier and more discoverable through a single DataFrame API.

Does this PR introduce any user-facing change?

Yes.

This PR introduces a new PySpark API:

df.dataQuality()

Example:

df = spark.createDataFrame(
    [(1, 10.0, "ok"), (2, None, "ok"), (None, 30.0, None)],
    ["id", "score", "status"],
)

df.dataQuality().show()

This allows users to retrieve column-level and dataset-level quality metrics directly from a DataFrame without composing multiple manual aggregations.

How was this patch tested?

Added a new PySpark test in python/pyspark/sql/tests/test_dataframe.py.

The test covers:

  • per-column null and non-null counts
  • dataset-level completeness metrics
  • distinct counts
  • mean and median for numeric columns
  • mode for categorical columns
  • NaN handling for floating-point columns

I also verified that the modified Python files compile successfully with:

python -m py_compile python/pyspark/sql/dataframe.py python/pyspark/sql/tests/test_dataframe.py

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex (GPT-5)

Copy link
Copy Markdown
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I have concerns about whether dataQuality() belongs as a built-in DataFrame API in PySpark. It has significant overlaps with the existing describe() and summary() API which already provide count, mean, stddev, min, and max. I'd suggest discussing the design on the dev mailing list before proceeding. cc @HyukjinKwon

@sougata99
Copy link
Copy Markdown
Author

Thanks for the feedback @allisonwang-db . That makes sense.

My intention was to provide a more data-quality-focused profiling API that includes metrics such as null counts, null ratios, distinct counts, mode, median, and a dataset-level summary row, which are not directly available from describe() or summary() today.

That said, I agree this is a public API design question and should be discussed more broadly first. I’m happy to start a thread on the dev mailing list to gather feedback on whether this should be a new DataFrame API, an extension of an existing API, or something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants