[SPARK-56253][PYTHON][CONNECT] Make spark.read.json accept DataFrame input by Yicong-Huang · Pull Request #55097 · apache/spark

Yicong-Huang · 2026-03-30T16:51:39Z

What changes were proposed in this pull request?

Allow spark.read.json() to accept a DataFrame with a single string column as input, in addition to file paths and RDDs.

Why are the changes needed?

Parsing in-memory JSON text into a structured DataFrame currently requires sc.parallelize(), which is unavailable on Spark Connect. Accepting a DataFrame as input provides a Connect-compatible alternative. This is the inverse of DataFrame.toJSON().

Part of SPARK-55227.

Does this PR introduce any user-facing change?

Yes. spark.read.json() now accepts a single-string-column DataFrame as input.

How was this patch tested?

New tests in test_datasources.py (classic) and test_connect_readwriter.py (Connect).

Was this patch authored or co-authored using generative AI tooling?

No

zhengruifeng · 2026-03-31T04:20:54Z

python/pyspark/sql/readwriter.py

+        if isinstance(path, DataFrame):
+            assert self._spark._jvm is not None
+            string_encoder = self._spark._jvm.Encoders.STRING()
+            jdataset = getattr(path._jdf, "as")(string_encoder)


can we add a private overload in JVM side def json(jsonDataset: DataFrame): DataFrame and directly call it here?

What would be the benefit?

I think it can simplify the python side callsite a bit.
another trivial benefit is to reduce the number of py4j calls.

zhengruifeng · 2026-04-01T01:07:06Z

python/pyspark/sql/tests/test_datasources.py

+        result = self.spark.read.json(json_df, schema="name STRING, age INT")
+        expected = [Row(name="Alice", age=25), Row(name="Bob", age=30)]
+        self.assertEqual(sorted(result.collect(), key=lambda r: r.name), expected)
+


The JVM side implementations accept RDD[String] and Dataset[String], now we are using DataFrame in python side.

So I think we need to add negative tests cases like:
1, single column, non-str type;
2, multiple columns
3, zero columns

zhengruifeng · 2026-04-01T01:08:20Z

python/pyspark/sql/connect/readwriter.py

+
+        from pyspark.sql.connect.dataframe import DataFrame
+
+        if isinstance(path, DataFrame):


do we need to check the schema in python side that it should only contain a string type column? or we depends on the check in the JVM side (or connect server)

feat: support DataFrame input for spark.read.json

da8b007

Yicong-Huang mentioned this pull request Mar 30, 2026

[SPARK-56257][PYTHON][CONNECT] Support DataFrame input for spark.read.json/csv/xml #55057

Closed

Yicong-Huang added 2 commits March 30, 2026 20:55

fix: mypy errors and schema passthrough in connect json

10dcc39

fix: use proper proto type for format, add connect schema test

164c3f1

zhengruifeng reviewed Mar 31, 2026

View reviewed changes

zhengruifeng requested review from LuciferYang and hvanhovell March 31, 2026 04:21

zhengruifeng reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56253][PYTHON][CONNECT] Make spark.read.json accept DataFrame input#55097

[SPARK-56253][PYTHON][CONNECT] Make spark.read.json accept DataFrame input#55097
Yicong-Huang wants to merge 3 commits intoapache:masterfrom
Yicong-Huang:SPARK-56253

Yicong-Huang commented Mar 30, 2026

Uh oh!

zhengruifeng Mar 31, 2026

Uh oh!

hvanhovell Mar 31, 2026

Uh oh!

zhengruifeng Apr 1, 2026

Uh oh!

zhengruifeng Apr 1, 2026 •

edited

Loading

Uh oh!

zhengruifeng Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		from pyspark.sql.connect.dataframe import DataFrame

		if isinstance(path, DataFrame):

Conversation

Yicong-Huang commented Mar 30, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

hvanhovell Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhengruifeng Apr 1, 2026 •

edited

Loading