[SPARK-56253][PYTHON][CONNECT] Make spark.read.json accept DataFrame input#55097
[SPARK-56253][PYTHON][CONNECT] Make spark.read.json accept DataFrame input#55097Yicong-Huang wants to merge 3 commits intoapache:masterfrom
Conversation
| if isinstance(path, DataFrame): | ||
| assert self._spark._jvm is not None | ||
| string_encoder = self._spark._jvm.Encoders.STRING() | ||
| jdataset = getattr(path._jdf, "as")(string_encoder) |
There was a problem hiding this comment.
can we add a private overload in JVM side def json(jsonDataset: DataFrame): DataFrame and directly call it here?
There was a problem hiding this comment.
What would be the benefit?
There was a problem hiding this comment.
I think it can simplify the python side callsite a bit.
another trivial benefit is to reduce the number of py4j calls.
| result = self.spark.read.json(json_df, schema="name STRING, age INT") | ||
| expected = [Row(name="Alice", age=25), Row(name="Bob", age=30)] | ||
| self.assertEqual(sorted(result.collect(), key=lambda r: r.name), expected) | ||
|
|
There was a problem hiding this comment.
The JVM side implementations accept RDD[String] and Dataset[String], now we are using DataFrame in python side.
So I think we need to add negative tests cases like:
1, single column, non-str type;
2, multiple columns
3, zero columns
|
|
||
| from pyspark.sql.connect.dataframe import DataFrame | ||
|
|
||
| if isinstance(path, DataFrame): |
There was a problem hiding this comment.
do we need to check the schema in python side that it should only contain a string type column? or we depends on the check in the JVM side (or connect server)
What changes were proposed in this pull request?
Allow
spark.read.json()to accept a DataFrame with a single string column as input, in addition to file paths and RDDs.Why are the changes needed?
Parsing in-memory JSON text into a structured DataFrame currently requires
sc.parallelize(), which is unavailable on Spark Connect. Accepting a DataFrame as input provides a Connect-compatible alternative. This is the inverse ofDataFrame.toJSON().Part of SPARK-55227.
Does this PR introduce any user-facing change?
Yes.
spark.read.json()now accepts a single-string-column DataFrame as input.How was this patch tested?
New tests in
test_datasources.py(classic) andtest_connect_readwriter.py(Connect).Was this patch authored or co-authored using generative AI tooling?
No