[SPARK-56176][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagation to FileScan by LuciferYang · Pull Request #55108 · apache/spark

LuciferYang · 2026-03-31T04:03:02Z

What changes were proposed in this pull request?

This PR is part of SPARK-56170. It implements V2-native ANALYZE TABLE and ANALYZE COLUMN for file tables, removing the last V1 fallbacks in ResolveSessionCatalog, and propagates stored statistics through FileScan for query optimization.

1. V2-native ANALYZE TABLE (`AnalyzeTableExec`)

New physical plan node that computes table statistics and persists them as table properties via TableCatalog.alterTable(TableChange.setProperty(...)):

spark.sql.statistics.totalSize: computed from fileIndex.sizeInBytes
spark.sql.statistics.numRows: computed from df.count() (skipped when NOSCAN)

2. V2-native ANALYZE COLUMN (`AnalyzeColumnExec`)

New physical plan node that computes column-level statistics using CommandUtils.computeColumnStats and persists them as table properties:

spark.sql.statistics.colStats.<col>.<stat>: min, max, nullCount, distinctCount, avgLen, maxLen
Supports both FOR COLUMNS and FOR ALL COLUMNS syntax

3. Stats propagation to FileScan

FileScan.numRows(): reads stored row count from __numRows option (set by ANALYZE TABLE)
FileTable.mergedOptions: injects __numRows from catalogTable.stats.rowCount into scan options so FileScan.estimateStatistics() can use it

4. V1 fallback removal

Removes the AnalyzeTable and AnalyzeColumn FileTable cases from ResolveSessionCatalog — these are now handled natively by DataSourceV2Strategy routing to AnalyzeTableExec/AnalyzeColumnExec.

Why are the changes needed?

Previously, ANALYZE TABLE/COLUMN for V2 file tables was routed through V1 commands (AnalyzeTableCommand/AnalyzeColumnCommand) via ResolveSessionCatalog fallbacks. This broke the V2 design principle where table operations should go through the V2 catalog API. The V1 path also stored stats only in the Hive metastore, making them invisible to V2 scan planning.

With this change:

Stats are stored as table properties via the standard TableCatalog.alterTable() API
FileScan can read stored row counts for better query optimization
No V1 command routing needed for ANALYZE operations

Does this PR introduce any user-facing change?

No. ANALYZE TABLE and ANALYZE TABLE FOR COLUMNS produce the same results. Statistics are stored in the catalog and used for query optimization as before.

How was this patch tested?

Pass Github Actions

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code 4.6

…Frame API writes and delete FallBackFileSourceV2 Key changes: - FileWrite: added partitionSchema, customPartitionLocations, dynamicPartitionOverwrite, isTruncate; path creation and truncate logic; dynamic partition overwrite via FileCommitProtocol - FileTable: createFileWriteBuilder with SupportsDynamicOverwrite and SupportsTruncate; capabilities now include TRUNCATE and OVERWRITE_DYNAMIC; fileIndex skips file existence checks when userSpecifiedSchema is provided (write path) - All file format writes (Parquet, ORC, CSV, JSON, Text, Avro) use createFileWriteBuilder with partition/truncate/overwrite support - DataFrameWriter.lookupV2Provider: enabled FileDataSourceV2 for non-partitioned Append and Overwrite via df.write.save(path) - DataFrameWriter.insertInto: V1 fallback for file sources (TODO: SPARK-56175) - DataFrameWriter.saveAsTable: V1 fallback for file sources (TODO: SPARK-56230, needs StagingTableCatalog) - DataSourceV2Utils.getTableProvider: V1 fallback for file sources (TODO: SPARK-56175) - Removed FallBackFileSourceV2 rule - V2SessionCatalog.createTable: V1 FileFormat data type validation

LuciferYang · 2026-03-31T04:04:29Z

This is the 4th pr in this series, and it is based on #55091

…catalog table loading, and gate removal Key changes: - FileTable extends SupportsPartitionManagement with createPartition, dropPartition, listPartitionIdentifiers, partitionSchema - Partition operations sync to catalog metastore (best-effort) - V2SessionCatalog.loadTable returns FileTable instead of V1Table, sets catalogTable and useCatalogFileIndex on FileTable - V2SessionCatalog.getDataSourceOptions includes storage.properties for proper option propagation (header, ORC bloom filter, etc.) - V2SessionCatalog.createTable validates data types via FileTable - FileTable.columns() restores NOT NULL constraints from catalogTable - FileTable.partitioning() falls back to userSpecifiedPartitioning or catalog partition columns - FileTable.fileIndex uses CatalogFileIndex when catalog has registered partitions (custom partition locations) - FileTable.schema checks column name duplication for non-catalog tables only - DataSourceV2Utils.getTableProvider: removed FileDataSourceV2 gate - DataFrameWriter.insertInto: enabled V2 for file sources - DataFrameWriter.saveAsTable: V1 fallback (TODO: SPARK-56230) - ResolveSessionCatalog: V1 fallback for FileTable-backed commands (AnalyzeTable, AnalyzeColumn, TruncateTable, TruncatePartition, ShowPartitions, RecoverPartitions, AddPartitions, RenamePartitions, DropPartitions, SetTableLocation, CREATE TABLE validation, REPLACE TABLE blocking) - FindDataSourceTable: streaming V1 fallback for FileTable (TODO: SPARK-56233) - DataSource.planForWritingFileFormat: graceful V2 handling

…ion to FileScan

LuciferYang marked this pull request as draft March 31, 2026 04:03

LuciferYang force-pushed the SPARK-56176 branch from 39e8009 to a2f06d7 Compare March 31, 2026 04:30

LuciferYang added 2 commits March 31, 2026 16:34

[SPARK-56174][SQL] Complete V2 file write path for DataFrame API

ef5bc4b

LuciferYang force-pushed the SPARK-56176 branch from a2f06d7 to 222840c Compare March 31, 2026 09:50

[SPARK-56176][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagat…

ccd7272

…ion to FileScan

LuciferYang force-pushed the SPARK-56176 branch from 222840c to ccd7272 Compare March 31, 2026 09:51

LuciferYang changed the title ~~[SPARK-56176][SPARK-56232][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagation to FileScan~~ [SPARK-56176][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagation to FileScan Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56176][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagation to FileScan#55108

[SPARK-56176][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagation to FileScan#55108
LuciferYang wants to merge 4 commits intoapache:masterfrom
LuciferYang:SPARK-56176

LuciferYang commented Mar 31, 2026

Uh oh!

LuciferYang commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LuciferYang commented Mar 31, 2026

What changes were proposed in this pull request?

1. V2-native ANALYZE TABLE (AnalyzeTableExec)

2. V2-native ANALYZE COLUMN (AnalyzeColumnExec)

3. Stats propagation to FileScan

4. V1 fallback removal

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. V2-native ANALYZE TABLE (`AnalyzeTableExec`)

2. V2-native ANALYZE COLUMN (`AnalyzeColumnExec`)