Skip to content

[SPARK-56176][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagation to FileScan#55108

Draft
LuciferYang wants to merge 4 commits intoapache:masterfrom
LuciferYang:SPARK-56176
Draft

[SPARK-56176][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagation to FileScan#55108
LuciferYang wants to merge 4 commits intoapache:masterfrom
LuciferYang:SPARK-56176

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR is part of SPARK-56170. It implements V2-native ANALYZE TABLE and ANALYZE COLUMN for file tables, removing the last V1 fallbacks in ResolveSessionCatalog, and propagates stored statistics through FileScan for query optimization.

1. V2-native ANALYZE TABLE (AnalyzeTableExec)

New physical plan node that computes table statistics and persists them as table properties via TableCatalog.alterTable(TableChange.setProperty(...)):

  • spark.sql.statistics.totalSize: computed from fileIndex.sizeInBytes
  • spark.sql.statistics.numRows: computed from df.count() (skipped when NOSCAN)

2. V2-native ANALYZE COLUMN (AnalyzeColumnExec)

New physical plan node that computes column-level statistics using CommandUtils.computeColumnStats and persists them as table properties:

  • spark.sql.statistics.colStats.<col>.<stat>: min, max, nullCount, distinctCount, avgLen, maxLen
  • Supports both FOR COLUMNS and FOR ALL COLUMNS syntax

3. Stats propagation to FileScan

  • FileScan.numRows(): reads stored row count from __numRows option (set by ANALYZE TABLE)
  • FileTable.mergedOptions: injects __numRows from catalogTable.stats.rowCount into scan options so FileScan.estimateStatistics() can use it

4. V1 fallback removal

Removes the AnalyzeTable and AnalyzeColumn FileTable cases from ResolveSessionCatalog — these are now handled natively by DataSourceV2Strategy routing to AnalyzeTableExec/AnalyzeColumnExec.

Why are the changes needed?

Previously, ANALYZE TABLE/COLUMN for V2 file tables was routed through V1 commands (AnalyzeTableCommand/AnalyzeColumnCommand) via ResolveSessionCatalog fallbacks. This broke the V2 design principle where table operations should go through the V2 catalog API. The V1 path also stored stats only in the Hive metastore, making them invisible to V2 scan planning.

With this change:

  • Stats are stored as table properties via the standard TableCatalog.alterTable() API
  • FileScan can read stored row counts for better query optimization
  • No V1 command routing needed for ANALYZE operations

Does this PR introduce any user-facing change?

No. ANALYZE TABLE and ANALYZE TABLE FOR COLUMNS produce the same results. Statistics are stored in the catalog and used for query optimization as before.

How was this patch tested?

  • Pass Github Actions

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code 4.6

…Frame API writes and delete FallBackFileSourceV2

Key changes:
- FileWrite: added partitionSchema, customPartitionLocations,
  dynamicPartitionOverwrite, isTruncate; path creation and truncate
  logic; dynamic partition overwrite via FileCommitProtocol
- FileTable: createFileWriteBuilder with SupportsDynamicOverwrite
  and SupportsTruncate; capabilities now include TRUNCATE and
  OVERWRITE_DYNAMIC; fileIndex skips file existence checks when
  userSpecifiedSchema is provided (write path)
- All file format writes (Parquet, ORC, CSV, JSON, Text, Avro) use
  createFileWriteBuilder with partition/truncate/overwrite support
- DataFrameWriter.lookupV2Provider: enabled FileDataSourceV2 for
  non-partitioned Append and Overwrite via df.write.save(path)
- DataFrameWriter.insertInto: V1 fallback for file sources
  (TODO: SPARK-56175)
- DataFrameWriter.saveAsTable: V1 fallback for file sources
  (TODO: SPARK-56230, needs StagingTableCatalog)
- DataSourceV2Utils.getTableProvider: V1 fallback for file sources
  (TODO: SPARK-56175)
- Removed FallBackFileSourceV2 rule
- V2SessionCatalog.createTable: V1 FileFormat data type validation
@LuciferYang LuciferYang marked this pull request as draft March 31, 2026 04:03
@LuciferYang
Copy link
Copy Markdown
Contributor Author

This is the 4th pr in this series, and it is based on #55091

…catalog table loading, and gate removal

Key changes:
- FileTable extends SupportsPartitionManagement with createPartition,
  dropPartition, listPartitionIdentifiers, partitionSchema
- Partition operations sync to catalog metastore (best-effort)
- V2SessionCatalog.loadTable returns FileTable instead of V1Table,
  sets catalogTable and useCatalogFileIndex on FileTable
- V2SessionCatalog.getDataSourceOptions includes storage.properties
  for proper option propagation (header, ORC bloom filter, etc.)
- V2SessionCatalog.createTable validates data types via FileTable
- FileTable.columns() restores NOT NULL constraints from catalogTable
- FileTable.partitioning() falls back to userSpecifiedPartitioning
  or catalog partition columns
- FileTable.fileIndex uses CatalogFileIndex when catalog has
  registered partitions (custom partition locations)
- FileTable.schema checks column name duplication for non-catalog
  tables only
- DataSourceV2Utils.getTableProvider: removed FileDataSourceV2 gate
- DataFrameWriter.insertInto: enabled V2 for file sources
- DataFrameWriter.saveAsTable: V1 fallback (TODO: SPARK-56230)
- ResolveSessionCatalog: V1 fallback for FileTable-backed commands
  (AnalyzeTable, AnalyzeColumn, TruncateTable, TruncatePartition,
  ShowPartitions, RecoverPartitions, AddPartitions, RenamePartitions,
  DropPartitions, SetTableLocation, CREATE TABLE validation,
  REPLACE TABLE blocking)
- FindDataSourceTable: streaming V1 fallback for FileTable
  (TODO: SPARK-56233)
- DataSource.planForWritingFileFormat: graceful V2 handling
@LuciferYang LuciferYang changed the title [SPARK-56176][SPARK-56232][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagation to FileScan [SPARK-56176][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagation to FileScan Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant