[SPARK-56176][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagation to FileScan#55108
Draft
LuciferYang wants to merge 4 commits intoapache:masterfrom
Draft
[SPARK-56176][SQL] V2-native ANALYZE TABLE/COLUMN with stats propagation to FileScan#55108LuciferYang wants to merge 4 commits intoapache:masterfrom
LuciferYang wants to merge 4 commits intoapache:masterfrom
Conversation
…Frame API writes and delete FallBackFileSourceV2 Key changes: - FileWrite: added partitionSchema, customPartitionLocations, dynamicPartitionOverwrite, isTruncate; path creation and truncate logic; dynamic partition overwrite via FileCommitProtocol - FileTable: createFileWriteBuilder with SupportsDynamicOverwrite and SupportsTruncate; capabilities now include TRUNCATE and OVERWRITE_DYNAMIC; fileIndex skips file existence checks when userSpecifiedSchema is provided (write path) - All file format writes (Parquet, ORC, CSV, JSON, Text, Avro) use createFileWriteBuilder with partition/truncate/overwrite support - DataFrameWriter.lookupV2Provider: enabled FileDataSourceV2 for non-partitioned Append and Overwrite via df.write.save(path) - DataFrameWriter.insertInto: V1 fallback for file sources (TODO: SPARK-56175) - DataFrameWriter.saveAsTable: V1 fallback for file sources (TODO: SPARK-56230, needs StagingTableCatalog) - DataSourceV2Utils.getTableProvider: V1 fallback for file sources (TODO: SPARK-56175) - Removed FallBackFileSourceV2 rule - V2SessionCatalog.createTable: V1 FileFormat data type validation
Contributor
Author
|
This is the 4th pr in this series, and it is based on #55091 |
39e8009 to
a2f06d7
Compare
…catalog table loading, and gate removal Key changes: - FileTable extends SupportsPartitionManagement with createPartition, dropPartition, listPartitionIdentifiers, partitionSchema - Partition operations sync to catalog metastore (best-effort) - V2SessionCatalog.loadTable returns FileTable instead of V1Table, sets catalogTable and useCatalogFileIndex on FileTable - V2SessionCatalog.getDataSourceOptions includes storage.properties for proper option propagation (header, ORC bloom filter, etc.) - V2SessionCatalog.createTable validates data types via FileTable - FileTable.columns() restores NOT NULL constraints from catalogTable - FileTable.partitioning() falls back to userSpecifiedPartitioning or catalog partition columns - FileTable.fileIndex uses CatalogFileIndex when catalog has registered partitions (custom partition locations) - FileTable.schema checks column name duplication for non-catalog tables only - DataSourceV2Utils.getTableProvider: removed FileDataSourceV2 gate - DataFrameWriter.insertInto: enabled V2 for file sources - DataFrameWriter.saveAsTable: V1 fallback (TODO: SPARK-56230) - ResolveSessionCatalog: V1 fallback for FileTable-backed commands (AnalyzeTable, AnalyzeColumn, TruncateTable, TruncatePartition, ShowPartitions, RecoverPartitions, AddPartitions, RenamePartitions, DropPartitions, SetTableLocation, CREATE TABLE validation, REPLACE TABLE blocking) - FindDataSourceTable: streaming V1 fallback for FileTable (TODO: SPARK-56233) - DataSource.planForWritingFileFormat: graceful V2 handling
a2f06d7 to
222840c
Compare
222840c to
ccd7272
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR is part of SPARK-56170. It implements V2-native ANALYZE TABLE and ANALYZE COLUMN for file tables, removing the last V1 fallbacks in
ResolveSessionCatalog, and propagates stored statistics throughFileScanfor query optimization.1. V2-native ANALYZE TABLE (
AnalyzeTableExec)New physical plan node that computes table statistics and persists them as table properties via
TableCatalog.alterTable(TableChange.setProperty(...)):spark.sql.statistics.totalSize: computed fromfileIndex.sizeInBytesspark.sql.statistics.numRows: computed fromdf.count()(skipped whenNOSCAN)2. V2-native ANALYZE COLUMN (
AnalyzeColumnExec)New physical plan node that computes column-level statistics using
CommandUtils.computeColumnStatsand persists them as table properties:spark.sql.statistics.colStats.<col>.<stat>: min, max, nullCount, distinctCount, avgLen, maxLenFOR COLUMNSandFOR ALL COLUMNSsyntax3. Stats propagation to FileScan
FileScan.numRows(): reads stored row count from__numRowsoption (set by ANALYZE TABLE)FileTable.mergedOptions: injects__numRowsfromcatalogTable.stats.rowCountinto scan options soFileScan.estimateStatistics()can use it4. V1 fallback removal
Removes the AnalyzeTable and AnalyzeColumn FileTable cases from
ResolveSessionCatalog— these are now handled natively byDataSourceV2Strategyrouting toAnalyzeTableExec/AnalyzeColumnExec.Why are the changes needed?
Previously, ANALYZE TABLE/COLUMN for V2 file tables was routed through V1 commands (
AnalyzeTableCommand/AnalyzeColumnCommand) viaResolveSessionCatalogfallbacks. This broke the V2 design principle where table operations should go through the V2 catalog API. The V1 path also stored stats only in the Hive metastore, making them invisible to V2 scan planning.With this change:
TableCatalog.alterTable()APIFileScancan read stored row counts for better query optimizationDoes this PR introduce any user-facing change?
No.
ANALYZE TABLEandANALYZE TABLE FOR COLUMNSproduce the same results. Statistics are stored in the catalog and used for query optimization as before.How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code 4.6