Skip to content

[SPARK-56177][SQL] V2 file bucketing write support#55128

Draft
LuciferYang wants to merge 5 commits intoapache:masterfrom
LuciferYang:SPARK-56177
Draft

[SPARK-56177][SQL] V2 file bucketing write support#55128
LuciferYang wants to merge 5 commits intoapache:masterfrom
LuciferYang:SPARK-56177

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Enable bucketed writes for V2 file tables via catalog BucketSpec.

Changes:

  • FileWrite: add bucketSpec field, use V1WritesUtils.getWriterBucketSpec() instead of hardcoded None
  • FileTable.createFileWriteBuilder: extract catalogTable.bucketSpec and pass to the write pipeline
  • FileDataSourceV2.getTable: use collect to skip BucketTransform (handled via catalogTable.bucketSpec)
  • FileWriterFactory: use DynamicPartitionDataConcurrentWriter for bucketed writes since V2's RequiresDistributionAndOrdering cannot express hash-based ordering
  • All 6 format Write/Table classes (Parquet, ORC, CSV, JSON, Text, Avro) updated with BucketSpec parameter

Why are the changes needed?

After SPARK-56171 removed the V2 file write gate, INSERT INTO a bucketed file table goes through the V2 write path. Without this change, WriteJobDescription.bucketSpec is always None, so bucketed tables produce non-bucketed files.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new tests in FileDataSourceV2WriteSuite:

  • Bucketed write with bucket ID verification via BucketingUtils.getBucketId
  • Partitioned + bucketed write with partition directory and bucket ID verification

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

…Frame API writes and delete FallBackFileSourceV2

Key changes:
- FileWrite: added partitionSchema, customPartitionLocations,
  dynamicPartitionOverwrite, isTruncate; path creation and truncate
  logic; dynamic partition overwrite via FileCommitProtocol
- FileTable: createFileWriteBuilder with SupportsDynamicOverwrite
  and SupportsTruncate; capabilities now include TRUNCATE and
  OVERWRITE_DYNAMIC; fileIndex skips file existence checks when
  userSpecifiedSchema is provided (write path)
- All file format writes (Parquet, ORC, CSV, JSON, Text, Avro) use
  createFileWriteBuilder with partition/truncate/overwrite support
- DataFrameWriter.lookupV2Provider: enabled FileDataSourceV2 for
  non-partitioned Append and Overwrite via df.write.save(path)
- DataFrameWriter.insertInto: V1 fallback for file sources
  (TODO: SPARK-56175)
- DataFrameWriter.saveAsTable: V1 fallback for file sources
  (TODO: SPARK-56230, needs StagingTableCatalog)
- DataSourceV2Utils.getTableProvider: V1 fallback for file sources
  (TODO: SPARK-56175)
- Removed FallBackFileSourceV2 rule
- V2SessionCatalog.createTable: V1 FileFormat data type validation
…catalog table loading, and gate removal

Key changes:
- FileTable extends SupportsPartitionManagement with createPartition,
  dropPartition, listPartitionIdentifiers, partitionSchema
- Partition operations sync to catalog metastore (best-effort)
- V2SessionCatalog.loadTable returns FileTable instead of V1Table,
  sets catalogTable and useCatalogFileIndex on FileTable
- V2SessionCatalog.getDataSourceOptions includes storage.properties
  for proper option propagation (header, ORC bloom filter, etc.)
- V2SessionCatalog.createTable validates data types via FileTable
- FileTable.columns() restores NOT NULL constraints from catalogTable
- FileTable.partitioning() falls back to userSpecifiedPartitioning
  or catalog partition columns
- FileTable.fileIndex uses CatalogFileIndex when catalog has
  registered partitions (custom partition locations)
- FileTable.schema checks column name duplication for non-catalog
  tables only
- DataSourceV2Utils.getTableProvider: removed FileDataSourceV2 gate
- DataFrameWriter.insertInto: enabled V2 for file sources
- DataFrameWriter.saveAsTable: V1 fallback (TODO: SPARK-56230)
- ResolveSessionCatalog: V1 fallback for FileTable-backed commands
  (AnalyzeTable, AnalyzeColumn, TruncateTable, TruncatePartition,
  ShowPartitions, RecoverPartitions, AddPartitions, RenamePartitions,
  DropPartitions, SetTableLocation, CREATE TABLE validation,
  REPLACE TABLE blocking)
- FindDataSourceTable: streaming V1 fallback for FileTable
  (TODO: SPARK-56233)
- DataSource.planForWritingFileFormat: graceful V2 handling
Enable bucketed writes for V2 file tables via catalog BucketSpec.

Key changes:
- FileWrite: add bucketSpec field, use V1WritesUtils.getWriterBucketSpec()
  instead of hardcoded None
- FileTable: createFileWriteBuilder passes catalogTable.bucketSpec
  to the write pipeline
- FileDataSourceV2: getTable uses collect to skip BucketTransform
  (handled via catalogTable.bucketSpec instead)
- FileWriterFactory: use DynamicPartitionDataConcurrentWriter for
  bucketed writes since V2's RequiresDistributionAndOrdering cannot
  express hash-based ordering
- All 6 format Write/Table classes updated with BucketSpec parameter

Note: bucket pruning and bucket join (read-path optimization) are
not included in this patch (tracked under SPARK-56231).
@LuciferYang LuciferYang marked this pull request as draft April 1, 2026 03:54
@LuciferYang
Copy link
Copy Markdown
Contributor Author

LuciferYang commented Apr 1, 2026

677a482 represents the actual change in the current patch, which is the 5th patch in SPARK-56170.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant