Implement replace data files action by Shekharrajak · Pull Request #2106 · apache/iceberg-rust

Shekharrajak · 2026-02-04T11:37:31Z

Which issue does this PR close?

Ref Iceberg Table Maintenance: Acceleration Opportunities datafusion-comet#3371

Compaction merges many small files → fewer large files
Fewer files = less I/O overhead, faster scans
Fewer manifest entries to track
Change file format (e.g., Parquet v1 → v2) without changing data

What changes are included in this PR?

This PR adds ReplaceDataFilesAction - a new transaction action for replacing data files in Iceberg tables. This is essential for compaction operations where multiple small files are merged into larger ones.

In Java/Spark Iceberg: RewriteDataFilesSparkAction -> calls RewriteDataFiles API -> uses Operation.REPLACE

This Rust implementation provides the same primitive that Spark uses for:

CALL system.rewrite_data_files('table')
Compaction procedures
Bin-packing optimizations

This allows maintenance tools to safely compact files without affecting ongoing reads or
concurrent writes - a key property for production table maintenance.

The plan is to have these features available so that datafusion-comet can use it for Comet Native execution.

Spark CALL system.rewrite_data_files('table') -> Spark Iceberg (Plan compaction) -> Comet Native (Execute in Rust) -> iceberg-rust (Read old files using ArrowReader, Write new files using ParquetWriter, Commit replace using ReplaceDataFilesAction)

Note: ArrowReader and ParquetWriter is already available.

Ref https://iceberg.apache.org/docs/latest/maintenance/

Are these changes tested?

Unit tests

CTTY · 2026-02-05T20:51:21Z

Hi @Shekharrajak , thanks for the contribution! However, even excluding the datafusion integration, I don't think just the transaction action alone is going to work as we wanted. We will have to add validation and delete processing.

Here is a stale draft that I haven't got a chance to work on, and I'm planning to pick it up soon

Shekharrajak · 2026-02-06T03:17:22Z

Spark CALL system.rewrite_data_files('table') -> Spark Iceberg (Plan compaction) -> Comet Native (Execute in Rust) -> iceberg-rust (Read old files using ArrowReader, Write new files using ParquetWriter, Commit replace using ReplaceDataFilesAction)

Yes, the plan is to use it in datafusion comet, this would be the flow :

Spark CALL system.rewrite_data_files('table') -> Spark Iceberg (Plan compaction) -> Comet Native (Execute in Rust) -> iceberg-rust (Read old files using ArrowReader, Write new files using ParquetWriter, Commit replace using ReplaceDataFilesAction)

ref https://iceberg.apache.org/docs/latest/maintenance/

…esAction

Adds ReplaceDataFilesAction to Transaction for atomic replace operations (compaction/rewrite). Implements Operation::Replace end-to-end: snapshot summary accounting, manifest writing, and transaction action. Ports and extends the approach from apache/iceberg-rust PR apache#2106, incorporating the delete-manifest bug fix from PR apache#2149. Changes: - crates/iceberg/src/transaction/replace_data_files.rs (new) - ReplaceDataFilesAction with delete_files(), add_files(), set_commit_uuid(), set_key_metadata(), set_snapshot_properties() builders - Validates no duplicate paths in files_to_delete - Calls validate_duplicate_files() to reject files already alive in snapshot - Validates all files_to_delete paths exist in current snapshot (phantom delete protection) - ReplaceOperation implements SnapshotProduceOperation: scans existing manifests to build Deleted entries; per-manifest filtering in existing_manifest() preserves unaffected manifests - delete_files() doc comment describes manifest-granularity constraint - Comment on existing_manifest() documents O(2N) double scan limitation - crates/iceberg/src/transaction/snapshot.rs - Added removed_data_files field and with_removed_data_files() builder - Implemented write_delete_manifest() using add_delete_entry() - Fixed latent bug: summary() looked up new snapshot ID (not yet in metadata) to find previous snapshot, causing subtract-overflow panics - Fixed stale error message ("fast append" -> neutral wording) - Added comment explaining ManifestContentType::Data in write_delete_manifest - crates/iceberg/src/spec/snapshot_summary.rs - Added Operation::Replace to update_snapshot_summaries() allowlist - Fixed unwrap() -> unwrap_or(0) on property string parsing - Fixed u64 subtraction -> saturating_sub() to prevent underflow - Removed stale #[allow(dead_code)] from all four helpers - crates/iceberg/src/spec/manifest/writer.rs - Removed #[allow(dead_code)] from add_delete_entry() - crates/iceberg/src/transaction/mod.rs - Wired replace_data_files() method on Transaction - Added apply_updates_to_table() test helper - .gitignore: added .private/ Tests: 1052 pass (cargo test -p iceberg --lib)

- fix(append): carry forward delete-only manifests after Replace (port of apache/iceberg-rust PR apache#2149): add || entry.has_deleted_files() to FastAppendOperation::existing_manifest() so delete manifests created by Replace are not silently dropped by the next FastAppend - feat(replace): add validate_from_snapshot(snapshot_id) builder to ReplaceDataFilesAction for concurrent compaction workflows; validates files_to_delete against a historical snapshot instead of current, avoiding false rejections when another writer commits between planning and committing (port of apache/iceberg-rust PR apache#2106) - feat(replace): add data_sequence_number(seq_num) builder to ReplaceDataFilesAction and SnapshotProducer; threads an explicit sequence number into added manifest entries so compacted files can retain the original sequence number when equality deletes are present (port of apache/iceberg-rust PR apache#2106) - test: four new tests covering all three changes (1056 total, all pass)

Shekharrajak added 4 commits February 4, 2026 17:12

Add ReplaceDataFilesAction for compaction support

eeb2e26

Add integration tests for ReplaceDataFilesAction

e8900b3

Add validate_from_snapshot to ReplaceDataFilesAction

72c7c5c

Add data_sequence_number to ReplaceDataFilesAction

588c3e6

Shekharrajak force-pushed the feat/ReplaceDataFilesAction branch from eaaabd3 to 588c3e6 Compare February 4, 2026 11:42

Fix formatting and clippy lints

c1215fe

Shekharrajak mentioned this pull request Feb 12, 2026

Iceberg Table Maintenance: Acceleration Opportunities apache/datafusion-comet#3371

Open

feat: add native compaction benchmark using ReplaceDataFilesAction

7315d48

Shekharrajak mentioned this pull request Feb 16, 2026

Acceleration : Iceberg table compaction [iceberg] apache/datafusion-comet#3519

Open

Shekharrajak added 3 commits February 17, 2026 00:04

Merge remote-tracking branch 'upstream/main' into feat/ReplaceDataFil…

85626cc

…esAction

fix: apply cargo fmt after upstream merge

343ce79

fix: inline format args in compaction_benchmark for clippy compliance

7694ed7

big-mac-slice mentioned this pull request Mar 6, 2026

feat(transaction): ReplaceDataFilesAction for table compaction (Phase 1 PoC) perpetualsystems/iceberg-rust#1

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement replace data files action#2106

Implement replace data files action#2106
Shekharrajak wants to merge 9 commits intoapache:mainfrom
Shekharrajak:feat/ReplaceDataFilesAction

Shekharrajak commented Feb 4, 2026 •

edited

Loading

Uh oh!

CTTY commented Feb 5, 2026

Uh oh!

Shekharrajak commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Shekharrajak commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

CTTY commented Feb 5, 2026

Uh oh!

Shekharrajak commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shekharrajak commented Feb 4, 2026 •

edited

Loading