Skip to content

feat(transaction): add ReplaceDataFilesAction for table compaction#1

Merged
big-mac-slice merged 1 commit intoperpetual-mainfrom
feat/replace-data-files-action
Mar 17, 2026
Merged

feat(transaction): add ReplaceDataFilesAction for table compaction#1
big-mac-slice merged 1 commit intoperpetual-mainfrom
feat/replace-data-files-action

Conversation

@big-mac-slice
Copy link
Copy Markdown

@big-mac-slice big-mac-slice commented Mar 6, 2026

Summary

Adds ReplaceDataFilesAction -- an atomic transaction action that replaces a set of input data files with compacted output files in a single Iceberg snapshot. This is the core primitive needed for compaction workflows.

Also includes several supporting improvements: timestamp predicate pushdown, scan plan deadlock fix, IcebergTableProvider::try_new visibility, and s3a path normalization.

Changes

crates/iceberg/src/transaction/replace_data_files.rs (new)

  • ReplaceDataFilesAction with delete_files(), add_files(), set_commit_uuid(), set_key_metadata(), set_snapshot_properties(), validate_from_snapshot(), and data_sequence_number() builders
  • Validates no duplicate paths in files_to_delete; validates all deleted files exist as alive entries (phantom delete protection); validates no added files already exist alive
  • validate_from_snapshot(snapshot_id) enables correct concurrent compaction workflows by validating against a historical planning snapshot
  • data_sequence_number(seq_num) sets an explicit data sequence number on added entries for v2 equality-delete correctness
  • existing_manifest() uses dual .any() checks to classify each manifest: no deletions (carry forward), all alive entries deleted (exclude), mixed (return DataInvalid). This replaces the previous behavior of silently dropping surviving files from mixed manifests. Phase 2 will rewrite mixed manifests transparently using ManifestWriter::add_existing_entry.

crates/iceberg/src/transaction/snapshot.rs

  • Added removed_data_files field and with_removed_data_files() builder on SnapshotProducer
  • Implemented write_delete_manifest() using add_delete_entry() to write Deleted-status manifest entries
  • Fixed summary() snapshot lookup (was using new snapshot ID which does not exist yet, causing subtract-overflow panics)
  • Fixed stale error message on content type check

crates/iceberg/src/spec/snapshot_summary.rs

  • Added Operation::Replace to update_snapshot_summaries()
  • Fixed unwrap() to unwrap_or(0) on parse calls; u64 subtraction to saturating_sub()

crates/iceberg/src/transaction/append.rs

crates/iceberg/src/spec/manifest/writer.rs

  • add_delete_entry() is now called (removed dead-code guard)

crates/iceberg/src/writer/file_writer/parquet_writer.rs

  • Use HEAD stat for authoritative file_size_in_bytes instead of a byte-count estimate

crates/iceberg/src/scan/mod.rs + context.rs

  • Scan plan is now pure async streams rather than channels, eliminating a class of deadlock
  • Added incremental scan validation

crates/iceberg/src/scan/task.rs

  • FileScanTask exposes column_sizes and split_offsets

crates/integrations/datafusion/src/physical_plan/expr_to_predicate.rs

  • Timestamp filter pushdown to Iceberg predicates (previously caused full table scans on time-based queries)

crates/integrations/datafusion/src/table/mod.rs

  • IcebergTableProvider::try_new is now pub

crates/storage/opendal/src/lib.rs

  • Normalize s3a:// paths to s3://

crates/examples/src/compact.rs (new)

  • End-to-end compaction example: reads all data files from current snapshot, streams through a RollingFileWriter, commits via ReplaceDataFilesAction
  • 1 GB target file size, explicit 128 MB row group size
  • Configurable via env vars: CATALOG_URI, CATALOG_WAREHOUSE, TABLE_NAMESPACE, TABLE_NAME, S3_REGION

Testing

Unit tests in replace_data_files.rs:

  • test_replace_requires_files_to_add
  • test_replace_add_file_already_in_snapshot_errors
  • test_replace_phantom_delete_errors
  • test_replace_duplicate_paths_in_delete_errors
  • test_replace_action_full_table -- 2 files to 1; verifies Deleted entries and Replace snapshot summary totals
  • test_replace_action_partial -- replaces 2 of 3 files across separate manifests; verifies unaffected file survives
  • test_replace_then_fast_append_preserves_delete_manifest
  • test_replace_validate_from_snapshot_succeeds
  • test_replace_validate_from_snapshot_unknown_id_errors
  • test_replace_data_sequence_number_is_set_on_entries
  • test_replace_mixed_manifest_errors -- A+B in one manifest, delete only A -> DataInvalid
  • test_replace_full_manifest_delete_no_regression -- A+B in one manifest, delete both -> success
  • test_replace_single_file_manifests_no_false_positive -- A and B in separate manifests, delete A -> success, B survives

@big-mac-slice big-mac-slice force-pushed the feat/replace-data-files-action branch 2 times, most recently from 4c8e1b3 to 198b4da Compare March 10, 2026 18:02
@big-mac-slice big-mac-slice changed the title feat(transaction): ReplaceDataFilesAction for table compaction (Phase 1 PoC) feat(transaction): add ReplaceDataFilesAction for table compaction Mar 10, 2026
@big-mac-slice big-mac-slice force-pushed the feat/replace-data-files-action branch 3 times, most recently from 70994c7 to 2e3ccad Compare March 11, 2026 16:36
@derrley
Copy link
Copy Markdown

derrley commented Mar 11, 2026

This needs to be a PR into perpetual-main not into main. We keep a separate main so we can periodically sync main to upstream main.

@derrley
Copy link
Copy Markdown

derrley commented Mar 11, 2026

This is the claude summary of where else we might find PR's that add replace files into the library.

 ---
  Upstream & Fork Landscape for Replace/Rewrite Data Files

  apache/iceberg-rust (upstream)

  ┌───────┬──────────────┬─────────────┬─────────────────────────────────────────────────────────────────────┐
  │  PR   │    Author    │   Status    │                             Description                             │
  ├───────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────┤
  │       │              │ Open        │ RewriteFilesAction + validation + delete manifest processing. The   │
  │ #1606 │ CTTY         │ (draft,     │ most comprehensive upstream attempt. CTTY is a maintainer and       │
  │       │              │ since Aug   │ commented on #2106 that they plan to pick this up. Includes         │
  │       │              │ 2025)       │ SnapshotValidator and rewrite_files.rs.                             │
  ├───────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────┤
  │       │              │ Open (Feb   │ ReplaceDataFilesAction — motivated by DataFusion Comet integration. │
  │ #2106 │ Shekharrajak │ 2026)       │  CTTY indicated this alone won't work without validation and delete │
  │       │              │             │  processing, and pointed to #1606.                                  │
  ├───────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────┤
  │       │              │ Open (Feb   │ OverwriteAction with CoW delete support — related but different     │
  │ #2185 │ glitchy      │ 2026)       │ operation (Operation::Overwrite not Replace). First in a planned    │
  │       │              │             │ CoW/MoR series. Touches the same snapshot.rs and append.rs files.   │
  ├───────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────┤
  │ #2149 │ drbothen     │ Open (Feb   │ Fix: preserve delete-only manifests in fast_append — same bug our   │
  │       │              │ 2026)       │ branch fixes in append.rs.                                          │
  ├───────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────┤
  │ #2144 │ drbothen     │ Open (Feb   │ Fix: saturating_sub in snapshot summary — same fix our branch       │
  │       │              │ 2026)       │ includes.                                                           │
  └───────┴──────────────┴─────────────┴─────────────────────────────────────────────────────────────────────┘

  risingwavelabs/iceberg-rust

  Has a RewriteFilesAction on branches li0k/rewrite_files_action and li0k/rewrite_files_action_0314 (the _0314
  one is more recent). Key differences from our implementation:

  - Manifest rewriting: When a manifest contains both deleted and surviving files, RisingWave rewrites the
  manifest to keep the surviving entries. Our branch drops the entire manifest (documented as a
  "manifest-granularity constraint").
  - No validation: Doesn't validate that delete paths exist in the snapshot. No validate_from_snapshot or
  data_sequence_number features.
  - Manifest merge support: Integrates with MergeManifestProcess based on table properties (target_size_bytes,
  min_count_to_merge).
  - All in a single transaction.rs file (not a separate module).

  bauplanlabs/iceberg-rust

  No replace/rewrite/compact support on bauplan-main. Nothing relevant in their tree.

  ---
  Summary

  Our feat/replace-data-files-action is most similar to upstream's #2106 (same name, same approach) but
  significantly more complete — it includes validation (validate_from_snapshot, duplicate path checks, phantom
  delete detection), data_sequence_number for equality-delete correctness, the append.rs fix (#2149), and the
  saturating_sub fix (#2144).

  The key risk is that CTTY's #1606 is the one most likely to land upstream since they're a maintainer, and it
  takes a different architectural approach with a SnapshotValidator abstraction. RisingWave's approach of
  rewriting manifests to preserve non-deleted entries is arguably more correct than our manifest-granularity
  exclusion.

CTTY is the main shot caller in that community. So, Claude's suggestion there is probably a good one. But you already did the work. So. Instead, I asked how hard it would be to port the compactor to use CTTY's interface once that ships, and got this answer:

  Easy — the compact.rs example would need ~4 lines changed:

  // Before (ours):
  tx.replace_data_files()
      .delete_files(files_to_delete)
      .add_files(compacted_files)

  // After (CTTY's):
  tx.rewrite_files()
      .delete_data_files(files_to_delete)?
      .add_data_files(compacted_files)?

So. I'm good there.

In that case, I think we should just take the advice to steal hints from the two ways that risingwave's patches are better:

This one:

  Manifest rewriting — yes, genuinely better. This is the one real advantage. Our existing_manifest() drops an
  entire manifest if it contains any file being deleted, silently orphaning surviving entries in that manifest.
  RisingWave rewrites the manifest to keep survivors. In practice this matters when you compact a subset of files
   from a manifest that also contains unrelated files — our approach would lose those unrelated files from the
  snapshot. That said, CTTY's #1606 also does this, so this will come for free from upstream.

And this one:

  Manifest merge support — irrelevant for compaction. MergeManifestProcess merges manifests (the metadata files
  listing data files), not data files themselves. It's an optimization for manifest proliferation. It's
  orthogonal to the replace/rewrite action and is only wired up as an option — RisingWave's RewriteFilesAction
  falls back to DefaultManifestProcess when merge is disabled. Our branch using DefaultManifestProcess is fine;
  manifest merging can be added independently.

Manifest merging will come next (as will expire snapshots). All three things are important. Compaction is just the one that is failing in trino and urgently in need of getting right.

Comment thread crates/examples/src/compact.rs Outdated
@@ -0,0 +1,449 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch isn't mergeable with compact here. Needs to move to simba.

Comment thread crates/iceberg/src/arrow/schema.rs Outdated
Ok(())
}

fn before_list_element(&mut self, field: &Field) -> Result<()> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be hesitant to change anything in this file. Arrow schema seems pretty central.

///
/// # TODO
/// Remove this allow later
#[allow(dead_code)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to keep the patch non-intrusive. The goal isn't just to make things work. The goal is to write a narrow patch that adds just what we need such that we can keep rebasing our patch onto their main until something they write actually ships and we can use it instead.


#[allow(dead_code)]
fn get_prop(previous_summary: &Summary, prop: &str) -> Result<i32> {
fn get_prop(previous_summary: &Summary, prop: &str) -> Result<i64> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point being that snapshots can be pretty big? And i32 will overflows? Seems like that needs to be a patch in the stack that we maintain independently of rewrite.

)?])
return Ok(vec![]);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tad bit scared to touch this file. Do you know why this edit is needed?

@derrley
Copy link
Copy Markdown

derrley commented Mar 11, 2026

API inconsistencies with the other actions in this repo:

  The real inconsistencies are:

  1. add_files() should be add_data_files() — matches FastAppendAction and is explicit about the content type
  (these are data files, not delete files).
  2. delete_files() should be delete_data_files() — same reasoning, and aligns with both CTTY's
  delete_data_files() and RisingWave's delete_files() which explicitly routes by content type. When/if delete
  file rewriting is added later, you'd want delete_delete_files() or a separate method.
  3. validate_from_snapshot() and data_sequence_number() break the pattern — every other setter in the repo uses
  set_ or with_ prefix. These should be set_validate_from_snapshot() / set_data_sequence_number() to match, or
  alternatively with_validate_from_snapshot() / with_data_sequence_number().

  Otherwise the structure (builder pattern consuming self, pub(crate) fn new(), impl TransactionAction) follows
  the repo conventions correctly.

Comment thread crates/examples/src/compact.rs Outdated
Comment on lines +259 to +262
// Partition-aware fanout: split each batch by partition key and route rows to a
// dedicated writer per partition. This is required for correctness — a single
// RollingFileWriter fed interleaved multi-partition rows would violate the Iceberg
// invariant that every data file belongs to exactly one partition.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest worry is this part of the approach. But I think we have a plan for that. Schedule per-partition more frequently, run the cron over everything less frequently?

Comment thread crates/examples/src/compact.rs Outdated
Comment on lines +162 to +165
let snapshot = current_table
.metadata()
.current_snapshot()
.expect("table has no snapshots — nothing to compact");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably only want to compact when we notice small files. 🤔 Otherwise this will read and write every file when it fires up. Even the old partitions that were already compacted and fine!

@big-mac-slice big-mac-slice force-pushed the feat/replace-data-files-action branch 5 times, most recently from 30f08d8 to 31b32b3 Compare March 12, 2026 23:07
@big-mac-slice big-mac-slice marked this pull request as ready for review March 13, 2026 15:29
@big-mac-slice big-mac-slice requested a review from derrley March 13, 2026 15:29
@big-mac-slice big-mac-slice changed the base branch from main to perpetual-main March 13, 2026 15:36
@big-mac-slice big-mac-slice force-pushed the feat/replace-data-files-action branch 2 times, most recently from 6843d8c to 5c31ce0 Compare March 13, 2026 17:05
Copy link
Copy Markdown

@derrley derrley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall solid implementation — good validation, thorough tests, and the v2 equality-delete sequence number handling is correct. One blocking issue around data safety in existing_manifest().

}
}

#[cfg(test)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: manifest-granularity exclusion can silently drop data files.

This drops the entire manifest if any file in it is being deleted. Files that are NOT being deleted but share a manifest with deleted files get silently excluded from the new snapshot — that's data loss.

Example: a manifest contains files A, B, C. Caller deletes only A. This code drops the entire manifest, so B and C vanish from the snapshot with no error.

This should be an error — if a manifest contains a mix of files-to-delete and files-to-keep, reject the operation rather than silently losing data. Something like:

let dominated = manifest.entries().iter().all(|e| {
    !e.is_alive() || self.files_to_delete.contains(e.file_path())
});
if has_file_to_delete && !dominated {
    return Err(Error::new(
        ErrorKind::DataInvalid,
        format!(
            "Manifest contains files to delete AND files to keep. \
             All alive files in a manifest must be included in files_to_delete \
             or the manifest must not contain any files to delete. \
             Manifest: {}",
            entry.manifest_path()
        ),
    ));
}

Alternatively (and probably better for Phase 2): rewrite the manifest to keep the surviving entries. But until then, an error is strictly better than silent data loss.

// NOTE: `commit` takes `Arc<Self>` and cannot move out of fields, so
// `files_to_add` and `files_to_delete` must be cloned here. For large
// compaction jobs with many DataFiles this is non-trivial overhead.
// Tracked as a known Phase 1 limitation; revisit if memory becomes an issue.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the trait, unfort. Kind of stuck with it.

Adds ReplaceDataFilesAction -- an atomic transaction action that replaces
a set of input data files with compacted outputs in a single Iceberg
snapshot (Operation::Replace). This is the core primitive needed for
compaction workflows.

Builder API:
- delete_files() / add_files() -- input and output file sets
- validate_from_snapshot(id) -- validate against a planning snapshot to
  avoid false rejections from concurrent writers between planning and
  committing
- data_sequence_number(seq_num) -- pin added entries to an explicit
  sequence number for v2 equality-delete correctness (compacted files
  that predate an equality delete must retain the original seq number)
- set_commit_uuid() / set_key_metadata() / set_snapshot_properties()

Validation: no duplicate paths in files_to_delete; all deleted files must
exist as alive entries (phantom-delete protection); no added files may
already exist alive (duplicate protection).

Mixed-manifest safety: existing_manifest() previously silently dropped
entire manifests when any file in them was being deleted, discarding
non-deleted files in the same manifest. Fixed with dual .any() calls that
classify each manifest into three states: no deletions (carry forward),
all alive entries deleted (exclude), mixed (Err DataInvalid). Phase 2 will
replace the error with transparent rewriting via add_existing_entry.

New tests: test_replace_mixed_manifest_errors,
test_replace_full_manifest_delete_no_regression,
test_replace_single_file_manifests_no_false_positive.

Supporting changes:

snapshot.rs: removed_data_files field and write_delete_manifest() on
SnapshotProducer; fixed summary() snapshot lookup (new snapshot ID does
not exist yet, causing subtract-overflow panics on file counts)

snapshot_summary.rs: Operation::Replace added to update_snapshot_summaries()

append.rs: preserve delete-only manifests from prior Replace operations
(regression guard for apache/iceberg-rust PR 2149)

scan/mod.rs: scan plan is now pure async streams rather than channels,
eliminating a class of deadlock; added incremental scan validation

scan/task.rs: FileScanTask exposes column_sizes and split_offsets

parquet_writer.rs: use HEAD stat for authoritative file_size_in_bytes

datafusion/expr_to_predicate.rs: timestamp filter pushdown to Iceberg
predicates (previously caused full table scans on time-based queries)

datafusion/table/mod.rs: IcebergTableProvider::try_new is now pub

storage/opendal/src/lib.rs: normalize s3a:// paths to s3://

compact.rs (new): end-to-end compaction example using ReplaceDataFilesAction
@big-mac-slice big-mac-slice force-pushed the feat/replace-data-files-action branch from 23fdc81 to 60a93bf Compare March 17, 2026 15:06
@big-mac-slice big-mac-slice merged commit a47ec67 into perpetual-main Mar 17, 2026
15 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants