Skip to content

feat(engine): SpillError::PriorSpillsLost typed variant + consumer wiring (SPL-35..37)#4749

Merged
oferchen merged 2 commits into
masterfrom
feat/spl-35-37-priorspilllost-error
May 23, 2026
Merged

feat(engine): SpillError::PriorSpillsLost typed variant + consumer wiring (SPL-35..37)#4749
oferchen merged 2 commits into
masterfrom
feat/spl-35-37-priorspilllost-error

Conversation

@oferchen
Copy link
Copy Markdown
Owner

Summary

Implements SPL-32's top hardening recommendation: distinguish "prior spills lost" from a generic NotFound at the reorder spill recovery-refusal sites.

  • SPL-35 - Add SpillError::PriorSpillsLost { dir, count } variant in crates/engine/src/concurrent_delta/spill/error.rs. The variant carries the spill directory that vanished and the count of records known to be unrecoverable. Display renders the recommended message "prior spill directory <dir> vanished; <count> chunk(s) cannot be recovered". io_error() and is_out_of_space() correctly classify it as a non-I/O failure.
  • SPL-36 - Wire the typed variant through the spill writer.
    • spill_candidates_per_item (per-item path): on NotFound returned from the inner refusal branch, calls the new prior_spills_lost_error() helper and surfaces PriorSpillsLost rather than Io(NotFound).
    • spill_candidates_whole_batch (whole-batch path): same upgrade in its refusal arm before restore_taken.
    • Consumer run_spillable_loop (consumer/loops.rs) adds the missing match arm so the variant funnels into DeltaResult::failed with the actionable diagnostic for the receiver (still mapped to upstream exit code 11).
  • SPL-37 - Regression coverage.
    • The pre-existing temp_dir_vanish_after_prior_spills_returns_error assertion was tightened to require SpillError::PriorSpillsLost { dir, count } with dir == spill_dir and count >= 1.
    • New prior_spills_lost_surfaces_typed_variant_on_dir_wipe explicitly drives the SPL-37 contract: seeds enough inserts to commit real records to disk, force-wipes the spill directory, then drains the buffer and asserts the surfaced error is exactly PriorSpillsLost with the configured dir and a non-zero count. dir_recreate_events must remain 0.

The variant is added in a way that is non-breaking only at the variant-set level; every existing exhaustive match SpillError in the workspace (consumer loop, hardening tests, top-level Display/source test) was updated to handle the new arm.

Test plan

  • CI fmt+clippy passes
  • CI nextest stable (Linux, macOS, Windows, Linux musl) passes
  • Existing temp-dir-vanish hardening tests still pass with the tightened typed-variant assertion
  • New prior_spills_lost_surfaces_typed_variant_on_dir_wipe test passes

@github-actions github-actions Bot added the enhancement New feature or request label May 22, 2026
@oferchen oferchen force-pushed the feat/spl-35-37-priorspilllost-error branch 11 times, most recently from 9246e05 to 597acf2 Compare May 22, 2026 22:40
oferchen added 2 commits May 23, 2026 02:09
…ring (SPL-35..37)

The reorder spill module previously bubbled a generic io::Error{NotFound}
when the caller-supplied spill directory vanished mid-transfer after prior
records had already committed. The receiver could not distinguish that
unrecoverable data-loss signal from a transient kernel-level ENOENT. This
change adds a typed SpillError::PriorSpillsLost { dir, count } variant,
upgrades both spill writers (per-item + whole-batch) to surface it at the
recovery-refusal sites, and wires the new variant through the spillable
reorder consumer loop so the receiver maps it to exit code 11 with an
actionable diagnostic.

- SPL-35: SpillError::PriorSpillsLost { dir, count } variant + Display.
- SPL-36: Per-item and whole-batch writers upgrade NotFound to the typed
  variant when spill_index is non-empty; consumer loop adds the missing
  match arm. The pre-existing temp-dir-vanish hardening test now asserts
  the typed variant carries the configured dir and count >= 1.
- SPL-37: New regression test forces real spills via the byte threshold,
  wipes the directory mid-transfer, and asserts the surfaced error is
  exactly SpillError::PriorSpillsLost with the expected dir and count.
@oferchen oferchen force-pushed the feat/spl-35-37-priorspilllost-error branch from 597acf2 to 990f265 Compare May 22, 2026 23:10
@oferchen oferchen merged commit 1c470e0 into master May 23, 2026
55 of 57 checks passed
oferchen added a commit that referenced this pull request May 23, 2026
Evaluate five candidate ENOSPC injection mechanisms (tmpfs size-cap,
fallocate filler, mock writer, failpoints crate, fuse-mt) against CI
compatibility on the Linux musl / macOS / Windows tiles and against
the SPL-32 audit's site inventory. Recommend a userspace mock writer
for the unit-test layer plus a Linux-gated fallocate filler for one
real-kernel integration test, and pin the chosen mechanism's API
surface in pseudo-code so SPL-33.b has a concrete target.

Cross-references docs/design/spill-fs-error-audit.md (SPL-32) for the
syscall site inventory and the SpillError::PriorSpillsLost wiring
landed via SPL-35/36/37 (#4749).
oferchen added a commit that referenced this pull request May 23, 2026
….a) (#4770)

Evaluate five candidate ENOSPC injection mechanisms (tmpfs size-cap,
fallocate filler, mock writer, failpoints crate, fuse-mt) against CI
compatibility on the Linux musl / macOS / Windows tiles and against
the SPL-32 audit's site inventory. Recommend a userspace mock writer
for the unit-test layer plus a Linux-gated fallocate filler for one
real-kernel integration test, and pin the chosen mechanism's API
surface in pseudo-code so SPL-33.b has a concrete target.

Cross-references docs/design/spill-fs-error-audit.md (SPL-32) for the
syscall site inventory and the SpillError::PriorSpillsLost wiring
landed via SPL-35/36/37 (#4749).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant