Skip to content

Fetch snapshots during join, rather than at startup#7630

Open
eddyashton wants to merge 32 commits intomicrosoft:mainfrom
eddyashton:snapshot_fetch_during_join
Open

Fetch snapshots during join, rather than at startup#7630
eddyashton wants to merge 32 commits intomicrosoft:mainfrom
eddyashton:snapshot_fetch_during_join

Conversation

@eddyashton
Copy link
Member

This was a 'mare to land. Will add some comments inline.

The goal was to push snapshot fetching later in the startup process, and late enough that we can do it when receiving a StartupSeqnoIsOld error response to a /join.

While doing that, I've decoupled the verification from the deserialisation of the snapshots, in a hopefully readable way, alongside a bunch of the helper functions for accessing snapshots.

@eddyashton eddyashton requested a review from a team as a code owner January 29, 2026 15:37
Copilot AI review requested due to automatic review settings January 29, 2026 15:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Shift snapshot fetching later in the node startup flow so joiners only fetch a newer snapshot after receiving a StartupSeqnoIsOld response, while refactoring snapshot handling to separate receipt verification from snapshot deserialisation.

Changes:

  • Move join-time snapshot discovery/fetching into the enclave join path (triggered on StartupSeqnoIsOld) rather than doing it at host startup.
  • Refactor snapshot serdes into segment separation + explicit receipt verification, and add helpers to enumerate committed snapshots across directories.
  • Update snapshot lookup/redirect behaviour, tests, and operator documentation to reflect the new flow.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/reconfiguration.py Adjusts join test to exercise redirected snapshot discovery and snapshot absence scenarios.
src/snapshots/snapshot_manager.h Simplifies latest-snapshot lookup API to return a single path from multiple directories.
src/snapshots/filenames.h Adds helpers to list/sort committed snapshots across directories and select the newest.
src/snapshots/fetch.h Updates peer snapshot fetch to use in-memory CA blob rather than a CA path.
src/node/snapshot_serdes.h Splits receipt verification from snapshot application (segment separation + verify + deserialise).
src/node/rpc/file_serving_handlers.h Redirects snapshot requests using the snapshot filename rather than a full path.
src/node/node_state.h Adds join-triggered background snapshot fetching and local snapshot discovery/verification during join/recover.
src/host/test/ledger.cpp Updates tests for the new snapshot-manager return type (path semantics).
src/host/run.cpp Removes host-side snapshot preload/fetch, and forwards join snapshot-fetch settings into StartupConfig.
src/enclave/main.cpp Updates enclave create entrypoint signature (no startup snapshot payload).
src/enclave/entry_points.h Updates enclave create entrypoint declaration accordingly.
src/enclave/enclave.h Updates enclave node creation call signature accordingly.
include/ccf/node/startup_config.h Adds join snapshot-fetch config fields to StartupConfig::Join.
doc/operations/ledger_snapshot.rst Documents the new join flow (including flowchart) and revised snapshot-fetch timing.

Comment on lines 819 to 826
// If we've followed a redirect, it will have been updated in
// config.join. Note that this is fire-and-forget, it is assumed
// that it proceeds in the background, updating state when it
// completes, and the join timer separately re-attempts join after
// this succeeds
ccf::tasks::add_task(std::make_shared<FetchSnapshot>(
config.join, config.snapshots, this));
return;
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On each StartupSeqnoIsOld response this schedules a new FetchSnapshot task, but the join retry timer can fire again before the fetch completes (the fetch helper is blocking and may retry internally). This can lead to multiple concurrent snapshot downloads and repeated overwrites of startup_snapshot_info. Consider tracking an in-flight fetch and skipping scheduling if one is already running (or otherwise gating join retries while fetching).

Copilot uses AI. Check for mistakes.
Comment on lines 283 to 284
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On snapshot verification failure this code unconditionally deletes the snapshot file. If the snapshot came from snapshots.read_only_directory (which is commonly mounted read-only), std::filesystem::remove will fail and currently escalates to a logic_error, preventing startup. Consider only deleting files from the writable snapshot directory (or, on failure to delete, log and continue to the next candidate snapshot without throwing).

Suggested change
throw std::logic_error(
fmt::format("Could not remove file {}", snapshot_path.string()));
LOG_FAIL_FMT(
"Failed to remove snapshot file {} (continuing with next candidate)",
snapshot_path.string());

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should confirm this isn't altering read-only, and rename-to-ignored rather than deleting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments