Fetch snapshots during join, rather than at startup#7630
Fetch snapshots during join, rather than at startup#7630eddyashton wants to merge 32 commits intomicrosoft:mainfrom
Conversation
…_fetch_during_join
…_fetch_during_join
…_fetch_during_join
…_fetch_during_join
There was a problem hiding this comment.
Pull request overview
Shift snapshot fetching later in the node startup flow so joiners only fetch a newer snapshot after receiving a StartupSeqnoIsOld response, while refactoring snapshot handling to separate receipt verification from snapshot deserialisation.
Changes:
- Move join-time snapshot discovery/fetching into the enclave join path (triggered on
StartupSeqnoIsOld) rather than doing it at host startup. - Refactor snapshot serdes into segment separation + explicit receipt verification, and add helpers to enumerate committed snapshots across directories.
- Update snapshot lookup/redirect behaviour, tests, and operator documentation to reflect the new flow.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/reconfiguration.py | Adjusts join test to exercise redirected snapshot discovery and snapshot absence scenarios. |
| src/snapshots/snapshot_manager.h | Simplifies latest-snapshot lookup API to return a single path from multiple directories. |
| src/snapshots/filenames.h | Adds helpers to list/sort committed snapshots across directories and select the newest. |
| src/snapshots/fetch.h | Updates peer snapshot fetch to use in-memory CA blob rather than a CA path. |
| src/node/snapshot_serdes.h | Splits receipt verification from snapshot application (segment separation + verify + deserialise). |
| src/node/rpc/file_serving_handlers.h | Redirects snapshot requests using the snapshot filename rather than a full path. |
| src/node/node_state.h | Adds join-triggered background snapshot fetching and local snapshot discovery/verification during join/recover. |
| src/host/test/ledger.cpp | Updates tests for the new snapshot-manager return type (path semantics). |
| src/host/run.cpp | Removes host-side snapshot preload/fetch, and forwards join snapshot-fetch settings into StartupConfig. |
| src/enclave/main.cpp | Updates enclave create entrypoint signature (no startup snapshot payload). |
| src/enclave/entry_points.h | Updates enclave create entrypoint declaration accordingly. |
| src/enclave/enclave.h | Updates enclave node creation call signature accordingly. |
| include/ccf/node/startup_config.h | Adds join snapshot-fetch config fields to StartupConfig::Join. |
| doc/operations/ledger_snapshot.rst | Documents the new join flow (including flowchart) and revised snapshot-fetch timing. |
| // If we've followed a redirect, it will have been updated in | ||
| // config.join. Note that this is fire-and-forget, it is assumed | ||
| // that it proceeds in the background, updating state when it | ||
| // completes, and the join timer separately re-attempts join after | ||
| // this succeeds | ||
| ccf::tasks::add_task(std::make_shared<FetchSnapshot>( | ||
| config.join, config.snapshots, this)); | ||
| return; |
There was a problem hiding this comment.
On each StartupSeqnoIsOld response this schedules a new FetchSnapshot task, but the join retry timer can fire again before the fetch completes (the fetch helper is blocking and may retry internally). This can lead to multiple concurrent snapshot downloads and repeated overwrites of startup_snapshot_info. Consider tracking an in-flight fetch and skipping scheduling if one is already running (or otherwise gating join retries while fetching).
src/node/node_state.h
Outdated
There was a problem hiding this comment.
On snapshot verification failure this code unconditionally deletes the snapshot file. If the snapshot came from snapshots.read_only_directory (which is commonly mounted read-only), std::filesystem::remove will fail and currently escalates to a logic_error, preventing startup. Consider only deleting files from the writable snapshot directory (or, on failure to delete, log and continue to the next candidate snapshot without throwing).
| throw std::logic_error( | |
| fmt::format("Could not remove file {}", snapshot_path.string())); | |
| LOG_FAIL_FMT( | |
| "Failed to remove snapshot file {} (continuing with next candidate)", | |
| snapshot_path.string()); |
There was a problem hiding this comment.
We should confirm this isn't altering read-only, and rename-to-ignored rather than deleting.
…_fetch_during_join
…_fetch_during_join
…shton/CCF into snapshot_fetch_during_join
…_fetch_during_join
…_fetch_during_join
…_fetch_during_join
This was a 'mare to land. Will add some comments inline.
The goal was to push snapshot fetching later in the startup process, and late enough that we can do it when receiving a
StartupSeqnoIsOlderror response to a/join.While doing that, I've decoupled the verification from the deserialisation of the snapshots, in a hopefully readable way, alongside a bunch of the helper functions for accessing snapshots.