Fetch snapshots during join, rather than at startup by eddyashton · Pull Request #7630 · microsoft/CCF

eddyashton · 2026-01-29T15:37:26Z

This was a 'mare to land. Will add some comments inline.

The goal was to push snapshot fetching later in the startup process, and late enough that we can do it when receiving a StartupSeqnoIsOld error response to a /join.

While doing that, I've decoupled the verification from the deserialisation of the snapshots, in a hopefully readable way, alongside a bunch of the helper functions for accessing snapshots.

…_fetch_during_join

doc/operations/ledger_snapshot.rst

include/ccf/node/startup_config.h

src/node/node_state.h

Copilot

Pull request overview

Shift snapshot fetching later in the node startup flow so joiners only fetch a newer snapshot after receiving a StartupSeqnoIsOld response, while refactoring snapshot handling to separate receipt verification from snapshot deserialisation.

Changes:

Move join-time snapshot discovery/fetching into the enclave join path (triggered on StartupSeqnoIsOld) rather than doing it at host startup.
Refactor snapshot serdes into segment separation + explicit receipt verification, and add helpers to enumerate committed snapshots across directories.
Update snapshot lookup/redirect behaviour, tests, and operator documentation to reflect the new flow.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
tests/reconfiguration.py	Adjusts join test to exercise redirected snapshot discovery and snapshot absence scenarios.
src/snapshots/snapshot_manager.h	Simplifies latest-snapshot lookup API to return a single path from multiple directories.
src/snapshots/filenames.h	Adds helpers to list/sort committed snapshots across directories and select the newest.
src/snapshots/fetch.h	Updates peer snapshot fetch to use in-memory CA blob rather than a CA path.
src/node/snapshot_serdes.h	Splits receipt verification from snapshot application (segment separation + verify + deserialise).
src/node/rpc/file_serving_handlers.h	Redirects snapshot requests using the snapshot filename rather than a full path.
src/node/node_state.h	Adds join-triggered background snapshot fetching and local snapshot discovery/verification during join/recover.
src/host/test/ledger.cpp	Updates tests for the new snapshot-manager return type (path semantics).
src/host/run.cpp	Removes host-side snapshot preload/fetch, and forwards join snapshot-fetch settings into StartupConfig.
src/enclave/main.cpp	Updates enclave create entrypoint signature (no startup snapshot payload).
src/enclave/entry_points.h	Updates enclave create entrypoint declaration accordingly.
src/enclave/enclave.h	Updates enclave node creation call signature accordingly.
include/ccf/node/startup_config.h	Adds join snapshot-fetch config fields to StartupConfig::Join.
doc/operations/ledger_snapshot.rst	Documents the new join flow (including flowchart) and revised snapshot-fetch timing.

src/node/node_state.h

src/snapshots/filenames.h

Copilot · 2026-01-29T15:49:17Z

src/node/node_state.h

+              // If we've followed a redirect, it will have been updated in
+              // config.join. Note that this is fire-and-forget, it is assumed
+              // that it proceeds in the background, updating state when it
+              // completes, and the join timer separately re-attempts join after
+              // this succeeds
+              ccf::tasks::add_task(std::make_shared<FetchSnapshot>(
+                config.join, config.snapshots, this));
+              return;


On each StartupSeqnoIsOld response this schedules a new FetchSnapshot task, but the join retry timer can fire again before the fetch completes (the fetch helper is blocking and may retry internally). This can lead to multiple concurrent snapshot downloads and repeated overwrites of startup_snapshot_info. Consider tracking an in-flight fetch and skipping scheduling if one is already running (or otherwise gating join retries while fetching).

src/node/node_state.h

Copilot · 2026-01-29T15:49:18Z

src/node/node_state.h

On snapshot verification failure this code unconditionally deletes the snapshot file. If the snapshot came from snapshots.read_only_directory (which is commonly mounted read-only), std::filesystem::remove will fail and currently escalates to a logic_error, preventing startup. Consider only deleting files from the writable snapshot directory (or, on failure to delete, log and continue to the next candidate snapshot without throwing).

Suggested change

throw std::logic_error(

fmt::format("Could not remove file {}", snapshot_path.string()));

LOG_FAIL_FMT(

"Failed to remove snapshot file {} (continuing with next candidate)",

snapshot_path.string());

We should confirm this isn't altering read-only, and rename-to-ignored rather than deleting.

src/node/node_state.h

src/node/snapshot_serdes.h

src/node/node_state.h

…_fetch_during_join

…shton/CCF into snapshot_fetch_during_join

…_fetch_during_join

eddyashton added 17 commits January 12, 2026 13:55

Tweak API, so we have a static variant

27390b2

Remove SnapshotManager arg

1d0813c

Push snapshot fetching in-enclave, late

639eff8

Separate verifying and deserialising snapshot

da179a8

Running in place, and finally restoring recovery parsing

f36949d

Merge branch 'main' of https://github.com/microsoft/CCF into snapshot…

b85549d

…_fetch_during_join

Merge branch 'main' of https://github.com/microsoft/CCF into snapshot…

93ebc07

…_fetch_during_join

ci-checks

98178a5

Sketch later snapshot fetch, with a mess of TODOs

1764a12

Merge branch 'main' of https://github.com/microsoft/CCF into snapshot…

50f2281

…_fetch_during_join

Concede blocking fetch for now

4f1b470

Remove eager fetch

d8e97d5

Remove backup's snapshot files in e2e test

b067309

Update docs, including a flowchart

07e41ad

Iterate through local snapshots to find best non-corrupt

c4ba3f6

Update flowchart

e6db07c

Merge branch 'main' of https://github.com/microsoft/CCF into snapshot…

8646b65

…_fetch_during_join

eddyashton requested a review from a team as a code owner January 29, 2026 15:37

Copilot AI review requested due to automatic review settings January 29, 2026 15:37

Copilot started reviewing on behalf of eddyashton January 29, 2026 15:37 View session

eddyashton commented Jan 29, 2026

View reviewed changes

doc/operations/ledger_snapshot.rst Show resolved Hide resolved

eddyashton commented Jan 29, 2026

View reviewed changes

include/ccf/node/startup_config.h Show resolved Hide resolved

eddyashton commented Jan 29, 2026

View reviewed changes

src/node/node_state.h Show resolved Hide resolved

Format

52d7ccd

Copilot AI reviewed Jan 29, 2026

View reviewed changes

achamayou reviewed Jan 30, 2026

View reviewed changes

src/node/node_state.h Show resolved Hide resolved

eddyashton added 2 commits February 3, 2026 11:08

Sort once

475730f

Merge branch 'main' of https://github.com/microsoft/CCF into snapshot…

d804062

…_fetch_during_join

Copilot AI mentioned this pull request Feb 10, 2026

Add exception handling in task_worker_loop to prevent worker thread termination #7658

Open

11 tasks

Verify after fetching, lock before updating, try around parsing

49e8a07

eddyashton mentioned this pull request Feb 10, 2026

Forbid reading an empty range in file serving handles #7657

Merged

achamayou and others added 11 commits February 10, 2026 16:47

Merge branch 'main' into snapshot_fetch_during_join

4ba90f6

Merge branch 'main' of https://github.com/microsoft/CCF into snapshot…

6d0fa80

…_fetch_during_join

Ignore rather than deleting

4983647

Less verbose, and fixups

67601d1

e2e test of corrupt/ignore behaviour

2a39cfb

Merge branch 'snapshot_fetch_during_join' of https://github.com/eddya…

b21ea9b

…shton/CCF into snapshot_fetch_during_join

Merge branch 'main' of https://github.com/microsoft/CCF into snapshot…

d5d9a0c

…_fetch_during_join

Merge branch 'main' of https://github.com/microsoft/CCF into snapshot…

dc4591d

…_fetch_during_join

Merge branch 'main' of https://github.com/microsoft/CCF into snapshot…

5c81af0

…_fetch_during_join

Avoid scheduling multiple fetches

ff5afa2

Format and lint

b3887d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch snapshots during join, rather than at startup#7630

Fetch snapshots during join, rather than at startup#7630
eddyashton wants to merge 32 commits intomicrosoft:mainfrom
eddyashton:snapshot_fetch_during_join

eddyashton commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

eddyashton Jan 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

-            throw std::logic_error(
-              fmt::format("Could not remove file {}", snapshot_path.string()));
+            LOG_FAIL_FMT(
+              "Failed to remove snapshot file {} (continuing with next candidate)",
+              snapshot_path.string());

Conversation

eddyashton commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

eddyashton Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments