chore(ci): try Swatinem/rust-cache for target/ caching#25478
Draft
pront wants to merge 7 commits into
Draft
Conversation
Replace the hand-rolled actions/cache step (which only cached ~/.cargo/registry) with Swatinem/rust-cache@v2.9.1, which also caches target/ build artifacts and prunes them intelligently before save: removes incremental artifacts, deps no longer in Cargo.lock, artifacts older than ~1 week, and ~/.cargo/registry/src (recreated from archives on restore). Configuration: - shared-key: "vector-<os>" — single OS-shared cache instead of one per Cargo.lock hash (today's pattern produces ~26 active entries at ~400 MB each, exhausting the 10 GB GHA cache budget). - save-if: refs/heads/master — only master pushes save the cache. PR jobs restore from master's cache but don't write, so PR churn no longer creates per-PR cache entries. Expected behavior on a PR with src/ changes only: Cargo fingerprints match for all unchanged third-party crates (rdkafka-sys, openssl-sys, zstd-sys, …), so only the touched workspace members recompile. Cold build time on cached PRs should drop from ~20m to ~3-5m. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
So this draft branch can produce a measurable cache hit on its own second push. Revert before merging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`key:` is additive — Swatinem appends rustc-hash + lockfile-hash to it regardless. `shared-key:` is the documented way to partition only by OS (still gets the rustc/lockfile suffixes appended, but that's the intended default behavior). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The source-change filter in .github/workflows/changes.yml doesn't include .github/actions/setup/**, so this PR's CI run skipped all the Rust-building workflows that exercise the cache. Touching src/main.rs with a comment flips the source filter on. Revert before merging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Swatinem appends `${runnerOS}-${runnerArch}` automatically. Including
runner.os in shared-key produced keys like `vector-Linux-Linux-x64-...`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
shared-key forced multiple Rust CI jobs (check-fmt, check-clippy, Build, …) to compete for the same cache entry on save. The lightest job (check-fmt) consistently won the reservation lock and persisted its near-empty target/ (~486 MB, mostly registry). Heavier jobs that compiled the full workspace got rejected: Failed to save: Unable to reserve cache with key v0-rust-vector-Linux-x64-..., another job may be creating this cache. Letting Swatinem use its default job-id partitioning produces one entry per job, with no contention. Per-entry size will vary by job (clippy and Build will produce multi-GB entries; deny/fmt stay small). Trade off is more entries vs. correct per-job artifact reuse. Also bumps prefix-key to v1-vector so old shared-key entries don't shadow the new job-partitioned ones during transition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Confirms whether check-clippy / tests cache restores actually hit (i.e. saved entries from prior run survive eviction). Will revert along with the source touch before merging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: paused
Exploration / prototype. Do not merge as-is — see Pre-merge cleanup below. Findings recorded here for whoever picks this up next; primary follow-up is to evaluate sccache + a Datadog-internal S3 cache instead of GHA cache (see Next steps).
Summary
Replaces the hand-rolled
actions/cachestep in.github/actions/setup/action.ymlwithSwatinem/rust-cache@v2.9.1, which caches~/.cargoANDtarget/build artifacts. The current cache step only caches~/.cargo/registry, so every Vector CI job does a cold cargo build today.Configuration:
prefix-key: "v1-vector"— explicit cache version for easy invalidationshared-key); attempting a single shared key races between jobs and only the lightest job winssave-if: true(TEMP) — needs revert togithub.ref == 'refs/heads/master'before mergingMeasurements
Empirical per-job cache entry sizes from this PR's CI runs:
Cache HIT was confirmed. On a second push to this branch (no Cargo.lock change), check-clippy reported:
Cache hit for: v1-vector-check-clippy-Linux-x64-f996a9aa-c3e9f435Restored from cache key ... full match: true.Cargo's first activity after restore was workspace-local code (
fakedata,vector-vrl/web-playground) instead of cold-start third-party crates (portable-atomic,critical-section,itoa). Compile time for check-clippy dropped from ~6 min cold → ~2 min warm.Limitations (the reason this is paused)
The 10 GB-per-repo GHA cache cap is the binding constraint.
Sum of Vector's per-job entries (~10 GB) saturates the entire repo budget on one CI cycle. Between the two pushes on this branch (~40 min apart), LRU eviction killed 2 of 7 entries, including the largest one (
check-generated-docsat 2.55 GB). Cache works at the macro level but individual entries are at constant eviction risk.Pre-PR repo state was already at 91% saturation: 11.56 GB across 62 caches, with 26 redundant
Linux-cargo-*entries (~9.8 GB) because the current cache key includeshashFiles('**/Cargo.lock')and every Cargo.lock variant created a new ~400 MB save.After this PR's runs, repo cache settled to 9.88 GB across 5 caches — Swatinem's pruning and unified per-job entries replaced the noise.
What works in steady state with
save-if: refs/heads/masterWhat doesn't work
cross.yml: still cold (Docker-based cross-compile, host cache doesn't reach the container). Confirmed out of scope per discussion — release builds always cold.Comparison vs status quo
actions/cache@v5Swatinem/rust-cache@v2.9.1~/.cargo/registry/{index,cache},~/.cargo/git/db~/.cargo/bin+target/${runner.os}-cargo-${hash(Cargo.lock)}— creates one ~400 MB entry per Cargo.lock variantLinux-cargo-*near-duplicates totalling ~9.8 GB)cargo buildPre-merge cleanup (when resuming)
save-if: truetosave-if: ${{ github.ref == 'refs/heads/master' }}in.github/actions/setup/action.yml.src/main.rssource-touch added to trigger the changes filter. (The filter at.github/workflows/changes.ymldoesn't include.github/actions/setup/**; possibly worth adding as a follow-up so this hack isn't needed next time.)Linux-cargo-*entries before merging so the new Swatinem cache has room to seed cleanly:Next steps (not in scope for this PR)
The right long-term answer for Vector's cache budget is to move sccache to a Datadog-internal S3 backend, removing the 10 GB GHA cap entirely. The bucket
dd-sccache-storage-us1-ddbuild-ioalready exists (used by observability-pipelines-worker on Datadog's GitLab CI per their.gitlab-ci.yml), and is documented in the Confluence page 2025-07-02 - S3 Lifecycle Policy Exploration for sccache Storage. The bucket uses a 30-day TTL.Blocker for Vector to use it from GHA-hosted runners: needs an IAM role + OIDC trust policy in the
build-stableAWS account that allowsrepo:vectordotdev/vector:*. That requires coordination with the build-stable team (Bruce Guenter's group per the Confluence page).Once that's in place, follow-up PR replaces Swatinem entirely with:
One conflict to resolve:
RUSTC_WRAPPER=sccachecollides with the mold wrapper at.github/actions/setup/action.yml:153-198. Options: drop the mold wrapper in favor ofRUSTFLAGS="-C link-arg=-fuse-ld=mold", or useupdate-alternativesfor a system-level linker swap. Either is cleaner than the current wrapper-script approach and removes mold from Cargo's fingerprint inputs.Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.🤖 Generated with Claude Code