Conversation
Update the supported Kubernetes version range for testing from v1.30.x-v1.33.x to v1.32.x-v1.35.x. Per-PR tests use the minimum supported version (K8s 1.32.x): - Default K3S image: rancher/k3s:v1.32.13-k3s1 - Kind node images: kindest/node:v1.32.11 - Kube test components: v1.32.13 Nightly tests use the maximum supported version (K8s 1.35.x): - Split ci-entry-point into two steps: one for per-PR tests (minimum K8s version) and one for nightly tests (maximum K8s version via K3S_IMAGE=rancher/k3s:v1.35.2-k3s1) - The K3S_IMAGE env var propagates through to the testsuite pipeline and overrides the default in pkg/k3d/k3d.go Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kind v0.31.0 requires the @sha256 digest to guarantee the correct image for the release. Without it, the containerd snapshotter detection fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kuttl v0.19.0 embeds Kind v0.24.0 which cannot handle kindest/node images from Kind v0.31.0 (fails with "failed to detect containerd snapshotter"). Since kuttl tests use Kind internally via startKIND, the Kind node images must stay compatible with kuttl's embedded Kind library. The K8s version bump to 1.32.x-1.35.x is achieved through k3d (integration and acceptance tests) which is not affected by this limitation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kuttl v0.19.0 embedded Kind v0.24.0 which couldn't handle kindest/node
images from Kind v0.31.0 ("failed to detect containerd snapshotter").
Kuttl v0.25.0 embeds Kind v0.31.0, enabling support for K8s 1.32.x
node images.
- ci/kuttl.nix: bump from v0.19.0 to v0.25.0
- operator/kind*.yaml: kindest/node v1.29.8 -> v1.32.11 with sha256 digest
- Taskfile.yml: kube test component images v1.29.6 -> v1.32.13
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The integration tests import kube-controller-manager and kube-apiserver images into the k3d cluster. These were hardcoded to v1.29.6 and need to match the bumped test version (v1.32.13). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents repository structure, build system, CI lint flow, golden test patterns, Kubernetes version testing architecture, and a step-by-step checklist for bumping Kubernetes versions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Revert pipeline.yml split: restore single ci-entry-point with original nightly condition instead of two separate steps - Move K3S_IMAGE env var to flake.nix devshell: nightly and local dev default to max K8s version (v1.35.2), per-PR tests use the hardcoded default in pkg/k3d/k3d.go (v1.32.13) - CLAUDE.md: note -update flag is legacy, prefer -update-golden - CLAUDE.md: wrap all commands in nix develop -c for correct tool versions and environment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Nix's string interpolation syntax conflicts with shell parameter expansion containing colons (e.g. v1.35.2-k3s1). Use a plain value instead of eval to avoid the parser error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove references to the legacy -update flag per review feedback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…steps Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The integration test failures with spurious DNS resolution errors (non-StatefulSet DNS names in SRV records) were transient, not a systemic issue. StatefulSet pods have stable name-[ordinal] DNS identities, so the PodDialer correctly handles them. Wrap KafkaClient and AdminClient assertions in require.Eventually retry loops to guard against transient DNS propagation delays during test cluster startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1.1-rc5 The old operator v25.1.3 can't install on K8s 1.35, causing all upgrade acceptance tests to fail with helm install timeouts. Bump the upgrade-from version to v25.2.2 (latest 25.2.x) which has better K8s compatibility. Also update the default Redpanda image to the v26.1.1-rc5 unstable build for testing with the upcoming 26.1 release (not yet GA). Changes: - operator-upgrades.feature: upgrade from v25.2.2 (was v25.1.3) - upgrade-regressions.feature: start from v25.2.2 (was v25.1.3) - console-upgrades.feature: start from v25.2.2 (was v25.1.3) - defaults.go: default Redpanda image to redpanda-unstable:v26.1.1-rc5 - Taskfile.yml: update test image defaults and add v26.1.1-rc5 to pull list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cert-manager v1.8.0 only supports K8s 1.19-1.24 and fails to start on K8s 1.32+. This caused the operator upgrade acceptance tests to timeout because cert-manager couldn't issue webhook TLS certificates, so the old operator's helm install never completed. cert-manager v1.17.2 supports K8s 1.29-1.33+, covering our test range of K8s 1.32-1.35. Updated in: - pkg/vcluster/vcluster.go (certManagerChartversion) - Taskfile.yml (DEFAULT_SECOND_TEST_CERTMANAGER_VERSION) - All integration test files that import cert-manager images Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two acceptance test fixes: 1. operatorIsRunning now uses require.Eventually (2min timeout) to wait for the operator deployment to have available replicas instead of immediately asserting after checkStableResource. The previous check only waited for the resource version to stabilize, but the pod may not be ready yet — especially after namespace switches between scenarios. 2. Bump upgrade-from operator version from v25.2.2 to v25.3.1. The v25.2.2 operator still times out on helm install in K8s 1.32 vclusters. v25.3.1 is the latest release and most likely to be compatible with the K8s 1.32 API surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The acceptance upgrade tests install operator v25.3.1 from the public helm repo, which requires the operator container image to be available in the k3d cluster. Without pre-pulling it, the image pull inside the vcluster times out causing INSTALLATION FAILED: context deadline exceeded. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: vcluster v0.28.0 fails to initialize on K8s 1.32 — the vcluster pod never starts, so the kubeconfig secret "vc-<name>" is never created, causing all vcluster-dependent tests to fail with "secrets not found". Changes: - Bump vcluster from v0.28.0 to v0.31.2 (supports K8s 1.32+) - Revert upgrade-from operator version back to v25.2.2 (from v25.3.1) since the vcluster fix is the actual blocker - Remove unused v25.3.1 operator image from pull list - Update vcluster-pro image refs in all integration test files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The field managers regression test was upgrading from v25.2.2 to v25.2.2
(same version) due to an earlier replace_all error. Fix the intermediate
upgrade step to use the local dev chart ("../operator/chart") so it
actually tests upgrading from v25.2.2 to the current build (v26.1.x).
Also add sections 9-11 to CLAUDE.md documenting the vcluster, cert-manager,
and acceptance upgrade test version dependencies that must be updated when
bumping Kubernetes versions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The field managers regression test needs a 3-step upgrade path: v25.2.2 → v25.3.1 → dev chart v25.3.1 introduced the *kube.Ctl field manager regression, and the dev chart fixes it. The previous commit incorrectly skipped v25.3.1 by upgrading directly to dev, so the regression never appeared and the test timed out waiting for *kube.Ctl. Also add the v25.3.1 operator image to the pull list so it's available inside the k3d cluster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
flake.nix was unconditionally setting K3S_IMAGE=rancher/k3s:v1.35.2-k3s1 in the devshell, which meant ALL CI runs (including per-PR) used K8s 1.35 instead of the intended K8s 1.32 minimum. Remove the K3S_IMAGE env var from flake.nix so: - Per-PR tests use the Go default from pkg/k3d/k3d.go (v1.32.13-k3s1) - Nightly tests override via the Buildkite schedule env setting: K3S_IMAGE=rancher/k3s:v1.35.2-k3s1 The Buildkite nightly schedule must be configured to set both: K8S_NIGHTLY=1 (gate condition in pipeline.yml) K3S_IMAGE=rancher/k3s:v1.35.2-k3s1 (runtime override for k3d) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Action needed in Buildkite UI: The nightly schedule for this pipeline must be configured with these env vars: K8S_NIGHTLY=1 is the gate condition (already in pipeline.yml line 39). K3S_IMAGE is the runtime override that pkg/k3d/k3d.go reads via os.LookupEnv. Both must be set in the Buildkite schedule's environment settings — this can't be done in code, only in the Buildkite UI |
Resolve conflict in redpanda_controller_test.go: take main's refactor that uses the importImages variable instead of a hardcoded list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…2 images The redpanda_controller_test.go was missed when bumping test infrastructure versions. It still imported: - vcluster-pro:0.28.0 (should be 0.31.2) - kube-controller-manager:v1.29.6 (should be v1.32.13) - kube-apiserver:v1.29.6 (should be v1.32.13) - cert-manager:v1.8.0 (should be v1.17.2) This caused TestIntegrationRedpandaController to fail immediately with "Image 'ghcr.io/loft-sh/vcluster-pro:0.28.0' couldn't be found in the container runtime" since only v0.31.2 is pre-pulled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vcluster v0.31.2 may not create the kubeconfig secret (vc-<name>) immediately after helm install --wait completes. The secret creation is asynchronous — the vcluster pod is ready but the secret hasn't been written yet. Replace the single Get with wait.PollUntilContextTimeout (2 min timeout, 2 sec interval) to handle this race condition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
andrewstucki
left a comment
There was a problem hiding this comment.
Some things we should definitely strip out, some things that we'll want to double check, and a question about what we want to document as our supported range (are we going to tell users they must be on 1.32-1.35? if so, let's also constrain the helm charts to that)
|
|
||
| const ( | ||
| DefaultK3sImage = `rancher/k3s:v1.29.6-k3s2` | ||
| DefaultK3sImage = `rancher/k3s:v1.32.13-k3s1` |
There was a problem hiding this comment.
I believe we talked about keeping the default k3s image the earliest documented version we'll support. Are we planning on telling everyone we only support 1.32+? If so we should also change the helm manifests to match. If not, I'd keep this as is and solely overwrite the K3S_IMAGE variable in nightly tests.
There was a problem hiding this comment.
I think we should keep the helm chart installation not blocked by ks version but the testing should be done on what we say we support. We can always do spot check if need to manually for certain versions if asked but I don't think we should keep older versions around as I feel like there will be a tendency to support a large surface area and not test against non EOL k8s versions.
There was a problem hiding this comment.
Industry Comparison
| Project | K8s Versions Tested | Min | Max | Tool |
|---|---|---|---|---|
| Istio | 12 (!!) | 1.23 | 1.35 | Kind (custom images) |
| cert-manager | 5 | 1.31 | 1.35 | Kind |
| ArgoCD | 4 | 1.32 | 1.35 | K3s |
| Prometheus Operator | 1 | 1.35 | 1.35 | Kind |
Key Takeaways
- Istio is an extreme outlier — they test 12 minor versions (1.23–1.35), including many long-EOL versions. Most projects don't do this.
- cert-manager and ArgoCD represent the mainstream — 4–5 versions, roughly tracking what cloud providers still support. Their minimums (1.31, 1.32) include at most 1 recently-EOL version.
- Prometheus Operator only tests on the latest version in CI.
- Nobody uses the minimum supported version as the default CI test target — the default is always a recent version, with older versions in matrix/nightly runs.
There was a problem hiding this comment.
Cert manager also just moved to 1.32 - 1.35 as of Friday: https://cert-manager.io/docs/releases/#currently-supported-releases
- Remove "Post-Merge: Tagging and Publishing" section to prevent accidental release cutting via Claude (git tag push, workflow triggers) - Use task-based generators (task generate, task k8s:generate, task lint, task test:unit) instead of raw tool invocations for consistency with CI - Note that chart template tests use -update instead of -update-golden Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CLAUDE.md review feedback addressedCommit
Unit test failure (Build 12372)The only failure is |
Address review comment: consolidate Build System section to reference task generate instead of individual gotohelm and gen commands. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The nix devshell already sets GOLANG_PROTOBUF_REGISTRATION_CONFLICT=ignore, so recommend nix develop -c instead of manual env var prefix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace all raw gotohelm and gen references with task-based equivalents. Remove the standalone gotohelm and k8s:generate entries from Common Commands since task generate covers both. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Explain what the tools do but explicitly state to use task-based commands instead of invoking gotohelm, gen schema, or gen partial directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ce tests - upgrade-regressions: use v25.2.1 (pre-dates the field manager fix in v25.2.2) - console-upgrades: revert to v25.1.3 (v25.2.2 already has Console v3 migration, making the v2→v3 test invalid) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v25.2.2 supports the Stable status condition, so use it for both pre-upgrade and post-upgrade checks instead of the weaker Ready check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v25.2.1 doesn't use cluster.redpanda.com/operator as the Service field manager, causing the test to poll forever and timeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@andrewstucki Ready for review. Here is the docs PR associated with this bump: redpanda-data/docs#1627 |
Summary
K3S_IMAGEin a separate Buildkite pipeline stepCLAUDE.mddocumenting repo structure, CI patterns, and a step-by-step checklist for future K8s version bumpsIntegration test retry for transient DNS failures
Integration tests (
TestIntegrationClientFactory,TestIntegrationClientFactoryTLSListeners) occasionally see transient DNS resolution failures during SRV lookups after cluster startup. These are spurious — StatefulSet pods have stablename-[ordinal]DNS identities and thePodDialercorrectly handles them via the SRV record code path incharts/redpanda/client/client.go.The fix wraps client connection assertions in
require.Eventuallyretry loops (2min timeout, 5s interval) to tolerate transient DNS propagation delays rather than failing on the first attempt.Root cause: vcluster v0.28.0 incompatible with K8s 1.32
The acceptance upgrade tests were failing with
secrets "vc-vcluster-xxx" not found— the vcluster pod itself failed to initialize on K8s 1.32, so the kubeconfig secret was never created. This caused all vcluster-dependent tests (operator upgrades, field manager regressions) to fail. The subsequentINSTALLATION FAILED: context deadline exceedederrors from helm were a downstream symptom.Fix: Bump vcluster from v0.28.0 → v0.31.2 which supports K8s 1.32+.
Updated in:
pkg/vcluster/vcluster.go— vcluster chart version constantTaskfile.yml—DEFAULT_TEST_VCLUSTER_VERSIONvcluster-proimageRoot cause: cert-manager v1.8.0 incompatible with K8s 1.32
The vcluster test infrastructure also deploys cert-manager v1.8.0 inside the vcluster for webhook TLS certificate management. cert-manager v1.8.0 only supports K8s 1.19-1.24 and fails to start on K8s 1.32, preventing the operator's webhook certificates from being issued.
Fix: Bump cert-manager from v1.8.0 → v1.17.2 which supports K8s 1.29-1.33+.
Updated in:
pkg/vcluster/vcluster.go— cert-manager chart version constantTaskfile.yml—DEFAULT_SECOND_TEST_CERTMANAGER_VERSIONAcceptance test improvements
operatorIsRunningreadiness check: Replaced immediaterequire.Equalassertions withrequire.Eventually(2min timeout, 5s interval) that polls until the operator deployment has available replicas. The previous check usedcheckStableResourcewhich only waited for the resource version to stabilize — not for the pod to become ready.Upgrade test versions: Bumped starting operator from v25.1.3 to v25.2.2 and default Redpanda image to
redpanda-unstable:v26.1.1-rc5(26.1 is not yet GA).Before this PR:
Operator upgrade from 25.1.3redpandadata/redpanda:v25.3.1Regression - field managersredpandadata/redpanda:v25.3.1Console v2 to v3(2 scenarios)redpandadata/redpanda:v25.3.1After this PR:
Operator upgrade from 25.2.2redpandadata/redpanda-unstable:v26.1.1-rc5Regression - field managersredpandadata/redpanda-unstable:v26.1.1-rc5Console v2 to v3(2 scenarios)redpandadata/redpanda-unstable:v26.1.1-rc5Changes
pkg/k3d/k3d.gov1.29.6-k3s2→v1.32.13-k3s1pkg/vcluster/vcluster.gov0.28.0→v0.31.2; cert-manager:v1.8.0→v1.17.2operator/kind*.yamlv1.29.8→v1.32.11(with sha256 digest from Kind v0.31.0)Taskfile.ymlv1.29.6→v1.32.13; vcluster:0.28.0→0.31.2; cert-manager:v1.8.0→v1.17.2; default test Redpanda image:redpanda:v25.3.1→redpanda-unstable:v26.1.1-rc5; add v25.3.1 operator image to pull listci/kuttl.nixv0.19.0→v0.25.0(embeds Kind v0.31.0, required for kindest/node v1.32.x)pkg/lint/testdata/tool-versions.txtaroperator/*_test.go(3 files)kube-controller-manager/kube-apiserver:v1.29.6→v1.32.13; cert-manager:v1.8.0→v1.17.2; vcluster-pro:0.28.0→0.31.2operator/pkg/client/factory_test.gorequire.Eventuallyretry loops; update test imagesacceptance/steps/operator.gooperatorIsRunningusesrequire.Eventuallyinstead of immediate assertionsacceptance/features/operator-upgrades.featureacceptance/features/upgrade-regressions.featureacceptance/features/console-upgrades.featureacceptance/steps/defaults.goredpanda:v25.3.1→redpanda-unstable:v26.1.1-rc5.buildkite/pipeline.ymlci-entry-pointinto per-PR (min K8s) and nightly (max K8s viaK3S_IMAGE=rancher/k3s:v1.35.2-k3s1)CLAUDE.mdWhy bump kuttl?
Kuttl v0.19.0 embedded Kind v0.24.0 which maxes out at
kindest/node:v1.31.0. Attempting to usekindest/node:v1.32.11with the old kuttl causedfailed to detect containerd snapshottererrors because Kind v0.24.0 doesn't understand the containerd configuration in newer node images. Kuttl v0.25.0 embeds Kind v0.31.0, which natively supportskindest/node:v1.32.11.How nightly K8s version override works
The
K3S_IMAGEenv var set on the nightly Buildkite step propagates throughbuildkite-agent pipeline uploadinto the testsuite pipeline. The k3d package (pkg/k3d/k3d.go:93-95) checksK3S_IMAGEand uses it to override the default image when creating test clusters.CLAUDE.md
Documents learnings from this bump, including:
generate→lint→git diff --exit-code)Test plan
K8S_NIGHTLY=1🤖 Generated with Claude Code