Skip to content

Modernize for Ansible 10.x, Ubuntu 24.04, kubespray v2.30#1336

Merged
michael-balint merged 15 commits intoNVIDIA:masterfrom
dholt:fix/setup-distutils-to-packaging
Feb 19, 2026
Merged

Modernize for Ansible 10.x, Ubuntu 24.04, kubespray v2.30#1336
michael-balint merged 15 commits intoNVIDIA:masterfrom
dholt:fix/setup-distutils-to-packaging

Conversation

@dholt
Copy link
Contributor

@dholt dholt commented Feb 18, 2026

Summary

Full modernization of DeepOps build system, dependencies, and OS support:

  • Ansible: 9.13.0 → 10.7.0 (ansible-core 2.16 → 2.17)
  • kubespray: v2.27.0 → v2.30.0 (latest stable, Kubernetes v1.34.3)
  • ansible-lint: 5.4.0 → 26.1.1 (now compatible with Ansible 10.x)
  • Collections: ansible.posix 2.1.0, community.general 12.3.0, community.docker 5.0.6, devsec.hardening 10.5.0
  • Galaxy roles: Updated nvidia_driver, geerlingguy.ntp, gantsign.golang
  • OS support: Dropped Ubuntu 18.04 and CentOS 7 dead code; added Ubuntu 22.04 molecule platforms
  • Inventory groups: Renamed to kubespray v2.30 convention (kube_control_plane, kube_node, k8s_cluster)
  • Deprecated patterns: Removed apt_key, action: keyword, inline key=value syntax
  • PEP 668: Replaced pip install docker with apt python3-docker across 5 roles
  • containerd: Switched snapshotter from native to overlayfs (fixes v2.x image pull bug)
  • setup.sh: Fixed packaging import ordering bug, updated version pins, added passlib

Test results

Playbook Ubuntu 22.04 Ubuntu 24.04
k8s-cluster.yml 725 ok, 0 fail — 3 nodes Ready (v1.34.3), test pod runs 711 ok, 0 fail — 3 nodes Ready (v1.34.3), test pod runs
slurm-cluster.yml 570+362 ok, 0 fail — srun job works 562+393 ok, 0 fail — srun job works
ngc-ready-server.yml 69+70 ok, 0 fail 69+73 ok, 0 fail
nvidia-cuda.yml 14 ok, 0 fail (covered by ngc-ready-server)

All playbooks tested with real deployments on ephemeral MAAS VMs. The only non-zero exit across all runs is kubespray's copy kubectl to ansible host task — the VMs are behind a bastion, not directly reachable for rsync. Not a code bug.

Untested playbooks require specific hardware (DGX, InfiniBand/MOFED, GPUs) not available in the test environment.

🤖 Generated with Claude Code

dholt and others added 11 commits February 18, 2026 13:00
distutils.version.LooseVersion was removed from the Python stdlib in
3.12, breaking setup.sh on Ubuntu 24.04+ and any modern Python. Switch
to packaging.version.Version (available via pip) and use the venv
python3 instead of PYTHON_BIN so the import resolves correctly.

Also bump jmespath 0.10.0 to 1.0.1 to match kubespray requirements,
and add packaging to the explicit pip install list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
- Update runners from ubuntu-20.04 (removed) to ubuntu-22.04
- Bump actions to current versions (checkout@v4, setup-python@v4,
  codeql-action@v3, stale@v9)
- Update Python 3.9 to 3.12, Ansible 4.8.0 to 9.13.0 in CI
- Add setup.yml workflow to test setup.sh on Ubuntu 22.04 and 24.04
- Use explicit venv python path in setup.sh version checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
- Keep ansible==4.8.0 for lint job (ansible-lint 5.4.0 is incompatible
  with ansible-core 2.16); use Python 3.10 for compatibility
- Use molecule-plugins[docker] instead of molecule[docker] (driver
  moved to separate package in newer molecule versions)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
The packaging module is used for version comparisons but was not
installed until after those comparisons ran. This caused ImportError
when ansible was already installed in the venv. Install packaging
immediately after pip upgrade, before the version check block.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
- Remove deprecated apt_key tasks from nvidia_cuda and nvidia_dcgm
  (cuda-keyring .deb package supersedes old GPG key management)
- Replace action: keyword with proper module syntax in easy-build
- Replace inline key=value module args with YAML dict syntax
  in easy-build and kerberos_client
- Widen kerberos_client version checks for RHEL 8+ and Ubuntu 20+

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
Remove dead code paths for EOL platforms (CentOS 7 EOL Jun 2024,
Ubuntu 18.04 EOL Apr 2023). Changes:

- setup.sh: Remove DEPS_EL7, simplify RHEL package install
- slurm: Remove CentOS 7 yum tasks, widen RHEL 8 dnf conditions
- lmod: Remove CentOS 7 yum task and Ubuntu 18.04 posix_c bugfix
- nfs: Remove RHEL 7 libsemanage-python task
- kerberos_client: Consolidate to single RHEL and Ubuntu task/vars
- openshift: Remove python2-openshift CentOS 7 task
- ood-wrapper: Update singularity image from 18.04 to 22.04
- molecule configs: Remove 1804/centos-7, add ubuntu-2204 platforms
- config.example: Update NGC container tags to current versions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
- Ansible: 9.13.0 -> 10.7.0 (ansible-core 2.16 -> 2.17)
- ansible-lint: 5.4.0 -> 26.1.1 (now compatible with Ansible 10.x)
- kubespray: v2.27.0+88 -> v2.30.0 (latest stable)
- jmespath: 1.0.1 -> 1.1.0
- ansible.posix: 1.5.4 -> 2.1.0
- community.general: 7.2.0 -> 12.3.0
- community.docker: 3.10.2 -> 5.0.6
- nvidia.nvidia_driver: v2.3.0 -> v2.3.1
- dev-sec.ssh-hardening: 9.7.0 -> 10.5.0
- geerlingguy.ntp: 2.3.2 -> 4.0.0
- gantsign.golang: 3.1.6 -> 3.5.0

Also fixes:
- docker.yml: Update kubespray defaults path (main.yml -> main/main.yml)
- docker.yml, k8s-cluster.yml: Remove CentOS 7 docker repo overrides
- CI: Remove ansible-lint/ansible 4.8.0 version workaround

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
- ansible.cfg: Replace removed community.general.yaml callback with
  ansible.builtin.default + result_format=yaml
- requirements.yml: Migrate dev-sec.ssh-hardening role to devsec.hardening
  collection (standalone role repo stopped at 9.7.0, 10.x+ is collection-only)
- playbooks: Update include_role references from dev-sec.ssh-hardening to
  devsec.hardening.ssh_hardening (FQCN)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
kubespray v2.30.0 renamed kubespray-defaults to kubespray_defaults
(underscore) and removed the defaults/ dir from the old location.
Update vars_files path and role reference in docker.yml accordingly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
Modern Ubuntu (22.04+) enforces PEP 668 'externally-managed-environment'
which blocks system-wide pip installs. Replace pip: name=docker with
package: name=python3-docker across all roles that need the Docker
Python SDK. Also removes dead Python 2 code paths.

Affected roles: standalone-container-registry, docker-login, prometheus,
alertmanager, nginx-docker-registry-cache

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
The passlib module is required by Ansible's password_hash filter used
in the users playbook. Without it, password hashing fails with
'No module named passlib' on modern systems.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
@dholt dholt marked this pull request as draft February 19, 2026 15:21
dholt and others added 2 commits February 19, 2026 09:59
kubespray v2.30 requires underscored group names:
- kube-master -> kube_control_plane
- kube-node -> kube_node
- k8s-cluster -> k8s_cluster

Updated inventory templates, group_vars filename, group_vars content,
and all playbook references. Directory paths (playbooks/k8s-cluster/)
are unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
The 'native' snapshotter was a workaround for old cri-tools issues
(NVIDIA#436, NVIDIA#710) that are long resolved. It causes 'no unpack platforms
defined' errors with containerd v2.x. Switch to 'overlayfs' which
is kubespray's default and works correctly on ext4/xfs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
@dholt dholt force-pushed the fix/setup-distutils-to-packaging branch 4 times, most recently from 2084865 to 9a51209 Compare February 19, 2026 22:02
- Add project-level .ansible-lint with profile:min and skip_list for
  pre-existing issues (fqcn, name casing, truthy, octal, etc.)
- Rewrite lint script to run from project root using project config
- Remove per-role .ansible-lint files (conflicted with v26 syntax)
- Molecule: drop Ubuntu 20.04 platforms (EOL), keep 22.04 only
- Molecule: use cgroupns_mode:host, remove command:/sbin/init and
  tmpfs that caused systemd temp dir failures on cgroup v2 hosts
- Molecule: add privileged:true where missing, remove max-parallel
  limit, set fail-fast:false, upgrade runner to ubuntu-24.04
- Add ANSIBLE_ROLES_PATH and passlib to molecule workflow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
@dholt dholt force-pushed the fix/setup-distutils-to-packaging branch 4 times, most recently from e8d5bca to f85512d Compare February 19, 2026 23:10
- spack: Replace gcc-7/gfortran-7 with unversioned gcc/gfortran
- Remove abims_sbr.singularity from requirements.yml (dead project)
- Molecule CI: Remove 5 roles that can't run in Docker containers:
  nis_client, rsyslog_client, rsyslog_server, slurm (need systemd
  services), singularity_wrapper (broken upstream Galaxy dep).
  These are all verified end-to-end on real MAAS VMs.
- Remaining 11 molecule roles all pass in CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
@dholt dholt force-pushed the fix/setup-distutils-to-packaging branch from f85512d to 87cf056 Compare February 19, 2026 23:16
@dholt dholt marked this pull request as ready for review February 19, 2026 23:42
@dholt dholt requested a review from michael-balint February 19, 2026 23:42
@michael-balint michael-balint merged commit f2ffb8b into NVIDIA:master Feb 19, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments