Skip to content

fix(cluster): resolve DNS failures on systemd-resolved hosts#478

Open
brianwtaylor wants to merge 1 commit intoNVIDIA:mainfrom
brianwtaylor:fix/cluster-dns-systemd-resolved
Open

fix(cluster): resolve DNS failures on systemd-resolved hosts#478
brianwtaylor wants to merge 1 commit intoNVIDIA:mainfrom
brianwtaylor:fix/cluster-dns-systemd-resolved

Conversation

@brianwtaylor
Copy link

@brianwtaylor brianwtaylor commented Mar 19, 2026

@drew thank you for approving my vouch, had this brewing as a small fix for the repo, look like this change could help some folks. doing a qa pass for this change, will add a visual diagram or two as well.

Summary

  • Mount host systemd-resolved config into the gateway container so the entrypoint can extract real upstream DNS resolvers
  • Bypass the broken cross-namespace DNAT proxy that silently fails on Ubuntu/systemd-resolved hosts (including DGX Spark)
  • Preserve existing DNAT and public DNS fallbacks for non-systemd environments

Closes #437
Related: #415

Changes

crates/openshell-bootstrap/src/docker.rs — Bind-mount /run/systemd/resolve/resolv.conf (read-only) into the container when the path exists on the host. Skipped on macOS/Windows where the path doesn't exist.

deploy/docker/cluster-entrypoint.sh — Add get_upstream_resolvers() that extracts non-loopback nameservers from the mounted systemd-resolved config (or falls back to /etc/resolv.conf). When upstream resolvers are found, write them directly to the k3s resolv.conf instead of attempting the DNAT proxy. Also improves DNS verification logging on failure.

Root Cause

Docker's embedded DNS at 127.0.0.11 is only reachable from the container's own network namespace. The existing DNAT rules forward to this loopback address, but k3s pods run in child network namespaces where the forwarded packets are dropped as martian packets. On systemd-resolved hosts, /etc/resolv.conf contains 127.0.0.53 (another loopback), so the fallback also fails silently.

Testing

  • Tested on DGX Spark (Ubuntu 24.04, systemd-resolved, Docker with cgroupns=host)
  • Verified DNS resolution works from k3s pods after the fix
  • Verified the mount is skipped on hosts without systemd-resolved

Automated Tests

cargo test -p openshell-bootstrap

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@brianwtaylor brianwtaylor requested a review from a team as a code owner March 19, 2026 17:39
@github-actions
Copy link

github-actions bot commented Mar 19, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@brianwtaylor
Copy link
Author

brianwtaylor commented Mar 19, 2026

I have read the DCO document and I hereby sign the DCO.

Comment on lines +538 to +544
if std::path::Path::new("/run/systemd/resolve/resolv.conf").exists() {
b.push(
"/run/systemd/resolve/resolv.conf:/run/systemd/resolve/resolv.conf:ro"
.to_string(),
);
}
b
Copy link
Collaborator

@drew drew Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just a little worried about mounting existing system files into the cluster container (this would be the first time we do it).

How about as an alternative approach we sniff the resolvers from the boostrap crate and then pass them into the cluster container as parameters.

Copy link
Author

@brianwtaylor brianwtaylor Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me explore this, i have some qa going right now as well. happy to chat to ensure this change aligns with your standards.

update: great call out on extending the existing architecture

@brianwtaylor
Copy link
Author

brianwtaylor commented Mar 19, 2026

@drew after your feedback, here is a breakdown of the DNS FLOW before and after and the validation toplogy.

DNS FLOW — BEFORE vs AFTER
──────────────────────────

BEFORE (broken on systemd-resolved hosts):

  Pod ──→ CoreDNS ──→ resolv.conf ──→ iptables DNAT ──→ Docker DNS
          (cache)      127.0.0.11       PREROUTING       127.0.0.11
                            │                                │
                            └──── FAILS ─────────────────────┘
                            loopback DNAT from pod namespace
                            dropped as martian packet

AFTER (PR #478):

  Pod ──→ CoreDNS ──→ resolv.conf ──→ upstream resolver ──→ response
          (cache)     e.g. 192.168.1.1    (direct UDP)
                       ▲
                       │
                  set by Rust bootstrap:
                  resolve_upstream_dns()
                  reads /run/systemd/resolve/resolv.conf
                  passes via UPSTREAM_DNS env var
                  entrypoint writes to k3s resolv.conf

NON-SYSTEMD HOSTS (macOS, WSL2, Alpine) — unchanged:

  Pod ──→ CoreDNS ──→ resolv.conf ──→ iptables DNAT ──→ Docker DNS ──→ host
          (cache)     container IP      PREROUTING       127.0.0.11

  /run/systemd/resolve/resolv.conf absent → UPSTREAM_DNS not set →
  entrypoint falls back to existing DNAT proxy path. Zero behavior change.

I also did some hardware validation with these changes, below is a breakdown of the validation topology i used.

OPENSHELL DNS FIX (PR #478) — VALIDATION TOPOLOGY
══════════════════════════════════════════════════

                  ┌─────────────────────────┐
                  │     Node A (Linux)      │
                  │    aarch64 · GPU        │
                  │                         │
                  │  BASELINE + ORCHESTRATOR │
                  │  (read-only, runs all   │
                  │   tests from here)      │
                  └─────┬──────────┬────────┘
                        │          │
          high-speed    │          │ LAN
          interconnect  │          │
                        │          ├──────────────┐
              ┌─────────▼──┐   ┌───▼──────────┐  ┌▼──────────────┐
              │  Node B    │   │  Node C      │  │  Node D       │
              │  Linux     │   │  macOS       │  │  Windows/WSL2 │
              │  aarch64   │   │  Apple Si    │  │  x86_64       │
              │            │   │  no systemd  │  │  no systemd   │
              │ TEST TARGET │   │              │  │               │
              │ (DNS-fixed │   │  CONTROL     │  │  CONTROL      │
              │  gateway   │   │  ✓ verified  │  │  ✓ verified   │
              │  deployed) │   │              │  │               │
              └────────────┘   └──────────────┘  └───────────────┘

WHAT EACH NODE PROVED
═════════════════════

Node A ─── "Does the fix break anything that already works?"
            Captured baseline iptables, TLS certs, and DNS state.
            All comparisons showed zero drift.

Node B ─── "Does the new code handle edge-case input safely?"

Node C ─── "Does the fix affect macOS hosts?"
            No systemd-resolved → no UPSTREAM_DNS set → no change.
            Existing DNAT proxy path untouched.

Node D ─── "Does the fix affect Windows/WSL2 hosts?"
            No systemd-resolved → no UPSTREAM_DNS set → no change.
            Existing DNAT proxy path untouched.

@brianwtaylor brianwtaylor force-pushed the fix/cluster-dns-systemd-resolved branch 2 times, most recently from 0079604 to 0218226 Compare March 19, 2026 20:16
@brianwtaylor
Copy link
Author

sorry trying not to rush with this fix.

Docker's embedded DNS at 127.0.0.11 is only reachable from the
container's own network namespace. k3s pods in child namespaces
cannot reach it, causing silent DNS failures on Ubuntu and other
systemd-resolved hosts where /etc/resolv.conf contains 127.0.0.53.

Sniff upstream DNS resolvers from the host in the Rust bootstrap
crate by reading /run/systemd/resolve/resolv.conf (systemd-resolved
only — intentionally does NOT read /etc/resolv.conf to avoid
bypassing Docker Desktop's DNAT proxy on macOS/Windows). Filter
loopback addresses (127.x.x.x and ::1) and pass the result to
the container as the UPSTREAM_DNS env var. Skip DNS sniffing for
remote deploys where the local host's resolvers would be wrong.

The entrypoint checks UPSTREAM_DNS first, falling back to
/etc/resolv.conf inside the container for manual launches. This
follows the existing pattern used by registry config, SSH gateway,
GPU support, and image tags.

Closes NVIDIA#437

Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>
@brianwtaylor brianwtaylor force-pushed the fix/cluster-dns-systemd-resolved branch from 0218226 to 106be9d Compare March 20, 2026 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DNS proxy in cluster-entrypoint.sh fails silently on Linux with systemd-resolved

2 participants