Detect host CPU features for native code generation
Summary
Replace the hardcoded "generic" CPU target and empty feature string in the embedded Clang/LLVM compiler with runtime detection of the host CPU name and feature set using llvm::sys::getHostCPUName() and llvm::sys::getHostCPUFeatures().
Motivation
The CPU backend currently compiles all kernels for a "generic" x86-64 target. While SSE2 is part of the x86-64 baseline, everything beyond that is missing: no AVX, no AVX2, no FMA, etc. For AArch64, NEON is available (including FMA) but SVE extensions are left unused. This leaves performance on the table for many modern CPUs.
Benchmarking Newton examples on an AMD Ryzen 9800X3D showed that switching from "generic" to the native CPU target (while keeping -O2) produces 20-70% speedups across examples
Current behavior
In warp/native/clang/clang.cpp, the CPU target and features are hardcoded:
const char* CPU = "generic";
const char* features = "";
The only exception is a conditional +f16c feature flag on x86-64, added via -target-feature arguments to the Clang frontend.
Proposed changes
-
Host CPU detection: Use llvm::sys::getHostCPUName() to get the CPU model string (e.g. "znver5", "apple-m1") and llvm::sys::getHostCPUFeatures() to get the full feature map. Build a feature string from the map ("+avx2,+fma,-avx512f,..."). Both APIs are available in LLVM 18+ via #include <llvm/TargetParser/Host.h>.
-
Apply to both frontend and backend: Pass the detected CPU name via -target-cpu to the Clang frontend (extending the existing pattern used for the +f16c flag). Pass the CPU name and feature string to createTargetMachine() for backend code generation, replacing the current "generic" and "".
-
Kernel cache invalidation: Extend the module hash to include the CPU target name and feature string (or a hash thereof). This ensures that changing native_cpu_codegen, or moving the kernel cache to a different machine, invalidates stale cached kernels compiled for a different target.
-
Remove the +f16c special case: The existing conditional +f16c flag becomes redundant when the full feature set is detected at runtime. It should be removed to avoid conflicts with the detected feature map.
Testing
Functional correctness: run the existing test suite.
uv run --extra dev -m warp.tests -s autodetect
Performance: compare Newton benchmarks across generic vs native targets.
uv run -m newton.examples basic_shapes --device cpu --benchmark 10
Detect host CPU features for native code generation
Summary
Replace the hardcoded
"generic"CPU target and empty feature string in the embedded Clang/LLVM compiler with runtime detection of the host CPU name and feature set usingllvm::sys::getHostCPUName()andllvm::sys::getHostCPUFeatures().Motivation
The CPU backend currently compiles all kernels for a
"generic"x86-64 target. While SSE2 is part of the x86-64 baseline, everything beyond that is missing: no AVX, no AVX2, no FMA, etc. For AArch64, NEON is available (including FMA) but SVE extensions are left unused. This leaves performance on the table for many modern CPUs.Benchmarking Newton examples on an AMD Ryzen 9800X3D showed that switching from
"generic"to the native CPU target (while keeping-O2) produces 20-70% speedups across examplesCurrent behavior
In
warp/native/clang/clang.cpp, the CPU target and features are hardcoded:The only exception is a conditional
+f16cfeature flag on x86-64, added via-target-featurearguments to the Clang frontend.Proposed changes
Host CPU detection: Use
llvm::sys::getHostCPUName()to get the CPU model string (e.g."znver5","apple-m1") andllvm::sys::getHostCPUFeatures()to get the full feature map. Build a feature string from the map ("+avx2,+fma,-avx512f,..."). Both APIs are available in LLVM 18+ via#include <llvm/TargetParser/Host.h>.Apply to both frontend and backend: Pass the detected CPU name via
-target-cputo the Clang frontend (extending the existing pattern used for the+f16cflag). Pass the CPU name and feature string tocreateTargetMachine()for backend code generation, replacing the current"generic"and"".Kernel cache invalidation: Extend the module hash to include the CPU target name and feature string (or a hash thereof). This ensures that changing
native_cpu_codegen, or moving the kernel cache to a different machine, invalidates stale cached kernels compiled for a different target.Remove the
+f16cspecial case: The existing conditional+f16cflag becomes redundant when the full feature set is detected at runtime. It should be removed to avoid conflicts with the detected feature map.Testing
Functional correctness: run the existing test suite.
Performance: compare Newton benchmarks across generic vs native targets.