Skip to content

[REQ] Detect host CPU features for native code generation #1308

@c0d1f1ed

Description

@c0d1f1ed

Detect host CPU features for native code generation

Summary

Replace the hardcoded "generic" CPU target and empty feature string in the embedded Clang/LLVM compiler with runtime detection of the host CPU name and feature set using llvm::sys::getHostCPUName() and llvm::sys::getHostCPUFeatures().

Motivation

The CPU backend currently compiles all kernels for a "generic" x86-64 target. While SSE2 is part of the x86-64 baseline, everything beyond that is missing: no AVX, no AVX2, no FMA, etc. For AArch64, NEON is available (including FMA) but SVE extensions are left unused. This leaves performance on the table for many modern CPUs.

Benchmarking Newton examples on an AMD Ryzen 9800X3D showed that switching from "generic" to the native CPU target (while keeping -O2) produces 20-70% speedups across examples

Current behavior

In warp/native/clang/clang.cpp, the CPU target and features are hardcoded:

const char* CPU = "generic";
const char* features = "";

The only exception is a conditional +f16c feature flag on x86-64, added via -target-feature arguments to the Clang frontend.

Proposed changes

  1. Host CPU detection: Use llvm::sys::getHostCPUName() to get the CPU model string (e.g. "znver5", "apple-m1") and llvm::sys::getHostCPUFeatures() to get the full feature map. Build a feature string from the map ("+avx2,+fma,-avx512f,..."). Both APIs are available in LLVM 18+ via #include <llvm/TargetParser/Host.h>.

  2. Apply to both frontend and backend: Pass the detected CPU name via -target-cpu to the Clang frontend (extending the existing pattern used for the +f16c flag). Pass the CPU name and feature string to createTargetMachine() for backend code generation, replacing the current "generic" and "".

  3. Kernel cache invalidation: Extend the module hash to include the CPU target name and feature string (or a hash thereof). This ensures that changing native_cpu_codegen, or moving the kernel cache to a different machine, invalidates stale cached kernels compiled for a different target.

  4. Remove the +f16c special case: The existing conditional +f16c flag becomes redundant when the full feature set is detected at runtime. It should be removed to avoid conflicts with the detected feature map.

Testing

Functional correctness: run the existing test suite.

uv run --extra dev -m warp.tests -s autodetect

Performance: compare Newton benchmarks across generic vs native targets.

uv run -m newton.examples basic_shapes --device cpu --benchmark 10

Metadata

Metadata

Assignees

Labels

feature requestRequest for something to be added

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions