[REQ] Detect host CPU features for native code generation

# Detect host CPU features for native code generation

## Summary

Replace the hardcoded `"generic"` CPU target and empty feature string in the embedded Clang/LLVM compiler with runtime detection of the host CPU name and feature set using `llvm::sys::getHostCPUName()` and `llvm::sys::getHostCPUFeatures()`.

## Motivation

The CPU backend currently compiles all kernels for a `"generic"` x86-64 target. While SSE2 is part of the x86-64 baseline, everything beyond that is missing: no AVX, no AVX2, no FMA, etc. For AArch64, NEON is available (including FMA) but SVE extensions are left unused. This leaves performance on the table for many modern CPUs.

Benchmarking Newton examples on an AMD Ryzen 9800X3D showed that switching from `"generic"` to the native CPU target (while keeping `-O2`) produces 20-70% speedups across examples

## Current behavior

In `warp/native/clang/clang.cpp`, the CPU target and features are hardcoded:

```cpp
const char* CPU = "generic";
const char* features = "";
```

The only exception is a conditional `+f16c` feature flag on x86-64, added via `-target-feature` arguments to the Clang frontend.

## Proposed changes

1. **Host CPU detection**: Use `llvm::sys::getHostCPUName()` to get the CPU model string (e.g. `"znver5"`, `"apple-m1"`) and `llvm::sys::getHostCPUFeatures()` to get the full feature map. Build a feature string from the map (`"+avx2,+fma,-avx512f,..."`). Both APIs are available in LLVM 18+ via `#include <llvm/TargetParser/Host.h>`.

2. **Apply to both frontend and backend**: Pass the detected CPU name via `-target-cpu` to the Clang frontend (extending the existing pattern used for the `+f16c` flag). Pass the CPU name and feature string to `createTargetMachine()` for backend code generation, replacing the current `"generic"` and `""`.

3. **Kernel cache invalidation**: Extend the module hash to include the CPU target name and feature string (or a hash thereof). This ensures that changing `native_cpu_codegen`, or moving the kernel cache to a different machine, invalidates stale cached kernels compiled for a different target.

4. **Remove the `+f16c` special case**: The existing conditional `+f16c` flag becomes redundant when the full feature set is detected at runtime. It should be removed to avoid conflicts with the detected feature map.

## Testing

Functional correctness: run the existing test suite.

```bash
uv run --extra dev -m warp.tests -s autodetect
```

Performance: compare Newton benchmarks across generic vs native targets.

```bash
uv run -m newton.examples basic_shapes --device cpu --benchmark 10
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQ] Detect host CPU features for native code generation #1308

Detect host CPU features for native code generation

Summary

Motivation

Current behavior

Proposed changes

Testing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[REQ] Detect host CPU features for native code generation #1308

Description

Detect host CPU features for native code generation

Summary

Motivation

Current behavior

Proposed changes

Testing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions