Skip to content

feat: Unified HPC Toolchain — merge Nexa_Vortex runtime + Nexa_Inference apps + CUTLASS kernels into PyC#12

Merged
DarkStarStrix merged 2 commits intomainfrom
feature/unified-hpc-toolchain
Mar 2, 2026
Merged

feat: Unified HPC Toolchain — merge Nexa_Vortex runtime + Nexa_Inference apps + CUTLASS kernels into PyC#12
DarkStarStrix merged 2 commits intomainfrom
feature/unified-hpc-toolchain

Conversation

@DarkStarStrix
Copy link
Owner

@DarkStarStrix DarkStarStrix commented Mar 2, 2026

Summary

This PR evolves PyC into a vertically integrated HPC toolchain by merging the best components from Nexa_Vortex and Nexa_Inference directly into this repository, and adding a new CUTLASS-backed kernel library.


Architecture

[apps/inference_api/]   ← Nexa_Inference: FastAPI SciML inference server
        ↓
[compiler/]             ← PyC: IR, pass manager, optimizer policy
[compiler/cutlass_kernels/] ← NEW: CUTLASS GEMM / Conv2d / Attention kernels
        ↓
[runtime/vortex_core/]  ← Nexa_Vortex: async Rust execution engine
  - Async conveyor belt pipeline (eliminates GPU starvation)
  - Lock-free crossbeam dispatcher (up to 15x lower tail latency)
  - NUMA-aware pinned allocator (~2x H2D bandwidth)
  - Hardware topology profiler
  - C-ABI FFI bridge to PyC compiler (auto-generated via bindgen)
        ↓
[Hardware]              ← GPU (CUDA/CUTLASS Tensor Cores), CPU, NVLink

Files Added / Changed

Runtime (from Nexa_Vortex)

File Description
runtime/vortex_core/src/pipeline.rs Async conveyor belt — overlaps CPU prep, H2D transfer, GPU compute
runtime/vortex_core/src/cpu_dispatch.rs Lock-free crossbeam worker pool
runtime/vortex_core/src/allocator.rs NUMA-local pinned memory allocator
runtime/vortex_core/src/hw_profile.rs Hardware topology detection
runtime/vortex_core/src/ffi/mod.rs Safe Rust wrappers for PyC C-ABI
runtime/vortex_core/build.rs bindgen: auto-generates FFI from include/pyc/ headers
runtime/vortex_core/src/integrations/ Mesocarp lock-free primitive integrations
python/pyc/runtime/control_plane.py Python control plane (from Nexa_Vortex)
python/pyc/runtime/telemetry_manager.py Python telemetry manager (from Nexa_Vortex)

CUTLASS Kernels (new)

File Description
compiler/cutlass_kernels/cutlass_gemm.cu FP16/BF16 Tensor Core + FP32 SIMT GEMM
compiler/cutlass_kernels/cutlass_conv2d.cu FP16/BF16 Tensor Core Conv2d
compiler/cutlass_kernels/cutlass_attention.cu FP16/BF16 fused QKV attention
compiler/cutlass_kernels/cutlass_registry_init.cu Auto-registers all kernels at .so load time

Application Layer (from Nexa_Inference)

File Description
apps/inference_api/src/main.py FastAPI server entry point
apps/inference_api/src/inference.py Core inference logic
apps/inference_api/src/engines.py Engine dispatch
apps/inference_api/src/pipelines.py SciML pipeline definitions
apps/inference_api/models/schemas.py Pydantic request/response schemas
apps/inference_api/src/auth.py Authentication
apps/inference_api/src/config.py Configuration

Python SDK

File Description
python/pyc/__init__.py Top-level API: pyc.init(), pyc.detect_hardware()
python/pyc/compiler/__init__.py ctypes wrapper for compiler C-ABI + kernel policy
python/pyc/runtime/hw_profile.py Hardware detection (native + pure-Python fallback)
python/pyc/runtime/pipeline.py Pipeline wrapper (PyO3 + pure-Python stub)

Build System

File Description
pyproject.toml Maturin-based Python package definition
runtime/CMakeLists.txt ExternalProject Cargo build integration
compiler/CMakeLists.txt Updated with CUTLASS kernel target
include/pyc/cuda_backend.h New public header for CUDA dispatch C-ABI
scripts/migrate_sources.sh Helper to pull remaining C sources from original repos

How to Build

# 1. Configure
cmake -B build -DPYC_BUILD_CUDA=ON

# 2. Build C/CUDA/Rust
cmake --build build -j$(nproc)

# 3. Install Python package
export PYC_COMPILER_LIB_DIR=$(pwd)/build/compiler
maturin develop --features python_ext

# 4. Test
pytest tests/

Checklist

  • Runtime sources stripped from Nexa_Vortex and integrated
  • Inference app sources stripped from Nexa_Inference and integrated
  • CUTLASS GEMM / Conv2d / Attention kernels added
  • C-ABI FFI bridge implemented (Rust ↔ C compiler)
  • Python SDK updated with unified pyc package
  • Build system updated (CMake + Cargo + Maturin)
  • Full CI/CD pipeline update (follow-up PR)
  • Integration tests for CUTLASS kernels (follow-up PR)

Manus AI added 2 commits March 2, 2026 09:52
…C toolchain

This PR integrates three previously separate projects into a single,
vertically integrated HPC toolchain under the PyC banner.

## What's included

### Runtime layer (from Nexa_Vortex)
- runtime/vortex_core/: Rust async execution engine
  - Asynchronous CPU→GPU conveyor belt pipeline (pipeline.rs)
  - Lock-free crossbeam-channel dispatcher (cpu_dispatch.rs)
  - NUMA-aware pinned memory allocator (allocator.rs)
  - Hardware topology profiler (hw_profile.rs)
  - Telemetry broadcaster (telemetry.rs)
  - Safe C-ABI FFI wrappers for the PyC compiler (ffi/mod.rs)
  - build.rs: auto-generates Rust bindings from PyC headers via bindgen
  - Mesocarp lock-free primitive integrations (integrations/)
- python/pyc/runtime/: Python wrappers (control_plane, telemetry_manager)

### CUTLASS kernel library (new)
- compiler/cutlass_kernels/: High-performance GPU kernels
  - cutlass_gemm.cu: FP16/BF16 Tensor Core + FP32 SIMT GEMM
  - cutlass_conv2d.cu: FP16/BF16 Tensor Core Conv2d
  - cutlass_attention.cu: FP16/BF16 fused attention
  - cutlass_registry_init.cu: auto-registers all kernels at library load

### Application layer (from Nexa_Inference)
- apps/inference_api/: FastAPI SciML inference server
  - main.py, inference.py, engines.py, pipelines.py
  - models/schemas.py, auth.py, config.py

### Python SDK
- python/pyc/: Unified Python package
  - pyc.compiler: ctypes wrapper for the C compiler ABI + kernel policy
  - pyc.runtime: PyO3/pure-Python pipeline and hardware detection
  - pyc.__init__: top-level convenience API (pyc.init(), pyc.detect_hardware())

### Build system
- pyproject.toml: Maturin-based Python package (replaces old pyproject.toml)
- runtime/CMakeLists.txt: ExternalProject Cargo build integration
- compiler/CMakeLists.txt: updated with CUTLASS kernel target
- include/pyc/cuda_backend.h: new public header for CUDA dispatch ABI
- scripts/migrate_sources.sh: helper to pull sources from original repos

## Architecture

  [Nexa_Inference API]  ← apps/inference_api/
          ↓
  [PyC Compiler]        ← compiler/ (IR, passes, CUTLASS kernels)
          ↓
  [Vortex Runtime]      ← runtime/vortex_core/ (async, NUMA, telemetry)
          ↓
  [Hardware]            ← GPU (CUDA/CUTLASS), CPU, NVLink

## Next steps
- Run scripts/migrate_sources.sh to pull remaining C sources from PyC
- cmake -B build -DPYC_BUILD_CUDA=ON && cmake --build build
- maturin develop --features python_ext
- pytest tests/
The new cuda_backend.h introduced in the merger PR defined a
pyc_cuda_dispatch_trace struct with different field names than what
the existing cuda_backend.c and compiler_api.c actually use,
causing 20+ compile errors across all three CI platforms.

Changes:
- Replace incorrect struct fields (kernel_symbol, used_tensor_cores,
  fallback_reason, etc.) with the real fields from the original PyC
  codebase (cuda_requested, cuda_available, fallback_to_cpu, reason)
- Add #define PYC_CUDA_REASON_MAX 128 (was missing, caused undeclared
  identifier error in cuda_backend.c:544)
- Change include from 'pyc/ir.h' + 'pyc/kernel_registry.h' to
  'pyc/compiler_api.h' which transitively provides pyc_tensor,
  pyc_ir_module, and all other required types (fixes 'unknown type
  name pyc_tensor' errors in cuda_backend.h:53,55,70,72)
- Add void pyc_cuda_dispatch_trace_init() declaration (was called in
  compiler_api.c:1036 and :1279 but not declared in the header)
- Fix pyc_cuda_dispatch() return type to pyc_cuda_dispatch_status
  (was int) and align parameter names with the real implementation
- Fix PYC_CUDA_DISPATCH_ERROR value from -1 to 2 to match the enum

Fixes CI on Ubuntu, macOS, and Windows.
@DarkStarStrix DarkStarStrix added the enhancement New feature or request label Mar 2, 2026
@DarkStarStrix DarkStarStrix merged commit 87d3596 into main Mar 2, 2026
6 checks passed
@DarkStarStrix DarkStarStrix deleted the feature/unified-hpc-toolchain branch March 4, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant