feat: Unified HPC Toolchain — merge Nexa_Vortex runtime + Nexa_Inference apps + CUTLASS kernels into PyC by DarkStarStrix · Pull Request #12 · DarkStarStrix/PyC

DarkStarStrix · 2026-03-02T14:52:55Z

Summary

This PR evolves PyC into a vertically integrated HPC toolchain by merging the best components from Nexa_Vortex and Nexa_Inference directly into this repository, and adding a new CUTLASS-backed kernel library.

Architecture

[apps/inference_api/]   ← Nexa_Inference: FastAPI SciML inference server
        ↓
[compiler/]             ← PyC: IR, pass manager, optimizer policy
[compiler/cutlass_kernels/] ← NEW: CUTLASS GEMM / Conv2d / Attention kernels
        ↓
[runtime/vortex_core/]  ← Nexa_Vortex: async Rust execution engine
  - Async conveyor belt pipeline (eliminates GPU starvation)
  - Lock-free crossbeam dispatcher (up to 15x lower tail latency)
  - NUMA-aware pinned allocator (~2x H2D bandwidth)
  - Hardware topology profiler
  - C-ABI FFI bridge to PyC compiler (auto-generated via bindgen)
        ↓
[Hardware]              ← GPU (CUDA/CUTLASS Tensor Cores), CPU, NVLink

Files Added / Changed

Runtime (from Nexa_Vortex)

File	Description
`runtime/vortex_core/src/pipeline.rs`	Async conveyor belt — overlaps CPU prep, H2D transfer, GPU compute
`runtime/vortex_core/src/cpu_dispatch.rs`	Lock-free crossbeam worker pool
`runtime/vortex_core/src/allocator.rs`	NUMA-local pinned memory allocator
`runtime/vortex_core/src/hw_profile.rs`	Hardware topology detection
`runtime/vortex_core/src/ffi/mod.rs`	Safe Rust wrappers for PyC C-ABI
`runtime/vortex_core/build.rs`	bindgen: auto-generates FFI from `include/pyc/` headers
`runtime/vortex_core/src/integrations/`	Mesocarp lock-free primitive integrations
`python/pyc/runtime/control_plane.py`	Python control plane (from Nexa_Vortex)
`python/pyc/runtime/telemetry_manager.py`	Python telemetry manager (from Nexa_Vortex)

CUTLASS Kernels (new)

File	Description
`compiler/cutlass_kernels/cutlass_gemm.cu`	FP16/BF16 Tensor Core + FP32 SIMT GEMM
`compiler/cutlass_kernels/cutlass_conv2d.cu`	FP16/BF16 Tensor Core Conv2d
`compiler/cutlass_kernels/cutlass_attention.cu`	FP16/BF16 fused QKV attention
`compiler/cutlass_kernels/cutlass_registry_init.cu`	Auto-registers all kernels at `.so` load time

Application Layer (from Nexa_Inference)

File	Description
`apps/inference_api/src/main.py`	FastAPI server entry point
`apps/inference_api/src/inference.py`	Core inference logic
`apps/inference_api/src/engines.py`	Engine dispatch
`apps/inference_api/src/pipelines.py`	SciML pipeline definitions
`apps/inference_api/models/schemas.py`	Pydantic request/response schemas
`apps/inference_api/src/auth.py`	Authentication
`apps/inference_api/src/config.py`	Configuration

Python SDK

File	Description
`python/pyc/__init__.py`	Top-level API: `pyc.init()`, `pyc.detect_hardware()`
`python/pyc/compiler/__init__.py`	ctypes wrapper for compiler C-ABI + kernel policy
`python/pyc/runtime/hw_profile.py`	Hardware detection (native + pure-Python fallback)
`python/pyc/runtime/pipeline.py`	Pipeline wrapper (PyO3 + pure-Python stub)

Build System

File	Description
`pyproject.toml`	Maturin-based Python package definition
`runtime/CMakeLists.txt`	ExternalProject Cargo build integration
`compiler/CMakeLists.txt`	Updated with CUTLASS kernel target
`include/pyc/cuda_backend.h`	New public header for CUDA dispatch C-ABI
`scripts/migrate_sources.sh`	Helper to pull remaining C sources from original repos

How to Build

# 1. Configure
cmake -B build -DPYC_BUILD_CUDA=ON

# 2. Build C/CUDA/Rust
cmake --build build -j$(nproc)

# 3. Install Python package
export PYC_COMPILER_LIB_DIR=$(pwd)/build/compiler
maturin develop --features python_ext

# 4. Test
pytest tests/

Checklist

Runtime sources stripped from Nexa_Vortex and integrated
Inference app sources stripped from Nexa_Inference and integrated
CUTLASS GEMM / Conv2d / Attention kernels added
C-ABI FFI bridge implemented (Rust ↔ C compiler)
Python SDK updated with unified pyc package
Build system updated (CMake + Cargo + Maturin)
Full CI/CD pipeline update (follow-up PR)
Integration tests for CUTLASS kernels (follow-up PR)

…C toolchain This PR integrates three previously separate projects into a single, vertically integrated HPC toolchain under the PyC banner. ## What's included ### Runtime layer (from Nexa_Vortex) - runtime/vortex_core/: Rust async execution engine - Asynchronous CPU→GPU conveyor belt pipeline (pipeline.rs) - Lock-free crossbeam-channel dispatcher (cpu_dispatch.rs) - NUMA-aware pinned memory allocator (allocator.rs) - Hardware topology profiler (hw_profile.rs) - Telemetry broadcaster (telemetry.rs) - Safe C-ABI FFI wrappers for the PyC compiler (ffi/mod.rs) - build.rs: auto-generates Rust bindings from PyC headers via bindgen - Mesocarp lock-free primitive integrations (integrations/) - python/pyc/runtime/: Python wrappers (control_plane, telemetry_manager) ### CUTLASS kernel library (new) - compiler/cutlass_kernels/: High-performance GPU kernels - cutlass_gemm.cu: FP16/BF16 Tensor Core + FP32 SIMT GEMM - cutlass_conv2d.cu: FP16/BF16 Tensor Core Conv2d - cutlass_attention.cu: FP16/BF16 fused attention - cutlass_registry_init.cu: auto-registers all kernels at library load ### Application layer (from Nexa_Inference) - apps/inference_api/: FastAPI SciML inference server - main.py, inference.py, engines.py, pipelines.py - models/schemas.py, auth.py, config.py ### Python SDK - python/pyc/: Unified Python package - pyc.compiler: ctypes wrapper for the C compiler ABI + kernel policy - pyc.runtime: PyO3/pure-Python pipeline and hardware detection - pyc.__init__: top-level convenience API (pyc.init(), pyc.detect_hardware()) ### Build system - pyproject.toml: Maturin-based Python package (replaces old pyproject.toml) - runtime/CMakeLists.txt: ExternalProject Cargo build integration - compiler/CMakeLists.txt: updated with CUTLASS kernel target - include/pyc/cuda_backend.h: new public header for CUDA dispatch ABI - scripts/migrate_sources.sh: helper to pull sources from original repos ## Architecture [Nexa_Inference API] ← apps/inference_api/ ↓ [PyC Compiler] ← compiler/ (IR, passes, CUTLASS kernels) ↓ [Vortex Runtime] ← runtime/vortex_core/ (async, NUMA, telemetry) ↓ [Hardware] ← GPU (CUDA/CUTLASS), CPU, NVLink ## Next steps - Run scripts/migrate_sources.sh to pull remaining C sources from PyC - cmake -B build -DPYC_BUILD_CUDA=ON && cmake --build build - maturin develop --features python_ext - pytest tests/

The new cuda_backend.h introduced in the merger PR defined a pyc_cuda_dispatch_trace struct with different field names than what the existing cuda_backend.c and compiler_api.c actually use, causing 20+ compile errors across all three CI platforms. Changes: - Replace incorrect struct fields (kernel_symbol, used_tensor_cores, fallback_reason, etc.) with the real fields from the original PyC codebase (cuda_requested, cuda_available, fallback_to_cpu, reason) - Add #define PYC_CUDA_REASON_MAX 128 (was missing, caused undeclared identifier error in cuda_backend.c:544) - Change include from 'pyc/ir.h' + 'pyc/kernel_registry.h' to 'pyc/compiler_api.h' which transitively provides pyc_tensor, pyc_ir_module, and all other required types (fixes 'unknown type name pyc_tensor' errors in cuda_backend.h:53,55,70,72) - Add void pyc_cuda_dispatch_trace_init() declaration (was called in compiler_api.c:1036 and :1279 but not declared in the header) - Fix pyc_cuda_dispatch() return type to pyc_cuda_dispatch_status (was int) and align parameter names with the real implementation - Fix PYC_CUDA_DISPATCH_ERROR value from -1 to 2 to match the enum Fixes CI on Ubuntu, macOS, and Windows.

Manus AI added 2 commits March 2, 2026 09:52

DarkStarStrix added the enhancement New feature or request label Mar 2, 2026

DarkStarStrix merged commit 87d3596 into main Mar 2, 2026
6 checks passed

DarkStarStrix deleted the feature/unified-hpc-toolchain branch March 4, 2026 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Unified HPC Toolchain — merge Nexa_Vortex runtime + Nexa_Inference apps + CUTLASS kernels into PyC#12

feat: Unified HPC Toolchain — merge Nexa_Vortex runtime + Nexa_Inference apps + CUTLASS kernels into PyC#12
DarkStarStrix merged 2 commits intomainfrom
feature/unified-hpc-toolchain

DarkStarStrix commented Mar 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DarkStarStrix commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Files Added / Changed

Runtime (from Nexa_Vortex)

CUTLASS Kernels (new)

Application Layer (from Nexa_Inference)

Python SDK

Build System

How to Build

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DarkStarStrix commented Mar 2, 2026 •

edited

Loading