Add device type meta support for `CutlassNvfp4GroupedMmaOp::evaluate` #5695

zasdfgbnm · 2025-12-17T19:22:29Z

No description provided.

github-actions · 2025-12-17T19:25:36Z

Review updated until commit 588ec95

Description

Add meta-device fast path to CutlassNvfp4GroupedMmaOp::evaluate for handling meta tensors
Implement shape and dtype inference for meta tensors without actual computation
Add comprehensive test covering both CUDA and meta device evaluation paths
Change NVFUSER_CUTLASS_KERNEL_ENABLED definition from PRIVATE to PUBLIC in CMake

Changes walkthrough

Relevant files

Enhancement

composite_nodes.cpp `Add meta device fast path to CutlassNvfp4GroupedMmaOp::evaluate` csrc/ir/composite_nodes.cpp Move NVFUSER_CUTLASS_KERNEL_ENABLED guard down in CutlassNvfp4GroupedMmaOp::evaluate Add meta-device fast path before CUTLASS computation that handles meta tensors Create empty result tensor with correct shape and properties for meta tensors Return early from meta path to avoid actual computation	+25/-1

Tests

test_meta.cpp `Add comprehensive test for CutlassNvfp4GroupedMma meta device support` tests/cpp/test_meta.cpp Add CutlassNvfp4GroupedMma test for meta device evaluation Create both real CUDA and meta tensor inputs with proper data types Verify meta evaluation produces same shape and properties as real evaluation Include detailed comments explaining tensor shapes and block-scaling factors	+181/-0

Configuration changes

CMakeLists.txt `Change NVFUSER_CUTLASS_KERNEL_ENABLED to PUBLIC definition` CMakeLists.txt Change NVFUSER_CUTLASS_KERNEL_ENABLED definition from PRIVATE to PUBLIC Make macro available to targets that depend on codegen_internal	+1/-1

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

CMake Definition Scope Change

The PR changes NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC in CMakeLists.txt. This exposes the macro to other targets that link against codegen_internal. Please verify this is intentional and doesn't cause unintended side effects in other parts of the codebase that might not expect this macro to be available.

TensorView* value,

Meta Device Path Validation

The meta-device fast path correctly handles tensor shape calculation and device placement. However, consider adding validation to ensure all input tensors are on the same device type (all meta or all non-meta) before proceeding, as mixing meta and real devices could lead to unexpected behavior.

// Meta-device fast path outside of torch version guard
if (mat1.is_meta() || mat2.is_meta() || scale1.is_meta() ||
    scale2.is_meta() || alpha.is_meta() || problem_sizes.is_meta() ||
    expert_offsets.is_meta() || sf_offsets.is_meta()) {
  // For nvfp4_scaled_grouped_mm, the output shape is [M, N]
  // where M = mat1.size(0) and N = mat2.size(2).
  // Note: CutlassNvfp4GroupedMmaOp expects mat2 to be [G, K/2, N] (packed) at
  // runtime and transposes it before calling into CUTLASS.
  std::vector<int64_t> result_sizes = {mat1.size(0), mat2.size(2)};

  at::ScalarType out_dtype = data_type_to_aten(out()->dtype());
  auto options =
      mat1.options().device(c10::Device(c10::kMeta)).dtype(out_dtype);
  at::Tensor result = at::empty(result_sizes, options);

  if (const auto rfactor_did_idx = getRFactorDeviceDimensionIndex(out());
      rfactor_did_idx != -1) {
    result = result.unsqueeze(rfactor_did_idx);
  }

  return {result};
}

zasdfgbnm · 2025-12-17T19:27:23Z

!test

…edMmaOp--evaluate

greptile-apps · 2026-01-07T01:53:06Z

Greptile Summary

This PR adds meta-device support for CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without executing CUTLASS kernels.

Changes:

Added meta-device fast path in CutlassNvfp4GroupedMmaOp::evaluate that computes output shape [M, N] from input dimensions, where M = mat1.size(0) and N = mat2.size(1)
Moved meta-device check outside the NVFUSER_CUTLASS_KERNEL_ENABLED preprocessor guard so shape inference works even when CUTLASS is not available
Properly handles rfactor device dimension indexing with unsqueeze when needed
Added comprehensive test MetaTest.CutlassNvfp4GroupedMma that validates meta-device outputs match CUDA execution for shapes, strides, and dtype

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes follow established patterns for meta-device support in other operations (e.g., GroupedMmaOp, SdpaFwdOp). The implementation correctly computes output shape from input dimensions and includes comprehensive testing. The meta-device path is isolated from CUDA execution and only affects shape inference.
No files require special attention

Important Files Changed

Filename	Overview
csrc/ir/composite_nodes.cpp	Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate by computing output shape without executing CUTLASS kernel
tests/cpp/test_meta.cpp	Added comprehensive test validating meta-device output shapes and strides match CUDA execution for CutlassNvfp4GroupedMma operations

Sequence Diagram

sequenceDiagram
    participant Evaluator as ExpressionEvaluator
    participant Op as CutlassNvfp4GroupedMmaOp
    participant MetaPath as Meta Device Path
    participant CudaPath as CUDA Path (CUTLASS)
    
    Evaluator->>Op: evaluate(inputs)
    Op->>Op: Extract 8 input tensors
    
    alt Any input is meta device
        Op->>MetaPath: Check is_meta() for all inputs
        MetaPath->>MetaPath: Compute output shape [M, N]<br/>M = mat1.size(0)<br/>N = mat2.size(1)
        MetaPath->>MetaPath: Create empty tensor on meta device<br/>with correct dtype
        MetaPath->>MetaPath: Apply rfactor dimension if needed
        MetaPath-->>Evaluator: Return meta tensor
    else All inputs on CUDA
        Op->>CudaPath: Validate input types and shapes
        CudaPath->>CudaPath: Calculate strides (ab_strides, c_strides)
        CudaPath->>CudaPath: Call cutlass_kernels::nvfp4_scaled_grouped_mm
        CudaPath->>CudaPath: Apply rfactor dimension if needed
        CudaPath-->>Evaluator: Return computed result tensor
    end

zasdfgbnm · 2026-01-07T20:16:01Z

!test

jjsjann123 · 2026-01-07T21:52:48Z

csrc/ir/composite_nodes.cpp

+      scale2.is_meta() || alpha.is_meta() || problem_sizes.is_meta() ||
+      expert_offsets.is_meta() || sf_offsets.is_meta()) {
+    // For nvfp4_scaled_grouped_mm, the output shape is [M, N]
+    // where M = mat1.size(0) and N = mat2.size(1)


I thought the output size n is mat2.size(2).
e.g. if you look at line 1767 below.

jjsjann123 · 2026-01-07T21:58:59Z

tests/cpp/test_meta.cpp

+  at::Tensor scale2_input = at::randn({4, 128, 8}, options_fp8);
+  at::Tensor alpha_input = at::ones({4}, options_fp32);
+  at::Tensor problem_sizes_input = at::tensor(
+      {{32, 128, 128}, {32, 128, 128}, {32, 128, 128}, {32, 128, 128}},


btw, here we get away with k == n. Maybe we want to change that just for slightly better test coverage. 😉

greptile-apps

Greptile Overview

Greptile Summary

Adds meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without executing actual CUTLASS kernels.

Key Changes:

Moved input tensor extraction outside the NVFUSER_CUTLASS_KERNEL_ENABLED guard to allow meta device path to access tensor shapes
Implemented fast path for meta device tensors that constructs output tensor with shape [M, N] where M = mat1.size(0) and N = mat2.size(1)
Correctly handles rfactor device dimension indexing for both meta and real execution paths
Added comprehensive test CutlassNvfp4GroupedMma that verifies meta evaluation produces identical shape, dtype, and strides as CUDA evaluation

Implementation Quality:

Follows existing meta device patterns used in other composite ops (e.g., MatmulOp, EmbeddingOp)
Correctly computes output shape based on grouped matmul semantics where mat1 is [M, K/2] and mat2 is [G, N, K/2]
Test uses realistic dimensions where M, N, K, and K/2 are all different to catch dimension indexing errors

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation follows established patterns for meta device support in the codebase, includes comprehensive testing that validates correctness, and only adds functionality without modifying existing behavior. The changes are well-isolated and the meta path returns early before any real computation occurs.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
csrc/ir/composite_nodes.cpp	5/5	Added meta device support for CutlassNvfp4GroupedMmaOp by moving input extraction outside CUTLASS guard and implementing fast path that returns meta tensor with correct shape [M, N]
tests/cpp/test_meta.cpp	5/5	Added comprehensive test for CutlassNvfp4GroupedMmaOp meta device support, verifying output shape, dtype, and strides match between CUDA and meta evaluation paths

Sequence Diagram

sequenceDiagram
    participant Client as ExpressionEvaluator
    participant Op as CutlassNvfp4GroupedMmaOp
    participant ATen as ATen/PyTorch
    participant Cutlass as CUTLASS Kernel
    
    Client->>Op: evaluate(inputs)
    Op->>Op: Extract 8 input tensors
    Note over Op: mat1, mat2, scale1, scale2,<br/>alpha, problem_sizes,<br/>expert_offsets, sf_offsets
    
    alt Any input is meta device
        Op->>Op: Calculate output shape<br/>[mat1.size(0), mat2.size(1)]
        Op->>ATen: at::empty(shape, meta_device)
        ATen-->>Op: meta tensor
        Op->>Op: Apply rfactor dimension if needed
        Op-->>Client: Return meta tensor
    else All inputs are real device
        Note over Op: CUTLASS_KERNEL_ENABLED required
        Op->>Op: Validate tensor dtypes
        Op->>Op: Calculate strides (ab_strides, c_strides)
        Op->>Cutlass: nvfp4_scaled_grouped_mm()
        Note over Cutlass: Performs FP4 grouped matmul<br/>with block scaling
        Cutlass-->>Op: Result tensor [M, N]
        Op->>Op: Apply rfactor dimension if needed
        Op-->>Client: Return result tensor
    end

greptile-apps

Greptile Overview

Greptile Summary

Added meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without executing CUDA kernels.

Key changes:

CMakeLists.txt: Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC so test files can properly use the #if guard
csrc/ir/composite_nodes.cpp: Added meta-device fast path in CutlassNvfp4GroupedMmaOp::evaluate that checks if any input tensor is on meta device and creates output tensor with correct shape [M, N] where M = mat1.size(0) and N = mat2.size(1)
tests/cpp/test_meta.cpp: Added comprehensive test case CutlassNvfp4GroupedMma that verifies meta evaluation produces tensors with same shape, dtype, and strides as CUDA execution

The implementation follows the established pattern used in other composite ops (e.g., EmbeddingOp, GroupedMatmulOp) and correctly handles the rfactor device dimension.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation follows established patterns from other meta device implementations in the codebase, has comprehensive test coverage that validates shape/dtype/stride correctness, and the CMakeLists.txt change is necessary and minimal
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed `NVFUSER_CUTLASS_KERNEL_ENABLED` from PRIVATE to PUBLIC visibility to make it available to test files
csrc/ir/composite_nodes.cpp	5/5	Added meta-device support for `CutlassNvfp4GroupedMmaOp::evaluate` by adding fast-path that creates output tensor with correct shape/dtype
tests/cpp/test_meta.cpp	5/5	Added comprehensive test case for meta-device evaluation of CUTLASS NVFP4 grouped matrix multiplication

Sequence Diagram

sequenceDiagram
    participant Test as MetaTest
    participant EE as ExpressionEvaluator
    participant Op as CutlassNvfp4GroupedMmaOp
    participant Kernel as cutlass_kernels

    Note over Test: Create fusion with cutlass_nvfp4_grouped_mm
    Test->>EE: bind meta device inputs
    Test->>EE: evaluate(fusion output)
    EE->>Op: evaluate(inputs)
    
    alt Any input is meta device
        Op->>Op: Check if any input.is_meta()
        Op->>Op: Calculate output shape [M, N]<br/>M = mat1.size(0), N = mat2.size(1)
        Op->>Op: Create meta tensor with correct dtype
        Op->>Op: Apply rfactor dimension if needed
        Op-->>EE: Return meta tensor
    else All inputs are CUDA
        Op->>Op: Extract input tensors
        Op->>Op: Validate shapes and dtypes
        Op->>Op: Calculate strides
        Op->>Kernel: nvfp4_scaled_grouped_mm(...)
        Kernel-->>Op: CUDA result tensor
        Op->>Op: Apply rfactor dimension if needed
        Op-->>EE: Return CUDA tensor
    end
    
    EE-->>Test: meta_out tensor
    Test->>Test: Verify meta_out.is_meta()
    Test->>Test: Verify sizes match real_out
    Test->>Test: Verify strides match real_out

greptile-apps

Greptile Overview

Greptile Summary

Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without executing CUTLASS kernels.

Key Changes

Meta-device fast path: Checks if any input tensors are on meta device and returns a correctly-shaped output tensor without invoking CUTLASS kernels
Output shape calculation: Computes output as [M, N] where M = mat1.size(0) and N = mat2.size(1), consistent with grouped matrix multiplication semantics
rFactor dimension handling: Properly handles rFactor device dimensions by unsqueezing when necessary
Build system fix: Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC to make the preprocessor define visible to test targets
Comprehensive test: Added CutlassNvfp4GroupedMma test that validates meta-device evaluation produces identical shape, dtype, and strides as CUDA execution

The implementation follows existing patterns in the codebase for meta-device support (e.g., MatmulOp, GroupedMatmulOp, SdpaFwdOp) and maintains consistency with the actual CUTLASS kernel behavior.

Confidence Score: 5/5

This PR is safe to merge with no issues found
The implementation follows established patterns in the codebase for meta-device support, the output shape calculation is correct and well-documented, the build system change is necessary and minimal, and comprehensive testing validates the functionality
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed `NVFUSER_CUTLASS_KERNEL_ENABLED` compile definition from PRIVATE to PUBLIC to make it visible to test targets linking against `codegen_internal`
csrc/ir/composite_nodes.cpp	5/5	Added meta-device fast path to `CutlassNvfp4GroupedMmaOp::evaluate` that returns correctly-shaped tensors without executing CUTLASS kernels
tests/cpp/test_meta.cpp	5/5	Added comprehensive test `CutlassNvfp4GroupedMma` that verifies meta-device evaluation produces correct shape, dtype, and strides matching CUDA execution

Sequence Diagram

sequenceDiagram
    participant Test as test_meta.cpp
    participant EE as ExpressionEvaluator
    participant Op as CutlassNvfp4GroupedMmaOp
    participant Meta as Meta Device Path
    participant CUTLASS as CUTLASS Kernel Path

    Test->>EE: evaluate(meta tensors)
    EE->>Op: evaluate(inputs)
    
    Op->>Op: Extract input tensors (mat1, mat2, scales, etc.)
    
    alt Any input is_meta()
        Op->>Meta: Check if any tensor is on meta device
        Meta->>Meta: Calculate output shape [M, N]<br/>M = mat1.size(0), N = mat2.size(1)
        Meta->>Meta: Create empty tensor with<br/>correct dtype and shape on meta device
        Meta->>Meta: Handle rFactor dimension if needed
        Meta-->>Op: Return meta tensor
    else All inputs on CUDA
        Op->>Op: Validate input dtypes and shapes
        Op->>Op: Calculate stride tensors
        Op->>CUTLASS: nvfp4_scaled_grouped_mm()
        CUTLASS-->>Op: Return result tensor
        Op->>Op: Handle rFactor dimension if needed
    end
    
    Op-->>EE: Return result
    EE-->>Test: Return output tensor
    Test->>Test: Verify shape, dtype, strides match

greptile-apps

Greptile Overview

Greptile Summary

Added meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without requiring actual CUTLASS kernels.

Key Changes:

Modified CMakeLists.txt to expose NVFUSER_CUTLASS_KERNEL_ENABLED as a PUBLIC definition so test files can check it
Added meta-device fast path in CutlassNvfp4GroupedMmaOp::evaluate that calculates output shape [M, N] from input dimensions before entering CUTLASS-specific code
Added comprehensive test CutlassNvfp4GroupedMma with 8 input tensors to verify meta and CUDA paths produce identical shapes/strides
Follows established pattern from other operations like MatmulOp, EmbeddingOp, and CumsumOp for meta device handling

Confidence Score: 5/5

This PR is safe to merge with minimal risk
Changes follow well-established patterns in the codebase, include comprehensive testing, and make minimal modifications with clear purpose. The meta device handling is implemented consistently with other operations, and the PUBLIC visibility change is necessary and appropriate.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed `NVFUSER_CUTLASS_KERNEL_ENABLED` from PRIVATE to PUBLIC to allow test files to access this definition
csrc/ir/composite_nodes.cpp	5/5	Added meta-device fast path for `CutlassNvfp4GroupedMmaOp::evaluate` to handle meta tensors before CUTLASS-specific code
tests/cpp/test_meta.cpp	5/5	Added comprehensive test `CutlassNvfp4GroupedMma` to verify meta device support with 8 tensor inputs and shape/stride validation

Sequence Diagram

sequenceDiagram
    participant Test as Test (test_meta.cpp)
    participant EE as ExpressionEvaluator
    participant Op as CutlassNvfp4GroupedMmaOp
    participant Meta as Meta Device Path
    participant CUTLASS as CUTLASS Kernel Path
    
    Test->>EE: bind meta tensors (mat1, mat2, scales, etc.)
    Test->>EE: evaluate(fusion output)
    EE->>Op: evaluate(inputs)
    
    Op->>Op: Extract 8 input tensors
    
    alt Any input is meta device
        Op->>Meta: Check is_meta() on all inputs
        Meta->>Meta: Calculate output shape [M, N]<br/>M = mat1.size(0), N = mat2.size(1)
        Meta->>Meta: Create empty tensor on meta device
        Meta->>Meta: Apply rfactor device dim if needed
        Meta-->>Op: Return meta tensor
    else All inputs are CUDA tensors
        Op->>CUTLASS: Validate inputs and shapes
        CUTLASS->>CUTLASS: Prepare stride tensors
        CUTLASS->>CUTLASS: Call nvfp4_scaled_grouped_mm kernel
        CUTLASS->>CUTLASS: Apply rfactor device dim if needed
        CUTLASS-->>Op: Return computed result
    end
    
    Op-->>EE: Return result tensor
    EE-->>Test: Return evaluated output
    Test->>Test: Validate meta output matches<br/>CUDA output shape/stride

greptile-apps

Greptile Overview

Greptile Summary

This PR enables meta-device support for CutlassNvfp4GroupedMmaOp::evaluate, allowing shape inference without requiring CUDA execution. The implementation moves the meta-device fast path outside the NVFUSER_CUTLASS_KERNEL_ENABLED guard and makes the macro PUBLIC in CMakeLists.txt so test files can use it.

Major Changes:

Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC scope in CMakeLists.txt to enable conditional compilation in test files
Added meta-device support in CutlassNvfp4GroupedMmaOp::evaluate() by checking is_meta() on all inputs and creating appropriately shaped meta tensors
Added comprehensive test CutlassNvfp4GroupedMma that validates meta tensors match real CUDA output shapes and strides

Issues Found:

Critical compilation errors in test file: undefined variables options_fp4 and options_fp8 are referenced but never declared (lines 620, 622, 626, 630)

Confidence Score: 1/5

PR contains critical compilation errors that will prevent the test from building
The test file references undefined variables (options_fp4 and options_fp8) which will cause compilation failures. While the core implementation in composite_nodes.cpp appears sound and the CMakeLists.txt change is necessary and correct, the test cannot run without fixing these critical syntax errors
tests/cpp/test_meta.cpp requires immediate attention to fix undefined variable references before merge

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed `NVFUSER_CUTLASS_KERNEL_ENABLED` from PRIVATE to PUBLIC scope to make the macro available in test files
csrc/ir/composite_nodes.cpp	5/5	Added meta-device support by checking `is_meta()` and creating meta tensors with correct shapes before the CUTLASS-guarded code path
tests/cpp/test_meta.cpp	1/5	Added comprehensive test for meta-device support, but contains critical compilation errors with undefined variables `options_fp4` and `options_fp8`

Sequence Diagram

sequenceDiagram
    participant Test as test_meta.cpp
    participant Eval as ExpressionEvaluator
    participant Op as CutlassNvfp4GroupedMmaOp
    participant Meta as Meta Device Path
    participant CUTLASS as CUTLASS Kernel

    Test->>Eval: bind meta tensors (mat1, mat2, scales, etc.)
    Test->>Eval: evaluate(fusion output)
    Eval->>Op: CutlassNvfp4GroupedMmaOp::evaluate()
    Op->>Op: Extract input tensors
    Op->>Meta: Check if any input is_meta()
    
    alt Meta Device Path
        Meta->>Meta: Calculate output shape [M, N]
        Meta->>Meta: Create empty meta tensor with correct dtype
        Meta->>Meta: Apply rfactor dimension if needed
        Meta-->>Test: Return meta tensor
    else CUDA Device Path (NVFUSER_CUTLASS_KERNEL_ENABLED)
        Op->>CUTLASS: Validate input types
        Op->>CUTLASS: Create stride tensors
        Op->>CUTLASS: Call nvfp4_scaled_grouped_mm kernel
        CUTLASS-->>Op: Return result tensor
        Op->>Op: Apply rfactor dimension if needed
        Op-->>Test: Return CUDA tensor
    end

greptile-apps · 2026-01-08T22:09:39Z

tests/cpp/test_meta.cpp

+  ee_cuda.bind(fusion->inputs().at(6), expert_offsets_input);
+  ee_cuda.bind(fusion->inputs().at(7), sf_offsets_input);
+  auto real_out = ee_cuda.evaluate(fusion->outputs().at(0)).as<at::Tensor>();
+


options_fp4 is undefined - should be options_uint8.dtype(at::kFloat4_e2m1fn_x2)

Suggested change

mat1_input.sizes(), mat1_input.strides(), options_uint8.dtype(at::kFloat4_e2m1fn_x2).device(at::kMeta));

greptile-apps · 2026-01-08T22:09:40Z

tests/cpp/test_meta.cpp

+  auto real_out = ee_cuda.evaluate(fusion->outputs().at(0)).as<at::Tensor>();
+
+  // Meta evaluation
+  ExpressionEvaluator ee_meta;


options_fp4 is undefined - should be options_uint8.dtype(at::kFloat4_e2m1fn_x2)

Suggested change

ExpressionEvaluator ee_meta;

mat2_input.sizes(), mat2_input.strides(), options_uint8.dtype(at::kFloat4_e2m1fn_x2).device(at::kMeta));

greptile-apps · 2026-01-08T22:09:41Z

tests/cpp/test_meta.cpp

+  auto meta_mat1 = at::empty_strided(
+      mat1_input.sizes(), mat1_input.strides(), options_fp4.device(at::kMeta));
+  auto meta_mat2 = at::empty_strided(
+      mat2_input.sizes(), mat2_input.strides(), options_fp4.device(at::kMeta));


options_fp8 is undefined - should be options_fp32.dtype(at::kFloat8_e4m3fn)

Suggested change

mat2_input.sizes(), mat2_input.strides(), options_fp4.device(at::kMeta));

options_fp32.dtype(at::kFloat8_e4m3fn).device(at::kMeta));

greptile-apps · 2026-01-08T22:09:42Z

tests/cpp/test_meta.cpp

+  auto meta_scale1 = at::empty_strided(
+      scale1_input.sizes(),
+      scale1_input.strides(),
+      options_fp8.device(at::kMeta));


options_fp8 is undefined - should be options_fp32.dtype(at::kFloat8_e4m3fn)

Suggested change

options_fp8.device(at::kMeta));

options_fp32.dtype(at::kFloat8_e4m3fn).device(at::kMeta));

greptile-apps

Greptile Overview

Greptile Summary

Adds meta-device support to CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without CUDA execution. The meta path computes output shape as [M, N] where M = mat1.size(0) and N = mat2.size(1), handling the rfactor device dimension index if present. The NVFUSER_CUTLASS_KERNEL_ENABLED compile definition is changed from PRIVATE to PUBLIC to make it visible to test files.

Confidence Score: 5/5

Safe to merge - follows established patterns for meta-device support with comprehensive test coverage
The implementation follows the exact same pattern used by other meta-device handlers in the codebase (GroupedMmaOp, EmbeddingOp). The meta path correctly computes output shapes, preserves dtype, and handles the rfactor device dimension. The CMakeLists.txt change is necessary to expose the compile definition to test files. Comprehensive test coverage validates shapes, strides, and dtype against real CUDA execution.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
csrc/ir/composite_nodes.cpp	4/5	Adds meta-device support for CutlassNvfp4GroupedMmaOp::evaluate by handling meta tensors before CUTLASS-enabled code path
CMakeLists.txt	4/5	Changes NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC compile definition for broader visibility

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Evaluate as CutlassNvfp4GroupedMmaOp::evaluate
    participant MetaPath as Meta Device Path
    participant CutlassPath as CUTLASS Kernel Path
    
    Caller->>Evaluate: "evaluate(inputs)"
    Evaluate->>Evaluate: "Extract 8 input tensors"
    alt Any input is meta device
        Evaluate->>MetaPath: "Compute output shape"
        MetaPath->>MetaPath: "result_sizes = [mat1.size(0), mat2.size(1)]"
        MetaPath->>MetaPath: "Create empty meta tensor with out_dtype"
        MetaPath->>MetaPath: "Apply rfactor unsqueeze if needed"
        MetaPath-->>Caller: "Return meta tensor"
    else All inputs are CUDA
        Evaluate->>CutlassPath: "Call nvfp4_scaled_grouped_mm"
        CutlassPath->>CutlassPath: "Execute CUTLASS kernel"
        CutlassPath->>CutlassPath: "Apply rfactor unsqueeze if needed"
        CutlassPath-->>Caller: "Return computed result"
    end

greptile-apps

Greptile Overview

Greptile Summary

This PR adds meta-device support for CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without requiring CUTLASS kernel execution. The implementation is consistent with existing meta-device patterns in the codebase.

Key Changes:

Added meta-device fast path that checks if any input tensor is on meta device and returns an appropriately shaped output tensor
Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC in CMakeLists.txt to ensure test files can access the preprocessor definition
Added comprehensive test CutlassNvfp4GroupedMma that validates meta and CUDA paths produce matching shapes, dtypes, and strides
Meta path calculates output shape as [M, N] where M = mat1.size(0) and N = mat2.size(1), matching the actual CUTLASS operation semantics

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation follows established patterns in the codebase for meta-device support, includes comprehensive test coverage, and makes a necessary build configuration change. The meta-device logic correctly calculates output shapes consistent with the actual CUTLASS operation, and the CMakeLists.txt change properly exposes the preprocessor definition to dependent targets.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed `NVFUSER_CUTLASS_KERNEL_ENABLED` from PRIVATE to PUBLIC to make the preprocessor definition available to targets linking against `codegen_internal`
csrc/ir/composite_nodes.cpp	5/5	Added meta-device fast path for `CutlassNvfp4GroupedMmaOp::evaluate` that returns correctly shaped output tensor without requiring CUTLASS kernel execution
tests/cpp/test_meta.cpp	5/5	Added comprehensive test for meta-device support in `CutlassNvfp4GroupedMmaOp`, verifying output shape, dtype, and strides match CUDA execution

Sequence Diagram

sequenceDiagram
    participant Test as MetaTest
    participant EE as ExpressionEvaluator
    participant Op as CutlassNvfp4GroupedMmaOp
    participant Cutlass as CUTLASS Kernel
    
    Note over Test: Create fusion with meta tensors
    Test->>EE: bind(meta_mat1, meta_mat2, ...)
    Test->>EE: evaluate(fusion output)
    EE->>Op: evaluate(ee, inputs)
    
    alt Any input is_meta()
        Op->>Op: Calculate output shape [M, N]<br/>M = mat1.size(0), N = mat2.size(1)
        Op->>Op: Create meta tensor with correct dtype
        Op->>Op: Apply rfactor_did_idx if needed
        Op-->>EE: Return meta tensor
    else All inputs are CUDA tensors
        Note over Op: Check NVFUSER_CUTLASS_KERNEL_ENABLED
        Op->>Op: Validate tensor types and shapes
        Op->>Op: Calculate strides for kernel
        Op->>Cutlass: nvfp4_scaled_grouped_mm(...)
        Cutlass-->>Op: Result tensor
        Op->>Op: Apply rfactor_did_idx if needed
        Op-->>EE: Return result tensor
    end
    
    EE-->>Test: meta_out tensor
    Test->>Test: Verify: is_meta(), dtype, sizes, strides

greptile-apps

Greptile Overview

Greptile Summary

This PR adds meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without running actual CUDA kernels. The implementation follows existing patterns in the codebase for handling meta devices.

Key Changes:

Added meta device fast path in CutlassNvfp4GroupedMmaOp::evaluate that checks if any input is on meta device and returns appropriately shaped meta tensor
Changed NVFUSER_CUTLASS_KERNEL_ENABLED macro visibility from PRIVATE to PUBLIC in CMakeLists.txt to allow test files to use the macro
Added comprehensive test CutlassNvfp4GroupedMma that validates output shape, dtype, and strides match between CUDA and meta execution paths

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes are minimal, well-tested, and follow established patterns in the codebase. The meta device support is implemented consistently with other evaluate methods, the CMakeLists.txt change is necessary and correct, and comprehensive tests validate the implementation.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC visibility. This allows test files to check the macro definition. Simple and correct change.
csrc/ir/composite_nodes.cpp	5/5	Added meta device support to CutlassNvfp4GroupedMmaOp::evaluate by checking all inputs with is_meta() and returning appropriate meta tensors. Follows existing patterns in the codebase.
tests/cpp/test_meta.cpp	5/5	Added comprehensive test for CutlassNvfp4GroupedMmaOp with meta device. Test properly guards with NVFUSER_CUTLASS_KERNEL_ENABLED and verifies output shape, dtype, and strides.

Sequence Diagram

sequenceDiagram
    participant Test as MetaTest
    participant EE as ExpressionEvaluator
    participant Eval as CutlassNvfp4GroupedMmaOp::evaluate
    participant ATen as at::Tensor
    
    Test->>EE: bind meta inputs (8 tensors)
    Test->>EE: evaluate(fusion output)
    EE->>Eval: evaluate(inputs)
    
    Eval->>Eval: Check if any input.is_meta()
    alt Any input is meta
        Eval->>Eval: Calculate output shape [M, N]
        Eval->>ATen: at::empty(sizes, meta device)
        ATen-->>Eval: meta tensor
        Eval->>Eval: Apply rfactor_did_idx if needed
        Eval-->>EE: meta tensor result
    else All inputs are CUDA
        Eval->>Eval: NVF_CHECK scalar types
        Eval->>Eval: Calculate strides
        Eval->>Eval: Call cutlass kernel
        Eval-->>EE: CUDA tensor result
    end
    
    EE-->>Test: output tensor
    Test->>Test: Verify shape, dtype, strides

greptile-apps

Greptile Overview

Greptile Summary

This PR adds meta-device support for CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without requiring CUTLASS kernel compilation.

Key changes:

CMakeLists.txt: Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC compile definition, allowing the macro to be visible in header files
composite_nodes.cpp: Moved meta-device fast path before the #if NVFUSER_CUTLASS_KERNEL_ENABLED preprocessor guard, allowing shape inference to work even when CUTLASS is not compiled
test_meta.cpp: Added comprehensive test that validates meta-device execution by comparing output shapes, strides, and dtypes between CUDA and meta paths

Implementation approach:
The meta-device fast path computes output shape as [mat1.size(0), mat2.size(1)] and creates an empty meta tensor with the correct dtype, matching the pattern used in other operators like GroupedMmaOp.

Confidence Score: 5/5

This PR is safe to merge with no significant risks
The changes are well-structured and follow established patterns in the codebase. The meta-device support is properly isolated in a fast path that doesn't affect the existing CUTLASS kernel execution. The test coverage is comprehensive and validates the implementation thoroughly.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changes `NVFUSER_CUTLASS_KERNEL_ENABLED` from PRIVATE to PUBLIC compile definition, enabling meta-device support check outside preprocessor guards
csrc/ir/composite_nodes.cpp	5/5	Adds meta-device fast path before `#if NVFUSER_CUTLASS_KERNEL_ENABLED` guard in `CutlassNvfp4GroupedMmaOp::evaluate`, enabling shape inference without CUTLASS compilation
tests/cpp/test_meta.cpp	5/5	Adds comprehensive test `CutlassNvfp4GroupedMma` that validates meta-device support by comparing output shapes/strides between CUDA and meta execution paths

Sequence Diagram

sequenceDiagram
    participant Test as test_meta.cpp
    participant EE as ExpressionEvaluator
    participant Eval as CutlassNvfp4GroupedMmaOp::evaluate
    participant MetaPath as Meta Fast Path
    participant CutlassPath as CUTLASS Kernel Path
    
    Test->>EE: bind meta tensors
    Test->>EE: evaluate(fusion->outputs)
    EE->>Eval: evaluate(inputs)
    
    Eval->>Eval: Check if any input is_meta()
    
    alt Any input is meta
        Eval->>MetaPath: Execute meta fast path
        MetaPath->>MetaPath: Calculate output shape [M, N]
        MetaPath->>MetaPath: Create empty meta tensor
        MetaPath->>MetaPath: Apply rFactor dim if needed
        MetaPath-->>Eval: Return meta tensor
    else All inputs on CUDA
        Eval->>CutlassPath: Check NVFUSER_CUTLASS_KERNEL_ENABLED
        alt CUTLASS enabled
            CutlassPath->>CutlassPath: Validate input types (Float4)
            CutlassPath->>CutlassPath: Call cutlass_kernels::nvfp4_scaled_grouped_mm
            CutlassPath-->>Eval: Return CUDA result
        else CUTLASS not enabled
            CutlassPath-->>Eval: Throw error
        end
    end
    
    Eval-->>EE: Return result
    EE-->>Test: Return evaluated tensor

greptile-apps

Greptile Overview

Greptile Summary

Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference and testing without CUDA execution.

Key Changes:

Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC in CMakeLists.txt to expose the macro to test files
Added early-return path in CutlassNvfp4GroupedMmaOp::evaluate that detects meta tensors and returns appropriately shaped meta output tensors
Added comprehensive test case CutlassNvfp4GroupedMma that validates meta evaluation produces matching shapes/dtypes/strides compared to CUDA execution
Meta-device path correctly handles RFactor device dimension indexing by applying unsqueeze when needed

Confidence Score: 5/5

This PR is safe to merge with minimal risk
All changes are well-structured and follow established patterns in the codebase. The meta-device implementation mirrors other similar operations, the CMake change correctly exposes the macro for test compilation, and comprehensive tests validate correctness. No breaking changes or risky modifications.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed `NVFUSER_CUTLASS_KERNEL_ENABLED` definition from PRIVATE to PUBLIC to expose the macro to downstream targets
csrc/ir/composite_nodes.cpp	5/5	Added meta-device support to `CutlassNvfp4GroupedMmaOp::evaluate` by checking for meta tensors and returning appropriately shaped meta output
tests/cpp/test_meta.cpp	5/5	Added comprehensive test `CutlassNvfp4GroupedMma` to verify meta-device evaluation produces correct shapes/dtypes matching CUDA path

Sequence Diagram

sequenceDiagram
    participant Client as ExpressionEvaluator
    participant Op as CutlassNvfp4GroupedMmaOp
    participant Check as Meta Check
    participant Meta as Meta Path
    participant CUDA as CUDA/CUTLASS Path
    
    Client->>Op: evaluate(inputs)
    Op->>Op: Extract 8 input tensors
    Op->>Check: Check if any input is_meta()
    
    alt Any input is meta tensor
        Check->>Meta: Enter meta-device path
        Meta->>Meta: Calculate output shape [M, N]
        Meta->>Meta: Get output dtype from IR
        Meta->>Meta: Create meta tensor with options
        Meta->>Meta: Apply RFactor unsqueeze if needed
        Meta-->>Client: Return meta tensor
    else All inputs are CUDA tensors
        Check->>CUDA: Enter CUTLASS kernel path
        CUDA->>CUDA: Validate input dtypes (Float4_e2m1fn_x2)
        CUDA->>CUDA: Validate problem_sizes dimensions
        CUDA->>CUDA: Calculate strides (ab_strides, c_strides)
        CUDA->>CUDA: Call cutlass_kernels::nvfp4_scaled_grouped_mm
        CUDA->>CUDA: Apply RFactor unsqueeze if needed
        CUDA-->>Client: Return result tensor
    end

greptile-apps

Greptile Overview

Greptile Summary

This PR adds meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape and stride inference without requiring actual CUDA execution.

Key changes:

Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC in CMakeLists.txt to make the macro accessible in test files
Added meta-device fast path in CutlassNvfp4GroupedMmaOp::evaluate that computes output shape [M, N] from input shapes before the CUTLASS kernel guard
Added comprehensive test MetaTest.CutlassNvfp4GroupedMma that validates meta device behavior matches CUDA execution for shapes, strides, and dtypes

Implementation details:

The meta path checks if any input tensor is on meta device and creates an output tensor on meta device with the correct shape, dtype, and rFactor dimension handling
Follows the established pattern used by other ops in the codebase (GroupedMmaOp, ScanOp, EmbeddingFwdOp)
Test properly guards CUTLASS-specific functionality with #if NVFUSER_CUTLASS_KERNEL_ENABLED

Confidence Score: 5/5

This PR is safe to merge with minimal risk - it adds meta device support following established patterns
The changes are clean, well-tested, and follow the existing codebase patterns. The CMakeLists.txt change is necessary and minimal, the implementation correctly places the meta path before the CUTLASS guard, and comprehensive testing validates the behavior. No files require special attention as all changes are straightforward and consistent.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC to make it visible in test files
csrc/ir/composite_nodes.cpp	5/5	Added meta-device support to CutlassNvfp4GroupedMmaOp::evaluate by moving meta path before NVFUSER_CUTLASS_KERNEL_ENABLED guard
tests/cpp/test_meta.cpp	5/5	Added comprehensive test for CutlassNvfp4GroupedMmaOp meta device support with proper guards

Sequence Diagram

sequenceDiagram
    participant Test as MetaTest
    participant EE as ExpressionEvaluator
    participant Op as CutlassNvfp4GroupedMmaOp
    participant ATen as ATen/PyTorch
    
    Note over Test: Create fusion with FP4 grouped matmul
    Test->>Test: Create meta tensors (mat1, mat2, scales, etc.)
    Test->>EE: bind meta tensors to fusion inputs
    Test->>EE: evaluate(fusion->outputs().at(0))
    
    EE->>Op: evaluate(ee, meta_inputs)
    
    Note over Op: Check if any input is_meta()
    alt Any input is meta
        Op->>Op: Compute output shape [M, N]<br/>M=mat1.size(0), N=mat2.size(1)
        Op->>ATen: at::empty(result_sizes, meta_device)
        ATen-->>Op: meta output tensor
        Op->>Op: Handle rFactor dimension if needed
        Op-->>EE: return meta output
    else All inputs are CUDA
        Op->>ATen: cutlass_kernels::nvfp4_scaled_grouped_mm(...)
        ATen-->>Op: CUDA result
        Op-->>EE: return CUDA result
    end
    
    EE-->>Test: meta_out tensor
    Test->>Test: Validate shape, strides, dtype match CUDA execution

greptile-apps

Greptile Overview

Greptile Summary

Added meta-device support to CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without GPU execution. The key changes include:

CMakeLists.txt: Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC visibility, making it accessible to test files
csrc/ir/composite_nodes.cpp: Added meta-device fast path that checks if any input tensor is on meta device and returns a properly-shaped meta tensor, positioned before the #if NVFUSER_CUTLASS_KERNEL_ENABLED guard
tests/cpp/test_meta.cpp: Added comprehensive test validating that meta-device evaluation produces correct shape, dtype, and strides matching CUDA execution

The implementation follows the established pattern used by other operators in the codebase (e.g., MatmulOp, EmbeddingOp) where meta tensors are detected and handled separately to enable shape/dtype inference.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes are straightforward and follow established patterns in the codebase. The meta-device path is clearly separated from the CUDA execution path, the CMakeLists visibility change is necessary and correct, and the comprehensive test validates the implementation.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed `NVFUSER_CUTLASS_KERNEL_ENABLED` from PRIVATE to PUBLIC visibility, enabling header files to check the macro
csrc/ir/composite_nodes.cpp	5/5	Added meta-device support in `CutlassNvfp4GroupedMmaOp::evaluate` by checking for meta tensors and returning appropriately-shaped meta tensor
tests/cpp/test_meta.cpp	5/5	Added comprehensive test `CutlassNvfp4GroupedMma` that validates meta-device behavior matches CUDA execution for shape/dtype/strides

Sequence Diagram

sequenceDiagram
    participant Test as Test/User Code
    participant EE as ExpressionEvaluator
    participant Op as CutlassNvfp4GroupedMmaOp::evaluate
    participant MetaPath as Meta Device Path
    participant CUDAPath as CUDA/CUTLASS Path
    
    Test->>EE: bind meta tensors
    Test->>EE: evaluate(fusion output)
    EE->>Op: evaluate(ee, inputs)
    Op->>Op: Extract 8 input tensors
    
    alt Any input is meta tensor
        Op->>MetaPath: Check is_meta() for all inputs
        MetaPath->>MetaPath: Calculate output shape [M, N]
        MetaPath->>MetaPath: Create meta tensor with out_dtype
        MetaPath->>MetaPath: Apply rfactor dimension if needed
        MetaPath-->>Op: Return meta tensor
    else All inputs are CUDA tensors
        Op->>CUDAPath: Check NVFUSER_CUTLASS_KERNEL_ENABLED
        CUDAPath->>CUDAPath: Validate tensor types and shapes
        CUDAPath->>CUDAPath: Calculate strides
        CUDAPath->>CUDAPath: Call cutlass_kernels::nvfp4_scaled_grouped_mm
        CUDAPath->>CUDAPath: Apply rfactor dimension if needed
        CUDAPath-->>Op: Return result tensor
    end
    
    Op-->>EE: Return result
    EE-->>Test: Return evaluated output

greptile-apps

Greptile Overview

Greptile Summary

Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without CUDA execution.

Key changes:

Added early return path in CutlassNvfp4GroupedMmaOp::evaluate that checks if any input tensor is on the meta device and returns an appropriately shaped meta output tensor
Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC in CMakeLists.txt so test files can access it
Added comprehensive test validating that meta-device evaluation produces outputs with correct shapes, strides, and dtypes matching CUDA execution

The implementation follows the established pattern used by other ops in the codebase (matmul, SDPA, embedding, etc.) for meta-device handling.

Confidence Score: 5/5

This PR is safe to merge with no issues found
The changes are minimal, well-tested, and follow established patterns in the codebase. The meta-device support implementation mirrors other similar operations, the CMakeLists.txt change is necessary and correct, and the test thoroughly validates the functionality.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed `NVFUSER_CUTLASS_KERNEL_ENABLED` from PRIVATE to PUBLIC visibility to make it available to test files that link against `codegen_internal`
csrc/ir/composite_nodes.cpp	5/5	Added meta-device support in `CutlassNvfp4GroupedMmaOp::evaluate` by checking for meta tensors and returning appropriately shaped meta outputs before CUTLASS kernel invocation
tests/cpp/test_meta.cpp	5/5	Added comprehensive test `CutlassNvfp4GroupedMma` that verifies meta-device evaluation produces correct shapes/strides/dtypes compared to CUDA execution

Sequence Diagram

sequenceDiagram
    participant Test as Test Code
    participant EE as ExpressionEvaluator
    participant Op as CutlassNvfp4GroupedMmaOp
    participant Meta as Meta Device Path
    participant CUTLASS as CUTLASS Kernel

    Test->>EE: bind meta tensors to fusion inputs
    Test->>EE: evaluate(fusion output)
    EE->>Op: evaluate(inputs)
    Op->>Op: Extract 8 input tensors
    
    alt Any input is_meta()
        Op->>Meta: Check if any tensor is meta
        Meta->>Meta: Calculate output shape [M, N]
        Meta->>Meta: Create empty meta tensor with correct dtype
        Meta->>Meta: Apply rfactor dimension if needed
        Meta-->>Op: Return meta tensor
        Op-->>EE: Return {meta_result}
    else All inputs are CUDA tensors
        Op->>Op: Validate tensor dtypes and shapes
        Op->>Op: Calculate strides for CUTLASS
        Op->>CUTLASS: nvfp4_scaled_grouped_mm()
        CUTLASS-->>Op: Return CUDA result
        Op->>Op: Apply rfactor dimension if needed
        Op-->>EE: Return {cuda_result}
    end
    
    EE-->>Test: Return result tensor
    Test->>Test: Verify meta output shape/dtype/strides

greptile-apps

Greptile Overview

Greptile Summary

Adds meta device support to CutlassNvfp4GroupedMmaOp::evaluate, allowing shape inference without executing CUTLASS kernels.

Major Changes:

Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC in CMakeLists.txt (required for test files to see the preprocessor guard)
Added meta device fast path in CutlassNvfp4GroupedMmaOp::evaluate that checks if any input is meta and returns appropriately shaped output tensor
Added comprehensive test CutlassNvfp4GroupedMma that validates meta device behavior against CUDA execution

Implementation:
The meta device path computes output shape as [mat1.size(0), mat2.size(1)] and creates an empty meta tensor with the correct dtype, matching the pattern used in other composite ops like GroupedMmaOp. The rfactor device dimension handling is consistent with the CUDA path.

Confidence Score: 4/5

This PR is safe to merge with minor considerations regarding dimension validation
The implementation correctly adds meta device support following established patterns in the codebase. The CMakeLists.txt change is necessary and correct. The test is comprehensive. However, the meta path lacks explicit dimension validation that exists in similar operations like GroupedMmaOp, though this may be intentional as validation could occur during fusion construction.
csrc/ir/composite_nodes.cpp - verify that dimension validation in meta path matches design intent

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed `NVFUSER_CUTLASS_KERNEL_ENABLED` from PRIVATE to PUBLIC to make it visible to test files that link against codegen_internal
csrc/ir/composite_nodes.cpp	4/5	Added meta device fast path to `CutlassNvfp4GroupedMmaOp::evaluate` that returns properly shaped tensor without CUTLASS execution
tests/cpp/test_meta.cpp	5/5	Added comprehensive test `CutlassNvfp4GroupedMma` that verifies meta device support with proper tensor shapes and strides

Sequence Diagram

sequenceDiagram
    participant Test as test_meta.cpp
    participant EE as ExpressionEvaluator
    participant Eval as CutlassNvfp4GroupedMmaOp::evaluate
    participant Meta as Meta Device Path
    participant CUTLASS as CUTLASS Kernel

    Test->>EE: bind meta tensors
    Test->>EE: evaluate(fusion output)
    EE->>Eval: evaluate(inputs)
    
    alt Any input is_meta()
        Eval->>Meta: Check if any tensor is meta
        Meta->>Meta: Calculate output shape [M, N]
        Meta->>Meta: Create empty meta tensor
        Meta->>Meta: Apply rfactor dimension if needed
        Meta-->>Eval: Return meta tensor
    else All inputs are CUDA tensors
        alt NVFUSER_CUTLASS_KERNEL_ENABLED defined
            Eval->>CUTLASS: Validate input types (FP4)
            Eval->>CUTLASS: Validate problem_sizes shape
            Eval->>CUTLASS: Calculate stride tensors
            Eval->>CUTLASS: nvfp4_scaled_grouped_mm(...)
            CUTLASS-->>Eval: Result tensor
            Eval->>Eval: Apply rfactor dimension if needed
            Eval-->>EE: Return result
        else CUTLASS not enabled
            Eval-->>EE: Throw error
        end
    end
    
    EE-->>Test: meta_out tensor
    Test->>Test: Verify sizes, strides, dtype

greptile-apps · 2026-01-08T22:50:39Z

csrc/ir/composite_nodes.cpp

+  if (mat1.is_meta() || mat2.is_meta() || scale1.is_meta() ||
+      scale2.is_meta() || alpha.is_meta() || problem_sizes.is_meta() ||
+      expert_offsets.is_meta() || sf_offsets.is_meta()) {
+    // For nvfp4_scaled_grouped_mm, the output shape is [M, N]
+    // where M = mat1.size(0) and N = mat2.size(1)
+    std::vector<int64_t> result_sizes = {mat1.size(0), mat2.size(1)};


missing dimension validation in meta path - mat1 expected to be 2D and mat2 expected to be 3D, but shape calculation proceeds without checks (unlike GroupedMmaOp::evaluate which validates dimensions)

greptile-apps

Greptile Overview

Greptile Summary

Added meta-device support to CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without executing CUTLASS kernels. This allows the operation to compute output tensor shapes and properties when inputs are meta-device tensors, which is useful for ahead-of-time analysis and optimization.

Key changes:

Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC in CMakeLists.txt so test files can access the macro
Added meta-device fast path before the CUTLASS kernel guard that checks if any input tensor is on the meta device
Computes output shape as [M, N] where M = mat1.size(0) and N = mat2.size(2), handles rfactor dimensions
Added comprehensive test comparing meta and CUDA device outputs for correctness

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes follow established patterns in the codebase for meta-device support, include comprehensive testing, and make minimal, well-understood modifications. The CMakeLists.txt change correctly propagates the macro definition, and the meta-device logic mirrors similar implementations in the same file.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed `NVFUSER_CUTLASS_KERNEL_ENABLED` macro from PRIVATE to PUBLIC scope, enabling test files to access the macro definition
csrc/ir/composite_nodes.cpp	5/5	Added meta-device fast path to `CutlassNvfp4GroupedMmaOp::evaluate` that computes output shape without executing CUTLASS kernels
tests/cpp/test_meta.cpp	5/5	Added comprehensive test `CutlassNvfp4GroupedMma` comparing meta-device and CUDA device outputs for shape, dtype, and stride consistency

Sequence Diagram

sequenceDiagram
    participant Test as Test Code
    participant EE as ExpressionEvaluator
    participant Op as CutlassNvfp4GroupedMmaOp
    participant Tensor as at::Tensor
    participant CUTLASS as CUTLASS Kernel

    Note over Test: Create fusion with CutlassNvfp4GroupedMmaOp
    Test->>EE: bind meta tensors (mat1, mat2, scales, etc)
    Test->>EE: evaluate(output)
    EE->>Op: evaluate(inputs)
    
    Op->>Tensor: Extract input tensors from PolymorphicValue
    Op->>Tensor: is_meta() check on all inputs
    
    alt Any input is meta device
        Note over Op: Fast path - no CUTLASS execution
        Op->>Op: Calculate output shape [M, N]
        Op->>Op: M = mat1.size(0), N = mat2.size(2)
        Op->>Tensor: Create meta tensor with computed shape
        Op->>Op: Handle rfactor dimension if needed
        Op-->>EE: Return meta tensor result
    else All inputs are CUDA device
        Note over Op: Runtime path - CUTLASS execution
        Op->>CUTLASS: nvfp4_scaled_grouped_mm(...)
        CUTLASS-->>Op: Return computed result
        Op->>Op: Handle rfactor dimension if needed
        Op-->>EE: Return CUDA tensor result
    end
    
    EE-->>Test: Return result tensor
    Test->>Test: Verify meta tensor properties match CUDA output

greptile-apps

Greptile Overview

Greptile Summary

This PR adds meta device support for CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without requiring actual CUDA tensors or CUTLASS runtime execution.

Key changes:

Moved tensor extraction code before the #if NVFUSER_CUTLASS_KERNEL_ENABLED preprocessor guard to allow meta device handling regardless of CUTLASS availability
Added meta-device fast path that synthesizes output tensors with correct shape [M, N] derived from input dimensions
Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC in CMakeLists.txt so test files can use the macro
Added comprehensive test case verifying meta device evaluation produces tensors with correct shapes/strides matching CUDA execution

Confidence Score: 5/5

This PR is safe to merge with minimal risk - it's a well-contained feature addition with proper testing
The changes follow established patterns in the codebase for meta device support (similar to other ops), include comprehensive testing, and the CMakeLists.txt change correctly exposes the compile definition to dependent targets
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
CMakeLists.txt	5/5	Changed `NVFUSER_CUTLASS_KERNEL_ENABLED` compile definition from PRIVATE to PUBLIC to make it visible to test files
csrc/ir/composite_nodes.cpp	5/5	Added meta device support to `CutlassNvfp4GroupedMmaOp::evaluate` by moving tensor extraction before preprocessor guard and adding meta-device fast path
tests/cpp/test_meta.cpp	5/5	Added comprehensive test `CutlassNvfp4GroupedMma` to verify meta device support for the CUTLASS grouped matmul operation

Sequence Diagram

sequenceDiagram
    participant Caller
    participant ExpressionEvaluator
    participant CutlassNvfp4GroupedMmaOp
    participant MetaDevice
    participant CutlassKernel

    Caller->>ExpressionEvaluator: evaluate(fusion output)
    ExpressionEvaluator->>CutlassNvfp4GroupedMmaOp: evaluate(inputs)
    CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Extract 8 input tensors
    
    alt Any input is meta device
        CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Calculate output shape [M, N]
        CutlassNvfp4GroupedMmaOp->>MetaDevice: at::empty(result_sizes, meta options)
        MetaDevice-->>CutlassNvfp4GroupedMmaOp: meta tensor
        CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Check rfactor_did_idx
        opt rfactor_did_idx != -1
            CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: result.unsqueeze(rfactor_did_idx)
        end
        CutlassNvfp4GroupedMmaOp-->>ExpressionEvaluator: return meta result
    else All inputs are CUDA tensors
        Note over CutlassNvfp4GroupedMmaOp: NVFUSER_CUTLASS_KERNEL_ENABLED guard
        CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Validate input types & shapes
        CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Setup strides (ab_strides, c_strides)
        CutlassNvfp4GroupedMmaOp->>CutlassKernel: nvfp4_scaled_grouped_mm(...)
        CutlassKernel-->>CutlassNvfp4GroupedMmaOp: result tensor
        CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Check rfactor_did_idx
        opt rfactor_did_idx != -1
            CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: result.unsqueeze(rfactor_did_idx)
        end
        CutlassNvfp4GroupedMmaOp-->>ExpressionEvaluator: return CUDA result
    end
    
    ExpressionEvaluator-->>Caller: result tensor

Add device type meta support for CutlassNvfp4GroupedMmaOp::evaluate

1a99a15

save

6b57c2d

zasdfgbnm added 2 commits January 6, 2026 10:37

Merge branch 'main' of github.com:NVIDIA/Fuser into CutlassNvfp4Group…

12680c9

…edMmaOp--evaluate

save

5da5e22

zasdfgbnm marked this pull request as ready for review January 7, 2026 01:48

zasdfgbnm requested a review from jjsjann123 January 7, 2026 01:48

jjsjann123 reviewed Jan 7, 2026

View reviewed changes

save

e7b6b9c