Skip to content

Conversation

@zasdfgbnm
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Dec 17, 2025

Review updated until commit 588ec95

Description

  • Add meta-device fast path to CutlassNvfp4GroupedMmaOp::evaluate for handling meta tensors

  • Implement shape and dtype inference for meta tensors without actual computation

  • Add comprehensive test covering both CUDA and meta device evaluation paths

  • Change NVFUSER_CUTLASS_KERNEL_ENABLED definition from PRIVATE to PUBLIC in CMake

Changes walkthrough

Relevant files
Enhancement
composite_nodes.cpp
Add meta device fast path to CutlassNvfp4GroupedMmaOp::evaluate

csrc/ir/composite_nodes.cpp

  • Move NVFUSER_CUTLASS_KERNEL_ENABLED guard down in
    CutlassNvfp4GroupedMmaOp::evaluate
  • Add meta-device fast path before CUTLASS computation that handles meta
    tensors
  • Create empty result tensor with correct shape and properties for meta
    tensors
  • Return early from meta path to avoid actual computation
  • +25/-1   
    Tests
    test_meta.cpp
    Add comprehensive test for CutlassNvfp4GroupedMma meta device support

    tests/cpp/test_meta.cpp

  • Add CutlassNvfp4GroupedMma test for meta device evaluation
  • Create both real CUDA and meta tensor inputs with proper data types
  • Verify meta evaluation produces same shape and properties as real
    evaluation
  • Include detailed comments explaining tensor shapes and block-scaling
    factors
  • +181/-0 
    Configuration changes
    CMakeLists.txt
    Change NVFUSER_CUTLASS_KERNEL_ENABLED to PUBLIC definition

    CMakeLists.txt

  • Change NVFUSER_CUTLASS_KERNEL_ENABLED definition from PRIVATE to
    PUBLIC
  • Make macro available to targets that depend on codegen_internal
  • +1/-1     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    CMake Definition Scope Change

    The PR changes NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC in CMakeLists.txt. This exposes the macro to other targets that link against codegen_internal. Please verify this is intentional and doesn't cause unintended side effects in other parts of the codebase that might not expect this macro to be available.

    TensorView* value,
    Meta Device Path Validation

    The meta-device fast path correctly handles tensor shape calculation and device placement. However, consider adding validation to ensure all input tensors are on the same device type (all meta or all non-meta) before proceeding, as mixing meta and real devices could lead to unexpected behavior.

    // Meta-device fast path outside of torch version guard
    if (mat1.is_meta() || mat2.is_meta() || scale1.is_meta() ||
        scale2.is_meta() || alpha.is_meta() || problem_sizes.is_meta() ||
        expert_offsets.is_meta() || sf_offsets.is_meta()) {
      // For nvfp4_scaled_grouped_mm, the output shape is [M, N]
      // where M = mat1.size(0) and N = mat2.size(2).
      // Note: CutlassNvfp4GroupedMmaOp expects mat2 to be [G, K/2, N] (packed) at
      // runtime and transposes it before calling into CUTLASS.
      std::vector<int64_t> result_sizes = {mat1.size(0), mat2.size(2)};
    
      at::ScalarType out_dtype = data_type_to_aten(out()->dtype());
      auto options =
          mat1.options().device(c10::Device(c10::kMeta)).dtype(out_dtype);
      at::Tensor result = at::empty(result_sizes, options);
    
      if (const auto rfactor_did_idx = getRFactorDeviceDimensionIndex(out());
          rfactor_did_idx != -1) {
        result = result.unsqueeze(rfactor_did_idx);
      }
    
      return {result};
    }

    @zasdfgbnm
    Copy link
    Collaborator Author

    !test

    @zasdfgbnm zasdfgbnm marked this pull request as ready for review January 7, 2026 01:48
    @zasdfgbnm zasdfgbnm requested a review from jjsjann123 January 7, 2026 01:48
    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Jan 7, 2026

    Greptile Summary

    This PR adds meta-device support for CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without executing CUTLASS kernels.

    Changes:

    • Added meta-device fast path in CutlassNvfp4GroupedMmaOp::evaluate that computes output shape [M, N] from input dimensions, where M = mat1.size(0) and N = mat2.size(1)
    • Moved meta-device check outside the NVFUSER_CUTLASS_KERNEL_ENABLED preprocessor guard so shape inference works even when CUTLASS is not available
    • Properly handles rfactor device dimension indexing with unsqueeze when needed
    • Added comprehensive test MetaTest.CutlassNvfp4GroupedMma that validates meta-device outputs match CUDA execution for shapes, strides, and dtype

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • The changes follow established patterns for meta-device support in other operations (e.g., GroupedMmaOp, SdpaFwdOp). The implementation correctly computes output shape from input dimensions and includes comprehensive testing. The meta-device path is isolated from CUDA execution and only affects shape inference.
    • No files require special attention

    Important Files Changed

    Filename Overview
    csrc/ir/composite_nodes.cpp Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate by computing output shape without executing CUTLASS kernel
    tests/cpp/test_meta.cpp Added comprehensive test validating meta-device output shapes and strides match CUDA execution for CutlassNvfp4GroupedMma operations

    Sequence Diagram

    sequenceDiagram
        participant Evaluator as ExpressionEvaluator
        participant Op as CutlassNvfp4GroupedMmaOp
        participant MetaPath as Meta Device Path
        participant CudaPath as CUDA Path (CUTLASS)
        
        Evaluator->>Op: evaluate(inputs)
        Op->>Op: Extract 8 input tensors
        
        alt Any input is meta device
            Op->>MetaPath: Check is_meta() for all inputs
            MetaPath->>MetaPath: Compute output shape [M, N]<br/>M = mat1.size(0)<br/>N = mat2.size(1)
            MetaPath->>MetaPath: Create empty tensor on meta device<br/>with correct dtype
            MetaPath->>MetaPath: Apply rfactor dimension if needed
            MetaPath-->>Evaluator: Return meta tensor
        else All inputs on CUDA
            Op->>CudaPath: Validate input types and shapes
            CudaPath->>CudaPath: Calculate strides (ab_strides, c_strides)
            CudaPath->>CudaPath: Call cutlass_kernels::nvfp4_scaled_grouped_mm
            CudaPath->>CudaPath: Apply rfactor dimension if needed
            CudaPath-->>Evaluator: Return computed result tensor
        end
    
    Loading

    @zasdfgbnm
    Copy link
    Collaborator Author

    !test

    scale2.is_meta() || alpha.is_meta() || problem_sizes.is_meta() ||
    expert_offsets.is_meta() || sf_offsets.is_meta()) {
    // For nvfp4_scaled_grouped_mm, the output shape is [M, N]
    // where M = mat1.size(0) and N = mat2.size(1)
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I thought the output size n is mat2.size(2).
    e.g. if you look at line 1767 below.

    at::Tensor scale2_input = at::randn({4, 128, 8}, options_fp8);
    at::Tensor alpha_input = at::ones({4}, options_fp32);
    at::Tensor problem_sizes_input = at::tensor(
    {{32, 128, 128}, {32, 128, 128}, {32, 128, 128}, {32, 128, 128}},
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    btw, here we get away with k == n. Maybe we want to change that just for slightly better test coverage. 😉

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    Adds meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without executing actual CUTLASS kernels.

    Key Changes:

    • Moved input tensor extraction outside the NVFUSER_CUTLASS_KERNEL_ENABLED guard to allow meta device path to access tensor shapes
    • Implemented fast path for meta device tensors that constructs output tensor with shape [M, N] where M = mat1.size(0) and N = mat2.size(1)
    • Correctly handles rfactor device dimension indexing for both meta and real execution paths
    • Added comprehensive test CutlassNvfp4GroupedMma that verifies meta evaluation produces identical shape, dtype, and strides as CUDA evaluation

    Implementation Quality:

    • Follows existing meta device patterns used in other composite ops (e.g., MatmulOp, EmbeddingOp)
    • Correctly computes output shape based on grouped matmul semantics where mat1 is [M, K/2] and mat2 is [G, N, K/2]
    • Test uses realistic dimensions where M, N, K, and K/2 are all different to catch dimension indexing errors

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • The implementation follows established patterns for meta device support in the codebase, includes comprehensive testing that validates correctness, and only adds functionality without modifying existing behavior. The changes are well-isolated and the meta path returns early before any real computation occurs.
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    csrc/ir/composite_nodes.cpp 5/5 Added meta device support for CutlassNvfp4GroupedMmaOp by moving input extraction outside CUTLASS guard and implementing fast path that returns meta tensor with correct shape [M, N]
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test for CutlassNvfp4GroupedMmaOp meta device support, verifying output shape, dtype, and strides match between CUDA and meta evaluation paths

    Sequence Diagram

    sequenceDiagram
        participant Client as ExpressionEvaluator
        participant Op as CutlassNvfp4GroupedMmaOp
        participant ATen as ATen/PyTorch
        participant Cutlass as CUTLASS Kernel
        
        Client->>Op: evaluate(inputs)
        Op->>Op: Extract 8 input tensors
        Note over Op: mat1, mat2, scale1, scale2,<br/>alpha, problem_sizes,<br/>expert_offsets, sf_offsets
        
        alt Any input is meta device
            Op->>Op: Calculate output shape<br/>[mat1.size(0), mat2.size(1)]
            Op->>ATen: at::empty(shape, meta_device)
            ATen-->>Op: meta tensor
            Op->>Op: Apply rfactor dimension if needed
            Op-->>Client: Return meta tensor
        else All inputs are real device
            Note over Op: CUTLASS_KERNEL_ENABLED required
            Op->>Op: Validate tensor dtypes
            Op->>Op: Calculate strides (ab_strides, c_strides)
            Op->>Cutlass: nvfp4_scaled_grouped_mm()
            Note over Cutlass: Performs FP4 grouped matmul<br/>with block scaling
            Cutlass-->>Op: Result tensor [M, N]
            Op->>Op: Apply rfactor dimension if needed
            Op-->>Client: Return result tensor
        end
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    Added meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without executing CUDA kernels.

    Key changes:

    • CMakeLists.txt: Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC so test files can properly use the #if guard
    • csrc/ir/composite_nodes.cpp: Added meta-device fast path in CutlassNvfp4GroupedMmaOp::evaluate that checks if any input tensor is on meta device and creates output tensor with correct shape [M, N] where M = mat1.size(0) and N = mat2.size(1)
    • tests/cpp/test_meta.cpp: Added comprehensive test case CutlassNvfp4GroupedMma that verifies meta evaluation produces tensors with same shape, dtype, and strides as CUDA execution

    The implementation follows the established pattern used in other composite ops (e.g., EmbeddingOp, GroupedMatmulOp) and correctly handles the rfactor device dimension.

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • The implementation follows established patterns from other meta device implementations in the codebase, has comprehensive test coverage that validates shape/dtype/stride correctness, and the CMakeLists.txt change is necessary and minimal
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC visibility to make it available to test files
    csrc/ir/composite_nodes.cpp 5/5 Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate by adding fast-path that creates output tensor with correct shape/dtype
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test case for meta-device evaluation of CUTLASS NVFP4 grouped matrix multiplication

    Sequence Diagram

    sequenceDiagram
        participant Test as MetaTest
        participant EE as ExpressionEvaluator
        participant Op as CutlassNvfp4GroupedMmaOp
        participant Kernel as cutlass_kernels
    
        Note over Test: Create fusion with cutlass_nvfp4_grouped_mm
        Test->>EE: bind meta device inputs
        Test->>EE: evaluate(fusion output)
        EE->>Op: evaluate(inputs)
        
        alt Any input is meta device
            Op->>Op: Check if any input.is_meta()
            Op->>Op: Calculate output shape [M, N]<br/>M = mat1.size(0), N = mat2.size(1)
            Op->>Op: Create meta tensor with correct dtype
            Op->>Op: Apply rfactor dimension if needed
            Op-->>EE: Return meta tensor
        else All inputs are CUDA
            Op->>Op: Extract input tensors
            Op->>Op: Validate shapes and dtypes
            Op->>Op: Calculate strides
            Op->>Kernel: nvfp4_scaled_grouped_mm(...)
            Kernel-->>Op: CUDA result tensor
            Op->>Op: Apply rfactor dimension if needed
            Op-->>EE: Return CUDA tensor
        end
        
        EE-->>Test: meta_out tensor
        Test->>Test: Verify meta_out.is_meta()
        Test->>Test: Verify sizes match real_out
        Test->>Test: Verify strides match real_out
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without executing CUTLASS kernels.

    Key Changes

    • Meta-device fast path: Checks if any input tensors are on meta device and returns a correctly-shaped output tensor without invoking CUTLASS kernels
    • Output shape calculation: Computes output as [M, N] where M = mat1.size(0) and N = mat2.size(1), consistent with grouped matrix multiplication semantics
    • rFactor dimension handling: Properly handles rFactor device dimensions by unsqueezing when necessary
    • Build system fix: Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC to make the preprocessor define visible to test targets
    • Comprehensive test: Added CutlassNvfp4GroupedMma test that validates meta-device evaluation produces identical shape, dtype, and strides as CUDA execution

    The implementation follows existing patterns in the codebase for meta-device support (e.g., MatmulOp, GroupedMatmulOp, SdpaFwdOp) and maintains consistency with the actual CUTLASS kernel behavior.

    Confidence Score: 5/5

    • This PR is safe to merge with no issues found
    • The implementation follows established patterns in the codebase for meta-device support, the output shape calculation is correct and well-documented, the build system change is necessary and minimal, and comprehensive testing validates the functionality
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC to make it visible to test targets linking against codegen_internal
    csrc/ir/composite_nodes.cpp 5/5 Added meta-device fast path to CutlassNvfp4GroupedMmaOp::evaluate that returns correctly-shaped tensors without executing CUTLASS kernels
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test CutlassNvfp4GroupedMma that verifies meta-device evaluation produces correct shape, dtype, and strides matching CUDA execution

    Sequence Diagram

    sequenceDiagram
        participant Test as test_meta.cpp
        participant EE as ExpressionEvaluator
        participant Op as CutlassNvfp4GroupedMmaOp
        participant Meta as Meta Device Path
        participant CUTLASS as CUTLASS Kernel Path
    
        Test->>EE: evaluate(meta tensors)
        EE->>Op: evaluate(inputs)
        
        Op->>Op: Extract input tensors (mat1, mat2, scales, etc.)
        
        alt Any input is_meta()
            Op->>Meta: Check if any tensor is on meta device
            Meta->>Meta: Calculate output shape [M, N]<br/>M = mat1.size(0), N = mat2.size(1)
            Meta->>Meta: Create empty tensor with<br/>correct dtype and shape on meta device
            Meta->>Meta: Handle rFactor dimension if needed
            Meta-->>Op: Return meta tensor
        else All inputs on CUDA
            Op->>Op: Validate input dtypes and shapes
            Op->>Op: Calculate stride tensors
            Op->>CUTLASS: nvfp4_scaled_grouped_mm()
            CUTLASS-->>Op: Return result tensor
            Op->>Op: Handle rFactor dimension if needed
        end
        
        Op-->>EE: Return result
        EE-->>Test: Return output tensor
        Test->>Test: Verify shape, dtype, strides match
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    Added meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without requiring actual CUTLASS kernels.

    Key Changes:

    • Modified CMakeLists.txt to expose NVFUSER_CUTLASS_KERNEL_ENABLED as a PUBLIC definition so test files can check it
    • Added meta-device fast path in CutlassNvfp4GroupedMmaOp::evaluate that calculates output shape [M, N] from input dimensions before entering CUTLASS-specific code
    • Added comprehensive test CutlassNvfp4GroupedMma with 8 input tensors to verify meta and CUDA paths produce identical shapes/strides
    • Follows established pattern from other operations like MatmulOp, EmbeddingOp, and CumsumOp for meta device handling

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • Changes follow well-established patterns in the codebase, include comprehensive testing, and make minimal modifications with clear purpose. The meta device handling is implemented consistently with other operations, and the PUBLIC visibility change is necessary and appropriate.
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC to allow test files to access this definition
    csrc/ir/composite_nodes.cpp 5/5 Added meta-device fast path for CutlassNvfp4GroupedMmaOp::evaluate to handle meta tensors before CUTLASS-specific code
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test CutlassNvfp4GroupedMma to verify meta device support with 8 tensor inputs and shape/stride validation

    Sequence Diagram

    sequenceDiagram
        participant Test as Test (test_meta.cpp)
        participant EE as ExpressionEvaluator
        participant Op as CutlassNvfp4GroupedMmaOp
        participant Meta as Meta Device Path
        participant CUTLASS as CUTLASS Kernel Path
        
        Test->>EE: bind meta tensors (mat1, mat2, scales, etc.)
        Test->>EE: evaluate(fusion output)
        EE->>Op: evaluate(inputs)
        
        Op->>Op: Extract 8 input tensors
        
        alt Any input is meta device
            Op->>Meta: Check is_meta() on all inputs
            Meta->>Meta: Calculate output shape [M, N]<br/>M = mat1.size(0), N = mat2.size(1)
            Meta->>Meta: Create empty tensor on meta device
            Meta->>Meta: Apply rfactor device dim if needed
            Meta-->>Op: Return meta tensor
        else All inputs are CUDA tensors
            Op->>CUTLASS: Validate inputs and shapes
            CUTLASS->>CUTLASS: Prepare stride tensors
            CUTLASS->>CUTLASS: Call nvfp4_scaled_grouped_mm kernel
            CUTLASS->>CUTLASS: Apply rfactor device dim if needed
            CUTLASS-->>Op: Return computed result
        end
        
        Op-->>EE: Return result tensor
        EE-->>Test: Return evaluated output
        Test->>Test: Validate meta output matches<br/>CUDA output shape/stride
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    This PR enables meta-device support for CutlassNvfp4GroupedMmaOp::evaluate, allowing shape inference without requiring CUDA execution. The implementation moves the meta-device fast path outside the NVFUSER_CUTLASS_KERNEL_ENABLED guard and makes the macro PUBLIC in CMakeLists.txt so test files can use it.

    Major Changes:

    • Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC scope in CMakeLists.txt to enable conditional compilation in test files
    • Added meta-device support in CutlassNvfp4GroupedMmaOp::evaluate() by checking is_meta() on all inputs and creating appropriately shaped meta tensors
    • Added comprehensive test CutlassNvfp4GroupedMma that validates meta tensors match real CUDA output shapes and strides

    Issues Found:

    • Critical compilation errors in test file: undefined variables options_fp4 and options_fp8 are referenced but never declared (lines 620, 622, 626, 630)

    Confidence Score: 1/5

    • PR contains critical compilation errors that will prevent the test from building
    • The test file references undefined variables (options_fp4 and options_fp8) which will cause compilation failures. While the core implementation in composite_nodes.cpp appears sound and the CMakeLists.txt change is necessary and correct, the test cannot run without fixing these critical syntax errors
    • tests/cpp/test_meta.cpp requires immediate attention to fix undefined variable references before merge

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC scope to make the macro available in test files
    csrc/ir/composite_nodes.cpp 5/5 Added meta-device support by checking is_meta() and creating meta tensors with correct shapes before the CUTLASS-guarded code path
    tests/cpp/test_meta.cpp 1/5 Added comprehensive test for meta-device support, but contains critical compilation errors with undefined variables options_fp4 and options_fp8

    Sequence Diagram

    sequenceDiagram
        participant Test as test_meta.cpp
        participant Eval as ExpressionEvaluator
        participant Op as CutlassNvfp4GroupedMmaOp
        participant Meta as Meta Device Path
        participant CUTLASS as CUTLASS Kernel
    
        Test->>Eval: bind meta tensors (mat1, mat2, scales, etc.)
        Test->>Eval: evaluate(fusion output)
        Eval->>Op: CutlassNvfp4GroupedMmaOp::evaluate()
        Op->>Op: Extract input tensors
        Op->>Meta: Check if any input is_meta()
        
        alt Meta Device Path
            Meta->>Meta: Calculate output shape [M, N]
            Meta->>Meta: Create empty meta tensor with correct dtype
            Meta->>Meta: Apply rfactor dimension if needed
            Meta-->>Test: Return meta tensor
        else CUDA Device Path (NVFUSER_CUTLASS_KERNEL_ENABLED)
            Op->>CUTLASS: Validate input types
            Op->>CUTLASS: Create stride tensors
            Op->>CUTLASS: Call nvfp4_scaled_grouped_mm kernel
            CUTLASS-->>Op: Return result tensor
            Op->>Op: Apply rfactor dimension if needed
            Op-->>Test: Return CUDA tensor
        end
    
    Loading

    ee_cuda.bind(fusion->inputs().at(6), expert_offsets_input);
    ee_cuda.bind(fusion->inputs().at(7), sf_offsets_input);
    auto real_out = ee_cuda.evaluate(fusion->outputs().at(0)).as<at::Tensor>();

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    options_fp4 is undefined - should be options_uint8.dtype(at::kFloat4_e2m1fn_x2)

    Suggested change
    mat1_input.sizes(), mat1_input.strides(), options_uint8.dtype(at::kFloat4_e2m1fn_x2).device(at::kMeta));

    auto real_out = ee_cuda.evaluate(fusion->outputs().at(0)).as<at::Tensor>();

    // Meta evaluation
    ExpressionEvaluator ee_meta;
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    options_fp4 is undefined - should be options_uint8.dtype(at::kFloat4_e2m1fn_x2)

    Suggested change
    ExpressionEvaluator ee_meta;
    mat2_input.sizes(), mat2_input.strides(), options_uint8.dtype(at::kFloat4_e2m1fn_x2).device(at::kMeta));

    auto meta_mat1 = at::empty_strided(
    mat1_input.sizes(), mat1_input.strides(), options_fp4.device(at::kMeta));
    auto meta_mat2 = at::empty_strided(
    mat2_input.sizes(), mat2_input.strides(), options_fp4.device(at::kMeta));
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    options_fp8 is undefined - should be options_fp32.dtype(at::kFloat8_e4m3fn)

    Suggested change
    mat2_input.sizes(), mat2_input.strides(), options_fp4.device(at::kMeta));
    options_fp32.dtype(at::kFloat8_e4m3fn).device(at::kMeta));

    auto meta_scale1 = at::empty_strided(
    scale1_input.sizes(),
    scale1_input.strides(),
    options_fp8.device(at::kMeta));
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    options_fp8 is undefined - should be options_fp32.dtype(at::kFloat8_e4m3fn)

    Suggested change
    options_fp8.device(at::kMeta));
    options_fp32.dtype(at::kFloat8_e4m3fn).device(at::kMeta));

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    Adds meta-device support to CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without CUDA execution. The meta path computes output shape as [M, N] where M = mat1.size(0) and N = mat2.size(1), handling the rfactor device dimension index if present. The NVFUSER_CUTLASS_KERNEL_ENABLED compile definition is changed from PRIVATE to PUBLIC to make it visible to test files.

    Confidence Score: 5/5

    • Safe to merge - follows established patterns for meta-device support with comprehensive test coverage
    • The implementation follows the exact same pattern used by other meta-device handlers in the codebase (GroupedMmaOp, EmbeddingOp). The meta path correctly computes output shapes, preserves dtype, and handles the rfactor device dimension. The CMakeLists.txt change is necessary to expose the compile definition to test files. Comprehensive test coverage validates shapes, strides, and dtype against real CUDA execution.
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    csrc/ir/composite_nodes.cpp 4/5 Adds meta-device support for CutlassNvfp4GroupedMmaOp::evaluate by handling meta tensors before CUTLASS-enabled code path
    CMakeLists.txt 4/5 Changes NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC compile definition for broader visibility

    Sequence Diagram

    sequenceDiagram
        participant Caller
        participant Evaluate as CutlassNvfp4GroupedMmaOp::evaluate
        participant MetaPath as Meta Device Path
        participant CutlassPath as CUTLASS Kernel Path
        
        Caller->>Evaluate: "evaluate(inputs)"
        Evaluate->>Evaluate: "Extract 8 input tensors"
        alt Any input is meta device
            Evaluate->>MetaPath: "Compute output shape"
            MetaPath->>MetaPath: "result_sizes = [mat1.size(0), mat2.size(1)]"
            MetaPath->>MetaPath: "Create empty meta tensor with out_dtype"
            MetaPath->>MetaPath: "Apply rfactor unsqueeze if needed"
            MetaPath-->>Caller: "Return meta tensor"
        else All inputs are CUDA
            Evaluate->>CutlassPath: "Call nvfp4_scaled_grouped_mm"
            CutlassPath->>CutlassPath: "Execute CUTLASS kernel"
            CutlassPath->>CutlassPath: "Apply rfactor unsqueeze if needed"
            CutlassPath-->>Caller: "Return computed result"
        end
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    This PR adds meta-device support for CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without requiring CUTLASS kernel execution. The implementation is consistent with existing meta-device patterns in the codebase.

    Key Changes:

    • Added meta-device fast path that checks if any input tensor is on meta device and returns an appropriately shaped output tensor
    • Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC in CMakeLists.txt to ensure test files can access the preprocessor definition
    • Added comprehensive test CutlassNvfp4GroupedMma that validates meta and CUDA paths produce matching shapes, dtypes, and strides
    • Meta path calculates output shape as [M, N] where M = mat1.size(0) and N = mat2.size(1), matching the actual CUTLASS operation semantics

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • The implementation follows established patterns in the codebase for meta-device support, includes comprehensive test coverage, and makes a necessary build configuration change. The meta-device logic correctly calculates output shapes consistent with the actual CUTLASS operation, and the CMakeLists.txt change properly exposes the preprocessor definition to dependent targets.
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC to make the preprocessor definition available to targets linking against codegen_internal
    csrc/ir/composite_nodes.cpp 5/5 Added meta-device fast path for CutlassNvfp4GroupedMmaOp::evaluate that returns correctly shaped output tensor without requiring CUTLASS kernel execution
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test for meta-device support in CutlassNvfp4GroupedMmaOp, verifying output shape, dtype, and strides match CUDA execution

    Sequence Diagram

    sequenceDiagram
        participant Test as MetaTest
        participant EE as ExpressionEvaluator
        participant Op as CutlassNvfp4GroupedMmaOp
        participant Cutlass as CUTLASS Kernel
        
        Note over Test: Create fusion with meta tensors
        Test->>EE: bind(meta_mat1, meta_mat2, ...)
        Test->>EE: evaluate(fusion output)
        EE->>Op: evaluate(ee, inputs)
        
        alt Any input is_meta()
            Op->>Op: Calculate output shape [M, N]<br/>M = mat1.size(0), N = mat2.size(1)
            Op->>Op: Create meta tensor with correct dtype
            Op->>Op: Apply rfactor_did_idx if needed
            Op-->>EE: Return meta tensor
        else All inputs are CUDA tensors
            Note over Op: Check NVFUSER_CUTLASS_KERNEL_ENABLED
            Op->>Op: Validate tensor types and shapes
            Op->>Op: Calculate strides for kernel
            Op->>Cutlass: nvfp4_scaled_grouped_mm(...)
            Cutlass-->>Op: Result tensor
            Op->>Op: Apply rfactor_did_idx if needed
            Op-->>EE: Return result tensor
        end
        
        EE-->>Test: meta_out tensor
        Test->>Test: Verify: is_meta(), dtype, sizes, strides
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    This PR adds meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without running actual CUDA kernels. The implementation follows existing patterns in the codebase for handling meta devices.

    Key Changes:

    • Added meta device fast path in CutlassNvfp4GroupedMmaOp::evaluate that checks if any input is on meta device and returns appropriately shaped meta tensor
    • Changed NVFUSER_CUTLASS_KERNEL_ENABLED macro visibility from PRIVATE to PUBLIC in CMakeLists.txt to allow test files to use the macro
    • Added comprehensive test CutlassNvfp4GroupedMma that validates output shape, dtype, and strides match between CUDA and meta execution paths

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • The changes are minimal, well-tested, and follow established patterns in the codebase. The meta device support is implemented consistently with other evaluate methods, the CMakeLists.txt change is necessary and correct, and comprehensive tests validate the implementation.
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC visibility. This allows test files to check the macro definition. Simple and correct change.
    csrc/ir/composite_nodes.cpp 5/5 Added meta device support to CutlassNvfp4GroupedMmaOp::evaluate by checking all inputs with is_meta() and returning appropriate meta tensors. Follows existing patterns in the codebase.
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test for CutlassNvfp4GroupedMmaOp with meta device. Test properly guards with NVFUSER_CUTLASS_KERNEL_ENABLED and verifies output shape, dtype, and strides.

    Sequence Diagram

    sequenceDiagram
        participant Test as MetaTest
        participant EE as ExpressionEvaluator
        participant Eval as CutlassNvfp4GroupedMmaOp::evaluate
        participant ATen as at::Tensor
        
        Test->>EE: bind meta inputs (8 tensors)
        Test->>EE: evaluate(fusion output)
        EE->>Eval: evaluate(inputs)
        
        Eval->>Eval: Check if any input.is_meta()
        alt Any input is meta
            Eval->>Eval: Calculate output shape [M, N]
            Eval->>ATen: at::empty(sizes, meta device)
            ATen-->>Eval: meta tensor
            Eval->>Eval: Apply rfactor_did_idx if needed
            Eval-->>EE: meta tensor result
        else All inputs are CUDA
            Eval->>Eval: NVF_CHECK scalar types
            Eval->>Eval: Calculate strides
            Eval->>Eval: Call cutlass kernel
            Eval-->>EE: CUDA tensor result
        end
        
        EE-->>Test: output tensor
        Test->>Test: Verify shape, dtype, strides
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    This PR adds meta-device support for CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without requiring CUTLASS kernel compilation.

    Key changes:

    • CMakeLists.txt: Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC compile definition, allowing the macro to be visible in header files
    • composite_nodes.cpp: Moved meta-device fast path before the #if NVFUSER_CUTLASS_KERNEL_ENABLED preprocessor guard, allowing shape inference to work even when CUTLASS is not compiled
    • test_meta.cpp: Added comprehensive test that validates meta-device execution by comparing output shapes, strides, and dtypes between CUDA and meta paths

    Implementation approach:
    The meta-device fast path computes output shape as [mat1.size(0), mat2.size(1)] and creates an empty meta tensor with the correct dtype, matching the pattern used in other operators like GroupedMmaOp.

    Confidence Score: 5/5

    • This PR is safe to merge with no significant risks
    • The changes are well-structured and follow established patterns in the codebase. The meta-device support is properly isolated in a fast path that doesn't affect the existing CUTLASS kernel execution. The test coverage is comprehensive and validates the implementation thoroughly.
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changes NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC compile definition, enabling meta-device support check outside preprocessor guards
    csrc/ir/composite_nodes.cpp 5/5 Adds meta-device fast path before #if NVFUSER_CUTLASS_KERNEL_ENABLED guard in CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without CUTLASS compilation
    tests/cpp/test_meta.cpp 5/5 Adds comprehensive test CutlassNvfp4GroupedMma that validates meta-device support by comparing output shapes/strides between CUDA and meta execution paths

    Sequence Diagram

    sequenceDiagram
        participant Test as test_meta.cpp
        participant EE as ExpressionEvaluator
        participant Eval as CutlassNvfp4GroupedMmaOp::evaluate
        participant MetaPath as Meta Fast Path
        participant CutlassPath as CUTLASS Kernel Path
        
        Test->>EE: bind meta tensors
        Test->>EE: evaluate(fusion->outputs)
        EE->>Eval: evaluate(inputs)
        
        Eval->>Eval: Check if any input is_meta()
        
        alt Any input is meta
            Eval->>MetaPath: Execute meta fast path
            MetaPath->>MetaPath: Calculate output shape [M, N]
            MetaPath->>MetaPath: Create empty meta tensor
            MetaPath->>MetaPath: Apply rFactor dim if needed
            MetaPath-->>Eval: Return meta tensor
        else All inputs on CUDA
            Eval->>CutlassPath: Check NVFUSER_CUTLASS_KERNEL_ENABLED
            alt CUTLASS enabled
                CutlassPath->>CutlassPath: Validate input types (Float4)
                CutlassPath->>CutlassPath: Call cutlass_kernels::nvfp4_scaled_grouped_mm
                CutlassPath-->>Eval: Return CUDA result
            else CUTLASS not enabled
                CutlassPath-->>Eval: Throw error
            end
        end
        
        Eval-->>EE: Return result
        EE-->>Test: Return evaluated tensor
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference and testing without CUDA execution.

    Key Changes:

    • Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC in CMakeLists.txt to expose the macro to test files
    • Added early-return path in CutlassNvfp4GroupedMmaOp::evaluate that detects meta tensors and returns appropriately shaped meta output tensors
    • Added comprehensive test case CutlassNvfp4GroupedMma that validates meta evaluation produces matching shapes/dtypes/strides compared to CUDA execution
    • Meta-device path correctly handles RFactor device dimension indexing by applying unsqueeze when needed

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • All changes are well-structured and follow established patterns in the codebase. The meta-device implementation mirrors other similar operations, the CMake change correctly exposes the macro for test compilation, and comprehensive tests validate correctness. No breaking changes or risky modifications.
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED definition from PRIVATE to PUBLIC to expose the macro to downstream targets
    csrc/ir/composite_nodes.cpp 5/5 Added meta-device support to CutlassNvfp4GroupedMmaOp::evaluate by checking for meta tensors and returning appropriately shaped meta output
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test CutlassNvfp4GroupedMma to verify meta-device evaluation produces correct shapes/dtypes matching CUDA path

    Sequence Diagram

    sequenceDiagram
        participant Client as ExpressionEvaluator
        participant Op as CutlassNvfp4GroupedMmaOp
        participant Check as Meta Check
        participant Meta as Meta Path
        participant CUDA as CUDA/CUTLASS Path
        
        Client->>Op: evaluate(inputs)
        Op->>Op: Extract 8 input tensors
        Op->>Check: Check if any input is_meta()
        
        alt Any input is meta tensor
            Check->>Meta: Enter meta-device path
            Meta->>Meta: Calculate output shape [M, N]
            Meta->>Meta: Get output dtype from IR
            Meta->>Meta: Create meta tensor with options
            Meta->>Meta: Apply RFactor unsqueeze if needed
            Meta-->>Client: Return meta tensor
        else All inputs are CUDA tensors
            Check->>CUDA: Enter CUTLASS kernel path
            CUDA->>CUDA: Validate input dtypes (Float4_e2m1fn_x2)
            CUDA->>CUDA: Validate problem_sizes dimensions
            CUDA->>CUDA: Calculate strides (ab_strides, c_strides)
            CUDA->>CUDA: Call cutlass_kernels::nvfp4_scaled_grouped_mm
            CUDA->>CUDA: Apply RFactor unsqueeze if needed
            CUDA-->>Client: Return result tensor
        end
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    This PR adds meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape and stride inference without requiring actual CUDA execution.

    Key changes:

    • Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC in CMakeLists.txt to make the macro accessible in test files
    • Added meta-device fast path in CutlassNvfp4GroupedMmaOp::evaluate that computes output shape [M, N] from input shapes before the CUTLASS kernel guard
    • Added comprehensive test MetaTest.CutlassNvfp4GroupedMma that validates meta device behavior matches CUDA execution for shapes, strides, and dtypes

    Implementation details:

    • The meta path checks if any input tensor is on meta device and creates an output tensor on meta device with the correct shape, dtype, and rFactor dimension handling
    • Follows the established pattern used by other ops in the codebase (GroupedMmaOp, ScanOp, EmbeddingFwdOp)
    • Test properly guards CUTLASS-specific functionality with #if NVFUSER_CUTLASS_KERNEL_ENABLED

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk - it adds meta device support following established patterns
    • The changes are clean, well-tested, and follow the existing codebase patterns. The CMakeLists.txt change is necessary and minimal, the implementation correctly places the meta path before the CUTLASS guard, and comprehensive testing validates the behavior. No files require special attention as all changes are straightforward and consistent.
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC to make it visible in test files
    csrc/ir/composite_nodes.cpp 5/5 Added meta-device support to CutlassNvfp4GroupedMmaOp::evaluate by moving meta path before NVFUSER_CUTLASS_KERNEL_ENABLED guard
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test for CutlassNvfp4GroupedMmaOp meta device support with proper guards

    Sequence Diagram

    sequenceDiagram
        participant Test as MetaTest
        participant EE as ExpressionEvaluator
        participant Op as CutlassNvfp4GroupedMmaOp
        participant ATen as ATen/PyTorch
        
        Note over Test: Create fusion with FP4 grouped matmul
        Test->>Test: Create meta tensors (mat1, mat2, scales, etc.)
        Test->>EE: bind meta tensors to fusion inputs
        Test->>EE: evaluate(fusion->outputs().at(0))
        
        EE->>Op: evaluate(ee, meta_inputs)
        
        Note over Op: Check if any input is_meta()
        alt Any input is meta
            Op->>Op: Compute output shape [M, N]<br/>M=mat1.size(0), N=mat2.size(1)
            Op->>ATen: at::empty(result_sizes, meta_device)
            ATen-->>Op: meta output tensor
            Op->>Op: Handle rFactor dimension if needed
            Op-->>EE: return meta output
        else All inputs are CUDA
            Op->>ATen: cutlass_kernels::nvfp4_scaled_grouped_mm(...)
            ATen-->>Op: CUDA result
            Op-->>EE: return CUDA result
        end
        
        EE-->>Test: meta_out tensor
        Test->>Test: Validate shape, strides, dtype match CUDA execution
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    Added meta-device support to CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without GPU execution. The key changes include:

    • CMakeLists.txt: Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC visibility, making it accessible to test files
    • csrc/ir/composite_nodes.cpp: Added meta-device fast path that checks if any input tensor is on meta device and returns a properly-shaped meta tensor, positioned before the #if NVFUSER_CUTLASS_KERNEL_ENABLED guard
    • tests/cpp/test_meta.cpp: Added comprehensive test validating that meta-device evaluation produces correct shape, dtype, and strides matching CUDA execution

    The implementation follows the established pattern used by other operators in the codebase (e.g., MatmulOp, EmbeddingOp) where meta tensors are detected and handled separately to enable shape/dtype inference.

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • The changes are straightforward and follow established patterns in the codebase. The meta-device path is clearly separated from the CUDA execution path, the CMakeLists visibility change is necessary and correct, and the comprehensive test validates the implementation.
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC visibility, enabling header files to check the macro
    csrc/ir/composite_nodes.cpp 5/5 Added meta-device support in CutlassNvfp4GroupedMmaOp::evaluate by checking for meta tensors and returning appropriately-shaped meta tensor
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test CutlassNvfp4GroupedMma that validates meta-device behavior matches CUDA execution for shape/dtype/strides

    Sequence Diagram

    sequenceDiagram
        participant Test as Test/User Code
        participant EE as ExpressionEvaluator
        participant Op as CutlassNvfp4GroupedMmaOp::evaluate
        participant MetaPath as Meta Device Path
        participant CUDAPath as CUDA/CUTLASS Path
        
        Test->>EE: bind meta tensors
        Test->>EE: evaluate(fusion output)
        EE->>Op: evaluate(ee, inputs)
        Op->>Op: Extract 8 input tensors
        
        alt Any input is meta tensor
            Op->>MetaPath: Check is_meta() for all inputs
            MetaPath->>MetaPath: Calculate output shape [M, N]
            MetaPath->>MetaPath: Create meta tensor with out_dtype
            MetaPath->>MetaPath: Apply rfactor dimension if needed
            MetaPath-->>Op: Return meta tensor
        else All inputs are CUDA tensors
            Op->>CUDAPath: Check NVFUSER_CUTLASS_KERNEL_ENABLED
            CUDAPath->>CUDAPath: Validate tensor types and shapes
            CUDAPath->>CUDAPath: Calculate strides
            CUDAPath->>CUDAPath: Call cutlass_kernels::nvfp4_scaled_grouped_mm
            CUDAPath->>CUDAPath: Apply rfactor dimension if needed
            CUDAPath-->>Op: Return result tensor
        end
        
        Op-->>EE: Return result
        EE-->>Test: Return evaluated output
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without CUDA execution.

    Key changes:

    • Added early return path in CutlassNvfp4GroupedMmaOp::evaluate that checks if any input tensor is on the meta device and returns an appropriately shaped meta output tensor
    • Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC in CMakeLists.txt so test files can access it
    • Added comprehensive test validating that meta-device evaluation produces outputs with correct shapes, strides, and dtypes matching CUDA execution

    The implementation follows the established pattern used by other ops in the codebase (matmul, SDPA, embedding, etc.) for meta-device handling.

    Confidence Score: 5/5

    • This PR is safe to merge with no issues found
    • The changes are minimal, well-tested, and follow established patterns in the codebase. The meta-device support implementation mirrors other similar operations, the CMakeLists.txt change is necessary and correct, and the test thoroughly validates the functionality.
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC visibility to make it available to test files that link against codegen_internal
    csrc/ir/composite_nodes.cpp 5/5 Added meta-device support in CutlassNvfp4GroupedMmaOp::evaluate by checking for meta tensors and returning appropriately shaped meta outputs before CUTLASS kernel invocation
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test CutlassNvfp4GroupedMma that verifies meta-device evaluation produces correct shapes/strides/dtypes compared to CUDA execution

    Sequence Diagram

    sequenceDiagram
        participant Test as Test Code
        participant EE as ExpressionEvaluator
        participant Op as CutlassNvfp4GroupedMmaOp
        participant Meta as Meta Device Path
        participant CUTLASS as CUTLASS Kernel
    
        Test->>EE: bind meta tensors to fusion inputs
        Test->>EE: evaluate(fusion output)
        EE->>Op: evaluate(inputs)
        Op->>Op: Extract 8 input tensors
        
        alt Any input is_meta()
            Op->>Meta: Check if any tensor is meta
            Meta->>Meta: Calculate output shape [M, N]
            Meta->>Meta: Create empty meta tensor with correct dtype
            Meta->>Meta: Apply rfactor dimension if needed
            Meta-->>Op: Return meta tensor
            Op-->>EE: Return {meta_result}
        else All inputs are CUDA tensors
            Op->>Op: Validate tensor dtypes and shapes
            Op->>Op: Calculate strides for CUTLASS
            Op->>CUTLASS: nvfp4_scaled_grouped_mm()
            CUTLASS-->>Op: Return CUDA result
            Op->>Op: Apply rfactor dimension if needed
            Op-->>EE: Return {cuda_result}
        end
        
        EE-->>Test: Return result tensor
        Test->>Test: Verify meta output shape/dtype/strides
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    Adds meta device support to CutlassNvfp4GroupedMmaOp::evaluate, allowing shape inference without executing CUTLASS kernels.

    Major Changes:

    • Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC in CMakeLists.txt (required for test files to see the preprocessor guard)
    • Added meta device fast path in CutlassNvfp4GroupedMmaOp::evaluate that checks if any input is meta and returns appropriately shaped output tensor
    • Added comprehensive test CutlassNvfp4GroupedMma that validates meta device behavior against CUDA execution

    Implementation:
    The meta device path computes output shape as [mat1.size(0), mat2.size(1)] and creates an empty meta tensor with the correct dtype, matching the pattern used in other composite ops like GroupedMmaOp. The rfactor device dimension handling is consistent with the CUDA path.

    Confidence Score: 4/5

    • This PR is safe to merge with minor considerations regarding dimension validation
    • The implementation correctly adds meta device support following established patterns in the codebase. The CMakeLists.txt change is necessary and correct. The test is comprehensive. However, the meta path lacks explicit dimension validation that exists in similar operations like GroupedMmaOp, though this may be intentional as validation could occur during fusion construction.
    • csrc/ir/composite_nodes.cpp - verify that dimension validation in meta path matches design intent

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC to make it visible to test files that link against codegen_internal
    csrc/ir/composite_nodes.cpp 4/5 Added meta device fast path to CutlassNvfp4GroupedMmaOp::evaluate that returns properly shaped tensor without CUTLASS execution
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test CutlassNvfp4GroupedMma that verifies meta device support with proper tensor shapes and strides

    Sequence Diagram

    sequenceDiagram
        participant Test as test_meta.cpp
        participant EE as ExpressionEvaluator
        participant Eval as CutlassNvfp4GroupedMmaOp::evaluate
        participant Meta as Meta Device Path
        participant CUTLASS as CUTLASS Kernel
    
        Test->>EE: bind meta tensors
        Test->>EE: evaluate(fusion output)
        EE->>Eval: evaluate(inputs)
        
        alt Any input is_meta()
            Eval->>Meta: Check if any tensor is meta
            Meta->>Meta: Calculate output shape [M, N]
            Meta->>Meta: Create empty meta tensor
            Meta->>Meta: Apply rfactor dimension if needed
            Meta-->>Eval: Return meta tensor
        else All inputs are CUDA tensors
            alt NVFUSER_CUTLASS_KERNEL_ENABLED defined
                Eval->>CUTLASS: Validate input types (FP4)
                Eval->>CUTLASS: Validate problem_sizes shape
                Eval->>CUTLASS: Calculate stride tensors
                Eval->>CUTLASS: nvfp4_scaled_grouped_mm(...)
                CUTLASS-->>Eval: Result tensor
                Eval->>Eval: Apply rfactor dimension if needed
                Eval-->>EE: Return result
            else CUTLASS not enabled
                Eval-->>EE: Throw error
            end
        end
        
        EE-->>Test: meta_out tensor
        Test->>Test: Verify sizes, strides, dtype
    
    Loading

    Comment on lines 1726 to 1731
    if (mat1.is_meta() || mat2.is_meta() || scale1.is_meta() ||
    scale2.is_meta() || alpha.is_meta() || problem_sizes.is_meta() ||
    expert_offsets.is_meta() || sf_offsets.is_meta()) {
    // For nvfp4_scaled_grouped_mm, the output shape is [M, N]
    // where M = mat1.size(0) and N = mat2.size(1)
    std::vector<int64_t> result_sizes = {mat1.size(0), mat2.size(1)};
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    missing dimension validation in meta path - mat1 expected to be 2D and mat2 expected to be 3D, but shape calculation proceeds without checks (unlike GroupedMmaOp::evaluate which validates dimensions)

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    Added meta-device support to CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without executing CUTLASS kernels. This allows the operation to compute output tensor shapes and properties when inputs are meta-device tensors, which is useful for ahead-of-time analysis and optimization.

    Key changes:

    • Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC in CMakeLists.txt so test files can access the macro
    • Added meta-device fast path before the CUTLASS kernel guard that checks if any input tensor is on the meta device
    • Computes output shape as [M, N] where M = mat1.size(0) and N = mat2.size(2), handles rfactor dimensions
    • Added comprehensive test comparing meta and CUDA device outputs for correctness

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • The changes follow established patterns in the codebase for meta-device support, include comprehensive testing, and make minimal, well-understood modifications. The CMakeLists.txt change correctly propagates the macro definition, and the meta-device logic mirrors similar implementations in the same file.
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED macro from PRIVATE to PUBLIC scope, enabling test files to access the macro definition
    csrc/ir/composite_nodes.cpp 5/5 Added meta-device fast path to CutlassNvfp4GroupedMmaOp::evaluate that computes output shape without executing CUTLASS kernels
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test CutlassNvfp4GroupedMma comparing meta-device and CUDA device outputs for shape, dtype, and stride consistency

    Sequence Diagram

    sequenceDiagram
        participant Test as Test Code
        participant EE as ExpressionEvaluator
        participant Op as CutlassNvfp4GroupedMmaOp
        participant Tensor as at::Tensor
        participant CUTLASS as CUTLASS Kernel
    
        Note over Test: Create fusion with CutlassNvfp4GroupedMmaOp
        Test->>EE: bind meta tensors (mat1, mat2, scales, etc)
        Test->>EE: evaluate(output)
        EE->>Op: evaluate(inputs)
        
        Op->>Tensor: Extract input tensors from PolymorphicValue
        Op->>Tensor: is_meta() check on all inputs
        
        alt Any input is meta device
            Note over Op: Fast path - no CUTLASS execution
            Op->>Op: Calculate output shape [M, N]
            Op->>Op: M = mat1.size(0), N = mat2.size(2)
            Op->>Tensor: Create meta tensor with computed shape
            Op->>Op: Handle rfactor dimension if needed
            Op-->>EE: Return meta tensor result
        else All inputs are CUDA device
            Note over Op: Runtime path - CUTLASS execution
            Op->>CUTLASS: nvfp4_scaled_grouped_mm(...)
            CUTLASS-->>Op: Return computed result
            Op->>Op: Handle rfactor dimension if needed
            Op-->>EE: Return CUDA tensor result
        end
        
        EE-->>Test: Return result tensor
        Test->>Test: Verify meta tensor properties match CUDA output
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Greptile Overview

    Greptile Summary

    This PR adds meta device support for CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without requiring actual CUDA tensors or CUTLASS runtime execution.

    Key changes:

    • Moved tensor extraction code before the #if NVFUSER_CUTLASS_KERNEL_ENABLED preprocessor guard to allow meta device handling regardless of CUTLASS availability
    • Added meta-device fast path that synthesizes output tensors with correct shape [M, N] derived from input dimensions
    • Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC in CMakeLists.txt so test files can use the macro
    • Added comprehensive test case verifying meta device evaluation produces tensors with correct shapes/strides matching CUDA execution

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk - it's a well-contained feature addition with proper testing
    • The changes follow established patterns in the codebase for meta device support (similar to other ops), include comprehensive testing, and the CMakeLists.txt change correctly exposes the compile definition to dependent targets
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    CMakeLists.txt 5/5 Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC to make it visible to test files
    csrc/ir/composite_nodes.cpp 5/5 Added meta device support to CutlassNvfp4GroupedMmaOp::evaluate by moving tensor extraction before preprocessor guard and adding meta-device fast path
    tests/cpp/test_meta.cpp 5/5 Added comprehensive test CutlassNvfp4GroupedMma to verify meta device support for the CUTLASS grouped matmul operation

    Sequence Diagram

    sequenceDiagram
        participant Caller
        participant ExpressionEvaluator
        participant CutlassNvfp4GroupedMmaOp
        participant MetaDevice
        participant CutlassKernel
    
        Caller->>ExpressionEvaluator: evaluate(fusion output)
        ExpressionEvaluator->>CutlassNvfp4GroupedMmaOp: evaluate(inputs)
        CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Extract 8 input tensors
        
        alt Any input is meta device
            CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Calculate output shape [M, N]
            CutlassNvfp4GroupedMmaOp->>MetaDevice: at::empty(result_sizes, meta options)
            MetaDevice-->>CutlassNvfp4GroupedMmaOp: meta tensor
            CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Check rfactor_did_idx
            opt rfactor_did_idx != -1
                CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: result.unsqueeze(rfactor_did_idx)
            end
            CutlassNvfp4GroupedMmaOp-->>ExpressionEvaluator: return meta result
        else All inputs are CUDA tensors
            Note over CutlassNvfp4GroupedMmaOp: NVFUSER_CUTLASS_KERNEL_ENABLED guard
            CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Validate input types & shapes
            CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Setup strides (ab_strides, c_strides)
            CutlassNvfp4GroupedMmaOp->>CutlassKernel: nvfp4_scaled_grouped_mm(...)
            CutlassKernel-->>CutlassNvfp4GroupedMmaOp: result tensor
            CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Check rfactor_did_idx
            opt rfactor_did_idx != -1
                CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: result.unsqueeze(rfactor_did_idx)
            end
            CutlassNvfp4GroupedMmaOp-->>ExpressionEvaluator: return CUDA result
        end
        
        ExpressionEvaluator-->>Caller: result tensor
    
    Loading

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants