-
Notifications
You must be signed in to change notification settings - Fork 74
Add device type meta support for CutlassNvfp4GroupedMmaOp::evaluate
#5695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Review updated until commit 588ec95 Description
|
| Relevant files | |||
|---|---|---|---|
| Enhancement |
| ||
| Tests |
| ||
| Configuration changes |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
CMake Definition Scope Change
|
|
!test |
…edMmaOp--evaluate
Greptile SummaryThis PR adds meta-device support for Changes:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Evaluator as ExpressionEvaluator
participant Op as CutlassNvfp4GroupedMmaOp
participant MetaPath as Meta Device Path
participant CudaPath as CUDA Path (CUTLASS)
Evaluator->>Op: evaluate(inputs)
Op->>Op: Extract 8 input tensors
alt Any input is meta device
Op->>MetaPath: Check is_meta() for all inputs
MetaPath->>MetaPath: Compute output shape [M, N]<br/>M = mat1.size(0)<br/>N = mat2.size(1)
MetaPath->>MetaPath: Create empty tensor on meta device<br/>with correct dtype
MetaPath->>MetaPath: Apply rfactor dimension if needed
MetaPath-->>Evaluator: Return meta tensor
else All inputs on CUDA
Op->>CudaPath: Validate input types and shapes
CudaPath->>CudaPath: Calculate strides (ab_strides, c_strides)
CudaPath->>CudaPath: Call cutlass_kernels::nvfp4_scaled_grouped_mm
CudaPath->>CudaPath: Apply rfactor dimension if needed
CudaPath-->>Evaluator: Return computed result tensor
end
|
|
!test |
csrc/ir/composite_nodes.cpp
Outdated
| scale2.is_meta() || alpha.is_meta() || problem_sizes.is_meta() || | ||
| expert_offsets.is_meta() || sf_offsets.is_meta()) { | ||
| // For nvfp4_scaled_grouped_mm, the output shape is [M, N] | ||
| // where M = mat1.size(0) and N = mat2.size(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the output size n is mat2.size(2).
e.g. if you look at line 1767 below.
tests/cpp/test_meta.cpp
Outdated
| at::Tensor scale2_input = at::randn({4, 128, 8}, options_fp8); | ||
| at::Tensor alpha_input = at::ones({4}, options_fp32); | ||
| at::Tensor problem_sizes_input = at::tensor( | ||
| {{32, 128, 128}, {32, 128, 128}, {32, 128, 128}, {32, 128, 128}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, here we get away with k == n. Maybe we want to change that just for slightly better test coverage. 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without executing actual CUTLASS kernels.
Key Changes:
- Moved input tensor extraction outside the
NVFUSER_CUTLASS_KERNEL_ENABLEDguard to allow meta device path to access tensor shapes - Implemented fast path for meta device tensors that constructs output tensor with shape
[M, N]where M =mat1.size(0)and N =mat2.size(1) - Correctly handles rfactor device dimension indexing for both meta and real execution paths
- Added comprehensive test
CutlassNvfp4GroupedMmathat verifies meta evaluation produces identical shape, dtype, and strides as CUDA evaluation
Implementation Quality:
- Follows existing meta device patterns used in other composite ops (e.g.,
MatmulOp,EmbeddingOp) - Correctly computes output shape based on grouped matmul semantics where mat1 is
[M, K/2]and mat2 is[G, N, K/2] - Test uses realistic dimensions where M, N, K, and K/2 are all different to catch dimension indexing errors
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- The implementation follows established patterns for meta device support in the codebase, includes comprehensive testing that validates correctness, and only adds functionality without modifying existing behavior. The changes are well-isolated and the meta path returns early before any real computation occurs.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta device support for CutlassNvfp4GroupedMmaOp by moving input extraction outside CUTLASS guard and implementing fast path that returns meta tensor with correct shape [M, N] |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test for CutlassNvfp4GroupedMmaOp meta device support, verifying output shape, dtype, and strides match between CUDA and meta evaluation paths |
Sequence Diagram
sequenceDiagram
participant Client as ExpressionEvaluator
participant Op as CutlassNvfp4GroupedMmaOp
participant ATen as ATen/PyTorch
participant Cutlass as CUTLASS Kernel
Client->>Op: evaluate(inputs)
Op->>Op: Extract 8 input tensors
Note over Op: mat1, mat2, scale1, scale2,<br/>alpha, problem_sizes,<br/>expert_offsets, sf_offsets
alt Any input is meta device
Op->>Op: Calculate output shape<br/>[mat1.size(0), mat2.size(1)]
Op->>ATen: at::empty(shape, meta_device)
ATen-->>Op: meta tensor
Op->>Op: Apply rfactor dimension if needed
Op-->>Client: Return meta tensor
else All inputs are real device
Note over Op: CUTLASS_KERNEL_ENABLED required
Op->>Op: Validate tensor dtypes
Op->>Op: Calculate strides (ab_strides, c_strides)
Op->>Cutlass: nvfp4_scaled_grouped_mm()
Note over Cutlass: Performs FP4 grouped matmul<br/>with block scaling
Cutlass-->>Op: Result tensor [M, N]
Op->>Op: Apply rfactor dimension if needed
Op-->>Client: Return result tensor
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Added meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without executing CUDA kernels.
Key changes:
- CMakeLists.txt: Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDcompile definition from PRIVATE to PUBLIC so test files can properly use the#ifguard - csrc/ir/composite_nodes.cpp: Added meta-device fast path in
CutlassNvfp4GroupedMmaOp::evaluatethat checks if any input tensor is on meta device and creates output tensor with correct shape[M, N]where M =mat1.size(0)and N =mat2.size(1) - tests/cpp/test_meta.cpp: Added comprehensive test case
CutlassNvfp4GroupedMmathat verifies meta evaluation produces tensors with same shape, dtype, and strides as CUDA execution
The implementation follows the established pattern used in other composite ops (e.g., EmbeddingOp, GroupedMatmulOp) and correctly handles the rfactor device dimension.
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- The implementation follows established patterns from other meta device implementations in the codebase, has comprehensive test coverage that validates shape/dtype/stride correctness, and the CMakeLists.txt change is necessary and minimal
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC visibility to make it available to test files |
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate by adding fast-path that creates output tensor with correct shape/dtype |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test case for meta-device evaluation of CUTLASS NVFP4 grouped matrix multiplication |
Sequence Diagram
sequenceDiagram
participant Test as MetaTest
participant EE as ExpressionEvaluator
participant Op as CutlassNvfp4GroupedMmaOp
participant Kernel as cutlass_kernels
Note over Test: Create fusion with cutlass_nvfp4_grouped_mm
Test->>EE: bind meta device inputs
Test->>EE: evaluate(fusion output)
EE->>Op: evaluate(inputs)
alt Any input is meta device
Op->>Op: Check if any input.is_meta()
Op->>Op: Calculate output shape [M, N]<br/>M = mat1.size(0), N = mat2.size(1)
Op->>Op: Create meta tensor with correct dtype
Op->>Op: Apply rfactor dimension if needed
Op-->>EE: Return meta tensor
else All inputs are CUDA
Op->>Op: Extract input tensors
Op->>Op: Validate shapes and dtypes
Op->>Op: Calculate strides
Op->>Kernel: nvfp4_scaled_grouped_mm(...)
Kernel-->>Op: CUDA result tensor
Op->>Op: Apply rfactor dimension if needed
Op-->>EE: Return CUDA tensor
end
EE-->>Test: meta_out tensor
Test->>Test: Verify meta_out.is_meta()
Test->>Test: Verify sizes match real_out
Test->>Test: Verify strides match real_out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without executing CUTLASS kernels.
Key Changes
- Meta-device fast path: Checks if any input tensors are on meta device and returns a correctly-shaped output tensor without invoking CUTLASS kernels
- Output shape calculation: Computes output as
[M, N]whereM = mat1.size(0)andN = mat2.size(1), consistent with grouped matrix multiplication semantics - rFactor dimension handling: Properly handles rFactor device dimensions by unsqueezing when necessary
- Build system fix: Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDfrom PRIVATE to PUBLIC to make the preprocessor define visible to test targets - Comprehensive test: Added
CutlassNvfp4GroupedMmatest that validates meta-device evaluation produces identical shape, dtype, and strides as CUDA execution
The implementation follows existing patterns in the codebase for meta-device support (e.g., MatmulOp, GroupedMatmulOp, SdpaFwdOp) and maintains consistency with the actual CUTLASS kernel behavior.
Confidence Score: 5/5
- This PR is safe to merge with no issues found
- The implementation follows established patterns in the codebase for meta-device support, the output shape calculation is correct and well-documented, the build system change is necessary and minimal, and comprehensive testing validates the functionality
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC to make it visible to test targets linking against codegen_internal |
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta-device fast path to CutlassNvfp4GroupedMmaOp::evaluate that returns correctly-shaped tensors without executing CUTLASS kernels |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test CutlassNvfp4GroupedMma that verifies meta-device evaluation produces correct shape, dtype, and strides matching CUDA execution |
Sequence Diagram
sequenceDiagram
participant Test as test_meta.cpp
participant EE as ExpressionEvaluator
participant Op as CutlassNvfp4GroupedMmaOp
participant Meta as Meta Device Path
participant CUTLASS as CUTLASS Kernel Path
Test->>EE: evaluate(meta tensors)
EE->>Op: evaluate(inputs)
Op->>Op: Extract input tensors (mat1, mat2, scales, etc.)
alt Any input is_meta()
Op->>Meta: Check if any tensor is on meta device
Meta->>Meta: Calculate output shape [M, N]<br/>M = mat1.size(0), N = mat2.size(1)
Meta->>Meta: Create empty tensor with<br/>correct dtype and shape on meta device
Meta->>Meta: Handle rFactor dimension if needed
Meta-->>Op: Return meta tensor
else All inputs on CUDA
Op->>Op: Validate input dtypes and shapes
Op->>Op: Calculate stride tensors
Op->>CUTLASS: nvfp4_scaled_grouped_mm()
CUTLASS-->>Op: Return result tensor
Op->>Op: Handle rFactor dimension if needed
end
Op-->>EE: Return result
EE-->>Test: Return output tensor
Test->>Test: Verify shape, dtype, strides match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Added meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without requiring actual CUTLASS kernels.
Key Changes:
- Modified
CMakeLists.txtto exposeNVFUSER_CUTLASS_KERNEL_ENABLEDas a PUBLIC definition so test files can check it - Added meta-device fast path in
CutlassNvfp4GroupedMmaOp::evaluatethat calculates output shape[M, N]from input dimensions before entering CUTLASS-specific code - Added comprehensive test
CutlassNvfp4GroupedMmawith 8 input tensors to verify meta and CUDA paths produce identical shapes/strides - Follows established pattern from other operations like
MatmulOp,EmbeddingOp, andCumsumOpfor meta device handling
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- Changes follow well-established patterns in the codebase, include comprehensive testing, and make minimal modifications with clear purpose. The meta device handling is implemented consistently with other operations, and the PUBLIC visibility change is necessary and appropriate.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC to allow test files to access this definition |
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta-device fast path for CutlassNvfp4GroupedMmaOp::evaluate to handle meta tensors before CUTLASS-specific code |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test CutlassNvfp4GroupedMma to verify meta device support with 8 tensor inputs and shape/stride validation |
Sequence Diagram
sequenceDiagram
participant Test as Test (test_meta.cpp)
participant EE as ExpressionEvaluator
participant Op as CutlassNvfp4GroupedMmaOp
participant Meta as Meta Device Path
participant CUTLASS as CUTLASS Kernel Path
Test->>EE: bind meta tensors (mat1, mat2, scales, etc.)
Test->>EE: evaluate(fusion output)
EE->>Op: evaluate(inputs)
Op->>Op: Extract 8 input tensors
alt Any input is meta device
Op->>Meta: Check is_meta() on all inputs
Meta->>Meta: Calculate output shape [M, N]<br/>M = mat1.size(0), N = mat2.size(1)
Meta->>Meta: Create empty tensor on meta device
Meta->>Meta: Apply rfactor device dim if needed
Meta-->>Op: Return meta tensor
else All inputs are CUDA tensors
Op->>CUTLASS: Validate inputs and shapes
CUTLASS->>CUTLASS: Prepare stride tensors
CUTLASS->>CUTLASS: Call nvfp4_scaled_grouped_mm kernel
CUTLASS->>CUTLASS: Apply rfactor device dim if needed
CUTLASS-->>Op: Return computed result
end
Op-->>EE: Return result tensor
EE-->>Test: Return evaluated output
Test->>Test: Validate meta output matches<br/>CUDA output shape/stride
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR enables meta-device support for CutlassNvfp4GroupedMmaOp::evaluate, allowing shape inference without requiring CUDA execution. The implementation moves the meta-device fast path outside the NVFUSER_CUTLASS_KERNEL_ENABLED guard and makes the macro PUBLIC in CMakeLists.txt so test files can use it.
Major Changes:
- Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDfrom PRIVATE to PUBLIC scope in CMakeLists.txt to enable conditional compilation in test files - Added meta-device support in
CutlassNvfp4GroupedMmaOp::evaluate()by checkingis_meta()on all inputs and creating appropriately shaped meta tensors - Added comprehensive test
CutlassNvfp4GroupedMmathat validates meta tensors match real CUDA output shapes and strides
Issues Found:
- Critical compilation errors in test file: undefined variables
options_fp4andoptions_fp8are referenced but never declared (lines 620, 622, 626, 630)
Confidence Score: 1/5
- PR contains critical compilation errors that will prevent the test from building
- The test file references undefined variables (
options_fp4andoptions_fp8) which will cause compilation failures. While the core implementation in composite_nodes.cpp appears sound and the CMakeLists.txt change is necessary and correct, the test cannot run without fixing these critical syntax errors - tests/cpp/test_meta.cpp requires immediate attention to fix undefined variable references before merge
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC scope to make the macro available in test files |
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta-device support by checking is_meta() and creating meta tensors with correct shapes before the CUTLASS-guarded code path |
| tests/cpp/test_meta.cpp | 1/5 | Added comprehensive test for meta-device support, but contains critical compilation errors with undefined variables options_fp4 and options_fp8 |
Sequence Diagram
sequenceDiagram
participant Test as test_meta.cpp
participant Eval as ExpressionEvaluator
participant Op as CutlassNvfp4GroupedMmaOp
participant Meta as Meta Device Path
participant CUTLASS as CUTLASS Kernel
Test->>Eval: bind meta tensors (mat1, mat2, scales, etc.)
Test->>Eval: evaluate(fusion output)
Eval->>Op: CutlassNvfp4GroupedMmaOp::evaluate()
Op->>Op: Extract input tensors
Op->>Meta: Check if any input is_meta()
alt Meta Device Path
Meta->>Meta: Calculate output shape [M, N]
Meta->>Meta: Create empty meta tensor with correct dtype
Meta->>Meta: Apply rfactor dimension if needed
Meta-->>Test: Return meta tensor
else CUDA Device Path (NVFUSER_CUTLASS_KERNEL_ENABLED)
Op->>CUTLASS: Validate input types
Op->>CUTLASS: Create stride tensors
Op->>CUTLASS: Call nvfp4_scaled_grouped_mm kernel
CUTLASS-->>Op: Return result tensor
Op->>Op: Apply rfactor dimension if needed
Op-->>Test: Return CUDA tensor
end
| ee_cuda.bind(fusion->inputs().at(6), expert_offsets_input); | ||
| ee_cuda.bind(fusion->inputs().at(7), sf_offsets_input); | ||
| auto real_out = ee_cuda.evaluate(fusion->outputs().at(0)).as<at::Tensor>(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options_fp4 is undefined - should be options_uint8.dtype(at::kFloat4_e2m1fn_x2)
| mat1_input.sizes(), mat1_input.strides(), options_uint8.dtype(at::kFloat4_e2m1fn_x2).device(at::kMeta)); |
| auto real_out = ee_cuda.evaluate(fusion->outputs().at(0)).as<at::Tensor>(); | ||
|
|
||
| // Meta evaluation | ||
| ExpressionEvaluator ee_meta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options_fp4 is undefined - should be options_uint8.dtype(at::kFloat4_e2m1fn_x2)
| ExpressionEvaluator ee_meta; | |
| mat2_input.sizes(), mat2_input.strides(), options_uint8.dtype(at::kFloat4_e2m1fn_x2).device(at::kMeta)); |
| auto meta_mat1 = at::empty_strided( | ||
| mat1_input.sizes(), mat1_input.strides(), options_fp4.device(at::kMeta)); | ||
| auto meta_mat2 = at::empty_strided( | ||
| mat2_input.sizes(), mat2_input.strides(), options_fp4.device(at::kMeta)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options_fp8 is undefined - should be options_fp32.dtype(at::kFloat8_e4m3fn)
| mat2_input.sizes(), mat2_input.strides(), options_fp4.device(at::kMeta)); | |
| options_fp32.dtype(at::kFloat8_e4m3fn).device(at::kMeta)); |
| auto meta_scale1 = at::empty_strided( | ||
| scale1_input.sizes(), | ||
| scale1_input.strides(), | ||
| options_fp8.device(at::kMeta)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options_fp8 is undefined - should be options_fp32.dtype(at::kFloat8_e4m3fn)
| options_fp8.device(at::kMeta)); | |
| options_fp32.dtype(at::kFloat8_e4m3fn).device(at::kMeta)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds meta-device support to CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without CUDA execution. The meta path computes output shape as [M, N] where M = mat1.size(0) and N = mat2.size(1), handling the rfactor device dimension index if present. The NVFUSER_CUTLASS_KERNEL_ENABLED compile definition is changed from PRIVATE to PUBLIC to make it visible to test files.
Confidence Score: 5/5
- Safe to merge - follows established patterns for meta-device support with comprehensive test coverage
- The implementation follows the exact same pattern used by other meta-device handlers in the codebase (GroupedMmaOp, EmbeddingOp). The meta path correctly computes output shapes, preserves dtype, and handles the rfactor device dimension. The CMakeLists.txt change is necessary to expose the compile definition to test files. Comprehensive test coverage validates shapes, strides, and dtype against real CUDA execution.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| csrc/ir/composite_nodes.cpp | 4/5 | Adds meta-device support for CutlassNvfp4GroupedMmaOp::evaluate by handling meta tensors before CUTLASS-enabled code path |
| CMakeLists.txt | 4/5 | Changes NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC compile definition for broader visibility |
Sequence Diagram
sequenceDiagram
participant Caller
participant Evaluate as CutlassNvfp4GroupedMmaOp::evaluate
participant MetaPath as Meta Device Path
participant CutlassPath as CUTLASS Kernel Path
Caller->>Evaluate: "evaluate(inputs)"
Evaluate->>Evaluate: "Extract 8 input tensors"
alt Any input is meta device
Evaluate->>MetaPath: "Compute output shape"
MetaPath->>MetaPath: "result_sizes = [mat1.size(0), mat2.size(1)]"
MetaPath->>MetaPath: "Create empty meta tensor with out_dtype"
MetaPath->>MetaPath: "Apply rfactor unsqueeze if needed"
MetaPath-->>Caller: "Return meta tensor"
else All inputs are CUDA
Evaluate->>CutlassPath: "Call nvfp4_scaled_grouped_mm"
CutlassPath->>CutlassPath: "Execute CUTLASS kernel"
CutlassPath->>CutlassPath: "Apply rfactor unsqueeze if needed"
CutlassPath-->>Caller: "Return computed result"
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds meta-device support for CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without requiring CUTLASS kernel execution. The implementation is consistent with existing meta-device patterns in the codebase.
Key Changes:
- Added meta-device fast path that checks if any input tensor is on meta device and returns an appropriately shaped output tensor
- Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDfrom PRIVATE to PUBLIC in CMakeLists.txt to ensure test files can access the preprocessor definition - Added comprehensive test
CutlassNvfp4GroupedMmathat validates meta and CUDA paths produce matching shapes, dtypes, and strides - Meta path calculates output shape as [M, N] where M = mat1.size(0) and N = mat2.size(1), matching the actual CUTLASS operation semantics
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- The implementation follows established patterns in the codebase for meta-device support, includes comprehensive test coverage, and makes a necessary build configuration change. The meta-device logic correctly calculates output shapes consistent with the actual CUTLASS operation, and the CMakeLists.txt change properly exposes the preprocessor definition to dependent targets.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC to make the preprocessor definition available to targets linking against codegen_internal |
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta-device fast path for CutlassNvfp4GroupedMmaOp::evaluate that returns correctly shaped output tensor without requiring CUTLASS kernel execution |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test for meta-device support in CutlassNvfp4GroupedMmaOp, verifying output shape, dtype, and strides match CUDA execution |
Sequence Diagram
sequenceDiagram
participant Test as MetaTest
participant EE as ExpressionEvaluator
participant Op as CutlassNvfp4GroupedMmaOp
participant Cutlass as CUTLASS Kernel
Note over Test: Create fusion with meta tensors
Test->>EE: bind(meta_mat1, meta_mat2, ...)
Test->>EE: evaluate(fusion output)
EE->>Op: evaluate(ee, inputs)
alt Any input is_meta()
Op->>Op: Calculate output shape [M, N]<br/>M = mat1.size(0), N = mat2.size(1)
Op->>Op: Create meta tensor with correct dtype
Op->>Op: Apply rfactor_did_idx if needed
Op-->>EE: Return meta tensor
else All inputs are CUDA tensors
Note over Op: Check NVFUSER_CUTLASS_KERNEL_ENABLED
Op->>Op: Validate tensor types and shapes
Op->>Op: Calculate strides for kernel
Op->>Cutlass: nvfp4_scaled_grouped_mm(...)
Cutlass-->>Op: Result tensor
Op->>Op: Apply rfactor_did_idx if needed
Op-->>EE: Return result tensor
end
EE-->>Test: meta_out tensor
Test->>Test: Verify: is_meta(), dtype, sizes, strides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without running actual CUDA kernels. The implementation follows existing patterns in the codebase for handling meta devices.
Key Changes:
- Added meta device fast path in
CutlassNvfp4GroupedMmaOp::evaluatethat checks if any input is on meta device and returns appropriately shaped meta tensor - Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDmacro visibility fromPRIVATEtoPUBLICin CMakeLists.txt to allow test files to use the macro - Added comprehensive test
CutlassNvfp4GroupedMmathat validates output shape, dtype, and strides match between CUDA and meta execution paths
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- The changes are minimal, well-tested, and follow established patterns in the codebase. The meta device support is implemented consistently with other evaluate methods, the CMakeLists.txt change is necessary and correct, and comprehensive tests validate the implementation.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC visibility. This allows test files to check the macro definition. Simple and correct change. |
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta device support to CutlassNvfp4GroupedMmaOp::evaluate by checking all inputs with is_meta() and returning appropriate meta tensors. Follows existing patterns in the codebase. |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test for CutlassNvfp4GroupedMmaOp with meta device. Test properly guards with NVFUSER_CUTLASS_KERNEL_ENABLED and verifies output shape, dtype, and strides. |
Sequence Diagram
sequenceDiagram
participant Test as MetaTest
participant EE as ExpressionEvaluator
participant Eval as CutlassNvfp4GroupedMmaOp::evaluate
participant ATen as at::Tensor
Test->>EE: bind meta inputs (8 tensors)
Test->>EE: evaluate(fusion output)
EE->>Eval: evaluate(inputs)
Eval->>Eval: Check if any input.is_meta()
alt Any input is meta
Eval->>Eval: Calculate output shape [M, N]
Eval->>ATen: at::empty(sizes, meta device)
ATen-->>Eval: meta tensor
Eval->>Eval: Apply rfactor_did_idx if needed
Eval-->>EE: meta tensor result
else All inputs are CUDA
Eval->>Eval: NVF_CHECK scalar types
Eval->>Eval: Calculate strides
Eval->>Eval: Call cutlass kernel
Eval-->>EE: CUDA tensor result
end
EE-->>Test: output tensor
Test->>Test: Verify shape, dtype, strides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds meta-device support for CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without requiring CUTLASS kernel compilation.
Key changes:
- CMakeLists.txt: Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDfromPRIVATEtoPUBLICcompile definition, allowing the macro to be visible in header files - composite_nodes.cpp: Moved meta-device fast path before the
#if NVFUSER_CUTLASS_KERNEL_ENABLEDpreprocessor guard, allowing shape inference to work even when CUTLASS is not compiled - test_meta.cpp: Added comprehensive test that validates meta-device execution by comparing output shapes, strides, and dtypes between CUDA and meta paths
Implementation approach:
The meta-device fast path computes output shape as [mat1.size(0), mat2.size(1)] and creates an empty meta tensor with the correct dtype, matching the pattern used in other operators like GroupedMmaOp.
Confidence Score: 5/5
- This PR is safe to merge with no significant risks
- The changes are well-structured and follow established patterns in the codebase. The meta-device support is properly isolated in a fast path that doesn't affect the existing CUTLASS kernel execution. The test coverage is comprehensive and validates the implementation thoroughly.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changes NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC compile definition, enabling meta-device support check outside preprocessor guards |
| csrc/ir/composite_nodes.cpp | 5/5 | Adds meta-device fast path before #if NVFUSER_CUTLASS_KERNEL_ENABLED guard in CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without CUTLASS compilation |
| tests/cpp/test_meta.cpp | 5/5 | Adds comprehensive test CutlassNvfp4GroupedMma that validates meta-device support by comparing output shapes/strides between CUDA and meta execution paths |
Sequence Diagram
sequenceDiagram
participant Test as test_meta.cpp
participant EE as ExpressionEvaluator
participant Eval as CutlassNvfp4GroupedMmaOp::evaluate
participant MetaPath as Meta Fast Path
participant CutlassPath as CUTLASS Kernel Path
Test->>EE: bind meta tensors
Test->>EE: evaluate(fusion->outputs)
EE->>Eval: evaluate(inputs)
Eval->>Eval: Check if any input is_meta()
alt Any input is meta
Eval->>MetaPath: Execute meta fast path
MetaPath->>MetaPath: Calculate output shape [M, N]
MetaPath->>MetaPath: Create empty meta tensor
MetaPath->>MetaPath: Apply rFactor dim if needed
MetaPath-->>Eval: Return meta tensor
else All inputs on CUDA
Eval->>CutlassPath: Check NVFUSER_CUTLASS_KERNEL_ENABLED
alt CUTLASS enabled
CutlassPath->>CutlassPath: Validate input types (Float4)
CutlassPath->>CutlassPath: Call cutlass_kernels::nvfp4_scaled_grouped_mm
CutlassPath-->>Eval: Return CUDA result
else CUTLASS not enabled
CutlassPath-->>Eval: Throw error
end
end
Eval-->>EE: Return result
EE-->>Test: Return evaluated tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference and testing without CUDA execution.
Key Changes:
- Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDcompile definition from PRIVATE to PUBLIC in CMakeLists.txt to expose the macro to test files - Added early-return path in
CutlassNvfp4GroupedMmaOp::evaluatethat detects meta tensors and returns appropriately shaped meta output tensors - Added comprehensive test case
CutlassNvfp4GroupedMmathat validates meta evaluation produces matching shapes/dtypes/strides compared to CUDA execution - Meta-device path correctly handles RFactor device dimension indexing by applying
unsqueezewhen needed
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- All changes are well-structured and follow established patterns in the codebase. The meta-device implementation mirrors other similar operations, the CMake change correctly exposes the macro for test compilation, and comprehensive tests validate correctness. No breaking changes or risky modifications.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED definition from PRIVATE to PUBLIC to expose the macro to downstream targets |
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta-device support to CutlassNvfp4GroupedMmaOp::evaluate by checking for meta tensors and returning appropriately shaped meta output |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test CutlassNvfp4GroupedMma to verify meta-device evaluation produces correct shapes/dtypes matching CUDA path |
Sequence Diagram
sequenceDiagram
participant Client as ExpressionEvaluator
participant Op as CutlassNvfp4GroupedMmaOp
participant Check as Meta Check
participant Meta as Meta Path
participant CUDA as CUDA/CUTLASS Path
Client->>Op: evaluate(inputs)
Op->>Op: Extract 8 input tensors
Op->>Check: Check if any input is_meta()
alt Any input is meta tensor
Check->>Meta: Enter meta-device path
Meta->>Meta: Calculate output shape [M, N]
Meta->>Meta: Get output dtype from IR
Meta->>Meta: Create meta tensor with options
Meta->>Meta: Apply RFactor unsqueeze if needed
Meta-->>Client: Return meta tensor
else All inputs are CUDA tensors
Check->>CUDA: Enter CUTLASS kernel path
CUDA->>CUDA: Validate input dtypes (Float4_e2m1fn_x2)
CUDA->>CUDA: Validate problem_sizes dimensions
CUDA->>CUDA: Calculate strides (ab_strides, c_strides)
CUDA->>CUDA: Call cutlass_kernels::nvfp4_scaled_grouped_mm
CUDA->>CUDA: Apply RFactor unsqueeze if needed
CUDA-->>Client: Return result tensor
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds meta device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape and stride inference without requiring actual CUDA execution.
Key changes:
- Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDcompile definition fromPRIVATEtoPUBLICin CMakeLists.txt to make the macro accessible in test files - Added meta-device fast path in
CutlassNvfp4GroupedMmaOp::evaluatethat computes output shape[M, N]from input shapes before the CUTLASS kernel guard - Added comprehensive test
MetaTest.CutlassNvfp4GroupedMmathat validates meta device behavior matches CUDA execution for shapes, strides, and dtypes
Implementation details:
- The meta path checks if any input tensor is on meta device and creates an output tensor on meta device with the correct shape, dtype, and rFactor dimension handling
- Follows the established pattern used by other ops in the codebase (GroupedMmaOp, ScanOp, EmbeddingFwdOp)
- Test properly guards CUTLASS-specific functionality with
#if NVFUSER_CUTLASS_KERNEL_ENABLED
Confidence Score: 5/5
- This PR is safe to merge with minimal risk - it adds meta device support following established patterns
- The changes are clean, well-tested, and follow the existing codebase patterns. The CMakeLists.txt change is necessary and minimal, the implementation correctly places the meta path before the CUTLASS guard, and comprehensive testing validates the behavior. No files require special attention as all changes are straightforward and consistent.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC to make it visible in test files |
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta-device support to CutlassNvfp4GroupedMmaOp::evaluate by moving meta path before NVFUSER_CUTLASS_KERNEL_ENABLED guard |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test for CutlassNvfp4GroupedMmaOp meta device support with proper guards |
Sequence Diagram
sequenceDiagram
participant Test as MetaTest
participant EE as ExpressionEvaluator
participant Op as CutlassNvfp4GroupedMmaOp
participant ATen as ATen/PyTorch
Note over Test: Create fusion with FP4 grouped matmul
Test->>Test: Create meta tensors (mat1, mat2, scales, etc.)
Test->>EE: bind meta tensors to fusion inputs
Test->>EE: evaluate(fusion->outputs().at(0))
EE->>Op: evaluate(ee, meta_inputs)
Note over Op: Check if any input is_meta()
alt Any input is meta
Op->>Op: Compute output shape [M, N]<br/>M=mat1.size(0), N=mat2.size(1)
Op->>ATen: at::empty(result_sizes, meta_device)
ATen-->>Op: meta output tensor
Op->>Op: Handle rFactor dimension if needed
Op-->>EE: return meta output
else All inputs are CUDA
Op->>ATen: cutlass_kernels::nvfp4_scaled_grouped_mm(...)
ATen-->>Op: CUDA result
Op-->>EE: return CUDA result
end
EE-->>Test: meta_out tensor
Test->>Test: Validate shape, strides, dtype match CUDA execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Added meta-device support to CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without GPU execution. The key changes include:
- CMakeLists.txt: Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDcompile definition fromPRIVATEtoPUBLICvisibility, making it accessible to test files - csrc/ir/composite_nodes.cpp: Added meta-device fast path that checks if any input tensor is on meta device and returns a properly-shaped meta tensor, positioned before the
#if NVFUSER_CUTLASS_KERNEL_ENABLEDguard - tests/cpp/test_meta.cpp: Added comprehensive test validating that meta-device evaluation produces correct shape, dtype, and strides matching CUDA execution
The implementation follows the established pattern used by other operators in the codebase (e.g., MatmulOp, EmbeddingOp) where meta tensors are detected and handled separately to enable shape/dtype inference.
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- The changes are straightforward and follow established patterns in the codebase. The meta-device path is clearly separated from the CUDA execution path, the CMakeLists visibility change is necessary and correct, and the comprehensive test validates the implementation.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC visibility, enabling header files to check the macro |
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta-device support in CutlassNvfp4GroupedMmaOp::evaluate by checking for meta tensors and returning appropriately-shaped meta tensor |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test CutlassNvfp4GroupedMma that validates meta-device behavior matches CUDA execution for shape/dtype/strides |
Sequence Diagram
sequenceDiagram
participant Test as Test/User Code
participant EE as ExpressionEvaluator
participant Op as CutlassNvfp4GroupedMmaOp::evaluate
participant MetaPath as Meta Device Path
participant CUDAPath as CUDA/CUTLASS Path
Test->>EE: bind meta tensors
Test->>EE: evaluate(fusion output)
EE->>Op: evaluate(ee, inputs)
Op->>Op: Extract 8 input tensors
alt Any input is meta tensor
Op->>MetaPath: Check is_meta() for all inputs
MetaPath->>MetaPath: Calculate output shape [M, N]
MetaPath->>MetaPath: Create meta tensor with out_dtype
MetaPath->>MetaPath: Apply rfactor dimension if needed
MetaPath-->>Op: Return meta tensor
else All inputs are CUDA tensors
Op->>CUDAPath: Check NVFUSER_CUTLASS_KERNEL_ENABLED
CUDAPath->>CUDAPath: Validate tensor types and shapes
CUDAPath->>CUDAPath: Calculate strides
CUDAPath->>CUDAPath: Call cutlass_kernels::nvfp4_scaled_grouped_mm
CUDAPath->>CUDAPath: Apply rfactor dimension if needed
CUDAPath-->>Op: Return result tensor
end
Op-->>EE: Return result
EE-->>Test: Return evaluated output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Added meta-device support for CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without CUDA execution.
Key changes:
- Added early return path in
CutlassNvfp4GroupedMmaOp::evaluatethat checks if any input tensor is on the meta device and returns an appropriately shaped meta output tensor - Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDcompile definition from PRIVATE to PUBLIC in CMakeLists.txt so test files can access it - Added comprehensive test validating that meta-device evaluation produces outputs with correct shapes, strides, and dtypes matching CUDA execution
The implementation follows the established pattern used by other ops in the codebase (matmul, SDPA, embedding, etc.) for meta-device handling.
Confidence Score: 5/5
- This PR is safe to merge with no issues found
- The changes are minimal, well-tested, and follow established patterns in the codebase. The meta-device support implementation mirrors other similar operations, the CMakeLists.txt change is necessary and correct, and the test thoroughly validates the functionality.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC visibility to make it available to test files that link against codegen_internal |
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta-device support in CutlassNvfp4GroupedMmaOp::evaluate by checking for meta tensors and returning appropriately shaped meta outputs before CUTLASS kernel invocation |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test CutlassNvfp4GroupedMma that verifies meta-device evaluation produces correct shapes/strides/dtypes compared to CUDA execution |
Sequence Diagram
sequenceDiagram
participant Test as Test Code
participant EE as ExpressionEvaluator
participant Op as CutlassNvfp4GroupedMmaOp
participant Meta as Meta Device Path
participant CUTLASS as CUTLASS Kernel
Test->>EE: bind meta tensors to fusion inputs
Test->>EE: evaluate(fusion output)
EE->>Op: evaluate(inputs)
Op->>Op: Extract 8 input tensors
alt Any input is_meta()
Op->>Meta: Check if any tensor is meta
Meta->>Meta: Calculate output shape [M, N]
Meta->>Meta: Create empty meta tensor with correct dtype
Meta->>Meta: Apply rfactor dimension if needed
Meta-->>Op: Return meta tensor
Op-->>EE: Return {meta_result}
else All inputs are CUDA tensors
Op->>Op: Validate tensor dtypes and shapes
Op->>Op: Calculate strides for CUTLASS
Op->>CUTLASS: nvfp4_scaled_grouped_mm()
CUTLASS-->>Op: Return CUDA result
Op->>Op: Apply rfactor dimension if needed
Op-->>EE: Return {cuda_result}
end
EE-->>Test: Return result tensor
Test->>Test: Verify meta output shape/dtype/strides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds meta device support to CutlassNvfp4GroupedMmaOp::evaluate, allowing shape inference without executing CUTLASS kernels.
Major Changes:
- Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDcompile definition from PRIVATE to PUBLIC in CMakeLists.txt (required for test files to see the preprocessor guard) - Added meta device fast path in
CutlassNvfp4GroupedMmaOp::evaluatethat checks if any input is meta and returns appropriately shaped output tensor - Added comprehensive test
CutlassNvfp4GroupedMmathat validates meta device behavior against CUDA execution
Implementation:
The meta device path computes output shape as [mat1.size(0), mat2.size(1)] and creates an empty meta tensor with the correct dtype, matching the pattern used in other composite ops like GroupedMmaOp. The rfactor device dimension handling is consistent with the CUDA path.
Confidence Score: 4/5
- This PR is safe to merge with minor considerations regarding dimension validation
- The implementation correctly adds meta device support following established patterns in the codebase. The CMakeLists.txt change is necessary and correct. The test is comprehensive. However, the meta path lacks explicit dimension validation that exists in similar operations like
GroupedMmaOp, though this may be intentional as validation could occur during fusion construction. - csrc/ir/composite_nodes.cpp - verify that dimension validation in meta path matches design intent
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED from PRIVATE to PUBLIC to make it visible to test files that link against codegen_internal |
| csrc/ir/composite_nodes.cpp | 4/5 | Added meta device fast path to CutlassNvfp4GroupedMmaOp::evaluate that returns properly shaped tensor without CUTLASS execution |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test CutlassNvfp4GroupedMma that verifies meta device support with proper tensor shapes and strides |
Sequence Diagram
sequenceDiagram
participant Test as test_meta.cpp
participant EE as ExpressionEvaluator
participant Eval as CutlassNvfp4GroupedMmaOp::evaluate
participant Meta as Meta Device Path
participant CUTLASS as CUTLASS Kernel
Test->>EE: bind meta tensors
Test->>EE: evaluate(fusion output)
EE->>Eval: evaluate(inputs)
alt Any input is_meta()
Eval->>Meta: Check if any tensor is meta
Meta->>Meta: Calculate output shape [M, N]
Meta->>Meta: Create empty meta tensor
Meta->>Meta: Apply rfactor dimension if needed
Meta-->>Eval: Return meta tensor
else All inputs are CUDA tensors
alt NVFUSER_CUTLASS_KERNEL_ENABLED defined
Eval->>CUTLASS: Validate input types (FP4)
Eval->>CUTLASS: Validate problem_sizes shape
Eval->>CUTLASS: Calculate stride tensors
Eval->>CUTLASS: nvfp4_scaled_grouped_mm(...)
CUTLASS-->>Eval: Result tensor
Eval->>Eval: Apply rfactor dimension if needed
Eval-->>EE: Return result
else CUTLASS not enabled
Eval-->>EE: Throw error
end
end
EE-->>Test: meta_out tensor
Test->>Test: Verify sizes, strides, dtype
csrc/ir/composite_nodes.cpp
Outdated
| if (mat1.is_meta() || mat2.is_meta() || scale1.is_meta() || | ||
| scale2.is_meta() || alpha.is_meta() || problem_sizes.is_meta() || | ||
| expert_offsets.is_meta() || sf_offsets.is_meta()) { | ||
| // For nvfp4_scaled_grouped_mm, the output shape is [M, N] | ||
| // where M = mat1.size(0) and N = mat2.size(1) | ||
| std::vector<int64_t> result_sizes = {mat1.size(0), mat2.size(1)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing dimension validation in meta path - mat1 expected to be 2D and mat2 expected to be 3D, but shape calculation proceeds without checks (unlike GroupedMmaOp::evaluate which validates dimensions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Added meta-device support to CutlassNvfp4GroupedMmaOp::evaluate to enable shape inference without executing CUTLASS kernels. This allows the operation to compute output tensor shapes and properties when inputs are meta-device tensors, which is useful for ahead-of-time analysis and optimization.
Key changes:
- Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDcompile definition from PRIVATE to PUBLIC in CMakeLists.txt so test files can access the macro - Added meta-device fast path before the CUTLASS kernel guard that checks if any input tensor is on the meta device
- Computes output shape as
[M, N]whereM = mat1.size(0)andN = mat2.size(2), handles rfactor dimensions - Added comprehensive test comparing meta and CUDA device outputs for correctness
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- The changes follow established patterns in the codebase for meta-device support, include comprehensive testing, and make minimal, well-understood modifications. The CMakeLists.txt change correctly propagates the macro definition, and the meta-device logic mirrors similar implementations in the same file.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED macro from PRIVATE to PUBLIC scope, enabling test files to access the macro definition |
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta-device fast path to CutlassNvfp4GroupedMmaOp::evaluate that computes output shape without executing CUTLASS kernels |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test CutlassNvfp4GroupedMma comparing meta-device and CUDA device outputs for shape, dtype, and stride consistency |
Sequence Diagram
sequenceDiagram
participant Test as Test Code
participant EE as ExpressionEvaluator
participant Op as CutlassNvfp4GroupedMmaOp
participant Tensor as at::Tensor
participant CUTLASS as CUTLASS Kernel
Note over Test: Create fusion with CutlassNvfp4GroupedMmaOp
Test->>EE: bind meta tensors (mat1, mat2, scales, etc)
Test->>EE: evaluate(output)
EE->>Op: evaluate(inputs)
Op->>Tensor: Extract input tensors from PolymorphicValue
Op->>Tensor: is_meta() check on all inputs
alt Any input is meta device
Note over Op: Fast path - no CUTLASS execution
Op->>Op: Calculate output shape [M, N]
Op->>Op: M = mat1.size(0), N = mat2.size(2)
Op->>Tensor: Create meta tensor with computed shape
Op->>Op: Handle rfactor dimension if needed
Op-->>EE: Return meta tensor result
else All inputs are CUDA device
Note over Op: Runtime path - CUTLASS execution
Op->>CUTLASS: nvfp4_scaled_grouped_mm(...)
CUTLASS-->>Op: Return computed result
Op->>Op: Handle rfactor dimension if needed
Op-->>EE: Return CUDA tensor result
end
EE-->>Test: Return result tensor
Test->>Test: Verify meta tensor properties match CUDA output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds meta device support for CutlassNvfp4GroupedMmaOp::evaluate, enabling shape inference without requiring actual CUDA tensors or CUTLASS runtime execution.
Key changes:
- Moved tensor extraction code before the
#if NVFUSER_CUTLASS_KERNEL_ENABLEDpreprocessor guard to allow meta device handling regardless of CUTLASS availability - Added meta-device fast path that synthesizes output tensors with correct shape
[M, N]derived from input dimensions - Changed
NVFUSER_CUTLASS_KERNEL_ENABLEDcompile definition from PRIVATE to PUBLIC in CMakeLists.txt so test files can use the macro - Added comprehensive test case verifying meta device evaluation produces tensors with correct shapes/strides matching CUDA execution
Confidence Score: 5/5
- This PR is safe to merge with minimal risk - it's a well-contained feature addition with proper testing
- The changes follow established patterns in the codebase for meta device support (similar to other ops), include comprehensive testing, and the CMakeLists.txt change correctly exposes the compile definition to dependent targets
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| CMakeLists.txt | 5/5 | Changed NVFUSER_CUTLASS_KERNEL_ENABLED compile definition from PRIVATE to PUBLIC to make it visible to test files |
| csrc/ir/composite_nodes.cpp | 5/5 | Added meta device support to CutlassNvfp4GroupedMmaOp::evaluate by moving tensor extraction before preprocessor guard and adding meta-device fast path |
| tests/cpp/test_meta.cpp | 5/5 | Added comprehensive test CutlassNvfp4GroupedMma to verify meta device support for the CUTLASS grouped matmul operation |
Sequence Diagram
sequenceDiagram
participant Caller
participant ExpressionEvaluator
participant CutlassNvfp4GroupedMmaOp
participant MetaDevice
participant CutlassKernel
Caller->>ExpressionEvaluator: evaluate(fusion output)
ExpressionEvaluator->>CutlassNvfp4GroupedMmaOp: evaluate(inputs)
CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Extract 8 input tensors
alt Any input is meta device
CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Calculate output shape [M, N]
CutlassNvfp4GroupedMmaOp->>MetaDevice: at::empty(result_sizes, meta options)
MetaDevice-->>CutlassNvfp4GroupedMmaOp: meta tensor
CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Check rfactor_did_idx
opt rfactor_did_idx != -1
CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: result.unsqueeze(rfactor_did_idx)
end
CutlassNvfp4GroupedMmaOp-->>ExpressionEvaluator: return meta result
else All inputs are CUDA tensors
Note over CutlassNvfp4GroupedMmaOp: NVFUSER_CUTLASS_KERNEL_ENABLED guard
CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Validate input types & shapes
CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Setup strides (ab_strides, c_strides)
CutlassNvfp4GroupedMmaOp->>CutlassKernel: nvfp4_scaled_grouped_mm(...)
CutlassKernel-->>CutlassNvfp4GroupedMmaOp: result tensor
CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: Check rfactor_did_idx
opt rfactor_did_idx != -1
CutlassNvfp4GroupedMmaOp->>CutlassNvfp4GroupedMmaOp: result.unsqueeze(rfactor_did_idx)
end
CutlassNvfp4GroupedMmaOp-->>ExpressionEvaluator: return CUDA result
end
ExpressionEvaluator-->>Caller: result tensor
No description provided.