Precomputed swizzle_idx into group Hadamard ComputeKernel#2808
Precomputed swizzle_idx into group Hadamard ComputeKernel#2808cael-ling wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR hoists the computation of
Confidence Score: 5/5Safe to merge — pure loop-invariant hoisting with no behavioral change across all three kernel files. The optimization is mathematically correct (swizzle_idx depends only on threadIdx.x and a compile-time constant), applied uniformly across all three kernel variants, and introduces no new variables or logic paths. No P0 or P1 issues found. No files require special attention. Important Files Changed
Reviews (3): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile |
| const int warp_id = threadIdx.x / kThreadsPerWarp; | ||
| const int local_rank = threadIdx.x % kThreadsPerWarp; | ||
| const int ld_row_idx = local_rank % kHadamardDimension; | ||
| const int ld_col_idx = local_rank / kHadamardDimension + warp_id * 2; | ||
| const int swizzle_idx = swizzle_128B_atom_32B(ld_row_idx, ld_col_idx); |
There was a problem hiding this comment.
Same optimization not applied to sibling files
hadamard_transform.cu and graph_safe_group_hadamard_transform.cu contain near-identical ComputeKernel definitions that still recompute warp_id, local_rank, ld_row_idx, ld_col_idx, and swizzle_idx inside the function body on every invocation. If the goal is to eliminate redundant per-iteration work, those two files have the same hot-loop structure and would benefit from the same refactor.
This is not a bug — since ComputeKernel is __forceinline__, the compiler can already hoist these invariants under optimization. But for consistency and to complete the stated intent of the PR, consider applying the same pattern to:
hadamard_transform.cu:35-40/ call site at ~line 288graph_safe_group_hadamard_transform.cu:74-79/ call site at ~line 362
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Signed-off-by: Cael Ling <caell@nvidia.com>
c5b2087 to
f101b02
Compare
|
LGTM, can you introduce this change to other similar variants of the kernel? https://github.com/NVIDIA/TransformerEngine/tree/main/transformer_engine/common/hadamard_transform |
OK, will do |
Signed-off-by: Cael Ling <caell@nvidia.com>
for more information, see https://pre-commit.ci
|
The change has been applied to variants:(group_hadamard_transform.cu/hadamard_trnsform.cu/graph_safe_group_hadamard_transform.cu) |
Description
ComputeKernel used to derive warp_id, local_rank, ld_row_idx, ld_col_idx, and swizzle_idx from threadIdx.x on every call. Those quantities depend only on the thread’s position in the block and template constants; they do not change with pipeline stage, compute_stage_y / compute_stage_x, or the per-tile in_sh_ptr offset.
GroupHadamardAmaxTmaKernel now computes them once per thread before the for (stage_y) loop and passes swizzle_idx into ComputeKernel, avoiding redundant work in the hot nested loops. Behavior is unchanged; this is a small micro-optimization and clearer separation of loop-invariant mapping vs. per-tile pointer arithmetic.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: