Skip to content

cudax::copy(mdspan) Optimize shared memory cases#9137

Merged
fbusato merged 13 commits into
NVIDIA:mainfrom
fbusato:optimize-cudax-copy
Jun 4, 2026
Merged

cudax::copy(mdspan) Optimize shared memory cases#9137
fbusato merged 13 commits into
NVIDIA:mainfrom
fbusato:optimize-cudax-copy

Conversation

@fbusato

@fbusato fbusato commented May 27, 2026

Copy link
Copy Markdown
Contributor

Description

The PR optimizes cudax::copy(mdspan) for the following cases:

  • Extend permutation of the destination tensor to exploit coalesced memory accesses on a wider set of cases. Previously, the shared memory path was enabled only when the destination tensor is contiguous.
  • Add Xor swizzle for 2D cases.
  • Simplified fast_mod_div code to use less registers.

@fbusato fbusato self-assigned this May 27, 2026
@fbusato fbusato requested review from a team as code owners May 27, 2026 00:14
@fbusato fbusato added this to CCCL May 27, 2026
@fbusato fbusato added the cudax Feature intended for the cudax experimental library label May 27, 2026
@fbusato fbusato requested review from caugonnet and shwina May 27, 2026 00:14
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 27, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 27, 2026
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6ed99c7f-1def-4c0f-a44c-7436e01b2c29

📥 Commits

Reviewing files that changed from the base of the PR and between 23f4260 and 9dddcd0.

📒 Files selected for processing (2)
  • cudax/benchmarks/bench/copy/copy_bench.cu
  • cudax/include/cuda/experimental/__copy/copy_shared_memory.cuh
🚧 Files skipped from review as they are similar to previous changes (2)
  • cudax/include/cuda/experimental/__copy/copy_shared_memory.cuh
  • cudax/benchmarks/bench/copy/copy_bench.cu

📝 Walkthrough

Summary by CodeRabbit

  • Performance Optimizations

    • Faster device-to-device copy/transpose paths via improved shared-memory tiling, dispatch and kernel paths; reduced per-thread work for coordinate iteration.
  • Benchmarks

    • Generalized benchmark suite with configurable index/offsets.
    • Benchmark builds now run with assertions disabled for more realistic measurements.
  • Tests

    • Expanded shared-memory transpose tests: larger 2D cases plus new 3D full/partial/padded scenarios.

suggestion:

Walkthrough

Refactors tensor coordinate iteration, adds source/destination-aware shared-memory tiling with optional XOR swizzle and permuted load/store paths, adjusts mdspan dispatch and an optimized-kernel early-return, generalizes benchmarks to templated index/data types with size_t offsets, and expands shared-memory transpose tests.

Changes

Shared-memory tiling and tensor copy optimization

Layer / File(s) Summary
Stride ordering foundation
cudax/include/cuda/experimental/__copy_bytes/tensor_query.cuh
New __stride_order computes stable stride-order permutations; __sort_by_stride now delegates to it.
Tensor iterator coordinate mapping
cudax/include/cuda/experimental/__copy/tensor_iterator.cuh
Removes extent-product precomputation; uses unsigned extents and iterative div-based coordinate reduction; extracts __partial_tensor.__offset.
Shared-memory tiling decision logic
cudax/include/cuda/experimental/__copy/copy_shared_memory_utils.cuh
Adds __narrow_raw_tensor_rank, __shared_mem_tiling_result, __add_coalesced_tile_run; __find_shared_mem_tiling now computes source/destination-aware tiling and XOR-swizzle eligibility; __use_shared_mem_kernel uses tiling validity.
Shared-memory kernel with XOR swizzle and permuted paths
cudax/include/cuda/experimental/__copy/copy_shared_memory.cuh
Adds __smem_offset helper and _UseXorSwizzle template param; kernel accepts separate src/dst smem stride descriptors and permutation iterators; load/store loops and host launcher rewritten to use tiling result and conditional swizzle.
Copy dispatch, optimization, and cleanup
cudax/include/cuda/experimental/__copy/copy_optimized.cuh, cudax/include/cuda/experimental/__copy/mdspan_d2d.cuh, cudax/include/cuda/experimental/__copy/tensor_copy_utils.cuh
Adds __max_rank >= 2 guard and rank-2 fast path for shared-memory dispatch; adds early-return for small extent types in optimized kernel; removes unused includes.
Benchmark infrastructure and template generalization
cudax/benchmarks/CMakeLists.txt, cudax/benchmarks/bench/copy/copy_bench.cu
Disables assertions in benchmark builds; compute_alloc and bench_copy templated on data_t/idx_t with size_t offsets; benchmark instantiations updated to new signatures.
Shared-memory tests
cudax/test/copy/copy_shared_memory.cu
Replaces earlier 2D tests with large 2D cases and adds three 3D tests (full, partial, padded) with adjusted alloc/stride calculations.
  • Suggested labels: libcu++
  • Suggested reviewers:
    • gonidelis

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c248b761-c5a1-4a50-9627-641679ac2294

📥 Commits

Reviewing files that changed from the base of the PR and between 19e3eb3 and c0f7f49.

📒 Files selected for processing (9)
  • cudax/benchmarks/CMakeLists.txt
  • cudax/benchmarks/bench/copy/copy_bench.cu
  • cudax/include/cuda/experimental/__copy/copy_optimized.cuh
  • cudax/include/cuda/experimental/__copy/copy_shared_memory.cuh
  • cudax/include/cuda/experimental/__copy/copy_shared_memory_utils.cuh
  • cudax/include/cuda/experimental/__copy/mdspan_d2d.cuh
  • cudax/include/cuda/experimental/__copy/tensor_copy_utils.cuh
  • cudax/include/cuda/experimental/__copy/tensor_iterator.cuh
  • cudax/include/cuda/experimental/__copy_bytes/tensor_query.cuh
💤 Files with no reviewable changes (1)
  • cudax/include/cuda/experimental/__copy/tensor_copy_utils.cuh

Comment thread cudax/include/cuda/experimental/__copy/copy_optimized.cuh
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Comment thread cudax/include/cuda/experimental/__copy/copy_shared_memory_utils.cuh Outdated
Comment thread cudax/benchmarks/bench/copy/copy_bench.cu Outdated
@fbusato fbusato requested a review from NaderAlAwar May 27, 2026 16:51
@github-actions

This comment has been minimized.

Comment thread cudax/benchmarks/bench/copy/copy_bench.cu Outdated
Comment thread cudax/include/cuda/experimental/__copy/copy_shared_memory.cuh Outdated
@fbusato fbusato requested a review from oleksandr-pavlyk June 2, 2026 19:09
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 30m 59s: Pass: 100%/55 | Total: 9h 28m | Max: 30m 59s | Hits: 83%/38622

See results here.

@fbusato fbusato merged commit 316f9cc into NVIDIA:main Jun 4, 2026
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudax Feature intended for the cudax experimental library

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants