`cudax::copy(mdspan)` Optimize shared memory cases by fbusato · Pull Request #9137 · NVIDIA/cccl

fbusato · 2026-05-27T00:14:50Z

Description

The PR optimizes cudax::copy(mdspan) for the following cases:

Extend permutation of the destination tensor to exploit coalesced memory accesses on a wider set of cases. Previously, the shared memory path was enabled only when the destination tensor is contiguous.
Add Xor swizzle for 2D cases.
Simplified fast_mod_div code to use less registers.

coderabbitai · 2026-05-27T00:23:13Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6ed99c7f-1def-4c0f-a44c-7436e01b2c29

📥 Commits

Reviewing files that changed from the base of the PR and between 23f4260 and 9dddcd0.

📒 Files selected for processing (2)

cudax/benchmarks/bench/copy/copy_bench.cu
cudax/include/cuda/experimental/__copy/copy_shared_memory.cuh

🚧 Files skipped from review as they are similar to previous changes (2)

cudax/include/cuda/experimental/__copy/copy_shared_memory.cuh
cudax/benchmarks/bench/copy/copy_bench.cu

📝 Walkthrough

Summary by CodeRabbit

Performance Optimizations
- Faster device-to-device copy/transpose paths via improved shared-memory tiling, dispatch and kernel paths; reduced per-thread work for coordinate iteration.
Benchmarks
- Generalized benchmark suite with configurable index/offsets.
- Benchmark builds now run with assertions disabled for more realistic measurements.
Tests
- Expanded shared-memory transpose tests: larger 2D cases plus new 3D full/partial/padded scenarios.

suggestion:

Walkthrough

Refactors tensor coordinate iteration, adds source/destination-aware shared-memory tiling with optional XOR swizzle and permuted load/store paths, adjusts mdspan dispatch and an optimized-kernel early-return, generalizes benchmarks to templated index/data types with size_t offsets, and expands shared-memory transpose tests.

Changes

Shared-memory tiling and tensor copy optimization

Layer / File(s)	Summary
Stride ordering foundation `cudax/include/cuda/experimental/__copy_bytes/tensor_query.cuh`	New `__stride_order` computes stable stride-order permutations; `__sort_by_stride` now delegates to it.
Tensor iterator coordinate mapping `cudax/include/cuda/experimental/__copy/tensor_iterator.cuh`	Removes extent-product precomputation; uses unsigned extents and iterative `div`-based coordinate reduction; extracts `__partial_tensor.__offset`.
Shared-memory tiling decision logic `cudax/include/cuda/experimental/__copy/copy_shared_memory_utils.cuh`	Adds `__narrow_raw_tensor_rank`, `__shared_mem_tiling_result`, `__add_coalesced_tile_run`; `__find_shared_mem_tiling` now computes source/destination-aware tiling and XOR-swizzle eligibility; `__use_shared_mem_kernel` uses tiling validity.
Shared-memory kernel with XOR swizzle and permuted paths `cudax/include/cuda/experimental/__copy/copy_shared_memory.cuh`	Adds `__smem_offset` helper and `_UseXorSwizzle` template param; kernel accepts separate src/dst smem stride descriptors and permutation iterators; load/store loops and host launcher rewritten to use tiling result and conditional swizzle.
Copy dispatch, optimization, and cleanup `cudax/include/cuda/experimental/__copy/copy_optimized.cuh`, `cudax/include/cuda/experimental/__copy/mdspan_d2d.cuh`, `cudax/include/cuda/experimental/__copy/tensor_copy_utils.cuh`	Adds `__max_rank >= 2` guard and rank-2 fast path for shared-memory dispatch; adds early-return for small extent types in optimized kernel; removes unused includes.
Benchmark infrastructure and template generalization `cudax/benchmarks/CMakeLists.txt`, `cudax/benchmarks/bench/copy/copy_bench.cu`	Disables assertions in benchmark builds; `compute_alloc` and `bench_copy` templated on `data_t`/`idx_t` with `size_t` offsets; benchmark instantiations updated to new signatures.
Shared-memory tests `cudax/test/copy/copy_shared_memory.cu`	Replaces earlier 2D tests with large 2D cases and adds three 3D tests (full, partial, padded) with adjusted alloc/stride calculations.

Suggested labels: libcu++
Suggested reviewers:
- gonidelis

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c248b761-c5a1-4a50-9627-641679ac2294

📥 Commits

Reviewing files that changed from the base of the PR and between 19e3eb3 and c0f7f49.

📒 Files selected for processing (9)

cudax/benchmarks/CMakeLists.txt
cudax/benchmarks/bench/copy/copy_bench.cu
cudax/include/cuda/experimental/__copy/copy_optimized.cuh
cudax/include/cuda/experimental/__copy/copy_shared_memory.cuh
cudax/include/cuda/experimental/__copy/copy_shared_memory_utils.cuh
cudax/include/cuda/experimental/__copy/mdspan_d2d.cuh
cudax/include/cuda/experimental/__copy/tensor_copy_utils.cuh
cudax/include/cuda/experimental/__copy/tensor_iterator.cuh
cudax/include/cuda/experimental/__copy_bytes/tensor_query.cuh

💤 Files with no reviewable changes (1)

cudax/include/cuda/experimental/__copy/tensor_copy_utils.cuh

github-actions · 2026-06-02T19:41:53Z

🥳 CI Workflow Results

🟩 Finished in 30m 59s: Pass: 100%/55 | Total: 9h 28m | Max: 30m 59s | Hits: 83%/38622

See results here.

fbusato added 9 commits May 26, 2026 14:59

disable assertions in benchmarks

1c09da5

optimize/simplify tensor iterator

0c8f2c4

add optimized path for rank-2

aec29a2

early exit for 32-bit loop

ce7dff2

unify __stride_order

fa5cc02

add shared memory permutation in both src and dst

fdb1299

add Xor swizzle and destination permutation

bc48344

refactor copy benchmarks to get larger sizes and stress edge cases

c9961e6

nits

c0f7f49

fbusato self-assigned this May 27, 2026

fbusato requested review from a team as code owners May 27, 2026 00:14

fbusato added this to CCCL May 27, 2026

fbusato added the cudax Feature intended for the cudax experimental library label May 27, 2026

fbusato requested review from caugonnet and shwina May 27, 2026 00:14

github-project-automation Bot moved this to Todo in CCCL May 27, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread cudax/include/cuda/experimental/__copy/copy_optimized.cuh

This comment has been minimized.

Sign in to view

fbusato added 2 commits May 26, 2026 17:49

fixes benchmark comments

1d97206

add a few extra test cases

68d370c

This comment has been minimized.

Sign in to view

NaderAlAwar reviewed May 27, 2026

View reviewed changes

Comment thread cudax/include/cuda/experimental/__copy/copy_shared_memory_utils.cuh Outdated

Comment thread cudax/benchmarks/bench/copy/copy_bench.cu Outdated

address comments

23f4260

fbusato requested a review from NaderAlAwar May 27, 2026 16:51

This comment has been minimized.

Sign in to view

NaderAlAwar approved these changes Jun 2, 2026

View reviewed changes

Comment thread cudax/benchmarks/bench/copy/copy_bench.cu Outdated

oleksandr-pavlyk reviewed Jun 2, 2026

View reviewed changes

Comment thread cudax/include/cuda/experimental/__copy/copy_shared_memory.cuh Outdated

address comments

9dddcd0

fbusato requested a review from oleksandr-pavlyk June 2, 2026 19:09

srinivasyadav18 approved these changes Jun 4, 2026

View reviewed changes

fbusato merged commit 316f9cc into NVIDIA:main Jun 4, 2026
77 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cudax::copy(mdspan)` Optimize shared memory cases#9137

`cudax::copy(mdspan)` Optimize shared memory cases#9137
fbusato merged 13 commits into
NVIDIA:mainfrom
fbusato:optimize-cudax-copy

fbusato commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fbusato commented May 27, 2026

Description

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026

🥳 CI Workflow Results

🟩 Finished in 30m 59s: Pass: 100%/55 | Total: 9h 28m | Max: 30m 59s | Hits: 83%/38622

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai Bot commented May 27, 2026 •

edited

Loading