Skip to content

[STF] Fix host-thread data races in concurrent task submission#9186

Open
caugonnet wants to merge 6 commits into
NVIDIA:mainfrom
caugonnet:stf-thread-safety-fixes
Open

[STF] Fix host-thread data races in concurrent task submission#9186
caugonnet wants to merge 6 commits into
NVIDIA:mainfrom
caugonnet:stf-thread-safety-fixes

Conversation

@caugonnet
Copy link
Copy Markdown
Contributor

@caugonnet caugonnet commented May 29, 2026

Summary

Fixes two host-thread data races that could corrupt memory when several CPU threads submit tasks to the same CUDASTF context concurrently. Both were found with ThreadSanitizer plus a multithreaded stress test, and the second one is a plausible root cause for intermittent segfaults seen in the axpy-threads-2 test.

  • async_resources_handle cross-stream sync cachelast_event_per_stream::validate_sync_and_update read/wrote its interactions map without holding the mutex that is declared right next to it. A concurrent rehash (triggered when a new stream pair is inserted) could corrupt buckets being traversed by another thread. Now guarded by the existing mutex.

  • exec_place device singletonsexec_place::device() and exec_place_device::impl::get_place() re-wrapped a raw pointer to an enable_shared_from_this object in a fresh shared_ptr on every call. Each such construction writes the object's internal weak-this member, so concurrent calls (this runs on every task submission, from every user thread) raced on it. We now build one process-global shared_ptr per device exactly once in the thread-safe function-local static initializer and hand out copies, and use shared_from_this() in get_place(). Copying a shared_ptr only touches the atomic refcount, which is thread-safe.

Also documents the handle's thread-safety / lifetime / CUDA-teardown contract.

Test plan

  • CI green (notably the multithreaded cudax.cpp20.test.stf.threads.* tests, e.g. axpy-threads-2).
  • ThreadSanitizer reports no races on the previously-failing paths after the fix.

Two data races could corrupt memory when several host threads submit
tasks to the same context concurrently:

- async_resources_handle: last_event_per_stream::validate_sync_and_update
  read/wrote the cross-stream synchronization map without holding its
  declared mutex, so a concurrent rehash (triggered by a new stream pair)
  could corrupt buckets being traversed by another thread. Guard the map
  with the mutex.

- exec_place::device() / exec_place_device::impl::get_place(): each call
  re-wrapped a raw pointer to an enable_shared_from_this object in a fresh
  shared_ptr, racing on the object's internal weak-this member. Build one
  process-global shared_ptr per device in the thread-safe function-local
  static initializer and hand out copies; use shared_from_this() in
  get_place(). Copying only touches the atomic refcount.

Both were found via ThreadSanitizer plus a multithreaded stress test.
Also documents the handle's thread-safety / lifetime / CUDA-teardown
contract.
@caugonnet caugonnet requested a review from a team as a code owner May 29, 2026 11:44
@caugonnet caugonnet requested a review from andralex May 29, 2026 11:44
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 29, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 29, 2026
@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test a4cdf58

@caugonnet caugonnet self-assigned this May 29, 2026
@caugonnet caugonnet added the stf Sequential Task Flow programming model label May 29, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3a6921a2-841e-4033-8ba3-659e1b2ceeb2

📥 Commits

Reviewing files that changed from the base of the PR and between 25e2971 and 7f5dfdd.

📒 Files selected for processing (1)
  • cudax/include/cuda/experimental/__places/places.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cudax/include/cuda/experimental/__places/places.cuh

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Improved thread-safety for concurrent host-thread operations by adding mutex-protected access to internal caches to prevent races during task submission.
  • Performance

    • Added per-device cached instances to reduce overhead on repeated device access and lower redundant work on hot paths.
  • Documentation

    • Expanded guidance on concurrent usage patterns, resource lifetime, and synchronization requirements.

suggestion:

Walkthrough

Per-device shared_ptr instances are cached in exec_place::device() and exec_place_device::impl::get_place() now returns shared_from_this(). async_resources_handle adds <mutex>, clarifies concurrency/lifetime in docs, and guards interactions map access with std::lock_guard in validate_sync_and_update.

Changes

Per-device shared_ptr caching in exec_place

Layer / File(s) Summary
Per-device shared_ptr caching and shared_from_this adoption
cudax/include/cuda/experimental/__places/places.cuh
exec_place::device(int devid) allocates and caches a function-local static array of per-device shared_ptr<exec_place::impl> instances and returns impls[devid]. exec_place_device::impl::get_place(size_t idx) returns shared_from_this() rather than creating a new shared_ptr with a no-op deleter.

Thread-safety in async_resources_handle

Layer / File(s) Summary
Include, docs, and mutex-guarded interactions map
cudax/include/cuda/experimental/__stf/internal/async_resources_handle.cuh
Added <mutex>; expanded async_resources_handle docs to describe concurrent host-thread submissions and CUDA-context teardown lifetime/synchronization; last_event_per_stream::validate_sync_and_update acquires a std::lock_guard on the per-instance mutex before accessing the interactions unordered_map.

Suggested labels

stf

Suggested reviewers

  • andralex
  • alliepiper
  • davebayer

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6e56a269-c9fc-43ef-8e97-8e210bad7289

📥 Commits

Reviewing files that changed from the base of the PR and between ca4c5a6 and a4cdf58.

📒 Files selected for processing (2)
  • cudax/include/cuda/experimental/__places/places.cuh
  • cudax/include/cuda/experimental/__stf/internal/async_resources_handle.cuh

The file uses ::std::lock_guard<::std::mutex> and declares a ::std::mutex
member but relied on a transitive include for <mutex>. Include it directly.
@caugonnet caugonnet enabled auto-merge (squash) May 29, 2026 12:02
@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test 164ee89

Comment thread cudax/include/cuda/experimental/__places/places.cuh Outdated
Comment thread cudax/include/cuda/experimental/__places/places.cuh Outdated
@github-actions

This comment has been minimized.

caugonnet and others added 2 commits May 29, 2026 15:57
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
Comment thread cudax/include/cuda/experimental/__places/places.cuh Outdated
Comment thread cudax/include/cuda/experimental/__places/places.cuh
Co-authored-by: Jacob Faibussowitsch <jacob.fai@gmail.com>
Copy link
Copy Markdown
Contributor

@andralex andralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow so subtle. Thanks!

@andralex
Copy link
Copy Markdown
Contributor

/ok to test 3251d70

@github-actions
Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 33m 40s: Pass: 12%/55 | Total: 3h 23m | Max: 33m 40s | Hits: 100%/1682

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants