[STF] Fix host-thread data races in concurrent task submission#9186
[STF] Fix host-thread data races in concurrent task submission#9186caugonnet wants to merge 6 commits into
Conversation
Two data races could corrupt memory when several host threads submit tasks to the same context concurrently: - async_resources_handle: last_event_per_stream::validate_sync_and_update read/wrote the cross-stream synchronization map without holding its declared mutex, so a concurrent rehash (triggered by a new stream pair) could corrupt buckets being traversed by another thread. Guard the map with the mutex. - exec_place::device() / exec_place_device::impl::get_place(): each call re-wrapped a raw pointer to an enable_shared_from_this object in a fresh shared_ptr, racing on the object's internal weak-this member. Build one process-global shared_ptr per device in the thread-safe function-local static initializer and hand out copies; use shared_from_this() in get_place(). Copying only touches the atomic refcount. Both were found via ThreadSanitizer plus a multithreaded stress test. Also documents the handle's thread-safety / lifetime / CUDA-teardown contract.
|
/ok to test a4cdf58 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
suggestion: WalkthroughPer-device ChangesPer-device shared_ptr caching in exec_place
Thread-safety in async_resources_handle
Suggested labels
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6e56a269-c9fc-43ef-8e97-8e210bad7289
📒 Files selected for processing (2)
cudax/include/cuda/experimental/__places/places.cuhcudax/include/cuda/experimental/__stf/internal/async_resources_handle.cuh
The file uses ::std::lock_guard<::std::mutex> and declares a ::std::mutex member but relied on a transitive include for <mutex>. Include it directly.
|
/ok to test 164ee89 |
This comment has been minimized.
This comment has been minimized.
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
Co-authored-by: Jacob Faibussowitsch <jacob.fai@gmail.com>
|
/ok to test 3251d70 |
😬 CI Workflow Results🟥 Finished in 33m 40s: Pass: 12%/55 | Total: 3h 23m | Max: 33m 40s | Hits: 100%/1682See results here. |
Summary
Fixes two host-thread data races that could corrupt memory when several CPU threads submit tasks to the same CUDASTF context concurrently. Both were found with ThreadSanitizer plus a multithreaded stress test, and the second one is a plausible root cause for intermittent segfaults seen in the
axpy-threads-2test.async_resources_handlecross-stream sync cache —last_event_per_stream::validate_sync_and_updateread/wrote itsinteractionsmap without holding themutexthat is declared right next to it. A concurrent rehash (triggered when a new stream pair is inserted) could corrupt buckets being traversed by another thread. Now guarded by the existing mutex.exec_placedevice singletons —exec_place::device()andexec_place_device::impl::get_place()re-wrapped a raw pointer to anenable_shared_from_thisobject in a freshshared_ptron every call. Each such construction writes the object's internal weak-this member, so concurrent calls (this runs on every task submission, from every user thread) raced on it. We now build one process-globalshared_ptrper device exactly once in the thread-safe function-local static initializer and hand out copies, and useshared_from_this()inget_place(). Copying ashared_ptronly touches the atomic refcount, which is thread-safe.Also documents the handle's thread-safety / lifetime / CUDA-teardown contract.
Test plan
cudax.cpp20.test.stf.threads.*tests, e.g.axpy-threads-2).