Skip to content

fix(profiler): lock-free class/endpoint/context maps via TripleBufferedDictionary#524

Draft
jbachorik wants to merge 44 commits into
mainfrom
muse/crash-sigsegv-in-std-rb-tree-increment-clean
Draft

fix(profiler): lock-free class/endpoint/context maps via TripleBufferedDictionary#524
jbachorik wants to merge 44 commits into
mainfrom
muse/crash-sigsegv-in-std-rb-tree-increment-clean

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented May 12, 2026

What does this PR do?:

Replaces the SpinLock-guarded Dictionary instances for _class_map, _string_label_map, and _context_value_map with a new TripleBufferedDictionary that eliminates all locking from the read/write fast paths.

TripleBufferedDictionary holds three Dictionary buffers cycling through three roles via a generic TripleBufferRotator<T> template:

  • active — receives new writes (signal handlers + JNI threads), lock-free via CAS
  • dump — snapshot being read by the dump thread; promoted from old active on rotate()
  • scratch — two rotations behind active; ready to be cleared lock-free

The "scratch" role exists for safe lock-free reclamation: when a buffer enters that role, at least one full dump cycle has elapsed since it was last in the active or dump role. That grace period is much longer than any signal-handler (per-thread-locked, drained by lockAll() around the dump) or JNI-thread (microsecond lookup) can plausibly outlive a stale active pointer, so the buffer can be freed without any explicit drain.

bounded_lookup(size_limit=0) is signal-safe (no malloc) and checks the active buffer only — there is no fallback to older snapshots.

As part of this change the following dead code is removed:

  • _class_map_lock (SpinLock)
  • classMapSharedGuard() / classMapTrySharedGuard() on Profiler
  • tryLockSharedBounded() and BoundedOptionalSharedLockGuard on SpinLock

Motivation:

Three production crashes (fingerprint v10.DAECC680F0728EAB44F26DB0B91B703F, 2026-05-06 to 2026-05-08) showed SIGSEGV in std::_Rb_tree_increment via Recording::writeCpoolRecording::writeClassesDictionary::collect, caused by a race between writeClasses and concurrent Dictionary::clear().

PR #516 patched this with a shared-lock, but that introduced tryLockSharedBounded(5) in the signal-handler path (walkVM). Under heavy 100 µs wall-clock load on aarch64 the bounded CAS retries were consistently exhausted, causing class lookups to return -1 and corrupting JFR recordings.

This PR also fixes a related counter-tracking gap: dictionary_classes_keys was always 0 during wall-clock profiling because fill-path inserts went to a buffer with counter id=0. All three buffers now carry the real id.

Note: walkVM's vtable-stub class resolution remains best-effort (it can only find classes that some other path has already inserted into the active buffer); a proper fix would require pre-populating the dictionary via JVMTI ClassPrepare, which is left to a follow-up.

Supersedes PR #522 (fix(profiler): fix SIGSEGV in Dictionary::clear under concurrent lookup).

Additional Notes:

  • The grace period for safe lock-free clearing is the time between two consecutive clearStandby() calls (one full dump interval, typically ≥60s). This is many orders of magnitude longer than any signal-handler or JNI lookup, so explicit drains (RefCountGuard, lockAll) are unnecessary for the dictionary clear path.
  • At most two non-empty buffers exist at any time (active + dump).

How to test the change?:

  • ddprof-lib:gtestDebug_dictionary_ut — covers TripleBufferedDictionary rotation, counter semantics, and concurrent writer safety.
  • DictionaryRotationTest (Java) — counter reset after clearStandby; correct counts after fill-path inserts.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: [JIRA-XXXX]

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 12, 2026

CI Test Results

Run: #26080026842 | Commit: 32c135e | Duration: 32m 42s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-05-19 06:55:51 UTC

@jbachorik jbachorik force-pushed the muse/crash-sigsegv-in-std-rb-tree-increment-clean branch 2 times, most recently from 2f23bab to 132b472 Compare May 13, 2026 07:42
@jbachorik jbachorik changed the title fix(profiler): lock-free class/endpoint/context maps via DoubleBufferedDictionary fix(profiler): lock-free class/endpoint/context maps via TripleBufferedDictionary May 13, 2026
jbachorik and others added 2 commits May 13, 2026 17:38
Production crash (SIGSEGV) in Recording::cleanupUnreferencedMethods,
first seen in dd-trace-java 1.56.1 after PR #327 introduced method-map
cleanup. cleanupUnreferencedMethods() was called after finishChunk()
released the GetLoadedClasses pins, so jvmti->Deallocate(_ptr) inside
~SharedLineNumberTable could access freed line number table memory on
JVMs that reclaim JVMTI allocations on class unload.

Fix: detach SharedLineNumberTable from JVMTI lifetime by copying the
table into a malloc'd buffer in Lookup::fillJavaMethodInfo() and freeing
the JVMTI allocation immediately. SharedLineNumberTable destructor now
calls free(). As defence-in-depth, finishChunk() gains a do_cleanup
parameter so cleanup runs inside the GetLoadedClasses pin window.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On musl/aarch64/JDK11, HotSpot's deoptimisation blob
(generate_deopt_blob in sharedRuntime_aarch64.cpp) rebuilds interpreter
frames near the compiled frame's stack boundary, corrupting the top
~224 bytes of the thread stack where start_routine_wrapper_spec's frame
lives. Two crashes follow:

(a) -fstack-protector-strong inserts a canary into any frame with a
    non-trivially-destructed local (e.g. struct Cleanup); the canary
    lands in the corruption zone and fires __stack_chk_fail.
(b) Even without a canary, 'return' loads the corrupted saved LR and
    jumps to a garbage address.

Fix: no_stack_protector removes the canary; pthread_exit() replaces
'return' so LR is never used; cleanup is performed explicitly with
the tid read from TLS (ProfiledThread::currentTid()), which survives
frame corruption.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the muse/crash-sigsegv-in-std-rb-tree-increment-clean branch from 7844134 to b90761e Compare May 13, 2026 16:15
@jbachorik jbachorik changed the base branch from main to muse/sigsegv-in-recording May 13, 2026 16:15
JavaThread::~JavaThread / OSThread::~OSThread crashed on JDK 25 when the
ddprof pthread_create hook delivered SIGVTALRM between
Profiler::unregisterThread() returning and ProfiledThread::release()
acquiring its internal guard. The signal handler called
currentSignalSafe() and dereferenced the now-freed ProfiledThread.

Fix: extract unregister_and_release(tid) — a noinline helper that holds
a SignalBlocker for the entire unregister+release sequence. Both
start_routine_wrapper and start_routine_wrapper_spec invoke it; the
race window is eliminated without duplicating signal-masking logic.

Same SignalBlocker pattern is applied to perfEvents_linux.cpp's
pthread_setspecific_hook teardown path.

thread.h guards clearCurrentThreadTLS() with #ifdef UNIT_TEST so it
is absent from production builds; GtestTaskBuilder.kt adds -DUNIT_TEST
to the gtest compiler flags so the guarded method compiles in tests.

thread_teardown_safety_ut.cpp adds an acceptance-test suite
(ThreadTeardownSafetyTest T-01..T-10) covering the full teardown
lifecycle under signal load.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the muse/crash-sigsegv-in-std-rb-tree-increment-clean branch from b90761e to 76d919d Compare May 13, 2026 16:52
jbachorik and others added 2 commits May 14, 2026 13:13
ASSERT_NE expands to a bare 'return;' on failure, which is a compile
error in a function whose return type is void*. Use ADD_FAILURE +
explicit 'return nullptr;' instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ufferedDictionary

Replaces the SpinLock-guarded Dictionary instances for _class_map,
_string_label_map, and _context_value_map with a new TripleBufferedDictionary
that eliminates all locking from the read/write fast paths.

TripleBufferedDictionary holds three Dictionary buffers cycling through three
roles via a generic TripleBufferRotator<T> template:
- active   — receives new writes (signal handlers + JNI threads), lock-free via CAS
- dump     — snapshot being read by the dump thread; promoted from old active on rotate()
- scratch  — two rotations behind active; ready to be cleared lock-free

The scratch role exists for safe lock-free reclamation: when a buffer enters
that role, at least one full dump cycle has elapsed since it was last in the
active or dump role.  That grace period is much longer than any signal-handler
or JNI-thread can plausibly outlive a stale active pointer, so the buffer can
be freed without any explicit drain.

bounded_lookup(size_limit=0) is signal-safe (no malloc) and checks the active
buffer only — no fallback to older snapshots.

Dead code removed:
- _class_map_lock (SpinLock)
- classMapSharedGuard() / classMapTrySharedGuard() on Profiler
- tryLockSharedBounded() / BoundedOptionalSharedLockGuard on SpinLock
- spinlock_bounded_ut.cpp / dictionary_concurrent_ut.cpp (subsumed by dictionary_ut.cpp)

Motivation: three production crashes (fingerprint v10.DAECC680F0728EAB44F26DB0B91B703F)
showed SIGSEGV in std::_Rb_tree_increment via writeCpool → writeClasses →
Dictionary::collect, caused by a race between writeClasses and concurrent
Dictionary::clear().  PR #516 patched it with a shared-lock that exhausted
bounded CAS retries under heavy 100 µs wall-clock load on aarch64, causing
class lookups to return -1 and corrupting JFR recordings.  This change
eliminates the lock entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the muse/crash-sigsegv-in-std-rb-tree-increment-clean branch from 76d919d to 2105b61 Compare May 14, 2026 17:16
jbachorik and others added 3 commits May 15, 2026 12:01
- libraryPatcher_linux.cpp:151 — add __builtin_unreachable() after pthread_exit()
- flightRecorder.cpp:607 — update pin-window comment to reflect malloc'd ownership
- CleanupAfterClassUnloadTest.java:51 — fix Javadoc to describe both fix mechanisms
- thread_teardown_safety_ut.cpp:43 — add SigGuard RAII to restore signal dispositions
- thread_teardown_safety_ut.cpp:236 — gate T-07 on __GLIBC__; add musl cleanup path

Co-Authored-By: muse <muse@noreply>
- Add pthread_cleanup_push/pop + noinline cleanup_unregister() to
  start_routine_wrapper_spec so ProfiledThread is released when
  the wrapped routine calls pthread_exit() or the thread is canceled
- Extend CleanupAfterClassUnloadTest from AbstractDynamicClassTest to
  reuse generateClassBytecode/IsolatedClassLoader/tempFile helpers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the muse/crash-sigsegv-in-std-rb-tree-increment-clean branch from 30a5959 to 6645120 Compare May 15, 2026 13:26
jbachorik and others added 2 commits May 15, 2026 15:35
- thread_teardown_safety_ut.cpp: guard T-01 sentinel with ASSERT_NE(kNotYetRun) to short-circuit if handler never ran; replace T-06 absolute mask assertions with relative before/after comparisons
- AbstractDynamicClassTest.java: split compound Label statements to separate lines for Spotless/google-java-format compliance

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik requested a review from Copilot May 15, 2026 14:37
@jbachorik
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66451207c3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated
Comment thread ddprof-lib/src/main/cpp/dictionary.h Outdated
Comment thread ddprof-lib/src/main/cpp/dictionary.h Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Replaces the SpinLock-guarded Dictionary instances (_class_map, _string_label_map, _context_value_map) with a new TripleBufferedDictionary that rotates three Dictionary buffers (active / dump / scratch) under a generic TripleBufferRotator<T> template. Writes go to the active buffer lock-free; the dump thread reads standby() after rotate()/rotatePersistent() drains in-flight readers via RefCountGuard. The previous bounded shared-lock primitives in spinLock.h and the related classMap*Guard() factory methods are removed, and RefCountGuard is generalised (extracted to refCountGuard.h) to operate on void* so it can protect dictionaries as well as call-trace tables.

Changes:

  • New TripleBufferRotator<T> + TripleBufferedDictionary (with rotate, rotatePersistent, clearStandby, clearAll) and a new Dictionary::mergeFrom used by rotatePersistent.
  • Generalised RefCountGuard to void* and extracted it from callTraceStorage.h into refCountGuard.h; removed tryLockSharedBounded / BoundedOptionalSharedLockGuard and the _class_map_lock member.
  • Wired Profiler::dump/stop/start to rotate / clearStandby / clearAll, switched flightRecorder.cpp and hotspotSupport.cpp to read standby() / call bounded_lookup() directly without locking, replaced old gtest suites with a new dictionary_ut.cpp, added a Java DictionaryRotationTest, and disabled ContendedCallTraceStorageTest on musl/aarch64.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
ddprof-lib/src/main/cpp/tripleBuffer.h New generic 3-buffer rotator with atomic CAS rotate.
ddprof-lib/src/main/cpp/refCountGuard.h New header extracting RefCountGuard/RefCountSlot and generalising active_ptr to void*.
ddprof-lib/src/main/cpp/dictionary.h Adds TripleBufferedDictionary, Dictionary::mergeFrom, counterId(), size() accessors.
ddprof-lib/src/main/cpp/dictionary.cpp Implements mergeFrom (recursive re-insert via lookup).
ddprof-lib/src/main/cpp/spinLock.h Removes tryLockSharedBounded and BoundedOptionalSharedLockGuard.
ddprof-lib/src/main/cpp/profiler.h Replaces Dictionary members with TripleBufferedDictionary; drops _class_map_lock + guard factories.
ddprof-lib/src/main/cpp/profiler.cpp Switches start/stop/dump/lookupClass to triple-buffer ops; removes shared-lock dance.
ddprof-lib/src/main/cpp/flightRecorder.cpp writeCpool/writeClasses now read standby() snapshot without lock.
ddprof-lib/src/main/cpp/hotspot/hotspotSupport.cpp Removes classMapTrySharedGuard; uses bounded_lookup directly in vtable-stub path.
ddprof-lib/src/main/cpp/callTraceStorage.h/.cpp Removes the inline RefCountGuard definitions and updates the impl to void* resource.
ddprof-lib/src/test/cpp/dictionary_ut.cpp New gtest covering rotate / clearStandby / merge / counter id / concurrent writes.
ddprof-lib/src/test/cpp/dictionary_concurrent_ut.cpp Removed (covered the old shared-lock contract).
ddprof-lib/src/test/cpp/spinlock_bounded_ut.cpp Removed alongside the deleted bounded shared-lock primitive.
ddprof-test/.../DictionaryRotationTest.java New Java test asserting pre/post-dump separation and counter recalibration.
ddprof-test/.../BoundMethodHandleMetadataSizeTest.java Drops the (now unused) counter-size assertion.
ddprof-test/.../ContendedCallTraceStorageTest.java Skips on musl/aarch64 due to a separately tracked native bug.
AGENTS.md Adds musl/aarch64/JDK11 frame-corruption rule for start_routine_wrapper_spec.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ddprof-lib/src/main/cpp/dictionary.h Outdated
Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated
Comment thread ddprof-lib/src/main/cpp/dictionary.h Outdated
Comment thread ddprof-lib/src/main/cpp/dictionary.h Outdated
Comment thread ddprof-lib/src/main/cpp/dictionary.cpp Outdated
Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated
Comment thread ddprof-lib/src/main/cpp/dictionary.h Outdated
Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated
jbachorik and others added 3 commits May 15, 2026 20:56
- Ctor: for reentrant path, increment count BEFORE overwriting active_ptr
  so the scanner always sees the outer resource while count is elevated
- Dtor: for reentrant path, restore saved outer active_ptr BEFORE
  decrementing count to eliminate the false-clear window
- Move ctor/assign: propagate _is_reentrant and _saved_ptr fields
- getThreadRefCountSlot(): return slot+MAX_THREADS on reentrant entry
  so callers can detect reentrancy and save/restore active_ptr
- profiler.cpp start(): add lockAll()+waitForAllRefCountsToClear() before
  clearAll() calls — lockAll() gates new signal entries, waitForAll()
  drains handlers already inside bounded_lookup() without any _locks[] slot
- waitForRefCountToClear(): emit Log::warn on timeout instead of silent no-op

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jbachorik and others added 3 commits May 18, 2026 12:37
…nary)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…MH test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ctionary

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik requested a review from Copilot May 18, 2026 16:01
@jbachorik
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f8f6fc8c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated
Comment thread ddprof-lib/src/main/cpp/libraryPatcher_linux.cpp Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

ddprof-test/src/test/java/com/datadoghq/profiler/memleak/WriteStackTracesAfterClassUnloadTest.java:126

  • This test bypasses stopProfiler() from AbstractProfilerTest (which sets the stopped flag and prevents the @AfterEach hook from calling profiler.stop() again). After this finally block runs profiler.stop(), the base-class teardown will call stop() a second time on an already-stopped profiler. Other tests that drive the profiler manually (e.g. CleanupAfterClassUnloadTest) wrap the redundant stop in try { ... } catch (Exception ignored) {}; this test does not, so any exception from the double-stop will surface as a test failure even though the actual scenario succeeded. Consider routing through stopProfiler() or setting the equivalent guard.

Comment thread ddprof-lib/src/main/cpp/stringDictionary.h
Comment thread ddprof-lib/src/main/cpp/javaApi.cpp
jbachorik and others added 7 commits May 18, 2026 18:26
- profiler.cpp:1410 — rotate label maps under lockAll/unlockAll so the
  active→standby swap is atomic w.r.t. JNI writers and signal handlers,
  closing the writeCpool dangling-reference race; locks are released
  immediately after rotation so _jfr.dump() reads the stable standby
  concurrently with writers (preserves the triple-buffered design)
- libraryPatcher_linux.cpp — add noinline+no_stack_protector helper
  run_with_musl_cleanup so struct __ptcb lives outside the DEOPT
  corruption zone of start_routine_wrapper_spec
- stringDictionary.h — replace recursive freeTable/collectTable with
  iterative DFS bounded by overflow-chain depth (stk[34]) to avoid
  stack overflow on pathological hash-collision chains
- BoundMethodHandleMetadataSizeTest.java — drop blank line between
  java.lang.invoke.* and java.util.Map imports for google-java-format

Co-Authored-By: muse <muse@noreply>
rotate() spans three maps and is not signal-reentrant. lockAll() blocks
JNI writers and other threads' signal-handler tryLock paths, but does
not mask SIGPROF/SIGVTALRM on the dumping thread itself — an in-thread
async-signal handler could fire between rotates and observe partially-
swapped state across the class/string/context maps.

Wrap the rotate sequence in SignalBlocker so the swap is atomic from
every observer. Applies to both Profiler::stop() and Profiler::dump().

Co-Authored-By: muse <muse@noreply>
bounded_lookup is refcount-only and does not take _locks[], but
recordTraceRoot takes _locks[lock_index] via tryLock and writes the
TraceRootEvent into the per-thread JFR buffer. writeCpool runs at the
END of the chunk and only reads standby. So releasing lockAll between
rotate and _jfr.dump() leaves the door open for:

  1) JNI writer inserts Y into the new active (no lock taken)
  2) recordTraceRoot tryLock succeeds, TraceRootEvent(Y) lands in the
     per-thread JFR buffer
  3) _jfr.dump() flushes that buffer into the chunk
  4) writeCpool at chunk end reads standby, does not see Y
  => chunk contains an event referencing an ID missing from its cpool

Hold lockAll across rotate + _jfr.dump() (and rotate + _jfr.stop() in
Profiler::stop()). Dictionary inserts via bounded_lookup remain
lock-free — only the JFR event-write path is serialised, which was
already serialised by the existing lockAll/_jfr.dump() pairing before
this PR. Late TraceRootEvents now land in the NEXT chunk where their
Y is rotated into standby.

Co-Authored-By: muse <muse@noreply>
Error has no default constructor (only explicit Error(const char*)),
so 'Error err;' fails to compile. Initialize with Error::OK (NULL
message) to preserve the original semantics.
- fuzz_stringDictionary.cpp: libFuzzer target exercising the full op surface
  (lookup, bounded_lookup, lookupDuringDump, rotate, clearStandby, clearAll)
  with sequential invariant checks for id stability and dump consistency.
- StressStringDictionary.MultiDictionaryAtomicRotation: 3 dicts rotated
  atomically under a critical section, mirroring Profiler::dump().
- StressStringDictionary.ClearAllUnderConcurrentReaders: clearAll cycles
  under a shared_mutex barrier; ASan/TSan catch UAF on freed keys.
@jbachorik jbachorik added the test:asan Run CI tests with AddressSanitizer configuration label May 19, 2026
@datadog-prod-us1-6

This comment has been minimized.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the muse/crash-sigsegv-in-std-rb-tree-increment-clean branch from 9f48944 to 9cf40a7 Compare May 19, 2026 11:51
@jbachorik jbachorik requested a review from Copilot May 19, 2026 11:52
@jbachorik
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9cf40a7377

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +251 to +252
StringDictionaryBuffer* active = _rot.active();
RefCountGuard guard(active);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Retry dictionary access after a stale active pointer

If a thread loads active here and is preempted before the RefCountGuard increments its slot, a dump can rotate the dictionaries, wait without seeing that thread, and finish the standby snapshot; when the thread resumes it inserts into the old buffer and returns an ID that is no longer in the active dictionary. In recordTrace0/registerConstant0 that ID can then be written to the next JFR chunk after lockAll() is released, but the next chunk's constant pool is built from the current active buffer and will not contain the stale-buffer insert. This needs the same kind of post-guard revalidation/retry that CallTraceStorage::put() has before using the loaded pointer (and the same pattern applies to the bounded_lookup overloads below).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: added a post-guard revalidation loop (if (_rot.active() != active) continue;) to all three lookup/bounded_lookup overloads, matching the retry pattern in CallTraceStorage::put().

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 29 changed files in this pull request and generated 6 comments.

Comment on lines 15 to 37
assertFalse(counters.isEmpty());
// assert about the size of metadata here
assertFalse(counters.isEmpty(), "profiler debug counters must not be empty after BoundMethodHandle workload");
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: renamed class to BoundMethodHandleProfilerTest (file renamed accordingly).

protected boolean isPlatformSupported() {
return !Platform.isJ9(); // Avoid J9-specific issues
// CTimer::unregisterThread races with concurrent thread teardown on musl-aarch64 debug;
// tracked separately as a pre-existing native bug.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: added the tracking issue link https://github.com/DataDog/java-profiler/issues/534 to the skip comment.

Comment thread ddprof-lib/src/main/cpp/refCountGuard.h Outdated
Comment on lines +26 to +28
* full dump cycles (typically seconds). If the window were hit, the operation
* would return INT_MAX (miss) — an observable incorrect-state outcome that
* callers handle as a dropped trace or generic vtable frame, not a crash.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: removed the INT_MAX claim and replaced with a generic description — callers observe a miss (0 or equivalent sentinel) and handle it gracefully. Also updated "typically seconds" to "typically 60+ seconds" per your inline note.

EXPECT_EQ(1u, new_id);
}

TEST_F(StringDictionaryTest, LookupDuringDumpInsertsNewKeyIntoActivAndStandby) {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: renamed to LookupDuringDumpInsertsNewKeyIntoActiveAndStandby.

Comment on lines +57 to +66
// dump() triggers: rotate() → lockAll() → jfr.dump(snapshot) → unlockAll()
// → clearStandby() (recalibrates counter to active size, frees scratch buffer)
Path snapshot = Files.createTempFile("DictionaryRotation_snapshot_", ".jfr");
try {
dump(snapshot);

// Counter recalibrated to active buffer size — 0 because no post-dump inserts yet
Map<String, Long> afterDump = profiler.getDebugCounters();
assertEquals(0L, afterDump.getOrDefault("dictionary_endpoints_keys", -1L),
"dictionary_endpoints_keys must be 0 after clearStandby");
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: updated both comment lines to accurately reflect clearStandby() unconditionally zeroing the counters.

Comment on lines +330 to +336
// Clear the scratch buffer (two rotations behind active; safe to clear).
// Also resets the live-insert counter so it reflects only the active buffer.
void clearStandby() {
_rot.clearTarget()->clear();
Counters::set(DICTIONARY_KEYS, 0, _counter_offset);
Counters::set(DICTIONARY_KEYS_BYTES, 0, _counter_offset);
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: updated the clearStandby() doc comment to match actual semantics — counters reset to 0, not recalibrated to active buffer size.

- Add post-guard revalidation loop to all three lookup/bounded_lookup
  overloads to fix TOCTOU race between active-pointer load and
  RefCountGuard construction
- Fix refCountGuard.h comment: remove INT_MAX claim, generalize to
  caller-defined miss sentinel, update "typically seconds" to 60+
- Rename BoundMethodHandleMetadataSizeTest -> BoundMethodHandleProfilerTest
- Fix test name typo: LookupDuringDumpInsertsNewKeyIntoActiv -> Active
- Add issue #534 link to ContendedCallTraceStorageTest skip comment
- Fix clearStandby() and DictionaryRotationTest comments to reflect
  actual zero-reset semantics

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI test:asan Run CI tests with AddressSanitizer configuration test:fuzz

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants