feat(profiling): internal metrics for overhead by morrisonlevi · Pull Request #3616 · DataDog/dd-trace-php

morrisonlevi · 2026-02-02T19:44:05Z

Description

This is an alternative to #3591. Instead of putting the time we spend stack walking into the timeline and flamegraph, we instead aggregate it into counters, which we emit into internal metrics. This addresses some of the concerns I received about that PR while still providing something we believe is useful: the ability to understand our overhead for a specific profile. This internal metric could be re-exported in the future if we desire to do per-organization or other type of aggregation, for example, to determine which customers are hitting the highest amounts of overhead.

In addition to collecting CPU time walking the stack, this also aggregates CPU time spent in our two background threads, ddprof_time and ddprof_upload. The former is a bit of a misnomer because it also aggregates CPU samples, it doesn't just track time.

Manually verified it's working (practically idle service), this is on a slightly older version of the PR so it looks a little different now:

Reviewer checklist

Test coverage seems ok.
Appropriate labels assigned.

codecov-commenter · 2026-02-03T19:52:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 62.11%. Comparing base (32aaf0a) to head (76483bb).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3616      +/-   ##
==========================================
- Coverage   62.21%   62.11%   -0.11%     
==========================================
  Files         141      141              
  Lines       13387    13387              
  Branches     1753     1753              
==========================================
- Hits         8329     8315      -14     
- Misses       4260     4273      +13     
- Partials      798      799       +1

see 3 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32aaf0a...76483bb. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

pr-commenter · 2026-02-03T19:57:59Z

Benchmarks [ profiler ]

Benchmark execution time: 2026-02-03 19:57:47

Comparing candidate commit aed7a13 in PR branch levi/stack-walk-metric with baseline commit 32aaf0a in branch master.

Found 2 performance improvements and 3 performance regressions! Performance is the same for 24 metrics, 7 unstable metrics.

scenario:php-profiler-timeline-memory-with-profiler

🟥 execution_time [+54.337ms; +75.399ms] or [+5.855%; +8.124%]
🟩 cpu_usage_percentage [-4.885%; -3.086%]

scenario:php-profiler-timeline-memory-with-profiler-and-timeline

🟥 cpu_system_time [+38.635ms; +61.757ms] or [+7.237%; +11.568%]
🟥 execution_time [+93.618ms; +115.961ms] or [+7.583%; +9.393%]
🟩 cpu_usage_percentage [-6.534%; -5.345%]

…ddprof_upload` for current profile exported Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

realFlowControl · 2026-02-04T10:56:37Z

I was reviewing this and made a few additions that build on your work in #3618, let me know what you think

* feat(profiling): internal metrics for overhead * feat(profiling): move CPU time capture to include serialization for `ddprof_upload` for current profile exported Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(profiling): add CPU time tracking for `ddprof_time` thread Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(profiling): separate CPU time tracking per background thread Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Florian Engelhardt <florian.engelhardt@datadoghq.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* Adds process tags to profiler uploader * remove useless utils function * remove empty lines and fix spelling * add function to ddtrace.sym * feat(CI: installer tests): fix installer tests by changing enabling check on appsec extension (#3604) Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> * refactor(profiling): use module globals for ZMM state (#3608) * refactor(profiling): use module globals for ZMM state * style: fix clippy warnings * Apply suggestions from code review Co-authored-by: Florian Engelhardt <florian.engelhardt@datadoghq.com> * docs: note ZTS vs NTS differences --------- Co-authored-by: Florian Engelhardt <florian.engelhardt@datadoghq.com> * refactor(profiling): extract Backtrace type (#3612) * refactor(profiling): extract Backtrace type In a future change, this may hold a refcount for another object, so we need to encapsulate it. * fix `test_collect_stack_sample` not running --------- Co-authored-by: Florian Engelhardt <florian.engelhardt@datadoghq.com> * Propagate RELIABILITY_ENV_BRANCH to downstream pipeline (#3605) * Add simple_onboarding_appsec to SSI system tests (#3617) * Stores remote config requests in request-replayer (#3585) * feat(profiling): internal metrics for overhead (#3616) * feat(profiling): internal metrics for overhead * feat(profiling): move CPU time capture to include serialization for `ddprof_upload` for current profile exported Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(profiling): add CPU time tracking for `ddprof_time` thread Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(profiling): separate CPU time tracking per background thread Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Florian Engelhardt <florian.engelhardt@datadoghq.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> * fix(tracing): hook is_internal was backwards (#3625) * Fix phpt asm standalone tests (#3628) * fix readme file links (#3610) * test(language-tests): properly skip online tests and disabled soap_qname_crash.phpt on all version (#3632) Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> * Collect framework endpoints (#3548) * Only run publishing jobs when all dependent pipelines succeed (#3635) Signed-off-by: Bob Weinand <bob.weinand@datadoghq.com> * chore(profiling): update libdatadog to 26 (#3633) * test(CI): manually handle git operation for windows jobs (#3634) * test(CI): add aggressive git cleanup on windows runner Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> * test(CI): add manual cleanup in before_script step Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> --------- Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> * feat(CI): add healthcheck to SQLSRV server setup (#3619) * feat(CI): add healthcheck to SQLSRV server setup Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> * chore: add troubleshooting script for SQLSRV Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> * feat: add explicit memory limit and paths Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> * chore: replace sqlsrv docker image Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> --------- Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> * fix(CI: test_metrics): add explicit flush in logging (#3637) * fix(logging): fsync crash logs before _Exit() to prevent data loss When a SIGSEGV occurs, the signal handler logs "Segmentation fault encountered" and then calls _Exit() which terminates the process immediately. Without fsync(), kernel write buffers may not be flushed to disk before termination, causing a race condition where the error log file is sometimes not created. This fix adds fsync() on Unix/Linux and _commit() on Windows after write() in ddtrace_log_with_time() to ensure crash logs persist to disk before process termination. The issue affects production (rare but possible during power loss, kernel panic, or I/O errors) and causes consistent test failures where tests check for log files immediately after crashes (before kernel writeback completes). Fixes flaky test_metrics SigSegVTest::testGet failures on Kubernetes where dd_php_error.log was not being created consistently. * fix(signals): move flush in sigsegv handler Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> --------- Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> * Adds process_tags to live debugger payloads (#3580) * init process tags for APM Co-Authored-By: PROFeNoM <alexandre.choura@datadoghq.com> * feat(process_tags): add process_tags to tracing payloads * small auto review and fix test * bwoebi review * fix test * Adds process_tags to live debugger payloads * temporary libdatadog bump * auto review * bump libdatadog * fix build * update makefile && make cbindgen * fixing test * fixing test * fix appsec tests --------- Co-authored-by: PROFeNoM <alexandre.choura@datadoghq.com> * chore(profiling): update libdatadog 26 to 27 (#3640) * chore(profiling): update libdatadog 26 to 27 * process tags were removed while rebasing to sign commit --------- Signed-off-by: Alexandre Rulleau <alexandre.rulleau@datadoghq.com> Signed-off-by: Bob Weinand <bob.weinand@datadoghq.com> Co-authored-by: Florian Engelhardt <florian.engelhardt@datadoghq.com> Co-authored-by: Alexandre Rulleau <55387832+Leiyks@users.noreply.github.com> Co-authored-by: Levi Morrison <levi.morrison@datadoghq.com> Co-authored-by: Laplie Anderson <randomanderson@users.noreply.github.com> Co-authored-by: Alejandro Estringana Ruiz <alejandro.estringanaruiz@datadoghq.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Bob Weinand <bob.weinand@datadoghq.com> Co-authored-by: PROFeNoM <alexandre.choura@datadoghq.com>

feat(profiling): internal metrics for overhead

b14b268

This comment has been minimized.

Sign in to view

morrisonlevi added the profiling Relates to the Continuous Profiler label Feb 2, 2026

Merge branch 'master' into levi/stack-walk-metric

aed7a13

github-actions bot added the tracing label Feb 3, 2026

morrisonlevi marked this pull request as ready for review February 3, 2026 21:00

morrisonlevi requested a review from a team as a code owner February 3, 2026 21:00

realFlowControl and others added 3 commits February 4, 2026 11:40

feat(profiling): move CPU time capture to include serialization for `…

4995ba6

…ddprof_upload` for current profile exported Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat(profiling): add CPU time tracking for ddprof_time thread

6e3a026

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat(profiling): separate CPU time tracking per background thread

b33c01e

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

realFlowControl mentioned this pull request Feb 4, 2026

feat(profiling): internal metrics for overhead improvements #3618

Merged

2 tasks

feat(profiling): internal metrics for overhead improvements (#3618)

76483bb

realFlowControl approved these changes Feb 4, 2026

View reviewed changes

morrisonlevi merged commit 6843f96 into master Feb 4, 2026
410 of 511 checks passed

morrisonlevi deleted the levi/stack-walk-metric branch February 4, 2026 15:42

github-actions bot added this to the 1.17.0 milestone Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(profiling): internal metrics for overhead#3616

feat(profiling): internal metrics for overhead#3616
morrisonlevi merged 6 commits intomasterfrom
levi/stack-walk-metric

morrisonlevi commented Feb 2, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

codecov-commenter commented Feb 3, 2026 •

edited

Loading

Uh oh!

pr-commenter bot commented Feb 3, 2026

Uh oh!

realFlowControl commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

morrisonlevi commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Reviewer checklist

Uh oh!

This comment has been minimized.

codecov-commenter commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pr-commenter bot commented Feb 3, 2026

Benchmarks [ profiler ]

scenario:php-profiler-timeline-memory-with-profiler

scenario:php-profiler-timeline-memory-with-profiler-and-timeline

Uh oh!

realFlowControl commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

morrisonlevi commented Feb 2, 2026 •

edited

Loading

codecov-commenter commented Feb 3, 2026 •

edited

Loading