Add wait_event_timing: Oracle-style wait event instrumentation#1
Open
DmitryNFomin wants to merge 8 commits intomasterfrom
Open
Add wait_event_timing: Oracle-style wait event instrumentation#1DmitryNFomin wants to merge 8 commits intomasterfrom
DmitryNFomin wants to merge 8 commits intomasterfrom
Conversation
Add per-backend wait event timing with nanosecond precision, controlled by the wait_event_timing GUC (default: off). When enabled, every pgstat_report_wait_start()/pgstat_report_wait_end() call records the wait duration using clock_gettime(CLOCK_MONOTONIC) and accumulates: - Per-event call count - Per-event total nanoseconds - Per-event max duration - Per-event log2 histogram (16 buckets: <1us to >=16ms) Statistics are stored in per-backend shared memory arrays, requiring no locking (each backend writes only to its own slot). External tools can read accumulated stats via the pg_stat_get_wait_event_timing() function. Overhead when enabled: two VDSO clock_gettime() calls per wait event transition (~40-100 ns), plus a few memory writes — comparable to Oracle's TIMED_STATISTICS. When disabled: one predictable branch per wait event (~1 ns), zero other cost. This provides PostgreSQL with Oracle V$SYSTEM_EVENT / V$EVENT_HISTOGRAM equivalent functionality, enabling: - Wait event time model (which events consume the most time) - Latency histograms (bimodal distributions, tail latency) - Per-backend wait profiles Motivation: external BPF-based wait event tracers (pg_wait_tracer) use hardware watchpoints on PGPROC->wait_event_info, which incur 6-30% overhead due to CPU debug exceptions. Internal instrumentation via clock_gettime VDSO achieves equivalent functionality at ~1-5% overhead. New files: src/include/utils/wait_event_timing.h -- data structures, inline helpers src/backend/utils/activity/wait_event_timing.c -- shmem init, SQL function Modified files: src/include/utils/wait_event.h -- timing calls in inline functions src/backend/storage/lmgr/proc.c -- init/cleanup timing storage src/backend/storage/ipc/ipci.c -- shared memory allocation src/backend/utils/misc/guc_parameters.dat -- wait_event_timing GUC src/include/catalog/pg_proc.dat -- pg_stat_get_wait_event_timing() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When wait_event_timing is enabled, also accumulates wait stats per (query_id, wait_event) pair in a per-backend hash table (1024 slots, open addressing with linear probing). Query_id is read from PgBackendStatus->st_query_id via a cached pointer set during pgstat_beinit(). Zero overhead when query_id is not set (most background processes). New SQL function: pg_stat_get_wait_event_timing_by_query() Returns (backend_id, query_id, wait_event_type, wait_event, calls, total_time_ms) for all backends with non-zero counts. Shared memory: 32 KB per backend (1024 * 32-byte entries). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When wait_event_timing is enabled, also accumulates wait stats per (query_id, wait_event) pair in a per-backend hash table (1024 slots, open addressing with linear probing). Query_id is read from PgBackendStatus->st_query_id via a cached pointer set during pgstat_beinit(). Zero overhead when query_id is not set (most background processes). New SQL function: pg_stat_get_wait_event_timing_by_query() Returns (backend_id, query_id, wait_event_type, wait_event, calls, total_time_ms) for all backends with non-zero counts. Shared memory: 32 KB per backend (1024 * 32-byte entries). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dc89622 to
ffe0a58
Compare
When wait_event_trace GUC is enabled for a session (PGC_USERSET),
every pgstat_report_wait_end() writes a record to a per-backend ring
buffer in shared memory: {timestamp_ns, event, duration_ns, query_id}.
Ring buffer holds 4096 records (128 KB) per backend. At 220K events/sec
this covers ~18ms of history. External tools read via
pg_stat_get_wait_event_trace(backend_id).
Requires wait_event_timing = on (trace code runs inside the timing
block to reuse the already-computed timestamp and duration).
New GUC: wait_event_trace (bool, PGC_USERSET, default off)
New SQL function: pg_stat_get_wait_event_trace(backend_id int4)
Returns (seq, timestamp_ns, wait_event_type, wait_event, duration_us,
query_id) in chronological order.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ffe0a58 to
1b95736
Compare
…ing) When compiled without the flag (default), zero instructions are added to pgstat_report_wait_start/end — identical binary to stock PostgreSQL. When compiled with --enable-wait-event-timing: ./configure --enable-wait-event-timing meson setup -Dwait_event_timing=true The wait_event_timing and wait_event_trace GUCs still exist in both builds (setting them is harmless without the compile flag). This eliminates the ~2% overhead from the compiled-in branch check that was present even with GUC off. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7069864 to
7cadcee
Compare
At 32 bytes per record, this is 4 MB per backend. At 220K events/sec (worst case), holds ~0.6 seconds of history — enough for a background worker polling at 100ms to drain without losing events. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Oracle-style wait event instrumentation to PostgreSQL — per-event timing,
histograms, query attribution, and per-session 10046-style tracing. Controlled
by a compile-time flag (
--enable-wait-event-timing) with zero overhead whennot compiled in.
The problem
External BPF-based wait event profilers (like pg_wait_tracer) use CPU hardware watchpoints on
PGPROC->wait_event_info, costing ~200-300 ns per debug exception — 29% TPS overhead on high-transition workloads. Oracle solved this by instrumenting internally withclock_gettime()VDSO calls (~70-100 ns, no kernel trap).What this patch provides
Compile-time:
./configure --enable-wait-event-timing(ormeson setup -Dwait_event_timing=true).Without the flag, the binary is identical to stock PostgreSQL.
Two runtime GUCs:
Oracle equivalents:
Benchmark results
Environment: Hetzner cx43 (8 vCPU, 16 GB RAM), Rocky 9.7, PG 19devel,
pgbench scale 100, shared_buffers=128MB, 8 clients, SELECT-only — worst case
at ~220K wait event transitions/sec, 60-second runs.
Test 1: Compile flag overhead
Stock PG vs patched (without `--enable-wait-event-timing`) on same data directory,
alternating A/B runs, 5 rounds:
No measurable difference. Hot-path object code is byte-identical
(verified via `objcopy -O binary -j .text`).
Test 2: GUC overhead (same binary + same data)
Same binary (WITH flag), same data directory, toggling GUCs between restarts, 5 rounds:
< 0.5% difference. All configs within run-to-run variance.
vs hardware watchpoints
All 243 PostgreSQL regression tests pass (`make check`).
Files changed
New files:
Modified files:
Related work
🤖 Generated with Claude Code