feat(rivetkit): add current actor metrics by NathanFlurry · Pull Request #5020 · rivet-dev/rivet

NathanFlurry · 2026-05-11T13:45:35Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

NathanFlurry · 2026-05-11T13:46:01Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-05-11T13:54:11Z

Review: feat(rivetkit): add current actor metrics

This is a draft PR -- feedback is intended to catch issues before finalization.

Overview

This PR graduates several no-op actor metric methods into real implementations, adding gauges and counters for queue depth, active connections, inbox depths, user task counts, HTTP requests, keep-awake counts, shutdown drain counts, and SQLite worker state. The core mechanism is an aggregated gauge pattern: each ActorMetrics instance tracks its current contribution in a per-instance Mutex<ActorMetricState>, computes a signed delta, and applies it to the process-global IntGaugeVec. On drop, clear_aggregated_gauges zeroes every contribution, preventing stale label series from accumulating.

The test in metrics.rs covers the multi-instance aggregation and the drop-cleanup path. The kitchen-sink example cleanly removes the now-deleted registry.diagnostics call.

Issues

1. `user_tasks_active` gauge can go negative (bug)

In end_user_task, the decrement uses i64::saturating_sub(1), which produces -1 when the current stored value is 0:

let next = (*current).saturating_sub(1);  // 0i64.saturating_sub(1) == -1

If end_user_task is ever called without a prior begin_user_task for that kind (e.g. from a panicking task), the gauge goes negative. Fix:

let next = (*current).saturating_sub(1).max(0);

The same defensiveness applies to any gauge that decrements.

2. New strong-reference cycles in change callbacks

AsyncCounter::change_callbacks stores Arc<dyn Fn()> (strong references). The three new callbacks registered in configure_sleep_hooks each capture a cloned ActorContext (Arc<ActorContextInner>), and the counters live inside ActorContextInner.sleep.work. This creates a cycle:

Arc<ActorContextInner>
  -> sleep.work.keep_awake (AsyncCounter)
    -> change_callbacks: Vec<Arc<closure>>
      -> keep_awake_ctx: Arc<ActorContextInner>  <- back to start

The http_request_counter callback already does this (pre-existing), but the PR adds three more cycles. Worth confirming there is a teardown path that clears the callbacks before the ActorContextInner refcount reaches zero, otherwise actors will leak.

3. `set_aggregated_gauge` delta arithmetic

let delta = next.saturating_sub(*current);

For i64 in realistic metric ranges this equals plain subtraction, but saturating_sub on a signed type saturates at i64::MIN (not zero), which is surprising for a delta computation. Plain next - *current is more readable and correct for all in-range values.

Minor observations

WORK_LABELS is defined as a module-level constant but used only for keep_awake_active. Either inline it or leave it as-is.
set_worker_active maps bool to 1/0: aggregates correctly across actors (sum = active worker count per actor name), but a short comment near the metric definition would clarify the intent.
usize_to_i64 saturation at i64::MAX: a debug_assert that the value fits would catch regressions in tests without production overhead.
The examples/kitchen-sink removal is clean and correct.

What is good

clear_aggregated_gauges on drop correctly prevents the Prometheus registry from accumulating stale label-value series for stopped actors.
The test exercises concurrent instances, intermediate updates, and full drop cleanup -- good coverage for the aggregation invariant.
Parking lot mutex usage is justified: guards are never held across .await points, and Drop (sync context) requires a sync lock.
set_inbox_depth string dispatch with _ => unreachable!() is acceptable per project conventions (open string space, not an enum).

Summary

The aggregated-gauge design is sound and the test coverage is solid. The two actionable issues before merge are: (1) the potential negative gauge from end_user_task decrement, and (2) confirming the new change-callback cycles have a teardown path. The delta arithmetic nit is low-priority but worth a one-line fix for clarity.

claude · 2026-05-11T13:54:57Z

PR 5020 Review: feat(rivetkit): add current actor metrics. The core delta-based aggregation pattern in metrics.rs is well-designed. Key items: (1) sqlite_workers_active naming vs set_worker_active(bool) bool maps to 1 or 0 so the plural name is misleading; (2) keep_awake_active uses WORK_LABELS with kind dimension but shutdown_tasks_active uses ACTOR_LABELS without it creating PromQL inconsistency; (3) test coverage for keep_awake and shutdown_tasks gauges is missing from the new test; (4) begin_user_task / end_user_task can go negative if called out of order since entry().or_default() starts at 0; (5) registryDiagnostics removal in kitchen-sink appears unrelated to the metrics work. Overall the aggregation approach and Drop cleanup are solid.

NathanFlurry mentioned this pull request May 11, 2026

feat(rivetkit): expose low-cardinality metrics routes #5019

Draft

11 tasks

NathanFlurry force-pushed the counter-latency/current-actor-metrics branch from c6e4baf to 93e7a3b Compare May 11, 2026 13:46

NathanFlurry force-pushed the counter-latency/current-actor-metrics branch from 93e7a3b to 6d50b18 Compare May 11, 2026 13:54

NathanFlurry mentioned this pull request May 11, 2026

fix(rivetkit): require engine ping for health #5022

Draft

11 tasks

abcxff mentioned this pull request May 11, 2026

temp: patch wasm-pack to point to latest binary #5037

Draft

11 tasks

MasterPtato mentioned this pull request May 12, 2026

fix(envoy-client): dont abort drain after 20s #5043

Draft

11 tasks

feat(rivetkit): add current actor metrics

5b2112e

MasterPtato force-pushed the counter-latency/rivetkit-metrics-routes branch from 3a9c1be to 72886b4 Compare May 14, 2026 17:26

MasterPtato force-pushed the counter-latency/current-actor-metrics branch from 6d50b18 to 5b2112e Compare May 14, 2026 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rivetkit): add current actor metrics#5020

feat(rivetkit): add current actor metrics#5020
NathanFlurry wants to merge 1 commit into
counter-latency/rivetkit-metrics-routesfrom
counter-latency/current-actor-metrics

NathanFlurry commented May 11, 2026

Uh oh!

NathanFlurry commented May 11, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 11, 2026

Uh oh!

claude Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanFlurry commented May 11, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

NathanFlurry commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 11, 2026

Overview

Issues

1. user_tasks_active gauge can go negative (bug)

2. New strong-reference cycles in change callbacks

3. set_aggregated_gauge delta arithmetic

Minor observations

What is good

Summary

Uh oh!

claude Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NathanFlurry commented May 11, 2026 •

edited

Loading

1. `user_tasks_active` gauge can go negative (bug)

3. `set_aggregated_gauge` delta arithmetic

claude Bot commented May 11, 2026 •

edited

Loading