Skip to content

feat(rivetkit): add current actor metrics#5020

Draft
NathanFlurry wants to merge 1 commit into
counter-latency/rivetkit-metrics-routesfrom
counter-latency/current-actor-metrics
Draft

feat(rivetkit): add current actor metrics#5020
NathanFlurry wants to merge 1 commit into
counter-latency/rivetkit-metrics-routesfrom
counter-latency/current-actor-metrics

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@NathanFlurry NathanFlurry force-pushed the counter-latency/current-actor-metrics branch from c6e4baf to 93e7a3b Compare May 11, 2026 13:46
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 11, 2026

Review: feat(rivetkit): add current actor metrics

This is a draft PR -- feedback is intended to catch issues before finalization.

Overview

This PR graduates several no-op actor metric methods into real implementations, adding gauges and counters for queue depth, active connections, inbox depths, user task counts, HTTP requests, keep-awake counts, shutdown drain counts, and SQLite worker state. The core mechanism is an aggregated gauge pattern: each ActorMetrics instance tracks its current contribution in a per-instance Mutex<ActorMetricState>, computes a signed delta, and applies it to the process-global IntGaugeVec. On drop, clear_aggregated_gauges zeroes every contribution, preventing stale label series from accumulating.

The test in metrics.rs covers the multi-instance aggregation and the drop-cleanup path. The kitchen-sink example cleanly removes the now-deleted registry.diagnostics call.


Issues

1. user_tasks_active gauge can go negative (bug)

In end_user_task, the decrement uses i64::saturating_sub(1), which produces -1 when the current stored value is 0:

let next = (*current).saturating_sub(1);  // 0i64.saturating_sub(1) == -1

If end_user_task is ever called without a prior begin_user_task for that kind (e.g. from a panicking task), the gauge goes negative. Fix:

let next = (*current).saturating_sub(1).max(0);

The same defensiveness applies to any gauge that decrements.

2. New strong-reference cycles in change callbacks

AsyncCounter::change_callbacks stores Arc<dyn Fn()> (strong references). The three new callbacks registered in configure_sleep_hooks each capture a cloned ActorContext (Arc<ActorContextInner>), and the counters live inside ActorContextInner.sleep.work. This creates a cycle:

Arc<ActorContextInner>
  -> sleep.work.keep_awake (AsyncCounter)
    -> change_callbacks: Vec<Arc<closure>>
      -> keep_awake_ctx: Arc<ActorContextInner>  <- back to start

The http_request_counter callback already does this (pre-existing), but the PR adds three more cycles. Worth confirming there is a teardown path that clears the callbacks before the ActorContextInner refcount reaches zero, otherwise actors will leak.

3. set_aggregated_gauge delta arithmetic

let delta = next.saturating_sub(*current);

For i64 in realistic metric ranges this equals plain subtraction, but saturating_sub on a signed type saturates at i64::MIN (not zero), which is surprising for a delta computation. Plain next - *current is more readable and correct for all in-range values.


Minor observations

  • WORK_LABELS is defined as a module-level constant but used only for keep_awake_active. Either inline it or leave it as-is.
  • set_worker_active maps bool to 1/0: aggregates correctly across actors (sum = active worker count per actor name), but a short comment near the metric definition would clarify the intent.
  • usize_to_i64 saturation at i64::MAX: a debug_assert that the value fits would catch regressions in tests without production overhead.
  • The examples/kitchen-sink removal is clean and correct.

What is good

  • clear_aggregated_gauges on drop correctly prevents the Prometheus registry from accumulating stale label-value series for stopped actors.
  • The test exercises concurrent instances, intermediate updates, and full drop cleanup -- good coverage for the aggregation invariant.
  • Parking lot mutex usage is justified: guards are never held across .await points, and Drop (sync context) requires a sync lock.
  • set_inbox_depth string dispatch with _ => unreachable!() is acceptable per project conventions (open string space, not an enum).

Summary

The aggregated-gauge design is sound and the test coverage is solid. The two actionable issues before merge are: (1) the potential negative gauge from end_user_task decrement, and (2) confirming the new change-callback cycles have a teardown path. The delta arithmetic nit is low-priority but worth a one-line fix for clarity.

@NathanFlurry NathanFlurry force-pushed the counter-latency/current-actor-metrics branch from 93e7a3b to 6d50b18 Compare May 11, 2026 13:54
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 11, 2026

PR 5020 Review: feat(rivetkit): add current actor metrics. The core delta-based aggregation pattern in metrics.rs is well-designed. Key items: (1) sqlite_workers_active naming vs set_worker_active(bool) bool maps to 1 or 0 so the plural name is misleading; (2) keep_awake_active uses WORK_LABELS with kind dimension but shutdown_tasks_active uses ACTOR_LABELS without it creating PromQL inconsistency; (3) test coverage for keep_awake and shutdown_tasks gauges is missing from the new test; (4) begin_user_task / end_user_task can go negative if called out of order since entry().or_default() starts at 0; (5) registryDiagnostics removal in kitchen-sink appears unrelated to the metrics work. Overall the aggregation approach and Drop cleanup are solid.

@MasterPtato MasterPtato force-pushed the counter-latency/rivetkit-metrics-routes branch from 3a9c1be to 72886b4 Compare May 14, 2026 17:26
@MasterPtato MasterPtato force-pushed the counter-latency/current-actor-metrics branch from 6d50b18 to 5b2112e Compare May 14, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant