Skip to content

Add Rank-weighted Average Treatment Effect (RATE) metric#887

Merged
jeongyoonlee merged 2 commits intouber:masterfrom
aman-coder03:feature/rate-metric
Mar 21, 2026
Merged

Add Rank-weighted Average Treatment Effect (RATE) metric#887
jeongyoonlee merged 2 commits intouber:masterfrom
aman-coder03:feature/rate-metric

Conversation

@aman-coder03
Copy link
Copy Markdown
Contributor

Proposed changes

implements the RATE metric proposed by Yadlowsky et al. (2021) as requested in #540
RATE evaluates how well a treatment prioritization rule (e.g. a CATE estimator) identifies units with above-average treatment benefit. It does this by computing the weighted area under the Targeting Operator Characteristic (TOC) curve, which compares the ATE among the top-q fraction of prioritized units to the overall ATE.

3 functions are added to causalml/metrics/rate.py following the same API conventions as the existing qini_score / get_qini / plot_qini

  • get_toc() computes the TOC curve
  • rate_score() computes the RATE scalar with either AUTOC (1/q) or Qini (q) weighting
  • plot_toc() visualizes the TOC curve

both oracle mode (simulated tau) and observed RCT mode (y + w) are supported. 16 tests are included in tests/test_rate.py
Closes #540

Types of changes

What types of changes does your code introduce to CausalML?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING doc
  • I have signed the CLA
  • Lint and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in downstream modules

Further comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc. This PR template is adopted from appium.

@aman-coder03
Copy link
Copy Markdown
Contributor Author

aman-coder03 commented Mar 13, 2026

hey @jeongyoonlee can you please have a look at this PR?

Copy link
Copy Markdown
Collaborator

@jeongyoonlee jeongyoonlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the RATE metric! The implementation follows the existing get_qini/qini_score/plot_qini API pattern well. A few items to address:

Blocking

  1. normalize division by zero — At q=1, TOC = 0 by definition (subset ATE == overall ATE when subset is the entire population), so toc.div(np.abs(toc.iloc[-1, :]), axis=1) will divide by zero and produce inf/NaN. Needs a guard or a different normalization reference point (e.g., max absolute value).

  2. Unused random_seed parameter — All three functions accept random_seed=42 but never use it. The docstring says "deprecated" but this is brand-new code with no backward-compatibility obligation. Please remove it, or if kept for API consistency with get_qini, document why.

  3. Missing test for normalize=True — Given the division-by-zero issue above, this path needs coverage.

  4. Hardcoded seeds — Per project conventions, please use RANDOM_SEED from tests/const.py instead of hardcoded 42/0. Same for CONTROL_NAME and TREATMENT_NAMES if applicable.

  5. Test that TOC ends at zero — At q=1, TOC should be 0 by definition. There's a test for TOC starting at zero but not ending at zero.

Non-blocking suggestions

  • O(n²) complexity in get_toc — The loop over every data point computes sorted_df.iloc[:top_k].mean() for each k. For large datasets this will be slow. Consider using cumulative sums (like get_qini does) for O(n) performance:

    cumsum_tau = np.cumsum(sorted_tau)
    subset_ate = cumsum_tau / np.arange(1, n_total + 1)
  • Integration formula — The weight normalization (weights / weights.sum()) computes a weighted mean rather than a true integral. This preserves model rankings (which is the primary use case, similar to Qini/AUUC), but the absolute values won't exactly match the paper's definition. Worth a brief note in the docstring.

  • Module-level plt.style.use("fivethirtyeight") — This is a side effect at import time that affects global matplotlib state. Consistent with visualize.py but worth noting.

  • pytest.raises(Exception) in test_get_toc_errors_on_nan — Use a more specific exception type (the code raises AssertionError, so use pytest.raises(AssertionError)).

  • Observed-outcome fallback — When t_mask.sum() == 0 or c_mask.sum() == 0 at a quantile, the code silently falls back to overall_ate making TOC(q) = 0. This is reasonable but worth documenting.

@jeongyoonlee jeongyoonlee added the enhancement New feature or request label Mar 13, 2026
@aman-coder03
Copy link
Copy Markdown
Contributor Author

Hi @jeongyoonlee, thanks for the thorough review! I've already addressed all of these in the latest commit...

@jeongyoonlee
Copy link
Copy Markdown
Collaborator

Thanks @aman-coder03 — excellent work addressing all the feedback from the previous review. The updated implementation is clean and well-tested.

I cross-checked the implementation against the Yadlowsky et al. (2021) paper and the grf package reference. Here's a summary:

Correctness vs. the paper

Aspect Correct? Notes
TOC definition (oracle mode) Faithful: subset_ATE(top-q) - global_ATE, O(n) cumsum
TOC boundary conditions TOC(0)=0 and TOC(1)=0 both enforced and tested
Sort order (descending by score) Matches paper's "top-q by score" convention
RATE integral (numerical) Midpoint rule, preserves model rankings
AUTOC weight α(q) = 1/q Follows the grf software convention
Qini weight α(q) = q Matches paper exactly
Normalization `max(

The core math is correct. Two things worth documenting (not blocking):

1. Observed-outcome mode uses naive difference-in-means, not AIPW

The paper's key contribution is constructing the TOC from doubly-robust AIPW pseudo-outcomes (Section 4), which are valid even under observational confounding. The PR's observed-outcome path computes:

subset_ATE = mean(Y | W=1, top-q) - mean(Y | W=0, top-q)

This is an unbiased estimator for RCT data (where treatment is randomized), but not for observational data. Since causalml users commonly work with both, it would be helpful to add a note in the get_toc docstring, e.g.:

"When using observed outcomes (y and w), the TOC is estimated via naive difference-in-means within each quantile band. This is valid for randomized experimental data (RCTs) but may be biased for observational data. For observational settings, consider using doubly-robust (AIPW) pseudo-outcomes as the treatment_effect_col input."

2. No standard errors / p-values (fine for v1)

The paper's "most profound theoretical contribution" is the functional CLT enabling exact Gaussian inference (confidence intervals, p-values) via half-sample bootstrap. This PR implements the point estimate only, which is the right scope for an initial contribution. A follow-up issue to add rate_score(..., return_ci=True) with bootstrap inference would be a natural extension.


Bottom line: The implementation is mathematically sound for its intended scope. I'm happy to approve once CI passes, with just the docstring note above as a minor request.

Copy link
Copy Markdown
Collaborator

@jeongyoonlee jeongyoonlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left a comment on the future work.

@jeongyoonlee jeongyoonlee merged commit 1c555cf into uber:master Mar 21, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add the Rank-weighted Average Treatment Effect (RATE) metric

2 participants