Add Rank-weighted Average Treatment Effect (RATE) metric by aman-coder03 · Pull Request #887 · uber/causalml

aman-coder03 · 2026-03-07T18:34:20Z

Proposed changes

implements the RATE metric proposed by Yadlowsky et al. (2021) as requested in #540
RATE evaluates how well a treatment prioritization rule (e.g. a CATE estimator) identifies units with above-average treatment benefit. It does this by computing the weighted area under the Targeting Operator Characteristic (TOC) curve, which compares the ATE among the top-q fraction of prioritized units to the overall ATE.

3 functions are added to causalml/metrics/rate.py following the same API conventions as the existing qini_score / get_qini / plot_qini

get_toc() computes the TOC curve
rate_score() computes the RATE scalar with either AUTOC (1/q) or Qini (q) weighting
plot_toc() visualizes the TOC curve

both oracle mode (simulated tau) and observed RCT mode (y + w) are supported. 16 tests are included in tests/test_rate.py
Closes #540

Types of changes

What types of changes does your code introduce to CausalML?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING doc
I have signed the CLA
Lint and unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)
Any dependent changes have been merged and published in downstream modules

Further comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc. This PR template is adopted from appium.

aman-coder03 · 2026-03-13T17:27:31Z

hey @jeongyoonlee can you please have a look at this PR?

jeongyoonlee

Thanks for adding the RATE metric! The implementation follows the existing get_qini/qini_score/plot_qini API pattern well. A few items to address:

Blocking

normalize division by zero — At q=1, TOC = 0 by definition (subset ATE == overall ATE when subset is the entire population), so toc.div(np.abs(toc.iloc[-1, :]), axis=1) will divide by zero and produce inf/NaN. Needs a guard or a different normalization reference point (e.g., max absolute value).
Unused random_seed parameter — All three functions accept random_seed=42 but never use it. The docstring says "deprecated" but this is brand-new code with no backward-compatibility obligation. Please remove it, or if kept for API consistency with get_qini, document why.
Missing test for normalize=True — Given the division-by-zero issue above, this path needs coverage.
Hardcoded seeds — Per project conventions, please use RANDOM_SEED from tests/const.py instead of hardcoded 42/0. Same for CONTROL_NAME and TREATMENT_NAMES if applicable.
Test that TOC ends at zero — At q=1, TOC should be 0 by definition. There's a test for TOC starting at zero but not ending at zero.

Non-blocking suggestions

O(n²) complexity in get_toc — The loop over every data point computes sorted_df.iloc[:top_k].mean() for each k. For large datasets this will be slow. Consider using cumulative sums (like get_qini does) for O(n) performance:
```
cumsum_tau = np.cumsum(sorted_tau)
subset_ate = cumsum_tau / np.arange(1, n_total + 1)
```
Integration formula — The weight normalization (weights / weights.sum()) computes a weighted mean rather than a true integral. This preserves model rankings (which is the primary use case, similar to Qini/AUUC), but the absolute values won't exactly match the paper's definition. Worth a brief note in the docstring.
Module-level plt.style.use("fivethirtyeight") — This is a side effect at import time that affects global matplotlib state. Consistent with visualize.py but worth noting.
pytest.raises(Exception) in test_get_toc_errors_on_nan — Use a more specific exception type (the code raises AssertionError, so use pytest.raises(AssertionError)).
Observed-outcome fallback — When t_mask.sum() == 0 or c_mask.sum() == 0 at a quantile, the code silently falls back to overall_ate making TOC(q) = 0. This is reasonable but worth documenting.

aman-coder03 · 2026-03-14T08:30:22Z

Hi @jeongyoonlee, thanks for the thorough review! I've already addressed all of these in the latest commit...

jeongyoonlee · 2026-03-21T05:05:42Z

Thanks @aman-coder03 — excellent work addressing all the feedback from the previous review. The updated implementation is clean and well-tested.

I cross-checked the implementation against the Yadlowsky et al. (2021) paper and the grf package reference. Here's a summary:

Correctness vs. the paper

Aspect	Correct?	Notes
TOC definition (oracle mode)	✅	Faithful: `subset_ATE(top-q) - global_ATE`, O(n) cumsum
TOC boundary conditions	✅	TOC(0)=0 and TOC(1)=0 both enforced and tested
Sort order (descending by score)	✅	Matches paper's "top-q by score" convention
RATE integral (numerical)	✅	Midpoint rule, preserves model rankings
AUTOC weight α(q) = 1/q	✅	Follows the grf software convention
Qini weight α(q) = q	✅	Matches paper exactly
Normalization	✅	`max(

The core math is correct. Two things worth documenting (not blocking):

1. Observed-outcome mode uses naive difference-in-means, not AIPW

The paper's key contribution is constructing the TOC from doubly-robust AIPW pseudo-outcomes (Section 4), which are valid even under observational confounding. The PR's observed-outcome path computes:

subset_ATE = mean(Y | W=1, top-q) - mean(Y | W=0, top-q)

This is an unbiased estimator for RCT data (where treatment is randomized), but not for observational data. Since causalml users commonly work with both, it would be helpful to add a note in the get_toc docstring, e.g.:

"When using observed outcomes (y and w), the TOC is estimated via naive difference-in-means within each quantile band. This is valid for randomized experimental data (RCTs) but may be biased for observational data. For observational settings, consider using doubly-robust (AIPW) pseudo-outcomes as the treatment_effect_col input."

2. No standard errors / p-values (fine for v1)

The paper's "most profound theoretical contribution" is the functional CLT enabling exact Gaussian inference (confidence intervals, p-values) via half-sample bootstrap. This PR implements the point estimate only, which is the right scope for an initial contribution. A follow-up issue to add rate_score(..., return_ci=True) with bootstrap inference would be a natural extension.

Bottom line: The implementation is mathematically sound for its intended scope. I'm happy to approve once CI passes, with just the docstring note above as a minor request.

jeongyoonlee

LGTM. Left a comment on the future work.

adding RATE metric for CATE evaluators

4a3348d

jeongyoonlee reviewed Mar 13, 2026

View reviewed changes

jeongyoonlee added the enhancement New feature or request label Mar 13, 2026

addressing suggested changes

30eaebd

aman-coder03 requested a review from jeongyoonlee March 14, 2026 13:51

jeongyoonlee approved these changes Mar 21, 2026

View reviewed changes

jeongyoonlee merged commit 1c555cf into uber:master Mar 21, 2026
7 checks passed

This was referenced Mar 21, 2026

Add AIPW docstring warning to get_toc() for observational data #888

Open

Add bootstrap confidence intervals and p-values to rate_score() #889

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Rank-weighted Average Treatment Effect (RATE) metric#887

Add Rank-weighted Average Treatment Effect (RATE) metric#887
jeongyoonlee merged 2 commits intouber:masterfrom
aman-coder03:feature/rate-metric

aman-coder03 commented Mar 7, 2026

Uh oh!

aman-coder03 commented Mar 13, 2026 •

edited

Loading

Uh oh!

jeongyoonlee left a comment

Uh oh!

aman-coder03 commented Mar 14, 2026

Uh oh!

jeongyoonlee commented Mar 21, 2026

Uh oh!

jeongyoonlee left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aman-coder03 commented Mar 7, 2026

Proposed changes

Types of changes

Checklist

Further comments

Uh oh!

aman-coder03 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeongyoonlee left a comment

Choose a reason for hiding this comment

Blocking

Non-blocking suggestions

Uh oh!

aman-coder03 commented Mar 14, 2026

Uh oh!

jeongyoonlee commented Mar 21, 2026

Correctness vs. the paper

1. Observed-outcome mode uses naive difference-in-means, not AIPW

2. No standard errors / p-values (fine for v1)

Uh oh!

jeongyoonlee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aman-coder03 commented Mar 13, 2026 •

edited

Loading