Skip to content

Fix GAE masked critic values bootstrapping valid advantages#447

Open
haoyang9804 wants to merge 1 commit into
alibaba:mainfrom
haoyang9804:fix/gae-mask-critic-values
Open

Fix GAE masked critic values bootstrapping valid advantages#447
haoyang9804 wants to merge 1 commit into
alibaba:mainfrom
haoyang9804:fix/gae-mask-critic-values

Conversation

@haoyang9804
Copy link
Copy Markdown

Summary

ROLL's roll.utils.functionals.compute_advantage() can let critic values from masked response positions change valid-token GAE advantages. The trigger is adv_estimator="gae" with a response_mask=0 padding or filtered token whose critic values entry is non-zero or invalid. The function stored values * response_mask back into the batch, but still passed the original unmasked values tensor to compute_gae_advantage_return(), so a masked slot could bootstrap earlier valid tokens without a crash.

This patch applies response_mask to values before calling the GAE helper. The new test covers the concrete boundary where a masked padding value of 100.0 used to change both valid-token advantages.

Concrete triggering example

response_mask = [[1.0, 1.0, 0.0]]
token_level_rewards = [[0.0, 1.0, 0.0]]
values = [[0.0, 0.0, 100.0]]
gamma = 1.0
lambd = 0.95
adv_estimator = "gae"

Buggy output from upstream main:

{
  "stored_values_after_compute_advantage": [[0.0, 0.0, 0.0]],
  "advantages": [[5.699999809265137, 6.0, 0.0]],
  "returns": [[5.699999809265137, 6.0, 0.0]]
}

Wrong intermediate value: the masked critic value 100.0 at the final response_mask=0 slot is still used as nextvalues while computing the previous valid token, so the valid advantages become 5.699999809265137 and 6.0.

Fixed output:

{
  "stored_values_after_compute_advantage": [[0.0, 0.0, 0.0]],
  "advantages": [[0.949999988079071, 1.0, 0.0]],
  "returns": [[0.949999988079071, 1.0, 0.0]]
}

Fixed value: after masking critic values before GAE, the helper sees [[0.0, 0.0, 0.0]] for values, so the valid advantages are 0.949999988079071 and 1.0.

The shared invariant is that masked values must be selected out before they enter reward, advantage, return, or loss arithmetic. Multiplying only for storage is not enough when a later helper still receives the unmasked tensor.

Real rollout reproduction

This was also validated with a local real model rollout. The runner uses Qwen/Qwen2.5-0.5B-Instruct from the local Hugging Face cache, performs a real AutoModelForCausalLM.generate(), then sends the generated tokens through ROLL's real postprocess_generate(), get_sample_level_mask(), reward_postprocess(), compute_token_reward(), and compute_advantage() path. The only hook is the fault construction step after rollout: it injects a finite critic value into a rollout padding slot where response_mask=0.

Recipe:

{
  "kind": "rl_sentinel_validation_recipe",
  "schema_version": 1,
  "bug_id": "ROLL-GAE-MASKED-VALUE-BOOTSTRAP",
  "target": "roll",
  "validation_mode": "real_hf_model_rollout_plus_roll_training_signal_hook",
  "model": "${MODEL_ID:-Qwen/Qwen2.5-0.5B-Instruct}",
  "requirements": {
    "target_repo": "${TARGET_REPO}",
    "output_dir": "${OUTPUT_DIR}",
    "required_modules": [
      "roll.utils.functionals",
      "roll.distributed.scheduler.protocol",
      "transformers",
      "torch",
      "tensordict"
    ]
  },
  "preserved_infrastructure": [
    "real local HF model generation",
    "roll.utils.functionals.postprocess_generate",
    "roll.utils.functionals.get_sample_level_mask",
    "roll.utils.functionals.reward_postprocess",
    "roll.utils.functionals.compute_token_reward",
    "roll.utils.functionals.compute_advantage",
    "roll.distributed.scheduler.protocol.DataProto",
    "tensordict.TensorDict"
  ],
  "hooked_boundary": "after real model rollout, before critic values enter compute_advantage",
  "constructed_scenario": {
    "prompt": "Answer with one short word: ok",
    "max_new_tokens": 2,
    "extra_response_padding_slots": 3,
    "response_level_reward": 1.0,
    "masked_critic_value": 100.0,
    "gamma": 1.0,
    "lambd": 0.95,
    "adv_estimator": "gae"
  },
  "replaced_component": null
}

Runner:

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
export TARGET_REPO="${TARGET_REPO:?Set TARGET_REPO to the ROLL checkout}"
export OUTPUT_DIR="${OUTPUT_DIR:-${SCRIPT_DIR}/out}"
export BUG_ID="${BUG_ID:-ROLL-GAE-MASKED-VALUE-BOOTSTRAP}"
export MODEL_ID="${MODEL_ID:-Qwen/Qwen2.5-0.5B-Instruct}"
export PYTHONPATH="${TARGET_REPO}${PYTHONPATH:+:${PYTHONPATH}}"

mkdir -p "${OUTPUT_DIR}"
python3 "${SCRIPT_DIR}/roll_gae_masked_value_real_rollout_hook.py" \
  > "${OUTPUT_DIR}/real_rollout_training_signal_validation.json" \
  2> "${OUTPUT_DIR}/real_rollout_validation.stderr.log"
python3 -m json.tool "${OUTPUT_DIR}/real_rollout_training_signal_validation.json" > /dev/null

Hook:

from __future__ import annotations

import contextlib
import importlib
import json
import os
import subprocess
import sys
from pathlib import Path
from types import SimpleNamespace

import torch
from tensordict import TensorDict
from transformers import AutoModelForCausalLM, AutoTokenizer


def import_from_target(name: str, repo: Path):
    with contextlib.redirect_stdout(sys.stderr):
        module = importlib.import_module(name)
    module_file = Path(module.__file__).resolve()
    if not module_file.is_relative_to(repo):
        raise RuntimeError(f"{name} resolved outside TARGET_REPO: {module_file} not under {repo}")
    return module


class NoopKLController:
    value = 0.0

    def update(self, current, n_steps):
        return None


def tensor_list(tensor: torch.Tensor):
    return tensor.detach().cpu().tolist()


def main() -> int:
    repo = Path(os.environ["TARGET_REPO"]).resolve()
    output_dir = Path(os.environ["OUTPUT_DIR"]).resolve()
    output_dir.mkdir(parents=True, exist_ok=True)
    model_id = os.environ.get("MODEL_ID", "Qwen/Qwen2.5-0.5B-Instruct")
    expect = os.environ.get("EXPECT", "reproduced")

    functionals = import_from_target("roll.utils.functionals", repo)
    protocol = import_from_target("roll.distributed.scheduler.protocol", repo)
    DataProto = protocol.DataProto

    target_commit = subprocess.check_output(["git", "-C", str(repo), "rev-parse", "HEAD"], text=True).strip()
    device = "cuda" if torch.cuda.is_available() else "cpu"
    dtype = torch.float16 if device == "cuda" else torch.float32

    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, local_files_only=True, padding_side="left")
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        trust_remote_code=True,
        local_files_only=True,
        torch_dtype=dtype,
    ).to(device)
    model.eval()

    prompt_text = "Answer with one short word: ok"
    encoded = tokenizer([prompt_text], return_tensors="pt", padding=True)
    input_ids = encoded["input_ids"]
    attention_mask = encoded["attention_mask"]
    prompt_length = input_ids.shape[1]
    max_new_tokens = 2
    sequence_length = prompt_length + max_new_tokens + 3
    position_ids = torch.clip(torch.cumsum(attention_mask, dim=-1) - 1, min=0)

    with torch.no_grad():
        generated = model.generate(
            input_ids=input_ids.to(device),
            attention_mask=attention_mask.to(device),
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        ).detach().cpu()

    prompt_batch = TensorDict(
        {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "position_ids": position_ids,
        },
        batch_size=[1],
    )
    rollout = functionals.postprocess_generate(
        prompts=DataProto(batch=prompt_batch),
        output=generated,
        num_return_sequences=1,
        sequence_length=sequence_length,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

    action_mask = rollout.batch["response_mask"][:, 1:].float()
    if action_mask.sum().item() == 0:
        raise RuntimeError("Real rollout produced no trainable response tokens; cannot validate GAE signal.")

    config = SimpleNamespace(
        adv_estimator="gae",
        actor_infer=SimpleNamespace(generating_args=SimpleNamespace(num_return_sequences=1)),
        max_len_mask=False,
        difficulty_mask=False,
        difficulty_low_threshold=0.0,
        difficulty_high_threshold=1.0,
        error_max_len_clip=False,
        error_max_len_threshold=9999999999,
        norm_mean_type=None,
        norm_std_type=None,
        reward_clip=None,
        kl_penalty="kl",
        add_token_level_kl=False,
    )

    rollout.batch["response_level_rewards"] = torch.tensor([1.0], dtype=torch.float32)
    rollout.batch["old_log_probs"] = torch.zeros_like(action_mask, dtype=torch.float32)
    rollout.batch["ref_log_probs"] = torch.zeros_like(action_mask, dtype=torch.float32)

    rollout, mask_metrics = functionals.get_sample_level_mask(rollout, config)
    rollout, reward_metrics = functionals.reward_postprocess(rollout, config, {})
    rollout, token_reward_metrics = functionals.compute_token_reward(rollout, config, NoopKLController())

    final_mask = rollout.batch["final_response_mask"].clone().float()
    masked_candidates = (final_mask[0] == 0).nonzero(as_tuple=False).flatten()
    valid_positions = (final_mask[0] > 0).nonzero(as_tuple=False).flatten()
    masked_after_valid = masked_candidates[masked_candidates > valid_positions.min()]
    if len(masked_after_valid) == 0:
        raise RuntimeError(f"Real rollout left no masked action slot after a valid token: {tensor_list(final_mask)}")
    injected_index = int(masked_after_valid[0].item())

    bad_values = torch.zeros_like(final_mask, dtype=torch.float32)
    bad_values[0, injected_index] = 100.0
    safe_values = bad_values * final_mask

    def run_advantage(values: torch.Tensor):
        data = DataProto(
            batch=TensorDict(
                {
                    "response_mask": rollout.batch["response_mask"].clone(),
                    "final_response_mask": final_mask.clone(),
                    "token_level_rewards": rollout.batch["token_level_rewards"].clone(),
                    "values": values.clone(),
                    "old_log_probs": rollout.batch["old_log_probs"].clone(),
                    "ref_log_probs": rollout.batch["ref_log_probs"].clone(),
                },
                batch_size=[1],
            ),
            meta_info={},
        )
        return functionals.compute_advantage(
            data=data,
            gamma=torch.tensor(1.0),
            lambd=torch.tensor(0.95),
            adv_estimator="gae",
            advantage_clip=None,
            whiten_advantages=False,
            whiten_rewards=False,
            response_mask=final_mask.clone(),
            pipeline_config=config,
        )

    observed = run_advantage(bad_values)
    expected = run_advantage(safe_values)
    observed_adv = observed.batch["advantages"].detach()
    expected_adv = expected.batch["advantages"].detach()
    observed_returns = observed.batch["returns"].detach()
    expected_returns = expected.batch["returns"].detach()
    valid_delta = (observed_adv - expected_adv) * final_mask
    max_delta = float(torch.max(torch.abs(valid_delta)).item())
    reproduced = max_delta > 1e-6

    payload = {
        "kind": "rl_sentinel_training_signal_validation",
        "schema_version": 1,
        "bug_id": os.environ.get("BUG_ID", "ROLL-GAE-MASKED-VALUE-BOOTSTRAP"),
        "target": "roll",
        "target_repo": str(repo),
        "target_commit": target_commit,
        "status": "reproduced" if reproduced else "fixed",
        "expect": expect,
        "validation_mode": "real_hf_model_rollout_plus_roll_training_signal_hook",
        "real_rollout": {
            "backend": "transformers.AutoModelForCausalLM.generate",
            "model_id": model_id,
            "response": tokenizer.decode(rollout.batch["responses"][0], skip_special_tokens=True),
            "response_mask": tensor_list(rollout.batch["response_mask"]),
            "final_response_mask": tensor_list(final_mask),
        },
        "hook": {
            "hooked_boundary": "post-rollout critic value tensor before roll.utils.functionals.compute_advantage",
            "constructed_scenario": "inject a finite critic value under a response_mask=0 rollout padding slot",
            "injected_masked_value_index": injected_index,
            "injected_masked_value": 100.0,
            "replaced_component": None,
        },
        "pipeline_metrics": {
            "mask_metrics": mask_metrics,
            "reward_metrics": reward_metrics,
            "token_reward_metrics": token_reward_metrics,
        },
        "trigger": {
            "token_level_rewards": tensor_list(rollout.batch["token_level_rewards"]),
            "bad_values": tensor_list(bad_values),
            "safe_values": tensor_list(safe_values),
            "gamma": 1.0,
            "lambd": 0.95,
            "adv_estimator": "gae",
        },
        "observed": {
            "advantages": tensor_list(observed_adv),
            "returns": tensor_list(observed_returns),
            "stored_values_after_compute_advantage": tensor_list(observed.batch["values"]),
        },
        "expected_sanitized_values": {
            "advantages": tensor_list(expected_adv),
            "returns": tensor_list(expected_returns),
            "stored_values_after_compute_advantage": tensor_list(expected.batch["values"]),
        },
        "attack_effect": {
            "valid_token_advantage_delta": tensor_list(valid_delta),
            "max_abs_valid_advantage_delta": max_delta,
            "quiet_non_crash": True,
            "finite_signal_corruption": bool(torch.isfinite(observed_adv).all().item() and reproduced),
        },
        "candidate_bug_reproduced": reproduced,
    }
    print(json.dumps(payload, indent=2, sort_keys=True))

    if expect == "reproduced" and not reproduced:
        return 2
    if expect == "fixed" and reproduced:
        return 3
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

Real rollout output on unpatched alibaba/ROLL main:

{
  "target_commit": "c09bc8bc9f43",
  "status": "reproduced",
  "candidate_bug_reproduced": true,
  "real_rollout": {
    "backend": "transformers.AutoModelForCausalLM.generate",
    "model_id": "Qwen/Qwen2.5-0.5B-Instruct",
    "response": "\n\nSure",
    "final_response_mask": [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0]]
  },
  "observed": {
    "advantages": [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.699999809265137, 6.0, -0.0, 0.0, 0.0]]
  },
  "expected_sanitized_values": {
    "advantages": [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.949999988079071, 1.0, 0.0, 0.0, 0.0]]
  },
  "attack_effect": {
    "max_abs_valid_advantage_delta": 5.0,
    "quiet_non_crash": true,
    "finite_signal_corruption": true
  }
}

Real rollout output on this branch:

{
  "target_commit": "090ac94658ad",
  "status": "fixed",
  "candidate_bug_reproduced": false,
  "observed": {
    "advantages": [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.949999988079071, 1.0, 0.0, 0.0, 0.0]]
  },
  "expected_sanitized_values": {
    "advantages": [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.949999988079071, 1.0, 0.0, 0.0, 0.0]]
  },
  "attack_effect": {
    "max_abs_valid_advantage_delta": 0.0,
    "quiet_non_crash": true,
    "finite_signal_corruption": false
  }
}

Reproduction recipe

{
  "kind": "rl_sentinel_validation_recipe",
  "schema_version": 1,
  "target": "roll",
  "validation_mode": "real_functionals_boundary_hook",
  "hooked_boundary": "roll.utils.functionals.compute_advantage",
  "requirements": {
    "roll_repo": "${ROLL_REPO}",
    "output_dir": "${OUTPUT_DIR}",
    "required_modules": [
      "roll.utils.functionals",
      "roll.distributed.scheduler.protocol",
      "torch",
      "tensordict"
    ]
  },
  "constructed_scenario": {
    "response_mask": [[1.0, 1.0, 0.0]],
    "token_level_rewards": [[0.0, 1.0, 0.0]],
    "values": [[0.0, 0.0, 100.0]],
    "gamma": 1.0,
    "lambd": 0.95,
    "adv_estimator": "gae"
  },
  "expected_unpatched": {
    "advantages": [[5.699999809265137, 6.0, 0.0]]
  },
  "expected_fixed": {
    "advantages": [[0.949999988079071, 1.0, 0.0]]
  },
  "replaced_component": null
}

Validation runner

Save this as run_gae_masked_value_validation.sh and run it with ROLL_REPO pointing at either an unpatched or patched ROLL checkout.

#!/usr/bin/env bash
set -euo pipefail

ROLL_REPO="${ROLL_REPO:?Set ROLL_REPO to a ROLL checkout}"
OUTPUT_DIR="${OUTPUT_DIR:-./roll-gae-mask-validation}"
mkdir -p "${OUTPUT_DIR}"
export ROLL_REPO OUTPUT_DIR
export PYTHONPATH="${ROLL_REPO}${PYTHONPATH:+:${PYTHONPATH}}"

python3 - <<'PY' > "${OUTPUT_DIR}/gae_masked_value_validation.json"
import contextlib
import importlib
import json
import os
import subprocess
import sys
from pathlib import Path

import torch

repo = Path(os.environ["ROLL_REPO"]).resolve()

def load_module(name):
    with contextlib.redirect_stdout(sys.stderr):
        module = importlib.import_module(name)
    module_file = Path(module.__file__).resolve()
    if not module_file.is_relative_to(repo):
        raise RuntimeError(f"{name} imported from {module_file}, not {repo}")
    return module

functionals = load_module("roll.utils.functionals")
protocol = load_module("roll.distributed.scheduler.protocol")
TensorDict = importlib.import_module("tensordict").TensorDict
DataProto = protocol.DataProto

response_mask = torch.tensor([[1.0, 1.0, 0.0]], dtype=torch.float32)
token_level_rewards = torch.tensor([[0.0, 1.0, 0.0]], dtype=torch.float32)
values = torch.tensor([[0.0, 0.0, 100.0]], dtype=torch.float32)
expected_fixed = torch.tensor([[0.95, 1.0, 0.0]], dtype=torch.float32)

data = DataProto(
    batch=TensorDict(
        {
            "response_mask": response_mask.clone(),
            "token_level_rewards": token_level_rewards.clone(),
            "values": values.clone(),
        },
        batch_size=[1],
    ),
    meta_info={},
)
functionals.compute_advantage(
    data=data,
    gamma=torch.tensor(1.0),
    lambd=torch.tensor(0.95),
    adv_estimator="gae",
    response_mask=response_mask.clone(),
)

advantages = data.batch["advantages"].detach()
returns = data.batch["returns"].detach()
stored_values = data.batch["values"].detach()
valid_delta = (advantages - expected_fixed) * response_mask
fixed = bool(
    torch.max(torch.abs(valid_delta)).item() <= 1e-6
    and torch.equal(stored_values, values * response_mask)
)

print(json.dumps({
    "repo": str(repo),
    "commit": subprocess.check_output(["git", "-C", str(repo), "rev-parse", "HEAD"], text=True).strip(),
    "hooked_boundary": "roll.utils.functionals.compute_advantage",
    "trigger": {
        "response_mask": response_mask.tolist(),
        "token_level_rewards": token_level_rewards.tolist(),
        "values": values.tolist(),
        "gamma": 1.0,
        "lambd": 0.95,
        "adv_estimator": "gae"
    },
    "observed": {
        "stored_values_after_compute_advantage": stored_values.tolist(),
        "advantages": advantages.tolist(),
        "returns": returns.tolist()
    },
    "expected_fixed_advantages": expected_fixed.tolist(),
    "status": "fixed" if fixed else "bug_reproduced"
}, indent=2, sort_keys=True))

if not fixed:
    raise SystemExit(2)
PY

Observed output

On unpatched alibaba/ROLL main, the same boundary reproduced the bug:

{
  "status": "reproduced",
  "observed_unpatched": {
    "stored_values_after_compute_advantage": [[0.0, 0.0, 0.0]],
    "advantages": [[5.699999809265137, 6.0, -0.0]],
    "returns": [[5.699999809265137, 6.0, 0.0]]
  },
  "expected_fixed": {
    "advantages": [[0.949999988079071, 1.0, 0.0]],
    "returns": [[0.949999988079071, 1.0, 0.0]]
  },
  "attack_effect": {
    "max_abs_valid_advantage_delta": 5.0,
    "quiet_non_crash": true,
    "finite_signal_corruption": true
  }
}

On this branch:

{
  "status": "fixed",
  "observed_fixed": {
    "stored_values_after_compute_advantage": [[0.0, 0.0, 0.0]],
    "advantages": [[0.949999988079071, 1.0, 0.0]],
    "raw_advantages_before_final_mask": [[0.949999988079071, 1.0, 0.0]],
    "returns": [[0.949999988079071, 1.0, 0.0]],
    "all_finite": true
  },
  "attack_effect": {
    "candidate_bug_reproduced": false,
    "candidate_bug_fixed": true,
    "max_abs_valid_advantage_delta": 0.0,
    "max_abs_valid_return_delta": 0.0,
    "stored_values_match_response_mask": true
  }
}

Root cause

compute_advantage() did this in the GAE branch:

values = data.batch["values"].float()
data.batch["values"] = values * response_mask
advantages, returns = compute_gae_advantage_return(
    token_level_rewards=token_level_rewards, values=values, gamma=gamma, lambd=lambd
)

The batch stored masked values, but compute_gae_advantage_return() still received the raw values. Because GAE uses nextvalues = values[:, t + 1], a masked value at a later padding or filtered token can bootstrap into earlier valid tokens.

Fix

The GAE branch now masks values before both storage and helper invocation:

values = data.batch["values"].float()
values = values * response_mask
data.batch["values"] = values
advantages, returns = compute_gae_advantage_return(
    token_level_rewards=token_level_rewards, values=values, gamma=gamma, lambd=lambd
)

The regression test constructs the exact tensor boundary above and asserts values, advantages, and returns all match the masked-value oracle.

Tests and checks

PYTHONPATH="${REPAIR_REPO}:${PYTHONPATH}" python3 -m pytest -q tests/utils/test_functionals.py

Output:

tests/utils/test_functionals.py::test_traverse_obj PASSED
tests/utils/test_functionals.py::test_divide_by_chunk_size_valid PASSED
tests/utils/test_functionals.py::test_pad_to_length PASSED
tests/utils/test_functionals.py::test_compute_advantage_masks_values_before_gae_bootstrap PASSED
4 passed, 5 warnings in 3.48s
python3 -m pre_commit install
python3 -m pre_commit run --files roll/utils/functionals.py tests/utils/test_functionals.py

Output:

pre-commit installed at .git/hooks/pre-commit
isort....................................................................Passed
autoflake................................................................Passed
black....................................................................Passed
flake8...................................................................Passed

The commit hook also passed the same pre-commit hooks during git commit.

Contribution and duplicate checks

Target upstream repo: alibaba/ROLL.

Contribution files checked locally:

  • README.md
  • .pre-commit-config.yaml
  • pyproject.toml

No root CONTRIBUTING.md was present in the checkout. The relevant local tooling is black, isort, autoflake, and flake8 through pre-commit, with 119-character line length.

Duplicate checks performed:

  • Searched the local BUG_FINDINGS.md ledger for GAE, masked critic values, bootstrap, response_mask, compute_advantage, and related terms.
  • Searched repair repo branches and remote refs for matching gae or masked-value fixes.
  • Searched existing pr_drafts/ for related ROLL masked-value PR drafts.
  • Searched loop artifact history for ROLL-GAE-MASKED-VALUE-BOOTSTRAP and related boundary names.
  • Queried upstream alibaba/ROLL PRs and issues for the boundary and symptom terms.
  • Checked upstream main source and found it still passes raw values into compute_gae_advantage_return() after assigning masked values back to the batch.

Result: no exact upstream issue, PR, branch, PR draft, or ledger duplicate was found for ROLL's finite masked critic-value GAE bootstrap bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant