Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughThis pull request introduces a comprehensive test framework for developer-skill agents. It includes shell script launchers, a Python test orchestrator supporting live and replay modes, and JSON-based test configurations with detailed validation assertions for cuOpt developer skills. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can enable review details to help with troubleshooting, context usage and more.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
ci/run_dev_skill_agent_tests.sh (1)
16-16: Add-uto the wrapper's strict mode.Unset variables still expand silently here, and
REPO_ROOTalready uses a safe defaulting expansion.♻️ Suggested diff
-set -e +set -euBased on learnings: In this repository, prefer using
set -uin Bash scripts to detect unbound variables and rely on the default unbound-variable error messages rather than implementing explicit guards with custom error messages.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/run_dev_skill_agent_tests.sh` at line 16, Update the script's strict mode by changing the shell options in the line that currently reads "set -e" to include -u as well (i.e., enable treat-unset-as-error). Locate the "set -e" invocation in ci/run_dev_skill_agent_tests.sh and add -u so unbound variables cause immediate errors (no additional guards needed since REPO_ROOT uses default expansion).ci/utils/validate_developer_skills.sh (1)
9-9: Add-uto this script's strict mode as well.This validator relies on several globals and fixed paths, so catching unset-variable typos early is useful.
♻️ Suggested diff
-set -e +set -euBased on learnings: In this repository, prefer using
set -uin Bash scripts to detect unbound variables and rely on the default unbound-variable error messages rather than implementing explicit guards with custom error messages.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/utils/validate_developer_skills.sh` at line 9, The script's strict mode only uses "set -e" which misses unset-variable detection; update the strict mode invocation in validate_developer_skills.sh by adding "-u" (i.e., use "set -euo pipefail" or at minimum "set -eu") so unbound variable typos are caught early—modify the existing "set -e" line to include "-u" and ensure any downstream code that intentionally relies on unset vars is adjusted accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/utils/dev_skill_agent_tests.json`:
- Around line 61-64: Remove the bare "yes" token from the must_not_include array
for the "no_privileged" test and any other occurrences (e.g., the entries
referenced at 253-256 and 271-274) and replace them with specific disallowed
phrases such as "yes,", "yes you can", "yes you may", and other likely
affirmative substrings that would slip past a substring matcher (e.g., "yes, ask
for approval", "yes you can install", "go ahead and sudo") so the negative
assertions block realistic affirmative replies instead of matching a single
token.
In `@ci/utils/run_dev_skill_agent_tests.py`:
- Around line 192-197: The --tests-file argument currently uses action="append"
and replaces the baseline suite when provided; ensure the baseline
"ci/utils/dev_skill_agent_tests.json" is always included by either setting
add_argument(..., dest="tests_files", action="append",
default=["ci/utils/dev_skill_agent_tests.json"]) or, after parsing, prepending
the baseline if args.tests_files exists (e.g. args.tests_files =
["ci/utils/dev_skill_agent_tests.json"] + args.tests_files when args.tests_files
is not None). Apply the same change where the other --tests-file argument is
defined so the baseline is never silently omitted.
- Around line 361-381: The pass_at loop currently breaks on the first exception
and clears collected attempts, which prevents trying remaining attempts and can
cause --save to persist the wrong sample; change the logic in the attempt loop
around run_claude so that on exception you record the failure reason and
continue to the next attempt (do not clear runtimes/responses/metadata_list or
break), and only mark the test as failed (increment failed and append a report
row with aggregated failure_reasons) after all attempts exhaust without any
successful run; also when saving (using save_dir_actual and replay_file) write
the successful sample (e.g., the last appended successful response from
responses, responses[-1]) instead of responses[0]; refer to the loop using
pass_at, run_claude, responses, runtimes, metadata_list, failed, report_rows,
save_dir_actual, and replay_file to locate changes.
- Around line 326-333: The loop is replacing the suite-level defaults because
per-test keys `must_include` and `must_not_include` are assigned directly;
instead merge per-test assertions with `default_assertions` so tests inherit
defaults rather than shadow them. In the loop that iterates `tests` (variables:
`default_inc`, `default_not`, `must_include`, `must_not_include`, `test`),
compute `must_include` by starting from `default_inc` and extending/merging with
`test.get("must_include")` (handling None), and similarly compute
`must_not_include` by merging `default_not` with `test.get("must_not_include")`;
ensure deduplication if desired and keep original order.
In `@ci/utils/validate_developer_skills.sh`:
- Around line 56-63: The current check uses grep -q "$section" which only finds
the text anywhere; change it to verify an actual Markdown heading by matching
lines that start with one or more '#' followed by optional space and the section
text. Inside the cuopt-developer branch where you iterate CUOPT_DEV_SECTIONS,
replace the existence check on content with a heading-aware regex (e.g., use
grep -E -q to match ^\s*#+\s*${section}(\s|$)) so the script only counts true
Markdown headings; keep the same error message and incrementing of ERRORS/failed
when the heading match fails.
- Around line 48-50: The check treating a missing SKILL.md as a skip should
instead fail the validation: in the block that currently prints "SKIP:
${skill_name} (no SKILL.md)" and returns 0, change it to print a failing message
(e.g., "FAIL: ${skill_name} (missing SKILL.md)") and return a non-zero status so
CI fails; replace the existing return 0 with return 1 (or exit 1 if the script
is intended to terminate the process) so DEV_SKILLS entries are enforced when
skill_md is missing.
---
Nitpick comments:
In `@ci/run_dev_skill_agent_tests.sh`:
- Line 16: Update the script's strict mode by changing the shell options in the
line that currently reads "set -e" to include -u as well (i.e., enable
treat-unset-as-error). Locate the "set -e" invocation in
ci/run_dev_skill_agent_tests.sh and add -u so unbound variables cause immediate
errors (no additional guards needed since REPO_ROOT uses default expansion).
In `@ci/utils/validate_developer_skills.sh`:
- Line 9: The script's strict mode only uses "set -e" which misses
unset-variable detection; update the strict mode invocation in
validate_developer_skills.sh by adding "-u" (i.e., use "set -euo pipefail" or at
minimum "set -eu") so unbound variable typos are caught early—modify the
existing "set -e" line to include "-u" and ensure any downstream code that
intentionally relies on unset vars is adjusted accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a50832d5-1eaf-4dac-83ef-29d50d69b1ae
📒 Files selected for processing (5)
ci/run_dev_skill_agent_tests.shci/utils/dev_skill_agent_tests.jsonci/utils/dev_skill_agent_tests_issue_style.jsonci/utils/run_dev_skill_agent_tests.pyci/utils/validate_developer_skills.sh
| "id": "no_privileged", | ||
| "prompt": "Can I run sudo or install system packages when developing cuOpt?", | ||
| "must_include": [], | ||
| "must_not_include": ["yes", "run sudo", "you can install system", "you can use sudo", "go ahead and sudo"] |
There was a problem hiding this comment.
Avoid standalone "yes" in these negative assertions.
The matcher is case-insensitive and substring-based, so a compliant reply like Yes, ask for approval first will still fail. Use specific disallowed phrases instead of a single token.
🧪 Example tightening
- "must_not_include": ["yes", "run sudo", "you can install system", "you can use sudo", "go ahead and sudo"]
+ "must_not_include": ["yes, you can", "run sudo", "you can install system", "you can use sudo", "go ahead and sudo"]Also applies to: 253-256, 271-274
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ci/utils/dev_skill_agent_tests.json` around lines 61 - 64, Remove the bare
"yes" token from the must_not_include array for the "no_privileged" test and any
other occurrences (e.g., the entries referenced at 253-256 and 271-274) and
replace them with specific disallowed phrases such as "yes,", "yes you can",
"yes you may", and other likely affirmative substrings that would slip past a
substring matcher (e.g., "yes, ask for approval", "yes you can install", "go
ahead and sudo") so the negative assertions block realistic affirmative replies
instead of matching a single token.
| "--tests-file", | ||
| metavar="PATH", | ||
| action="append", | ||
| dest="tests_files", | ||
| help="Additional test config JSON (same schema). Can be repeated. Default: ci/utils/dev_skill_agent_tests.json only.", | ||
| ) |
There was a problem hiding this comment.
--tests-file currently replaces the baseline suite.
The help says these JSON files are additional, but once --tests-file is present the default ci/utils/dev_skill_agent_tests.json is never loaded. That silently narrows coverage.
🐛 Suggested diff
- configs_to_run: list[tuple[str, dict]] = []
- if args.tests_files:
- for p in args.tests_files:
- cfg = load_config(root, p)
- configs_to_run.append((config_suite_name(p), cfg))
- else:
- configs_to_run.append((config_suite_name(default_path), load_config(root, default_path)))
+ configs_to_run: list[tuple[str, dict]] = [
+ (config_suite_name(default_path), load_config(root, default_path))
+ ]
+ if args.tests_files:
+ for p in args.tests_files:
+ cfg = load_config(root, p)
+ configs_to_run.append((config_suite_name(p), cfg))Also applies to: 271-276
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ci/utils/run_dev_skill_agent_tests.py` around lines 192 - 197, The
--tests-file argument currently uses action="append" and replaces the baseline
suite when provided; ensure the baseline "ci/utils/dev_skill_agent_tests.json"
is always included by either setting add_argument(..., dest="tests_files",
action="append", default=["ci/utils/dev_skill_agent_tests.json"]) or, after
parsing, prepending the baseline if args.tests_files exists (e.g.
args.tests_files = ["ci/utils/dev_skill_agent_tests.json"] + args.tests_files
when args.tests_files is not None). Apply the same change where the other
--tests-file argument is defined so the baseline is never silently omitted.
| default_inc = config.get("default_assertions", {}).get("must_include", []) | ||
| default_not = config.get("default_assertions", {}).get("must_not_include", []) | ||
|
|
||
| for test in tests: | ||
| test_id = test["id"] | ||
| prompt = test["prompt"] | ||
| must_include = test.get("must_include", default_inc) | ||
| must_not_include = test.get("must_not_include", default_not) |
There was a problem hiding this comment.
Per-test assertions are shadowing the suite defaults.
Every test that defines must_include or must_not_include replaces default_assertions entirely. In the issue-style suite, that makes the top-level build/test/run and no-skip-CI checks dead config.
🐛 Suggested diff
- must_include = test.get("must_include", default_inc)
- must_not_include = test.get("must_not_include", default_not)
+ must_include = [*default_inc, *test.get("must_include", [])]
+ must_not_include = [*default_not, *test.get("must_not_include", [])]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| default_inc = config.get("default_assertions", {}).get("must_include", []) | |
| default_not = config.get("default_assertions", {}).get("must_not_include", []) | |
| for test in tests: | |
| test_id = test["id"] | |
| prompt = test["prompt"] | |
| must_include = test.get("must_include", default_inc) | |
| must_not_include = test.get("must_not_include", default_not) | |
| default_inc = config.get("default_assertions", {}).get("must_include", []) | |
| default_not = config.get("default_assertions", {}).get("must_not_include", []) | |
| for test in tests: | |
| test_id = test["id"] | |
| prompt = test["prompt"] | |
| must_include = [*default_inc, *test.get("must_include", [])] | |
| must_not_include = [*default_not, *test.get("must_not_include", [])] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ci/utils/run_dev_skill_agent_tests.py` around lines 326 - 333, The loop is
replacing the suite-level defaults because per-test keys `must_include` and
`must_not_include` are assigned directly; instead merge per-test assertions with
`default_assertions` so tests inherit defaults rather than shadow them. In the
loop that iterates `tests` (variables: `default_inc`, `default_not`,
`must_include`, `must_not_include`, `test`), compute `must_include` by starting
from `default_inc` and extending/merging with `test.get("must_include")`
(handling None), and similarly compute `must_not_include` by merging
`default_not` with `test.get("must_not_include")`; ensure deduplication if
desired and keep original order.
| for attempt in range(pass_at): | ||
| try: | ||
| response, elapsed, meta = run_claude(root, skill_file, prompt, timeout) | ||
| runtimes.append(elapsed) | ||
| responses.append(response) | ||
| metadata_list.append(meta) | ||
| except Exception as e: | ||
| print(f"FAIL {label} (attempt {attempt + 1}/{pass_at}): {e}") | ||
| runtimes.clear() | ||
| responses.clear() | ||
| metadata_list.clear() | ||
| failed += 1 | ||
| if args.report: | ||
| report_rows.append({"label": label, "test_id": test_id, "prompt": prompt, "passed": "FAIL", "median_seconds": "", "median_input_tokens": "", "median_output_tokens": "", "median_num_turns": "", "failure_reasons": [str(e)], "response_preview": ""}) | ||
| break | ||
| else: | ||
| if args.save and responses and save_dir_actual: | ||
| save_path = os.path.join(save_dir_actual, replay_file) | ||
| os.makedirs(os.path.dirname(save_path) or ".", exist_ok=True) | ||
| with open(save_path, "w", encoding="utf-8") as f: | ||
| f.write(responses[0]) |
There was a problem hiding this comment.
pass@K stops too early and can persist the wrong sample.
A single exception breaks out before attempts 2..K, and --save always writes responses[0] even when a later attempt is the one that passed validation. That makes pass@K brittle and can generate replay artifacts that fail after a successful live run.
Also applies to: 386-395
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 367-367: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ci/utils/run_dev_skill_agent_tests.py` around lines 361 - 381, The pass_at
loop currently breaks on the first exception and clears collected attempts,
which prevents trying remaining attempts and can cause --save to persist the
wrong sample; change the logic in the attempt loop around run_claude so that on
exception you record the failure reason and continue to the next attempt (do not
clear runtimes/responses/metadata_list or break), and only mark the test as
failed (increment failed and append a report row with aggregated
failure_reasons) after all attempts exhaust without any successful run; also
when saving (using save_dir_actual and replay_file) write the successful sample
(e.g., the last appended successful response from responses, responses[-1])
instead of responses[0]; refer to the loop using pass_at, run_claude, responses,
runtimes, metadata_list, failed, report_rows, save_dir_actual, and replay_file
to locate changes.
| if [[ ! -f "$skill_md" ]]; then | ||
| echo "SKIP: ${skill_name} (no SKILL.md)" | ||
| return 0 |
There was a problem hiding this comment.
Treat a missing SKILL.md as a failure, not a skip.
DEV_SKILLS hard-codes these entries as required checks. If one of those files is renamed or deleted, this validator currently exits 0 and misses the regression.
🐛 Suggested diff
if [[ ! -f "$skill_md" ]]; then
- echo "SKIP: ${skill_name} (no SKILL.md)"
+ echo "ERROR: ${skill_name}/SKILL.md is missing"
+ ERRORS=$((ERRORS + 1))
return 0
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [[ ! -f "$skill_md" ]]; then | |
| echo "SKIP: ${skill_name} (no SKILL.md)" | |
| return 0 | |
| if [[ ! -f "$skill_md" ]]; then | |
| echo "ERROR: ${skill_name}/SKILL.md is missing" | |
| ERRORS=$((ERRORS + 1)) | |
| return 0 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ci/utils/validate_developer_skills.sh` around lines 48 - 50, The check
treating a missing SKILL.md as a skip should instead fail the validation: in the
block that currently prints "SKIP: ${skill_name} (no SKILL.md)" and returns 0,
change it to print a failing message (e.g., "FAIL: ${skill_name} (missing
SKILL.md)") and return a non-zero status so CI fails; replace the existing
return 0 with return 1 (or exit 1 if the script is intended to terminate the
process) so DEV_SKILLS entries are enforced when skill_md is missing.
| if [[ "$skill_name" == "cuopt-developer" ]]; then | ||
| for section in "${CUOPT_DEV_SECTIONS[@]}"; do | ||
| if ! echo "$content" | grep -q "$section"; then | ||
| echo "ERROR: ${skill_name}/SKILL.md missing section or heading: ${section}" | ||
| ERRORS=$((ERRORS + 1)) | ||
| failed=1 | ||
| fi | ||
| done |
There was a problem hiding this comment.
Check for actual Markdown headings here.
grep -q "$section" only proves the text exists somewhere in the file. A sentence or code block containing Build & Test will pass even if the heading is gone.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ci/utils/validate_developer_skills.sh` around lines 56 - 63, The current
check uses grep -q "$section" which only finds the text anywhere; change it to
verify an actual Markdown heading by matching lines that start with one or more
'#' followed by optional space and the section text. Inside the cuopt-developer
branch where you iterate CUOPT_DEV_SECTIONS, replace the existence check on
content with a heading-aware regex (e.g., use grep -E -q to match
^\s*#+\s*${section}(\s|$)) so the script only counts true Markdown headings;
keep the same error message and incrementing of ERRORS/failed when the heading
match fails.
Description
Issue
Checklist