Add dev skill evaluation by rgsl888prabhu · Pull Request #953 · NVIDIA/cuopt

rgsl888prabhu · 2026-03-12T03:18:59Z

Description

Issue

Checklist

I am familiar with the Contributing Guidelines.
Testing
- New or existing tests cover these changes
- Added tests
- Created an issue to follow-up
- NA
Documentation
- The documentation is up to date with these changes
- Added new documentation
- NA

copy-pr-bot · 2026-03-12T03:19:02Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-03-16T16:39:54Z

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive test framework for developer-skill agents. It includes shell script launchers, a Python test orchestrator supporting live and replay modes, and JSON-based test configurations with detailed validation assertions for cuOpt developer skills.

Changes

Cohort / File(s)	Summary
Test Launcher Scripts `ci/run_dev_skill_agent_tests.sh`, `ci/utils/validate_developer_skills.sh`	Adds shell scripts for initiating developer-skill agent tests and validating SKILL.md files. Launcher delegates to Python runner with argument forwarding; validator checks for required sections and phrases in skill files.
Test Configurations `ci/utils/dev_skill_agent_tests.json`, `ci/utils/dev_skill_agent_tests_issue_style.json`	Introduces JSON-based test case definitions for cuOpt developer skill agent validation. Standard config includes 19 test cases; issue-style config includes 6 test cases with default assertions, timeouts, and must_include/must_not_include criteria per test.
Test Runner `ci/utils/run_dev_skill_agent_tests.py`	Implements Python orchestrator supporting live Claude CLI execution and replay validation modes. Features configuration-driven test loading, negation-aware phrase validation, multi-attempt support, CSV/Markdown reporting, metadata collection (tokens/turns/duration), and graceful error handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description only contains the repository's contribution template with empty sections and no substantive content describing the changes, their purpose, or rationale.	Add a meaningful description explaining what dev skill evaluation does, why it was added, and how it functions; include details about the new scripts, test configurations, and validation approach.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add dev skill evaluation' accurately describes the main change: introducing developer skill evaluation infrastructure through new scripts, test configurations, and validation utilities.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

ci/run_dev_skill_agent_tests.sh (1)
16-16: Add -u to the wrapper's strict mode.

Unset variables still expand silently here, and REPO_ROOT already uses a safe defaulting expansion.
♻️ Suggested diff
-set -e
+set -eu
Based on learnings: In this repository, prefer using set -u in Bash scripts to detect unbound variables and rely on the default unbound-variable error messages rather than implementing explicit guards with custom error messages.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/run_dev_skill_agent_tests.sh` at line 16, Update the script's strict mode
by changing the shell options in the line that currently reads "set -e" to
include -u as well (i.e., enable treat-unset-as-error). Locate the "set -e"
invocation in ci/run_dev_skill_agent_tests.sh and add -u so unbound variables
cause immediate errors (no additional guards needed since REPO_ROOT uses default
expansion).
ci/utils/validate_developer_skills.sh (1)
9-9: Add -u to this script's strict mode as well.

This validator relies on several globals and fixed paths, so catching unset-variable typos early is useful.
♻️ Suggested diff
-set -e
+set -eu
Based on learnings: In this repository, prefer using set -u in Bash scripts to detect unbound variables and rely on the default unbound-variable error messages rather than implementing explicit guards with custom error messages.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/validate_developer_skills.sh` at line 9, The script's strict mode
only uses "set -e" which misses unset-variable detection; update the strict mode
invocation in validate_developer_skills.sh by adding "-u" (i.e., use "set -euo
pipefail" or at minimum "set -eu") so unbound variable typos are caught
early—modify the existing "set -e" line to include "-u" and ensure any
downstream code that intentionally relies on unset vars is adjusted accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/utils/dev_skill_agent_tests.json`:
- Around line 61-64: Remove the bare "yes" token from the must_not_include array
for the "no_privileged" test and any other occurrences (e.g., the entries
referenced at 253-256 and 271-274) and replace them with specific disallowed
phrases such as "yes,", "yes you can", "yes you may", and other likely
affirmative substrings that would slip past a substring matcher (e.g., "yes, ask
for approval", "yes you can install", "go ahead and sudo") so the negative
assertions block realistic affirmative replies instead of matching a single
token.

In `@ci/utils/run_dev_skill_agent_tests.py`:
- Around line 192-197: The --tests-file argument currently uses action="append"
and replaces the baseline suite when provided; ensure the baseline
"ci/utils/dev_skill_agent_tests.json" is always included by either setting
add_argument(..., dest="tests_files", action="append",
default=["ci/utils/dev_skill_agent_tests.json"]) or, after parsing, prepending
the baseline if args.tests_files exists (e.g. args.tests_files =
["ci/utils/dev_skill_agent_tests.json"] + args.tests_files when args.tests_files
is not None). Apply the same change where the other --tests-file argument is
defined so the baseline is never silently omitted.
- Around line 361-381: The pass_at loop currently breaks on the first exception
and clears collected attempts, which prevents trying remaining attempts and can
cause --save to persist the wrong sample; change the logic in the attempt loop
around run_claude so that on exception you record the failure reason and
continue to the next attempt (do not clear runtimes/responses/metadata_list or
break), and only mark the test as failed (increment failed and append a report
row with aggregated failure_reasons) after all attempts exhaust without any
successful run; also when saving (using save_dir_actual and replay_file) write
the successful sample (e.g., the last appended successful response from
responses, responses[-1]) instead of responses[0]; refer to the loop using
pass_at, run_claude, responses, runtimes, metadata_list, failed, report_rows,
save_dir_actual, and replay_file to locate changes.
- Around line 326-333: The loop is replacing the suite-level defaults because
per-test keys `must_include` and `must_not_include` are assigned directly;
instead merge per-test assertions with `default_assertions` so tests inherit
defaults rather than shadow them. In the loop that iterates `tests` (variables:
`default_inc`, `default_not`, `must_include`, `must_not_include`, `test`),
compute `must_include` by starting from `default_inc` and extending/merging with
`test.get("must_include")` (handling None), and similarly compute
`must_not_include` by merging `default_not` with `test.get("must_not_include")`;
ensure deduplication if desired and keep original order.

In `@ci/utils/validate_developer_skills.sh`:
- Around line 56-63: The current check uses grep -q "$section" which only finds
the text anywhere; change it to verify an actual Markdown heading by matching
lines that start with one or more '#' followed by optional space and the section
text. Inside the cuopt-developer branch where you iterate CUOPT_DEV_SECTIONS,
replace the existence check on content with a heading-aware regex (e.g., use
grep -E -q to match ^\s*#+\s*${section}(\s|$)) so the script only counts true
Markdown headings; keep the same error message and incrementing of ERRORS/failed
when the heading match fails.
- Around line 48-50: The check treating a missing SKILL.md as a skip should
instead fail the validation: in the block that currently prints "SKIP:
${skill_name} (no SKILL.md)" and returns 0, change it to print a failing message
(e.g., "FAIL: ${skill_name} (missing SKILL.md)") and return a non-zero status so
CI fails; replace the existing return 0 with return 1 (or exit 1 if the script
is intended to terminate the process) so DEV_SKILLS entries are enforced when
skill_md is missing.

---

Nitpick comments:
In `@ci/run_dev_skill_agent_tests.sh`:
- Line 16: Update the script's strict mode by changing the shell options in the
line that currently reads "set -e" to include -u as well (i.e., enable
treat-unset-as-error). Locate the "set -e" invocation in
ci/run_dev_skill_agent_tests.sh and add -u so unbound variables cause immediate
errors (no additional guards needed since REPO_ROOT uses default expansion).

In `@ci/utils/validate_developer_skills.sh`:
- Line 9: The script's strict mode only uses "set -e" which misses
unset-variable detection; update the strict mode invocation in
validate_developer_skills.sh by adding "-u" (i.e., use "set -euo pipefail" or at
minimum "set -eu") so unbound variable typos are caught early—modify the
existing "set -e" line to include "-u" and ensure any downstream code that
intentionally relies on unset vars is adjusted accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a50832d5-1eaf-4dac-83ef-29d50d69b1ae

📥 Commits

Reviewing files that changed from the base of the PR and between d531ad1 and 1e9b4db.

📒 Files selected for processing (5)

ci/run_dev_skill_agent_tests.sh
ci/utils/dev_skill_agent_tests.json
ci/utils/dev_skill_agent_tests_issue_style.json
ci/utils/run_dev_skill_agent_tests.py
ci/utils/validate_developer_skills.sh

coderabbitai · 2026-03-16T16:39:57Z

ci/utils/dev_skill_agent_tests.json

+      "id": "no_privileged",
+      "prompt": "Can I run sudo or install system packages when developing cuOpt?",
+      "must_include": [],
+      "must_not_include": ["yes", "run sudo", "you can install system", "you can use sudo", "go ahead and sudo"]


⚠️ Potential issue | 🟠 Major

Avoid standalone "yes" in these negative assertions.

The matcher is case-insensitive and substring-based, so a compliant reply like Yes, ask for approval first will still fail. Use specific disallowed phrases instead of a single token.

🧪 Example tightening

- "must_not_include": ["yes", "run sudo", "you can install system", "you can use sudo", "go ahead and sudo"] + "must_not_include": ["yes, you can", "run sudo", "you can install system", "you can use sudo", "go ahead and sudo"]

Also applies to: 253-256, 271-274

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ci/utils/dev_skill_agent_tests.json` around lines 61 - 64, Remove the bare "yes" token from the must_not_include array for the "no_privileged" test and any other occurrences (e.g., the entries referenced at 253-256 and 271-274) and replace them with specific disallowed phrases such as "yes,", "yes you can", "yes you may", and other likely affirmative substrings that would slip past a substring matcher (e.g., "yes, ask for approval", "yes you can install", "go ahead and sudo") so the negative assertions block realistic affirmative replies instead of matching a single token.

coderabbitai · 2026-03-16T16:39:57Z

ci/utils/run_dev_skill_agent_tests.py

+        "--tests-file",
+        metavar="PATH",
+        action="append",
+        dest="tests_files",
+        help="Additional test config JSON (same schema). Can be repeated. Default: ci/utils/dev_skill_agent_tests.json only.",
+    )


⚠️ Potential issue | 🟠 Major

--tests-file currently replaces the baseline suite.

The help says these JSON files are additional, but once --tests-file is present the default ci/utils/dev_skill_agent_tests.json is never loaded. That silently narrows coverage.

🐛 Suggested diff

- configs_to_run: list[tuple[str, dict]] = [] - if args.tests_files: - for p in args.tests_files: - cfg = load_config(root, p) - configs_to_run.append((config_suite_name(p), cfg)) - else: - configs_to_run.append((config_suite_name(default_path), load_config(root, default_path))) + configs_to_run: list[tuple[str, dict]] = [ + (config_suite_name(default_path), load_config(root, default_path)) + ] + if args.tests_files: + for p in args.tests_files: + cfg = load_config(root, p) + configs_to_run.append((config_suite_name(p), cfg))

Also applies to: 271-276

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ci/utils/run_dev_skill_agent_tests.py` around lines 192 - 197, The --tests-file argument currently uses action="append" and replaces the baseline suite when provided; ensure the baseline "ci/utils/dev_skill_agent_tests.json" is always included by either setting add_argument(..., dest="tests_files", action="append", default=["ci/utils/dev_skill_agent_tests.json"]) or, after parsing, prepending the baseline if args.tests_files exists (e.g. args.tests_files = ["ci/utils/dev_skill_agent_tests.json"] + args.tests_files when args.tests_files is not None). Apply the same change where the other --tests-file argument is defined so the baseline is never silently omitted.

coderabbitai · 2026-03-16T16:39:57Z

ci/utils/run_dev_skill_agent_tests.py

+        default_inc = config.get("default_assertions", {}).get("must_include", [])
+        default_not = config.get("default_assertions", {}).get("must_not_include", [])
+
+        for test in tests:
+            test_id = test["id"]
+            prompt = test["prompt"]
+            must_include = test.get("must_include", default_inc)
+            must_not_include = test.get("must_not_include", default_not)


⚠️ Potential issue | 🟠 Major

Per-test assertions are shadowing the suite defaults.

Every test that defines must_include or must_not_include replaces default_assertions entirely. In the issue-style suite, that makes the top-level build/test/run and no-skip-CI checks dead config.

🐛 Suggested diff

- must_include = test.get("must_include", default_inc) - must_not_include = test.get("must_not_include", default_not) + must_include = [*default_inc, *test.get("must_include", [])] + must_not_include = [*default_not, *test.get("must_not_include", [])]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

default_inc = config.get("default_assertions", {}).get("must_include", [])

default_not = config.get("default_assertions", {}).get("must_not_include", [])

for test in tests:

test_id = test["id"]

prompt = test["prompt"]

must_include = test.get("must_include", default_inc)

must_not_include = test.get("must_not_include", default_not)

default_inc = config.get("default_assertions", {}).get("must_include", [])

default_not = config.get("default_assertions", {}).get("must_not_include", [])

for test in tests:

test_id = test["id"]

prompt = test["prompt"]

must_include = [*default_inc, *test.get("must_include", [])]

must_not_include = [*default_not, *test.get("must_not_include", [])]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ci/utils/run_dev_skill_agent_tests.py` around lines 326 - 333, The loop is replacing the suite-level defaults because per-test keys `must_include` and `must_not_include` are assigned directly; instead merge per-test assertions with `default_assertions` so tests inherit defaults rather than shadow them. In the loop that iterates `tests` (variables: `default_inc`, `default_not`, `must_include`, `must_not_include`, `test`), compute `must_include` by starting from `default_inc` and extending/merging with `test.get("must_include")` (handling None), and similarly compute `must_not_include` by merging `default_not` with `test.get("must_not_include")`; ensure deduplication if desired and keep original order.

coderabbitai · 2026-03-16T16:39:57Z

ci/utils/run_dev_skill_agent_tests.py

+                for attempt in range(pass_at):
+                    try:
+                        response, elapsed, meta = run_claude(root, skill_file, prompt, timeout)
+                        runtimes.append(elapsed)
+                        responses.append(response)
+                        metadata_list.append(meta)
+                    except Exception as e:
+                        print(f"FAIL {label} (attempt {attempt + 1}/{pass_at}): {e}")
+                        runtimes.clear()
+                        responses.clear()
+                        metadata_list.clear()
+                        failed += 1
+                        if args.report:
+                            report_rows.append({"label": label, "test_id": test_id, "prompt": prompt, "passed": "FAIL", "median_seconds": "", "median_input_tokens": "", "median_output_tokens": "", "median_num_turns": "", "failure_reasons": [str(e)], "response_preview": ""})
+                        break
+                else:
+                    if args.save and responses and save_dir_actual:
+                        save_path = os.path.join(save_dir_actual, replay_file)
+                        os.makedirs(os.path.dirname(save_path) or ".", exist_ok=True)
+                        with open(save_path, "w", encoding="utf-8") as f:
+                            f.write(responses[0])


⚠️ Potential issue | 🟠 Major

pass@K stops too early and can persist the wrong sample.

A single exception breaks out before attempts 2..K, and --save always writes responses[0] even when a later attempt is the one that passed validation. That makes pass@K brittle and can generate replay artifacts that fail after a successful live run.

Also applies to: 386-395

🧰 Tools

🪛 Ruff (0.15.6)

[warning] 367-367: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ci/utils/run_dev_skill_agent_tests.py` around lines 361 - 381, The pass_at loop currently breaks on the first exception and clears collected attempts, which prevents trying remaining attempts and can cause --save to persist the wrong sample; change the logic in the attempt loop around run_claude so that on exception you record the failure reason and continue to the next attempt (do not clear runtimes/responses/metadata_list or break), and only mark the test as failed (increment failed and append a report row with aggregated failure_reasons) after all attempts exhaust without any successful run; also when saving (using save_dir_actual and replay_file) write the successful sample (e.g., the last appended successful response from responses, responses[-1]) instead of responses[0]; refer to the loop using pass_at, run_claude, responses, runtimes, metadata_list, failed, report_rows, save_dir_actual, and replay_file to locate changes.

coderabbitai · 2026-03-16T16:39:57Z

ci/utils/validate_developer_skills.sh

+  if [[ ! -f "$skill_md" ]]; then
+    echo "SKIP: ${skill_name} (no SKILL.md)"
+    return 0


⚠️ Potential issue | 🟠 Major

Treat a missing SKILL.md as a failure, not a skip.

DEV_SKILLS hard-codes these entries as required checks. If one of those files is renamed or deleted, this validator currently exits 0 and misses the regression.

🐛 Suggested diff

if [[ ! -f "$skill_md" ]]; then - echo "SKIP: ${skill_name} (no SKILL.md)" + echo "ERROR: ${skill_name}/SKILL.md is missing" + ERRORS=$((ERRORS + 1)) return 0 fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [[ ! -f "$skill_md" ]]; then

echo "SKIP: ${skill_name} (no SKILL.md)"

return 0

if [[ ! -f "$skill_md" ]]; then

echo "ERROR: ${skill_name}/SKILL.md is missing"

ERRORS=$((ERRORS + 1))

return 0

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ci/utils/validate_developer_skills.sh` around lines 48 - 50, The check treating a missing SKILL.md as a skip should instead fail the validation: in the block that currently prints "SKIP: ${skill_name} (no SKILL.md)" and returns 0, change it to print a failing message (e.g., "FAIL: ${skill_name} (missing SKILL.md)") and return a non-zero status so CI fails; replace the existing return 0 with return 1 (or exit 1 if the script is intended to terminate the process) so DEV_SKILLS entries are enforced when skill_md is missing.

coderabbitai · 2026-03-16T16:39:57Z

ci/utils/validate_developer_skills.sh

+  if [[ "$skill_name" == "cuopt-developer" ]]; then
+    for section in "${CUOPT_DEV_SECTIONS[@]}"; do
+      if ! echo "$content" | grep -q "$section"; then
+        echo "ERROR: ${skill_name}/SKILL.md missing section or heading: ${section}"
+        ERRORS=$((ERRORS + 1))
+        failed=1
+      fi
+    done


⚠️ Potential issue | 🟡 Minor

Check for actual Markdown headings here.

grep -q "$section" only proves the text exists somewhere in the file. A sentence or code block containing Build & Test will pass even if the heading is gone.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ci/utils/validate_developer_skills.sh` around lines 56 - 63, The current check uses grep -q "$section" which only finds the text anywhere; change it to verify an actual Markdown heading by matching lines that start with one or more '#' followed by optional space and the section text. Inside the cuopt-developer branch where you iterate CUOPT_DEV_SECTIONS, replace the existence check on content with a heading-aware regex (e.g., use grep -E -q to match ^\s*#+\s*${section}(\s|$)) so the script only counts true Markdown headings; keep the same error message and incrementing of ERRORS/failed when the heading match fails.

rgsl888prabhu added 5 commits March 11, 2026 12:41

fix

832fbe7

update

ec57400

update

9f7d588

fix test cases

5569482

Address review

b073f29

update status details

8b5ce38

anandhkb added this to the 26.04 milestone Mar 12, 2026

add no skill option

4155648

rgsl888prabhu marked this pull request as ready for review March 16, 2026 16:30

rgsl888prabhu requested a review from a team as a code owner March 16, 2026 16:30

rgsl888prabhu requested a review from msarahan March 16, 2026 16:30

Merge branch 'main' into add_dev_skill_evaluation

1e9b4db

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

rgsl888prabhu changed the base branch from main to release/26.04 March 19, 2026 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dev skill evaluation#953

Add dev skill evaluation#953
rgsl888prabhu wants to merge 8 commits intoNVIDIA:release/26.04from
rgsl888prabhu:add_dev_skill_evaluation

rgsl888prabhu commented Mar 12, 2026

Uh oh!

copy-pr-bot bot commented Mar 12, 2026

Uh oh!

coderabbitai bot commented Mar 16, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 16, 2026

Uh oh!

coderabbitai bot Mar 16, 2026

Uh oh!

coderabbitai bot Mar 16, 2026

Uh oh!

coderabbitai bot Mar 16, 2026

Uh oh!

coderabbitai bot Mar 16, 2026

Uh oh!

coderabbitai bot Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rgsl888prabhu commented Mar 12, 2026

Description

Issue

Checklist

Uh oh!

copy-pr-bot bot commented Mar 12, 2026

Uh oh!

coderabbitai bot commented Mar 16, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants