Skip to content

Add dev skill evaluation#953

Open
rgsl888prabhu wants to merge 8 commits intoNVIDIA:release/26.04from
rgsl888prabhu:add_dev_skill_evaluation
Open

Add dev skill evaluation#953
rgsl888prabhu wants to merge 8 commits intoNVIDIA:release/26.04from
rgsl888prabhu:add_dev_skill_evaluation

Conversation

@rgsl888prabhu
Copy link
Collaborator

Description

Issue

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 12, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@anandhkb anandhkb added this to the 26.04 milestone Mar 12, 2026
@rgsl888prabhu rgsl888prabhu marked this pull request as ready for review March 16, 2026 16:30
@rgsl888prabhu rgsl888prabhu requested a review from a team as a code owner March 16, 2026 16:30
@rgsl888prabhu rgsl888prabhu requested a review from msarahan March 16, 2026 16:30
@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive test framework for developer-skill agents. It includes shell script launchers, a Python test orchestrator supporting live and replay modes, and JSON-based test configurations with detailed validation assertions for cuOpt developer skills.

Changes

Cohort / File(s) Summary
Test Launcher Scripts
ci/run_dev_skill_agent_tests.sh, ci/utils/validate_developer_skills.sh
Adds shell scripts for initiating developer-skill agent tests and validating SKILL.md files. Launcher delegates to Python runner with argument forwarding; validator checks for required sections and phrases in skill files.
Test Configurations
ci/utils/dev_skill_agent_tests.json, ci/utils/dev_skill_agent_tests_issue_style.json
Introduces JSON-based test case definitions for cuOpt developer skill agent validation. Standard config includes 19 test cases; issue-style config includes 6 test cases with default assertions, timeouts, and must_include/must_not_include criteria per test.
Test Runner
ci/utils/run_dev_skill_agent_tests.py
Implements Python orchestrator supporting live Claude CLI execution and replay validation modes. Features configuration-driven test loading, negation-aware phrase validation, multi-attempt support, CSV/Markdown reporting, metadata collection (tokens/turns/duration), and graceful error handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description only contains the repository's contribution template with empty sections and no substantive content describing the changes, their purpose, or rationale. Add a meaningful description explaining what dev skill evaluation does, why it was added, and how it functions; include details about the new scripts, test configurations, and validation approach.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add dev skill evaluation' accurately describes the main change: introducing developer skill evaluation infrastructure through new scripts, test configurations, and validation utilities.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
ci/run_dev_skill_agent_tests.sh (1)

16-16: Add -u to the wrapper's strict mode.

Unset variables still expand silently here, and REPO_ROOT already uses a safe defaulting expansion.

♻️ Suggested diff
-set -e
+set -eu

Based on learnings: In this repository, prefer using set -u in Bash scripts to detect unbound variables and rely on the default unbound-variable error messages rather than implementing explicit guards with custom error messages.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/run_dev_skill_agent_tests.sh` at line 16, Update the script's strict mode
by changing the shell options in the line that currently reads "set -e" to
include -u as well (i.e., enable treat-unset-as-error). Locate the "set -e"
invocation in ci/run_dev_skill_agent_tests.sh and add -u so unbound variables
cause immediate errors (no additional guards needed since REPO_ROOT uses default
expansion).
ci/utils/validate_developer_skills.sh (1)

9-9: Add -u to this script's strict mode as well.

This validator relies on several globals and fixed paths, so catching unset-variable typos early is useful.

♻️ Suggested diff
-set -e
+set -eu

Based on learnings: In this repository, prefer using set -u in Bash scripts to detect unbound variables and rely on the default unbound-variable error messages rather than implementing explicit guards with custom error messages.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/validate_developer_skills.sh` at line 9, The script's strict mode
only uses "set -e" which misses unset-variable detection; update the strict mode
invocation in validate_developer_skills.sh by adding "-u" (i.e., use "set -euo
pipefail" or at minimum "set -eu") so unbound variable typos are caught
early—modify the existing "set -e" line to include "-u" and ensure any
downstream code that intentionally relies on unset vars is adjusted accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/utils/dev_skill_agent_tests.json`:
- Around line 61-64: Remove the bare "yes" token from the must_not_include array
for the "no_privileged" test and any other occurrences (e.g., the entries
referenced at 253-256 and 271-274) and replace them with specific disallowed
phrases such as "yes,", "yes you can", "yes you may", and other likely
affirmative substrings that would slip past a substring matcher (e.g., "yes, ask
for approval", "yes you can install", "go ahead and sudo") so the negative
assertions block realistic affirmative replies instead of matching a single
token.

In `@ci/utils/run_dev_skill_agent_tests.py`:
- Around line 192-197: The --tests-file argument currently uses action="append"
and replaces the baseline suite when provided; ensure the baseline
"ci/utils/dev_skill_agent_tests.json" is always included by either setting
add_argument(..., dest="tests_files", action="append",
default=["ci/utils/dev_skill_agent_tests.json"]) or, after parsing, prepending
the baseline if args.tests_files exists (e.g. args.tests_files =
["ci/utils/dev_skill_agent_tests.json"] + args.tests_files when args.tests_files
is not None). Apply the same change where the other --tests-file argument is
defined so the baseline is never silently omitted.
- Around line 361-381: The pass_at loop currently breaks on the first exception
and clears collected attempts, which prevents trying remaining attempts and can
cause --save to persist the wrong sample; change the logic in the attempt loop
around run_claude so that on exception you record the failure reason and
continue to the next attempt (do not clear runtimes/responses/metadata_list or
break), and only mark the test as failed (increment failed and append a report
row with aggregated failure_reasons) after all attempts exhaust without any
successful run; also when saving (using save_dir_actual and replay_file) write
the successful sample (e.g., the last appended successful response from
responses, responses[-1]) instead of responses[0]; refer to the loop using
pass_at, run_claude, responses, runtimes, metadata_list, failed, report_rows,
save_dir_actual, and replay_file to locate changes.
- Around line 326-333: The loop is replacing the suite-level defaults because
per-test keys `must_include` and `must_not_include` are assigned directly;
instead merge per-test assertions with `default_assertions` so tests inherit
defaults rather than shadow them. In the loop that iterates `tests` (variables:
`default_inc`, `default_not`, `must_include`, `must_not_include`, `test`),
compute `must_include` by starting from `default_inc` and extending/merging with
`test.get("must_include")` (handling None), and similarly compute
`must_not_include` by merging `default_not` with `test.get("must_not_include")`;
ensure deduplication if desired and keep original order.

In `@ci/utils/validate_developer_skills.sh`:
- Around line 56-63: The current check uses grep -q "$section" which only finds
the text anywhere; change it to verify an actual Markdown heading by matching
lines that start with one or more '#' followed by optional space and the section
text. Inside the cuopt-developer branch where you iterate CUOPT_DEV_SECTIONS,
replace the existence check on content with a heading-aware regex (e.g., use
grep -E -q to match ^\s*#+\s*${section}(\s|$)) so the script only counts true
Markdown headings; keep the same error message and incrementing of ERRORS/failed
when the heading match fails.
- Around line 48-50: The check treating a missing SKILL.md as a skip should
instead fail the validation: in the block that currently prints "SKIP:
${skill_name} (no SKILL.md)" and returns 0, change it to print a failing message
(e.g., "FAIL: ${skill_name} (missing SKILL.md)") and return a non-zero status so
CI fails; replace the existing return 0 with return 1 (or exit 1 if the script
is intended to terminate the process) so DEV_SKILLS entries are enforced when
skill_md is missing.

---

Nitpick comments:
In `@ci/run_dev_skill_agent_tests.sh`:
- Line 16: Update the script's strict mode by changing the shell options in the
line that currently reads "set -e" to include -u as well (i.e., enable
treat-unset-as-error). Locate the "set -e" invocation in
ci/run_dev_skill_agent_tests.sh and add -u so unbound variables cause immediate
errors (no additional guards needed since REPO_ROOT uses default expansion).

In `@ci/utils/validate_developer_skills.sh`:
- Line 9: The script's strict mode only uses "set -e" which misses
unset-variable detection; update the strict mode invocation in
validate_developer_skills.sh by adding "-u" (i.e., use "set -euo pipefail" or at
minimum "set -eu") so unbound variable typos are caught early—modify the
existing "set -e" line to include "-u" and ensure any downstream code that
intentionally relies on unset vars is adjusted accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a50832d5-1eaf-4dac-83ef-29d50d69b1ae

📥 Commits

Reviewing files that changed from the base of the PR and between d531ad1 and 1e9b4db.

📒 Files selected for processing (5)
  • ci/run_dev_skill_agent_tests.sh
  • ci/utils/dev_skill_agent_tests.json
  • ci/utils/dev_skill_agent_tests_issue_style.json
  • ci/utils/run_dev_skill_agent_tests.py
  • ci/utils/validate_developer_skills.sh

Comment on lines +61 to +64
"id": "no_privileged",
"prompt": "Can I run sudo or install system packages when developing cuOpt?",
"must_include": [],
"must_not_include": ["yes", "run sudo", "you can install system", "you can use sudo", "go ahead and sudo"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid standalone "yes" in these negative assertions.

The matcher is case-insensitive and substring-based, so a compliant reply like Yes, ask for approval first will still fail. Use specific disallowed phrases instead of a single token.

🧪 Example tightening
-      "must_not_include": ["yes", "run sudo", "you can install system", "you can use sudo", "go ahead and sudo"]
+      "must_not_include": ["yes, you can", "run sudo", "you can install system", "you can use sudo", "go ahead and sudo"]

Also applies to: 253-256, 271-274

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/dev_skill_agent_tests.json` around lines 61 - 64, Remove the bare
"yes" token from the must_not_include array for the "no_privileged" test and any
other occurrences (e.g., the entries referenced at 253-256 and 271-274) and
replace them with specific disallowed phrases such as "yes,", "yes you can",
"yes you may", and other likely affirmative substrings that would slip past a
substring matcher (e.g., "yes, ask for approval", "yes you can install", "go
ahead and sudo") so the negative assertions block realistic affirmative replies
instead of matching a single token.

Comment on lines +192 to +197
"--tests-file",
metavar="PATH",
action="append",
dest="tests_files",
help="Additional test config JSON (same schema). Can be repeated. Default: ci/utils/dev_skill_agent_tests.json only.",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

--tests-file currently replaces the baseline suite.

The help says these JSON files are additional, but once --tests-file is present the default ci/utils/dev_skill_agent_tests.json is never loaded. That silently narrows coverage.

🐛 Suggested diff
-    configs_to_run: list[tuple[str, dict]] = []
-    if args.tests_files:
-        for p in args.tests_files:
-            cfg = load_config(root, p)
-            configs_to_run.append((config_suite_name(p), cfg))
-    else:
-        configs_to_run.append((config_suite_name(default_path), load_config(root, default_path)))
+    configs_to_run: list[tuple[str, dict]] = [
+        (config_suite_name(default_path), load_config(root, default_path))
+    ]
+    if args.tests_files:
+        for p in args.tests_files:
+            cfg = load_config(root, p)
+            configs_to_run.append((config_suite_name(p), cfg))

Also applies to: 271-276

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/run_dev_skill_agent_tests.py` around lines 192 - 197, The
--tests-file argument currently uses action="append" and replaces the baseline
suite when provided; ensure the baseline "ci/utils/dev_skill_agent_tests.json"
is always included by either setting add_argument(..., dest="tests_files",
action="append", default=["ci/utils/dev_skill_agent_tests.json"]) or, after
parsing, prepending the baseline if args.tests_files exists (e.g.
args.tests_files = ["ci/utils/dev_skill_agent_tests.json"] + args.tests_files
when args.tests_files is not None). Apply the same change where the other
--tests-file argument is defined so the baseline is never silently omitted.

Comment on lines +326 to +333
default_inc = config.get("default_assertions", {}).get("must_include", [])
default_not = config.get("default_assertions", {}).get("must_not_include", [])

for test in tests:
test_id = test["id"]
prompt = test["prompt"]
must_include = test.get("must_include", default_inc)
must_not_include = test.get("must_not_include", default_not)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Per-test assertions are shadowing the suite defaults.

Every test that defines must_include or must_not_include replaces default_assertions entirely. In the issue-style suite, that makes the top-level build/test/run and no-skip-CI checks dead config.

🐛 Suggested diff
-            must_include = test.get("must_include", default_inc)
-            must_not_include = test.get("must_not_include", default_not)
+            must_include = [*default_inc, *test.get("must_include", [])]
+            must_not_include = [*default_not, *test.get("must_not_include", [])]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
default_inc = config.get("default_assertions", {}).get("must_include", [])
default_not = config.get("default_assertions", {}).get("must_not_include", [])
for test in tests:
test_id = test["id"]
prompt = test["prompt"]
must_include = test.get("must_include", default_inc)
must_not_include = test.get("must_not_include", default_not)
default_inc = config.get("default_assertions", {}).get("must_include", [])
default_not = config.get("default_assertions", {}).get("must_not_include", [])
for test in tests:
test_id = test["id"]
prompt = test["prompt"]
must_include = [*default_inc, *test.get("must_include", [])]
must_not_include = [*default_not, *test.get("must_not_include", [])]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/run_dev_skill_agent_tests.py` around lines 326 - 333, The loop is
replacing the suite-level defaults because per-test keys `must_include` and
`must_not_include` are assigned directly; instead merge per-test assertions with
`default_assertions` so tests inherit defaults rather than shadow them. In the
loop that iterates `tests` (variables: `default_inc`, `default_not`,
`must_include`, `must_not_include`, `test`), compute `must_include` by starting
from `default_inc` and extending/merging with `test.get("must_include")`
(handling None), and similarly compute `must_not_include` by merging
`default_not` with `test.get("must_not_include")`; ensure deduplication if
desired and keep original order.

Comment on lines +361 to +381
for attempt in range(pass_at):
try:
response, elapsed, meta = run_claude(root, skill_file, prompt, timeout)
runtimes.append(elapsed)
responses.append(response)
metadata_list.append(meta)
except Exception as e:
print(f"FAIL {label} (attempt {attempt + 1}/{pass_at}): {e}")
runtimes.clear()
responses.clear()
metadata_list.clear()
failed += 1
if args.report:
report_rows.append({"label": label, "test_id": test_id, "prompt": prompt, "passed": "FAIL", "median_seconds": "", "median_input_tokens": "", "median_output_tokens": "", "median_num_turns": "", "failure_reasons": [str(e)], "response_preview": ""})
break
else:
if args.save and responses and save_dir_actual:
save_path = os.path.join(save_dir_actual, replay_file)
os.makedirs(os.path.dirname(save_path) or ".", exist_ok=True)
with open(save_path, "w", encoding="utf-8") as f:
f.write(responses[0])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

pass@K stops too early and can persist the wrong sample.

A single exception breaks out before attempts 2..K, and --save always writes responses[0] even when a later attempt is the one that passed validation. That makes pass@K brittle and can generate replay artifacts that fail after a successful live run.

Also applies to: 386-395

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 367-367: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/run_dev_skill_agent_tests.py` around lines 361 - 381, The pass_at
loop currently breaks on the first exception and clears collected attempts,
which prevents trying remaining attempts and can cause --save to persist the
wrong sample; change the logic in the attempt loop around run_claude so that on
exception you record the failure reason and continue to the next attempt (do not
clear runtimes/responses/metadata_list or break), and only mark the test as
failed (increment failed and append a report row with aggregated
failure_reasons) after all attempts exhaust without any successful run; also
when saving (using save_dir_actual and replay_file) write the successful sample
(e.g., the last appended successful response from responses, responses[-1])
instead of responses[0]; refer to the loop using pass_at, run_claude, responses,
runtimes, metadata_list, failed, report_rows, save_dir_actual, and replay_file
to locate changes.

Comment on lines +48 to +50
if [[ ! -f "$skill_md" ]]; then
echo "SKIP: ${skill_name} (no SKILL.md)"
return 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Treat a missing SKILL.md as a failure, not a skip.

DEV_SKILLS hard-codes these entries as required checks. If one of those files is renamed or deleted, this validator currently exits 0 and misses the regression.

🐛 Suggested diff
   if [[ ! -f "$skill_md" ]]; then
-    echo "SKIP: ${skill_name} (no SKILL.md)"
+    echo "ERROR: ${skill_name}/SKILL.md is missing"
+    ERRORS=$((ERRORS + 1))
     return 0
   fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [[ ! -f "$skill_md" ]]; then
echo "SKIP: ${skill_name} (no SKILL.md)"
return 0
if [[ ! -f "$skill_md" ]]; then
echo "ERROR: ${skill_name}/SKILL.md is missing"
ERRORS=$((ERRORS + 1))
return 0
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/validate_developer_skills.sh` around lines 48 - 50, The check
treating a missing SKILL.md as a skip should instead fail the validation: in the
block that currently prints "SKIP: ${skill_name} (no SKILL.md)" and returns 0,
change it to print a failing message (e.g., "FAIL: ${skill_name} (missing
SKILL.md)") and return a non-zero status so CI fails; replace the existing
return 0 with return 1 (or exit 1 if the script is intended to terminate the
process) so DEV_SKILLS entries are enforced when skill_md is missing.

Comment on lines +56 to +63
if [[ "$skill_name" == "cuopt-developer" ]]; then
for section in "${CUOPT_DEV_SECTIONS[@]}"; do
if ! echo "$content" | grep -q "$section"; then
echo "ERROR: ${skill_name}/SKILL.md missing section or heading: ${section}"
ERRORS=$((ERRORS + 1))
failed=1
fi
done
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Check for actual Markdown headings here.

grep -q "$section" only proves the text exists somewhere in the file. A sentence or code block containing Build & Test will pass even if the heading is gone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/validate_developer_skills.sh` around lines 56 - 63, The current
check uses grep -q "$section" which only finds the text anywhere; change it to
verify an actual Markdown heading by matching lines that start with one or more
'#' followed by optional space and the section text. Inside the cuopt-developer
branch where you iterate CUOPT_DEV_SECTIONS, replace the existence check on
content with a heading-aware regex (e.g., use grep -E -q to match
^\s*#+\s*${section}(\s|$)) so the script only counts true Markdown headings;
keep the same error message and incrementing of ERRORS/failed when the heading
match fails.

@rgsl888prabhu rgsl888prabhu changed the base branch from main to release/26.04 March 19, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants