Skip to content

Add eval runner script#64

Merged
mattpodwysocki merged 3 commits intomainfrom
add-eval-runner
Apr 1, 2026
Merged

Add eval runner script#64
mattpodwysocki merged 3 commits intomainfrom
add-eval-runner

Conversation

@mattpodwysocki
Copy link
Copy Markdown
Contributor

Summary

  • Adds scripts/run-evals.js — runs evals for any skill against Claude with and without the SKILL.md as system context, grades each expectation using Claude as a judge, and reports pass rates and delta
  • Adds npm run eval <skill-name> script
  • Adds @anthropic-ai/sdk as a dev dependency
  • Updates CONTRIBUTING.md to replace the skill-creator eval command (not yet implemented in any published version) with the actual npm run eval command, and clarifies the difference between knowledge evals and tool-execution evals

Usage

export ANTHROPIC_API_KEY=your-key-here
npm run eval mapbox-location-grounding

Output

Running evals for: mapbox-location-grounding
Model: claude-sonnet-4-6
Evals: 8

Eval 1: What restaurants are near -87.6298, 41.8781?
  Without skill: 17%  |  With skill: 50%  |  Delta: +33pp
  ...

Overall Results:
  Without skill (baseline): 21.6%
  With skill:               59.5%
  Delta:                    +37.8pp

  ✅ Strong skill (+20pp target met)

🤖 Generated with Claude Code

mattpodwysocki and others added 2 commits March 31, 2026 13:57
Adds scripts/run-evals.js — runs evals for any skill with and without the
SKILL.md as system context, grades each expectation via Claude, and reports
pass rates and delta.

Usage:
  ANTHROPIC_API_KEY=... npm run eval <skill-name>

Also updates CONTRIBUTING.md to:
- Replace the skill-creator eval command (not yet implemented) with the
  actual npm run eval command
- Clarify the difference between knowledge evals and tool-execution evals,
  and how to interpret results from each

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mattpodwysocki mattpodwysocki requested a review from a team as a code owner March 31, 2026 17:57
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mattpodwysocki mattpodwysocki merged commit 11d9305 into main Apr 1, 2026
1 check passed
@mattpodwysocki mattpodwysocki deleted the add-eval-runner branch April 1, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants