Add eval runner script by mattpodwysocki · Pull Request #64 · mapbox/mapbox-agent-skills

mattpodwysocki · 2026-03-31T17:57:30Z

Summary

Adds scripts/run-evals.js — runs evals for any skill against Claude with and without the SKILL.md as system context, grades each expectation using Claude as a judge, and reports pass rates and delta
Adds npm run eval <skill-name> script
Adds @anthropic-ai/sdk as a dev dependency
Updates CONTRIBUTING.md to replace the skill-creator eval command (not yet implemented in any published version) with the actual npm run eval command, and clarifies the difference between knowledge evals and tool-execution evals

Usage

export ANTHROPIC_API_KEY=your-key-here
npm run eval mapbox-location-grounding

Output

Running evals for: mapbox-location-grounding
Model: claude-sonnet-4-6
Evals: 8

Eval 1: What restaurants are near -87.6298, 41.8781?
  Without skill: 17%  |  With skill: 50%  |  Delta: +33pp
  ...

Overall Results:
  Without skill (baseline): 21.6%
  With skill:               59.5%
  Delta:                    +37.8pp

  ✅ Strong skill (+20pp target met)

🤖 Generated with Claude Code

Adds scripts/run-evals.js — runs evals for any skill with and without the SKILL.md as system context, grades each expectation via Claude, and reports pass rates and delta. Usage: ANTHROPIC_API_KEY=... npm run eval <skill-name> Also updates CONTRIBUTING.md to: - Replace the skill-creator eval command (not yet implemented) with the actual npm run eval command - Clarify the difference between knowledge evals and tool-execution evals, and how to interpret results from each Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mattpodwysocki and others added 2 commits March 31, 2026 13:57

Fix Prettier formatting in run-evals.js

9305dbb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mattpodwysocki requested a review from a team as a code owner March 31, 2026 17:57

Merge main into add-eval-runner

d68199f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

zmofei approved these changes Apr 1, 2026

View reviewed changes

mattpodwysocki merged commit 11d9305 into main Apr 1, 2026
1 check passed

mattpodwysocki deleted the add-eval-runner branch April 1, 2026 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eval runner script#64

Add eval runner script#64
mattpodwysocki merged 3 commits intomainfrom
add-eval-runner

mattpodwysocki commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mattpodwysocki commented Mar 31, 2026

Summary

Usage

Output

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants