Skip to content

Commit 511bfa3

Browse files
committed
Refactor arteval bench: ae_agent integration and code cleanup
- main.py: Extract _is_ae_agent(agent) helper and use it for report/summary writing; use json.dumps(..., ensure_ascii=False) for result.jsonl - run_eval_in_env.py: Remove unused Path import in interactive foreground path; reuse _get_container_id_from_runtime for long-running agent block instead of duplicating container ID resolution - README: Update usage and JSONL/CLI options
1 parent cefd7c1 commit 511bfa3

9 files changed

Lines changed: 718 additions & 213 deletions

File tree

benchmarks/arteval_bench/README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,5 +182,67 @@ The benchmark supports multiple AI agents:
182182
- **Claude Code**: Anthropic's code assistant
183183
- **Mini SWE Agent**: The compact version of [SWE-agent](https://github.com/SWE-agent) assistant
184184
- **OpenHands**: Open-source coding agent
185+
- **ae_agent**: Claude Agent SDK–based agent (same logic as the standalone [artifact-agent](https://github.com/sys-intelligence/artifact-agent) repo), with full support for host/Docker, interactive mode, Skill, Sub-agent, per-task timeout, GPU, and optional container sync/commit/stop.
185186

186187
To add your own agent to the benchmark, see [add_agents.md](add_agents.md).
188+
189+
#### » ae_agent usage and options
190+
191+
When using the **ae_agent** (`-a ae_agent` or `-a ae-agent`), you can pass the following from the command line and/or the task JSONL.
192+
193+
**Command-line arguments**
194+
195+
| Argument | Description |
196+
|----------|-------------|
197+
| `-i`, `--input_file` | Input JSONL file with tasks (default: `./data/benchmark/arteval_tasks.jsonl`). |
198+
| `-o`, `--save_path` | Directory for results (default: `./outputs/ae_<model>_ae-agent_<timestamp>`). |
199+
| `-a`, `--agent` | Agent name; use `ae_agent` or `ae-agent` for this agent. |
200+
| `-m`, `--model_name` | Model name (e.g. `claude-sonnet-4-5-20250929`). |
201+
| `--interactive` | After the task completes, keep a session open so you can give more instructions (requires a TTY). In Docker mode the runner is executed in the foreground via `docker exec -it`. |
202+
| `--enable-skill` | Enable Claude Agent SDK Skill (load from `~/.claude/skills/` and `.claude/skills/`). |
203+
| `--enable-subagent` | Enable Claude Agent SDK Sub-agent (Task tool). |
204+
205+
**JSONL task fields (per line)**
206+
207+
| Field | Description |
208+
|-------|-------------|
209+
| `artifact_id` | Unique task identifier. |
210+
| `artifact_dir` | Artifact directory name (relative to the JSONL file’s directory). |
211+
| `artifact_readme` | Path to the README or task description file (relative to artifact root). |
212+
| `artifact_url` | Optional. Git clone URL; used when `artifact_dir` is missing or the path does not exist. |
213+
| `env` | `"local"` for host; Docker image name (e.g. `bastoica/ae-agent-ubuntu24.04:latest`) for Docker. |
214+
| `evaluator` | Command to run after the agent (e.g. `python _agent_eval/main.py`). |
215+
| `expected_score` | Expected score for this artifact (default 4). |
216+
| `timeout` | Optional. Per-task timeout in seconds or milliseconds (see utils: values &lt; 86400 are seconds, else milliseconds). |
217+
| `gpu` | Optional. When `true`, pass `--gpus all` to Docker (Docker mode only). |
218+
| `interactive` | Optional. When `true`, enable interactive mode for this task (overrides CLI default). |
219+
| `enable_skill` | Optional. When `true`, enable Skill for this task. |
220+
| `enable_subagent` | Optional. When `true`, enable Sub-agent for this task. |
221+
| `keep_container` | Optional. When `false` (default for ae_agent), after the run the workspace is synced from the container to the host, the container is committed as an image, and the container is stopped. When `true`, the container is left running for inspection. |
222+
223+
**Examples**
224+
225+
```sh
226+
# Host mode, default options
227+
python src/main.py -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run1
228+
229+
# With interactive mode (TTY required for Docker)
230+
python src/main.py --interactive -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run2
231+
232+
# Enable Skill and Sub-agent
233+
python src/main.py --enable-skill --enable-subagent -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run3
234+
```
235+
236+
**Outputs (when using ae_agent)**
237+
238+
Results are written under the given `save_path`:
239+
240+
- `result.jsonl` — One JSON object per task (task_id, status, score, agent_run_results, etc.).
241+
- `avg_score.json` — Benchmark summary (final_score, total_tasks).
242+
- `ae_report_<artifact_id>.md` — Per-task report (status, project path, log file, agent summary, and optional Docker image instructions).
243+
- `summary.json` — Total and successful task counts and success rate (same format as standalone artifact-agent).
244+
- When running via the benchmark entry, log paths and agent summary are filled from available data; standalone `python -m ae_agent.main` also produces `ae_log_<artifact_id>.log`.
245+
246+
**Docker + interactive**
247+
248+
For Docker tasks with `interactive: true` (or `--interactive`), the benchmark runs the agent in the foreground via `docker exec -it` so you can interact in the same terminal. This requires a real TTY (e.g. running `python src/main.py ...` in a terminal, not under CI or with redirected stdin). If stdin is not a TTY, the run falls back to non-interactive (background runner) and a warning is logged.

benchmarks/arteval_bench/env.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
AZURE_API_KEY = "XXX"
33
AZURE_API_BASE = "XXXX"
44
AZURE_API_VERSION = "XXX"
5-
ANTHROPIC_API_KEY = "sk-XXXX"
5+
ANTHROPIC_API_KEY = "YOUR_ANTHROPIC_API_KEY"
66

77
[hardware]
88
use_gpu = false

benchmarks/arteval_bench/src/agents/ae_agent/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,16 +30,16 @@ The benchmark will:
3030
4. **Evaluation script flow** (same as claude_sdk): after the agent finishes, run the JSONL `evaluator` (test_method), e.g. `cd /repo && python _agent_eval/main.py`, parse output for `score` and write to result.
3131
5. If set, pass through `ANTHROPIC_API_KEY`, `ANTHROPIC_FOUNDRY_API_KEY`, `ANTHROPIC_FOUNDRY_BASE_URL`, `CLAUDE_CODE_USE_FOUNDRY`.
3232

33-
**Evaluation flow on host**: When `run_on_host=True` and the agent is ae_agent, `run_eval_in_env.run_eval_on_host` calls this packages `run_agent_then_eval()`: run the agent first, then run `test_method` on the host (e.g. `cd project_path && python _agent_eval/main.py`), parse score with `utils.parse_eval_score()`, and return a result with the same shape as the Docker path (`score`, `test_method`, `status`).
33+
**Evaluation flow on host**: When `run_on_host=True` and the agent is ae_agent, `run_eval_in_env.run_eval_on_host` calls this package's `run_agent_then_eval()`: run the agent first, then run `test_method` on the host (e.g. `cd project_path && python _agent_eval/main.py`), parse score with `utils.parse_eval_score()`, and return a result with the same shape as the Docker path (`score`, `test_method`, `status`).
3434

3535
## Dependencies
3636

3737
- Python 3; `claude-agent-sdk` is installed in the container via `install.sh`.
38-
- When running in Docker via the benchmarks `run_eval_in_env.py`, install `swerex` on the host (the benchmark includes it). When using this directorys `main.py` for Docker mode standalone, you also need `swe-rex`.
38+
- When running in Docker via the benchmark's `run_eval_in_env.py`, install `swerex` on the host (the benchmark includes it). When using this directory's `main.py` for Docker mode standalone, you also need `swe-rex`.
3939

4040
## Running on host (local)
4141

42-
You can run tasks on the **host** from this directory (without the benchmarks Docker flow):
42+
You can run tasks on the **host** from this directory (without the benchmark's Docker flow):
4343

4444
1. **Single or batch via main.py**
4545
Use a JSONL where each line can set `"env": "local"` or `"run_on_host": true` to run that task on the host; others run in Docker (requires swerex).
@@ -59,4 +59,4 @@ You can run tasks on the **host** from this directory (without the benchmark’s
5959

6060
## Relation to the standalone ae-agent repo
6161

62-
The standalone ae-agent repo provides the same host/Docker CLI. This sub-agent includes both the **in-container** runner (used by the benchmarks `run_eval_in_env.py`) and **host/local** mode via `main.py` and `run_eval.py`.
62+
The standalone ae-agent repo provides the same host/Docker CLI. This sub-agent includes both the **in-container** runner (used by the benchmark's `run_eval_in_env.py`) and **host/local** mode via `main.py` and `run_eval.py`.

benchmarks/arteval_bench/src/agents/ae_agent/main.py

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@
2525
Tee,
2626
compute_and_write_summary,
2727
docker_image_from_item,
28+
enable_skill_from_item,
29+
enable_subagent_from_item,
2830
env_from_item,
2931
get_task,
3032
gpu_from_item,
@@ -89,12 +91,16 @@ def _run_single_task(
8991
save_path: str,
9092
input_file: str,
9193
interactive_default: bool,
94+
enable_skill_default: bool = False,
95+
enable_subagent_default: bool = False,
9296
) -> None:
9397
"""Process a single JSONL task: parse, run, write results and report."""
9498
env = env_from_item(item)
9599
docker_image = docker_image_from_item(item, env=env)
96100
use_gpu = gpu_from_item(item)
97101
interactive = interactive_from_item(item) or interactive_default
102+
enable_skill = enable_skill_from_item(item, enable_skill_default)
103+
enable_subagent = enable_subagent_from_item(item, enable_subagent_default)
98104
task_file = item.get('artifact_readme', None)
99105
task_id = item.get('artifact_id', None)
100106
timeout_ms = timeout_ms_from_item(item)
@@ -115,7 +121,7 @@ def _run_single_task(
115121
f.write(task)
116122

117123
timeout_str = str(timeout_ms) if timeout_ms is not None else 'default'
118-
print(f'Task {task_id}: env={env}, timeout_ms={timeout_str}, gpu={use_gpu}, interactive={interactive}')
124+
print(f'Task {task_id}: env={env}, timeout_ms={timeout_str}, gpu={use_gpu}, interactive={interactive}, enable_skill={enable_skill}, enable_subagent={enable_subagent}')
119125

120126
log_path = os.path.join(save_path, f'ae_log_{safe_id}.log')
121127
with open(log_path, 'w', encoding='utf-8') as lf:
@@ -143,6 +149,8 @@ def _run_single_task(
143149
timeout_ms=timeout_ms,
144150
use_gpu=use_gpu,
145151
interactive=interactive,
152+
enable_skill=enable_skill,
153+
enable_subagent=enable_subagent,
146154
)
147155
except Exception as e:
148156
sys.stdout, sys.stderr = old_stdout, old_stderr
@@ -160,7 +168,7 @@ def _run_single_task(
160168
print(f'Task {task_id} completed. Status: {result.get("status", "unknown")}')
161169

162170

163-
def main(input_file, model, agent, save_path, interactive_default: bool = False):
171+
def main(input_file, model, agent, save_path, interactive_default: bool = False, enable_skill_default: bool = False, enable_subagent_default: bool = False):
164172
"""Main function for running tasks."""
165173
if not os.path.isfile(input_file):
166174
logging.error('Input file not found: %s', input_file)
@@ -186,6 +194,8 @@ def main(input_file, model, agent, save_path, interactive_default: bool = False)
186194
save_path=save_path,
187195
input_file=input_file,
188196
interactive_default=interactive_default,
197+
enable_skill_default=enable_skill_default,
198+
enable_subagent_default=enable_subagent_default,
189199
)
190200

191201
total_count, success_count = compute_and_write_summary(save_path)
@@ -201,6 +211,8 @@ class _ResolvedConfig:
201211
agent: str
202212
save_path: str
203213
interactive_default: bool
214+
enable_skill_default: bool
215+
enable_subagent_default: bool
204216

205217

206218
def _parse_args() -> argparse.Namespace:
@@ -230,6 +242,16 @@ def _parse_args() -> argparse.Namespace:
230242
action='store_true',
231243
help='Enable interactive mode (continue giving agent instructions after task completes)',
232244
)
245+
parser.add_argument(
246+
'--enable-skill',
247+
action='store_true',
248+
help='Enable Claude Agent SDK Skill (load from ~/.claude/skills/ and .claude/skills/)',
249+
)
250+
parser.add_argument(
251+
'--enable-subagent',
252+
action='store_true',
253+
help='Enable Claude Agent SDK Sub-agent (Task tool)',
254+
)
233255
return parser.parse_args()
234256

235257

@@ -258,6 +280,8 @@ def _resolve_paths(args: argparse.Namespace) -> _ResolvedConfig:
258280
agent=agent,
259281
save_path=save_path,
260282
interactive_default=getattr(args, 'interactive', False),
283+
enable_skill_default=getattr(args, 'enable_skill', False),
284+
enable_subagent_default=getattr(args, 'enable_subagent', False),
261285
)
262286

263287

@@ -274,6 +298,8 @@ def cli_main():
274298
config.agent,
275299
config.save_path,
276300
interactive_default=config.interactive_default,
301+
enable_skill_default=config.enable_skill_default,
302+
enable_subagent_default=config.enable_subagent_default,
277303
)
278304

279305

0 commit comments

Comments
 (0)