sys-intelligence
diff --git a/‎benchmarks/arteval_bench/README.md‎
Lines changed: 62 additions & 0 deletions b/‎benchmarks/arteval_bench/README.md‎
Lines changed: 62 additions & 0 deletions
diff --git a/‎benchmarks/arteval_bench/env.toml‎
Lines changed: 1 addition & 1 deletion b/‎benchmarks/arteval_bench/env.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎benchmarks/arteval_bench/src/agents/ae_agent/README.md‎
Lines changed: 4 additions & 4 deletions b/‎benchmarks/arteval_bench/src/agents/ae_agent/README.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎benchmarks/arteval_bench/src/agents/ae_agent/main.py‎
Lines changed: 28 additions & 2 deletions b/‎benchmarks/arteval_bench/src/agents/ae_agent/main.py‎
Lines changed: 28 additions & 2 deletions
@@ -182,5 +182,67 @@ The benchmark supports multiple AI agents:
 - **Claude Code**: Anthropic's code assistant
 - **Mini SWE Agent**: The compact version of [SWE-agent](https://github.com/SWE-agent) assistant
 - **OpenHands**: Open-source coding agent
+- **ae_agent**: Claude Agent SDK–based agent (same logic as the standalone [artifact-agent](https://github.com/sys-intelligence/artifact-agent) repo), with full support for host/Docker, interactive mode, Skill, Sub-agent, per-task timeout, GPU, and optional container sync/commit/stop.
 
 To add your own agent to the benchmark, see [add_agents.md](add_agents.md).
+
+#### » ae_agent usage and options
+
+When using the **ae_agent** (`-a ae_agent` or `-a ae-agent`), you can pass the following from the command line and/or the task JSONL.
+
+**Command-line arguments**
+
+| Argument | Description |
+|----------|-------------|
+| `-i`, `--input_file` | Input JSONL file with tasks (default: `./data/benchmark/arteval_tasks.jsonl`). |
+| `-o`, `--save_path` | Directory for results (default: `./outputs/ae_<model>_ae-agent_<timestamp>`). |
+| `-a`, `--agent` | Agent name; use `ae_agent` or `ae-agent` for this agent. |
+| `-m`, `--model_name` | Model name (e.g. `claude-sonnet-4-5-20250929`). |
+| `--interactive` | After the task completes, keep a session open so you can give more instructions (requires a TTY). In Docker mode the runner is executed in the foreground via `docker exec -it`. |
+| `--enable-skill` | Enable Claude Agent SDK Skill (load from `~/.claude/skills/` and `.claude/skills/`). |
+| `--enable-subagent` | Enable Claude Agent SDK Sub-agent (Task tool). |
+
+**JSONL task fields (per line)**
+
+| Field | Description |
+|-------|-------------|
+| `artifact_id` | Unique task identifier. |
+| `artifact_dir` | Artifact directory name (relative to the JSONL file’s directory). |
+| `artifact_readme` | Path to the README or task description file (relative to artifact root). |
+| `artifact_url` | Optional. Git clone URL; used when `artifact_dir` is missing or the path does not exist. |
+| `env` | `"local"` for host; Docker image name (e.g. `bastoica/ae-agent-ubuntu24.04:latest`) for Docker. |
+| `evaluator` | Command to run after the agent (e.g. `python _agent_eval/main.py`). |
+| `expected_score` | Expected score for this artifact (default 4). |
+| `timeout` | Optional. Per-task timeout in seconds or milliseconds (see utils: values &lt; 86400 are seconds, else milliseconds). |
+| `gpu` | Optional. When `true`, pass `--gpus all` to Docker (Docker mode only). |
+| `interactive` | Optional. When `true`, enable interactive mode for this task (overrides CLI default). |
+| `enable_skill` | Optional. When `true`, enable Skill for this task. |
+| `enable_subagent` | Optional. When `true`, enable Sub-agent for this task. |
+| `keep_container` | Optional. When `false` (default for ae_agent), after the run the workspace is synced from the container to the host, the container is committed as an image, and the container is stopped. When `true`, the container is left running for inspection. |
+
+**Examples**
+
+```sh
+# Host mode, default options
+python src/main.py -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run1
+
+# With interactive mode (TTY required for Docker)
+python src/main.py --interactive -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run2
+
+# Enable Skill and Sub-agent
+python src/main.py --enable-skill --enable-subagent -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run3
+```
+
+**Outputs (when using ae_agent)**
+
+Results are written under the given `save_path`:
+
+- `result.jsonl` — One JSON object per task (task_id, status, score, agent_run_results, etc.).
+- `avg_score.json` — Benchmark summary (final_score, total_tasks).
+- `ae_report_<artifact_id>.md` — Per-task report (status, project path, log file, agent summary, and optional Docker image instructions).
+- `summary.json` — Total and successful task counts and success rate (same format as standalone artifact-agent).
+- When running via the benchmark entry, log paths and agent summary are filled from available data; standalone `python -m ae_agent.main` also produces `ae_log_<artifact_id>.log`.
+
+**Docker + interactive**
+
+For Docker tasks with `interactive: true` (or `--interactive`), the benchmark runs the agent in the foreground via `docker exec -it` so you can interact in the same terminal. This requires a real TTY (e.g. running `python src/main.py ...` in a terminal, not under CI or with redirected stdin). If stdin is not a TTY, the run falls back to non-interactive (background runner) and a warning is logged.
@@ -2,7 +2,7 @@
 AZURE_API_KEY = "XXX"
 AZURE_API_BASE = "XXXX"
 AZURE_API_VERSION = "XXX"
-ANTHROPIC_API_KEY = "sk-XXXX"
+ANTHROPIC_API_KEY = "YOUR_ANTHROPIC_API_KEY"
 
 [hardware]
 use_gpu = false
 
@@ -30,16 +30,16 @@ The benchmark will:
 4. **Evaluation script flow** (same as claude_sdk): after the agent finishes, run the JSONL `evaluator` (test_method), e.g. `cd /repo && python _agent_eval/main.py`, parse output for `score` and write to result.
 5. If set, pass through `ANTHROPIC_API_KEY`, `ANTHROPIC_FOUNDRY_API_KEY`, `ANTHROPIC_FOUNDRY_BASE_URL`, `CLAUDE_CODE_USE_FOUNDRY`.
 
-**Evaluation flow on host**: When `run_on_host=True` and the agent is ae_agent, `run_eval_in_env.run_eval_on_host` calls this package’s `run_agent_then_eval()`: run the agent first, then run `test_method` on the host (e.g. `cd project_path && python _agent_eval/main.py`), parse score with `utils.parse_eval_score()`, and return a result with the same shape as the Docker path (`score`, `test_method`, `status`).
+**Evaluation flow on host**: When `run_on_host=True` and the agent is ae_agent, `run_eval_in_env.run_eval_on_host` calls this package's `run_agent_then_eval()`: run the agent first, then run `test_method` on the host (e.g. `cd project_path && python _agent_eval/main.py`), parse score with `utils.parse_eval_score()`, and return a result with the same shape as the Docker path (`score`, `test_method`, `status`).
 
 ## Dependencies
 
 - Python 3; `claude-agent-sdk` is installed in the container via `install.sh`.
-- When running in Docker via the benchmark’s `run_eval_in_env.py`, install `swerex` on the host (the benchmark includes it). When using this directory’s `main.py` for Docker mode standalone, you also need `swe-rex`.
+- When running in Docker via the benchmark's `run_eval_in_env.py`, install `swerex` on the host (the benchmark includes it). When using this directory's `main.py` for Docker mode standalone, you also need `swe-rex`.
 
 ## Running on host (local)
 
-You can run tasks on the **host** from this directory (without the benchmark’s Docker flow):
+You can run tasks on the **host** from this directory (without the benchmark's Docker flow):
 
 1. **Single or batch via main.py**  
    Use a JSONL where each line can set `"env": "local"` or `"run_on_host": true` to run that task on the host; others run in Docker (requires swerex).
@@ -59,4 +59,4 @@ You can run tasks on the **host** from this directory (without the benchmark’s
 
 ## Relation to the standalone ae-agent repo
 
-The standalone ae-agent repo provides the same host/Docker CLI. This sub-agent includes both the **in-container** runner (used by the benchmark’s `run_eval_in_env.py`) and **host/local** mode via `main.py` and `run_eval.py`.
+The standalone ae-agent repo provides the same host/Docker CLI. This sub-agent includes both the **in-container** runner (used by the benchmark's `run_eval_in_env.py`) and **host/local** mode via `main.py` and `run_eval.py`.
@@ -25,6 +25,8 @@
     Tee,
     compute_and_write_summary,
     docker_image_from_item,
+    enable_skill_from_item,
+    enable_subagent_from_item,
     env_from_item,
     get_task,
     gpu_from_item,
@@ -89,12 +91,16 @@ def _run_single_task(
     save_path: str,
     input_file: str,
     interactive_default: bool,
+    enable_skill_default: bool = False,
+    enable_subagent_default: bool = False,
 ) -> None:
     """Process a single JSONL task: parse, run, write results and report."""
     env = env_from_item(item)
     docker_image = docker_image_from_item(item, env=env)
     use_gpu = gpu_from_item(item)
     interactive = interactive_from_item(item) or interactive_default
+    enable_skill = enable_skill_from_item(item, enable_skill_default)
+    enable_subagent = enable_subagent_from_item(item, enable_subagent_default)
     task_file = item.get('artifact_readme', None)
     task_id = item.get('artifact_id', None)
     timeout_ms = timeout_ms_from_item(item)
@@ -115,7 +121,7 @@ def _run_single_task(
         f.write(task)
 
     timeout_str = str(timeout_ms) if timeout_ms is not None else 'default'
-    print(f'Task {task_id}: env={env}, timeout_ms={timeout_str}, gpu={use_gpu}, interactive={interactive}')
+    print(f'Task {task_id}: env={env}, timeout_ms={timeout_str}, gpu={use_gpu}, interactive={interactive}, enable_skill={enable_skill}, enable_subagent={enable_subagent}')
 
     log_path = os.path.join(save_path, f'ae_log_{safe_id}.log')
     with open(log_path, 'w', encoding='utf-8') as lf:
@@ -143,6 +149,8 @@ def _run_single_task(
                     timeout_ms=timeout_ms,
                     use_gpu=use_gpu,
                     interactive=interactive,
+                    enable_skill=enable_skill,
+                    enable_subagent=enable_subagent,
                 )
     except Exception as e:
         sys.stdout, sys.stderr = old_stdout, old_stderr
@@ -160,7 +168,7 @@ def _run_single_task(
     print(f'Task {task_id} completed. Status: {result.get("status", "unknown")}')
 
 
-def main(input_file, model, agent, save_path, interactive_default: bool = False):
+def main(input_file, model, agent, save_path, interactive_default: bool = False, enable_skill_default: bool = False, enable_subagent_default: bool = False):
     """Main function for running tasks."""
     if not os.path.isfile(input_file):
         logging.error('Input file not found: %s', input_file)
@@ -186,6 +194,8 @@ def main(input_file, model, agent, save_path, interactive_default: bool = False)
                 save_path=save_path,
                 input_file=input_file,
                 interactive_default=interactive_default,
+                enable_skill_default=enable_skill_default,
+                enable_subagent_default=enable_subagent_default,
             )
 
     total_count, success_count = compute_and_write_summary(save_path)
@@ -201,6 +211,8 @@ class _ResolvedConfig:
     agent: str
     save_path: str
     interactive_default: bool
+    enable_skill_default: bool
+    enable_subagent_default: bool
 
 
 def _parse_args() -> argparse.Namespace:
@@ -230,6 +242,16 @@ def _parse_args() -> argparse.Namespace:
         action='store_true',
         help='Enable interactive mode (continue giving agent instructions after task completes)',
     )
+    parser.add_argument(
+        '--enable-skill',
+        action='store_true',
+        help='Enable Claude Agent SDK Skill (load from ~/.claude/skills/ and .claude/skills/)',
+    )
+    parser.add_argument(
+        '--enable-subagent',
+        action='store_true',
+        help='Enable Claude Agent SDK Sub-agent (Task tool)',
+    )
     return parser.parse_args()
 
 
@@ -258,6 +280,8 @@ def _resolve_paths(args: argparse.Namespace) -> _ResolvedConfig:
         agent=agent,
         save_path=save_path,
         interactive_default=getattr(args, 'interactive', False),
+        enable_skill_default=getattr(args, 'enable_skill', False),
+        enable_subagent_default=getattr(args, 'enable_subagent', False),
     )
 
 
@@ -274,6 +298,8 @@ def cli_main():
         config.agent,
         config.save_path,
         interactive_default=config.interactive_default,
+        enable_skill_default=config.enable_skill_default,
+        enable_subagent_default=config.enable_subagent_default,
     )