Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. SysMoBench evaluates whether LLM-powered agents can translate realistic system artifacts into executable TLA+ specifications that pass rigorous syntactic, semantic, and invariant checks.
- Input
- System source code plus task-specific prompts and invariants from
sysmobench_core/tla_eval/tasks/<task_name>/ - Agent parameters (method, model name, iteration budget)
- TLA+ evaluation configuration and trace data
- System source code plus task-specific prompts and invariants from
- Output
- Generated TLA+ specification (
GenerationResult.tla_specification) - Iterative correction summaries with compilation/runtime verdicts
- Generated TLA+ specification (
- Evaluation
compilation_check/action_decomposition(syntax correctness via SANY)runtime_coverage(TLC simulation coverage)trace_validation/pgo_trace_validation(conformance to real system traces)invariant_verification(system-specific safety/liveness properties)
- Automated four-phase evaluation pipeline covering syntax, runtime, trace conformance, and invariants
- Nine real-world concurrent and distributed systems (Etcd Raft, Redis Raft, Asterinas primitives, Xline CURP, PGo suites)
- Extensible configuration for adding new systems via prompts, traces, and invariant templates
- Python 3.9+
- Java 11+ (SANY/TLC binaries are downloaded during installation)
- Anthropic evaluator key (required for trace/invariant workflows)
- LLM credentials for the model under test (OpenAI, Azure OpenAI, Anthropic, etc.)
Edit benchmarks/sysmobench/env.toml:
[evaluator_api_keys]
ANTHROPIC_API_KEY = "your_evaluator_key"
[llm]
OPENAI_API_KEY = "your_openai_key"
AZURE_API_KEY = "your_azure_key"
AZURE_API_BASE = "https://your-azure-endpoint.openai.azure.com"
AZURE_API_VERSION = "2024-05-01-preview"
ANTHROPIC_API_KEY = "your_model_key"
[hardware]
use_gpu = false
[env-docker]
image = "default"
entrypoint = "./run.sh"cd benchmarks/sysmobench
./install.shThis script installs OpenJDK when necessary, creates .venv/, installs benchmarks/sysmobench/requirements.txt, registers sysmobench_core in editable mode (exposing sysmobench/sysmobench-setup), and downloads the TLA+ toolchain.
./run.sh <model_name>The wrapper executes src/main.py with the default task list from data/benchmark/tasks.jsonl, iterating up to three correction rounds per task. Results are stored at:
outputs/sysmobench__<model>__agent_based__<timestamp>/
├── result.jsonl # Line-delimited iteration summaries per task
└── avg_score.json # Final averaged score across tasks
To orchestrate SysMoBench alongside other benchmarks:
cd cli
./run_all_local.sh <model_name>cd benchmarks/sysmobench
source .venv/bin/activate
sysmobench --task spin --method agent_based --model <model> --metric compilation_check
sysmobench --list-tasks
deactivateFor exhaustive CLI flag descriptions, see Usage Guide.
SysMoBench includes 9 diverse real-world system artifacts from concurrent and distributed systems:
| System | Type | Description | Source Lang. | Source LoC | TLA+ LoC |
|---|---|---|---|---|---|
| Asterinas Spinlock | Concurrent | Synchronization | Rust | 213 | 151 |
| Asterinas Mutex | Concurrent | Synchronization | Rust | 186 | 219 |
| Asterinas Rwmutex | Concurrent | Synchronization | Rust | 395 | 250 |
| Etcd Raft | Distributed | Consensus (Raft) | Go | 2,159 | 385 |
| Redis Raft | Distributed | Consensus (Raft) | C | 2,394 | 349 |
| Xline CURP | Distributed | Replication (CURP) | Rust | 4,064 | 100 |
| PGo dqueue | Distributed | Distributed Queue | Go | 175 | 75 |
| PGo locksvc | Distributed | Lock Server | Go | 281 | 93 |
| PGo raftkvs | Distributed | Consensus (Raft) | Go | 3,163 | 508 |
To list all available tasks:
sysmobench --list-tasksSysMoBench provides four automated phases to evaluate AI-generated TLA+ models with different metrics: syntax correctness, runtime correctness, conformance to system implementation, and invariant correctness.
SysMoBench is designed to be extensible. To add a new system artifact:
- Prepare system artifact: Collect repository links, branch names, and any relevant materials
- Create task definition: Specify modeling requirements, task configuration and related files in
task.yamland define invariant templates for correctness properties - Instrument for trace collection: Add logging statements to system code to collect execution traces for conformance validation
For detailed instructions, see Adding New Systems Guide.
LLM_Gen_TLA_benchmark_framework/
├── scripts/
│ └── run_benchmark.py # Main entry script for running benchmarks
├── tla_eval/
│ ├── tasks/ # Task definitions for each system artifact
│ │ ├── spin/ # Spinlock task with prompts, configs
│ │ │ ├── prompts/ # System-specific prompts
│ │ │ └── task.yaml # Task configuration (system info)
│ │ ├── mutex/
│ │ └── ...
│ ├── models/ # LLM model interfaces and wrappers
│ ├── evaluation/ # Evaluator implementations organized by metric type
│ └── config.py # Configuration management (API keys, LLM model configs)
├── data/
│ ├── invariant_templates/ # Expert-written invariant templates for each system
│ └── traces/ # System execution traces for conformance evaluation
└── lib/ # TLA+ toolchain (tla2tools.jar for SANY and TLC)

