Name	Name	Last commit message	Last commit date
parent directory ..
data/benchmark	data/benchmark
src	src
sysmobench_core	sysmobench_core
tests	tests
.gitignore	.gitignore
Dockerfile	Dockerfile
README.md	README.md
env.toml	env.toml
install.sh	install.sh
requirements.txt	requirements.txt
run.sh	run.sh
test.sh	test.sh

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

Scenario Description

Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. SysMoBench evaluates whether LLM-powered agents can translate realistic system artifacts into executable TLA+ specifications that pass rigorous syntactic, semantic, and invariant checks.

Task Details

Input
- System source code plus task-specific prompts and invariants from sysmobench_core/tla_eval/tasks/<task_name>/
- Agent parameters (method, model name, iteration budget)
- TLA+ evaluation configuration and trace data
Output
- Generated TLA+ specification (GenerationResult.tla_specification)
- Iterative correction summaries with compilation/runtime verdicts
Evaluation
- compilation_check / action_decomposition (syntax correctness via SANY)
- runtime_coverage (TLC simulation coverage)
- trace_validation / pgo_trace_validation (conformance to real system traces)
- invariant_verification (system-specific safety/liveness properties)

Key Features

Automated four-phase evaluation pipeline covering syntax, runtime, trace conformance, and invariants
Nine real-world concurrent and distributed systems (Etcd Raft, Redis Raft, Asterinas primitives, Xline CURP, PGo suites)
Extensible configuration for adding new systems via prompts, traces, and invariant templates

Benchmark Setup

Prerequisites

Python 3.9+
Java 11+ (SANY/TLC binaries are downloaded during installation)
Anthropic evaluator key (required for trace/invariant workflows)
LLM credentials for the model under test (OpenAI, Azure OpenAI, Anthropic, etc.)

Configuration

Edit benchmarks/sysmobench/env.toml:

[evaluator_api_keys]
ANTHROPIC_API_KEY = "your_evaluator_key"

[llm]
OPENAI_API_KEY = "your_openai_key"
AZURE_API_KEY = "your_azure_key"
AZURE_API_BASE = "https://your-azure-endpoint.openai.azure.com"
AZURE_API_VERSION = "2024-05-01-preview"
ANTHROPIC_API_KEY = "your_model_key"

[hardware]
use_gpu = false

[env-docker]
image = "default"
entrypoint = "./run.sh"

Install Dependencies

cd benchmarks/sysmobench
./install.sh

This script installs OpenJDK when necessary, creates .venv/, installs benchmarks/sysmobench/requirements.txt, registers sysmobench_core in editable mode (exposing sysmobench/sysmobench-setup), and downloads the TLA+ toolchain.

Run the Benchmark

./run.sh <model_name>

The wrapper executes src/main.py with the default task list from data/benchmark/tasks.jsonl, iterating up to three correction rounds per task. Results are stored at:

outputs/sysmobench__<model>__agent_based__<timestamp>/
├── result.jsonl   # Line-delimited iteration summaries per task
└── avg_score.json # Final averaged score across tasks

Use the System Intelligence CLI (optional)

To orchestrate SysMoBench alongside other benchmarks:

cd cli
./run_all_local.sh <model_name>

Using the upstream SysMoBench CLI (optional)

cd benchmarks/sysmobench
source .venv/bin/activate
sysmobench --task spin --method agent_based --model <model> --metric compilation_check
sysmobench --list-tasks
deactivate

For exhaustive CLI flag descriptions, see Usage Guide.

Benchmark Tasks

SysMoBench includes 9 diverse real-world system artifacts from concurrent and distributed systems:

System	Type	Description	Source Lang.	Source LoC	TLA+ LoC
Asterinas Spinlock	Concurrent	Synchronization	Rust	213	151
Asterinas Mutex	Concurrent	Synchronization	Rust	186	219
Asterinas Rwmutex	Concurrent	Synchronization	Rust	395	250
Etcd Raft	Distributed	Consensus (Raft)	Go	2,159	385
Redis Raft	Distributed	Consensus (Raft)	C	2,394	349
Xline CURP	Distributed	Replication (CURP)	Rust	4,064	100
PGo dqueue	Distributed	Distributed Queue	Go	175	75
PGo locksvc	Distributed	Lock Server	Go	281	93
PGo raftkvs	Distributed	Consensus (Raft)	Go	3,163	508

To list all available tasks:

sysmobench --list-tasks

Evaluation Metrics

SysMoBench provides four automated phases to evaluate AI-generated TLA+ models with different metrics: syntax correctness, runtime correctness, conformance to system implementation, and invariant correctness.

Adding New Systems

SysMoBench is designed to be extensible. To add a new system artifact:

Prepare system artifact: Collect repository links, branch names, and any relevant materials
Create task definition: Specify modeling requirements, task configuration and related files in task.yaml and define invariant templates for correctness properties
Instrument for trace collection: Add logging statements to system code to collect execution traces for conformance validation

For detailed instructions, see Adding New Systems Guide.

Project Structure

LLM_Gen_TLA_benchmark_framework/
├── scripts/
│   └── run_benchmark.py          # Main entry script for running benchmarks
├── tla_eval/
│   ├── tasks/                    # Task definitions for each system artifact
│   │   ├── spin/                 # Spinlock task with prompts, configs
│   │   │   ├── prompts/          # System-specific prompts
│   │   │   └── task.yaml         # Task configuration (system info)
│   │   ├── mutex/
│   │   └── ...
│   ├── models/                   # LLM model interfaces and wrappers
│   ├── evaluation/               # Evaluator implementations organized by metric type
│   └── config.py                 # Configuration management (API keys, LLM model configs)
├── data/
│   ├── invariant_templates/      # Expert-written invariant templates for each system
│   └── traces/                   # System execution traces for conformance evaluation
└── lib/                          # TLA+ toolchain (tla2tools.jar for SANY and TLC)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

Scenario Description

Task Details

Key Features

Benchmark Setup

Prerequisites

Configuration

Install Dependencies

Run the Benchmark

Use the System Intelligence CLI (optional)

Using the upstream SysMoBench CLI (optional)

Benchmark Tasks

Evaluation Metrics

Adding New Systems

Project Structure

FilesExpand file tree

sysmobench

Directory actions

More options

Directory actions

More options

Latest commit

History

sysmobench

Folders and files

parent directory

README.md

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

Scenario Description

Task Details

Key Features

Benchmark Setup

Prerequisites

Configuration

Install Dependencies

Run the Benchmark

Use the System Intelligence CLI (optional)

Using the upstream SysMoBench CLI (optional)

Benchmark Tasks

Evaluation Metrics

Adding New Systems

Project Structure