MultiAgent OS 🤖

A production-grade, containerized multi-agent orchestration system with a self-improving evaluation loop, dynamic tool orchestration, and adversarial robustness testing.

⚡ Quick Start (5 minutes)

git clone https://github.com/YOUR_USERNAME/multiagent-eval-system.git
cd multiagent-eval-system

# 1. Copy and configure environment variables
cp .env.example .env
# Open .env and set OPENAI_API_KEY=your-actual-key

# 2. Launch all services
docker compose up --build

# 3. Open the UI
open http://localhost:3000
# API docs: http://localhost:8000/docs

That's it. docker compose up starts everything with zero manual steps.

🏗 Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Client (Browser)                          │
│              SSE Stream ←─────────── POST /api/v1/query          │
└──────────────────────────┬──────────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────────┐
│                     FastAPI Server (:8000)                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  5 Endpoints: /query, /trace, /evals/latest,             │    │
│  │               /proposals/:id/review, /evals/trigger       │    │
│  └────────────────────────┬────────────────────────────────┘    │
│                            │ Celery Task Queue                    │
└────────────────────────────┼────────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────────┐
│                    Celery Worker                                  │
│                                                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                  Master Orchestrator                       │   │
│  │   LLM-driven dynamic routing (no hardcoded chain)         │   │
│  │   Logs every routing decision with justification          │   │
│  │                                                            │   │
│  │  ┌────────────┐  ┌──────────┐  ┌─────────┐  ┌────────┐  │   │
│  │  │Decomposition│  │Retrieval │  │Critique │  │Synthesis│  │   │
│  │  │  Agent     │  │  Agent   │  │  Agent  │  │  Agent  │  │   │
│  │  └────────────┘  └──────────┘  └─────────┘  └────────┘  │   │
│  │          ↑ SharedContext (Pydantic) ↑                     │   │
│  │  ┌─────────────┐              ┌──────────────────────┐   │   │
│  │  │ Compression │              │    Meta Agent         │   │   │
│  │  │   Agent     │              │  (eval loop only)     │   │   │
│  │  └─────────────┘              └──────────────────────┘   │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                   │
│  Tools: WebSearch | CodeExec | StructuredData | SelfReflection   │
└───────────────────────┬─────────────────────────────────────────┘
                        │
        ┌───────────────┴──────────────┐
        │                              │
┌───────▼──────┐             ┌─────────▼────────┐
│  PostgreSQL  │             │      Redis        │
│  (Postgres)  │             │  (Broker + PubSub)│
└──────────────┘             └──────────────────┘

🤖 Agents

Agent	Role	Decision Boundary
Master Orchestrator	Dynamic routing brain	Uses LLM to evaluate `SharedContext` state and select the next agent. Never follows a fixed chain. Max 12 turns.
Decomposition Agent	Query analysis	Breaks queries into typed sub-tasks (`factual`, `analytical`, `creative`, `retrieval`) with explicit JSON dependency graphs. Dependent tasks are gated by the orchestrator.
Retrieval Agent	Multi-hop RAG	Retrieves ≥2 chunks and requires cross-chunk reasoning. Single-hop answers are not accepted. Cites `[chunk_id]` per answer span.
Critique Agent	Output validation	Reviews synthesis/retrieval outputs claim-by-claim. Assigns confidence scores per claim (not globally). Flags specific text spans with reasons.
Synthesis Agent	Final answer	Merges all sub-agent outputs. Resolves contradictions flagged by the critique agent. Produces a provenance map linking each sentence to its source agent + chunk.
Compression Agent	Context management	Triggered when any agent exceeds 90% of its declared token budget. Losslessly preserves structured data (JSON, citations, scores). Lossy only on conversational filler.
Meta Agent	Self-improvement	Runs after every eval. Analyzes worst-performing dimension, proposes a prompt rewrite as a unified diff. Proposal is stored but NOT auto-applied.

🔧 Tools

Tool	Description	Failure Contracts
`WebSearchTool`	Returns mocked search results with URLs and relevance scores	`timeout` → retry with 1.5× delay; `empty` → broaden query keyword; `malformed` → immediate fail
`CodeExecutionTool`	Runs Python in ephemeral Docker containers, returns stdout/stderr/exit code	`timeout` → logged + retry with longer timeout; `empty` → logged; `malformed` → immediate fail
`StructuredDataTool`	NL→SQL translation against Postgres (Employee/Department schema), injection guarded	`timeout` → retry; `empty` → logged; `malformed` → immediate fail
`SelfReflectionTool`	Re-reads previous agent outputs in the session, identifies contradictions	`timeout` → retry; `empty` → logged gracefully; invalid job_id → `malformed`

All tools support up to 2 retries. Each attempt is logged separately with modified inputs. Retry strategy differs by failure type — it is explicit in code, not in prompt instructions.

📊 Evaluation

15 test cases across 3 categories:

Baseline (5): Known correct answers — validates basic pipeline correctness
Ambiguous (5): Underspecified queries — tests decomposition quality and multi-hop retrieval
Adversarial (5): Prompt injections, factually wrong premises, and forced critique/synthesis contradictions

6 Scoring Dimensions (each produces a numeric score + written justification string):

correctness — answer accuracy vs expected; wrong premises penalized
citation_accuracy — citation presence, relevance, and multi-hop requirement
contradiction_resolution — were critique flags resolved in synthesis?
tool_efficiency — unnecessary tool calls are penalized
budget_compliance — policy violations lower the score
critique_agreement — final output alignment with critique validation

Every eval run is stored in full: exact prompts, tool calls, outputs, scores, and timestamps. Re-running on same inputs produces a diff-able output.

🔄 Self-Improving Loop

Eval harness runs → scores per case + dimension stored in eval_test_cases
Meta Agent identifies worst-performing (agent, dimension) pair
Meta Agent proposes a rewritten prompt with unified diff + justification
Proposal is stored in prompt_proposals with status pending
Human reviews at POST /api/v1/prompts/proposals/{id}/review
On approval: old prompt deactivated, new version created, re-eval can be triggered
Performance delta is recorded on the next POST /api/v1/evals/trigger

Every step is auditable — timestamps, diffs, decisions, and deltas are all queryable.

🌐 API Reference

Method	Endpoint	Description
`POST`	`/api/v1/query`	Submit query → SSE stream of real-time agent activity
`GET`	`/api/v1/jobs/{job_id}/trace`	Full execution trace for a job
`GET`	`/api/v1/evals/latest`	Latest eval run summary by category and dimension
`POST`	`/api/v1/prompts/proposals/{id}/review`	Human approve/reject for a prompt rewrite
`POST`	`/api/v1/evals/trigger`	Trigger targeted re-eval with latest approved prompts
`GET`	`/api/v1/prompts/proposals`	List all prompt proposals (UI helper)

Full interactive docs at http://localhost:8000/docs.

Error responses always include:

{
  "error_code": "MACHINE_READABLE_CODE",
  "message": "Human-readable message",
  "job_id": "uuid-if-applicable"
}

⚙️ Environment Variables

All configuration is via environment variables. No credentials are hardcoded.

Variable	Default	Description
`OPENAI_API_KEY`	(required)	API key for LLM calls via litellm
`LLM_MODEL`	`gpt-4o-mini`	Any litellm-compatible model (e.g. `claude-3-5-sonnet-20241022`)
`LLM_TEMPERATURE`	`0.2`	Sampling temperature
`DATABASE_URL`	`postgresql+asyncpg://...`	Postgres connection string
`REDIS_URL`	`redis://redis:6379/0`	Redis connection string
`USE_DOCKER_SANDBOX`	`true`	Use Docker for code execution isolation
`CODE_EXEC_TIMEOUT`	`10`	Code execution timeout in seconds
`POSTGRES_USER`	`admin`	Postgres user
`POSTGRES_PASSWORD`	`admin`	Postgres password
`POSTGRES_DB`	`multiagent_db`	Postgres DB name

⚠️ Known Limitations

LLM JSON reliability: The system depends on LLMs returning valid JSON. parse_json_response() includes fallback extraction, but adversarial LLM outputs can still cause parse failures. Production use would add a structured outputs API (OpenAI response_format: json_schema).
Mock retrieval: The knowledge base is a fixed 10-document corpus with keyword-based retrieval, not a real vector database. Real deployment should use pgvector or Pinecone with embeddings.
Web search stub: The web search tool is deterministic and mocked. It does not make real HTTP calls. Real deployment would integrate Tavily or Brave Search.
Sequential eval runs: The eval harness runs 15 cases sequentially (no parallelism) to avoid overwhelming the LLM API rate limits. This makes full eval runs slow (~5-10 minutes with a real LLM API).
Meta-agent proposals are shallow: The meta-agent proposes one prompt rewrite per eval run, targeting only the worst dimension. Real-world improvement loops would benefit from A/B testing and statistical significance testing.
Context compression is advisory: The compression agent receives context and returns a compressed version, but the system doesn't fully reconstruct state from the compressed text (it adjusts token counters). A production system would need structured context restoration.
Docker-in-Docker sandbox: Code execution requires mounting the Docker socket, which has security implications in shared environments. Use a dedicated network-isolated sandbox service in production.

🚀 What I'd Build Next

pgvector integration for real semantic retrieval with embedding-based multi-hop
Parallel agent execution for independent sub-tasks (async DAG runner)
Structured output schemas using OpenAI's response_format: json_schema for zero-parse-failure agents
A/B prompt testing framework that runs both old and new prompts on a holdout split before accepting rewrites
Agent memory via a persistent vector store for session-level and cross-session context
Human-in-the-loop interrupts — allow a human to pause the pipeline mid-run and inject feedback
Distributed tracing with OpenTelemetry for cross-service span correlation
Adversarial red-teaming module that continuously generates new injection patterns using the LLM itself

🧪 Running Tests

cd backend
pip install -r requirements.txt
pytest tests/ -v

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiAgent OS 🤖

⚡ Quick Start (5 minutes)

🏗 Architecture

🤖 Agents

🔧 Tools

📊 Evaluation

🔄 Self-Improving Loop

🌐 API Reference

⚙️ Environment Variables

⚠️ Known Limitations

🚀 What I'd Build Next

🧪 Running Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MultiAgent OS 🤖

⚡ Quick Start (5 minutes)

🏗 Architecture

🤖 Agents

🔧 Tools

📊 Evaluation

🔄 Self-Improving Loop

🌐 API Reference

⚙️ Environment Variables

⚠️ Known Limitations

🚀 What I'd Build Next

🧪 Running Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages