A production-grade, containerized multi-agent orchestration system with a self-improving evaluation loop, dynamic tool orchestration, and adversarial robustness testing.
git clone https://github.com/YOUR_USERNAME/multiagent-eval-system.git
cd multiagent-eval-system
# 1. Copy and configure environment variables
cp .env.example .env
# Open .env and set OPENAI_API_KEY=your-actual-key
# 2. Launch all services
docker compose up --build
# 3. Open the UI
open http://localhost:3000
# API docs: http://localhost:8000/docsThat's it. docker compose up starts everything with zero manual steps.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client (Browser) β
β SSE Stream ββββββββββββ POST /api/v1/query β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β FastAPI Server (:8000) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 5 Endpoints: /query, /trace, /evals/latest, β β
β β /proposals/:id/review, /evals/trigger β β
β ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ β
β β Celery Task Queue β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β Celery Worker β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Master Orchestrator β β
β β LLM-driven dynamic routing (no hardcoded chain) β β
β β Logs every routing decision with justification β β
β β β β
β β ββββββββββββββ ββββββββββββ βββββββββββ ββββββββββ β β
β β βDecompositionβ βRetrieval β βCritique β βSynthesisβ β β
β β β Agent β β Agent β β Agent β β Agent β β β
β β ββββββββββββββ ββββββββββββ βββββββββββ ββββββββββ β β
β β β SharedContext (Pydantic) β β β
β β βββββββββββββββ ββββββββββββββββββββββββ β β
β β β Compression β β Meta Agent β β β
β β β Agent β β (eval loop only) β β β
β β βββββββββββββββ ββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Tools: WebSearch | CodeExec | StructuredData | SelfReflection β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββ΄βββββββββββββββ
β β
βββββββββΌβββββββ βββββββββββΌβββββββββ
β PostgreSQL β β Redis β
β (Postgres) β β (Broker + PubSub)β
ββββββββββββββββ ββββββββββββββββββββ
| Agent | Role | Decision Boundary |
|---|---|---|
| Master Orchestrator | Dynamic routing brain | Uses LLM to evaluate SharedContext state and select the next agent. Never follows a fixed chain. Max 12 turns. |
| Decomposition Agent | Query analysis | Breaks queries into typed sub-tasks (factual, analytical, creative, retrieval) with explicit JSON dependency graphs. Dependent tasks are gated by the orchestrator. |
| Retrieval Agent | Multi-hop RAG | Retrieves β₯2 chunks and requires cross-chunk reasoning. Single-hop answers are not accepted. Cites [chunk_id] per answer span. |
| Critique Agent | Output validation | Reviews synthesis/retrieval outputs claim-by-claim. Assigns confidence scores per claim (not globally). Flags specific text spans with reasons. |
| Synthesis Agent | Final answer | Merges all sub-agent outputs. Resolves contradictions flagged by the critique agent. Produces a provenance map linking each sentence to its source agent + chunk. |
| Compression Agent | Context management | Triggered when any agent exceeds 90% of its declared token budget. Losslessly preserves structured data (JSON, citations, scores). Lossy only on conversational filler. |
| Meta Agent | Self-improvement | Runs after every eval. Analyzes worst-performing dimension, proposes a prompt rewrite as a unified diff. Proposal is stored but NOT auto-applied. |
| Tool | Description | Failure Contracts |
|---|---|---|
WebSearchTool |
Returns mocked search results with URLs and relevance scores | timeout β retry with 1.5Γ delay; empty β broaden query keyword; malformed β immediate fail |
CodeExecutionTool |
Runs Python in ephemeral Docker containers, returns stdout/stderr/exit code | timeout β logged + retry with longer timeout; empty β logged; malformed β immediate fail |
StructuredDataTool |
NLβSQL translation against Postgres (Employee/Department schema), injection guarded | timeout β retry; empty β logged; malformed β immediate fail |
SelfReflectionTool |
Re-reads previous agent outputs in the session, identifies contradictions | timeout β retry; empty β logged gracefully; invalid job_id β malformed |
All tools support up to 2 retries. Each attempt is logged separately with modified inputs. Retry strategy differs by failure type β it is explicit in code, not in prompt instructions.
15 test cases across 3 categories:
- Baseline (5): Known correct answers β validates basic pipeline correctness
- Ambiguous (5): Underspecified queries β tests decomposition quality and multi-hop retrieval
- Adversarial (5): Prompt injections, factually wrong premises, and forced critique/synthesis contradictions
6 Scoring Dimensions (each produces a numeric score + written justification string):
correctnessβ answer accuracy vs expected; wrong premises penalizedcitation_accuracyβ citation presence, relevance, and multi-hop requirementcontradiction_resolutionβ were critique flags resolved in synthesis?tool_efficiencyβ unnecessary tool calls are penalizedbudget_complianceβ policy violations lower the scorecritique_agreementβ final output alignment with critique validation
Every eval run is stored in full: exact prompts, tool calls, outputs, scores, and timestamps. Re-running on same inputs produces a diff-able output.
- Eval harness runs β scores per case + dimension stored in
eval_test_cases - Meta Agent identifies worst-performing
(agent, dimension)pair - Meta Agent proposes a rewritten prompt with unified diff + justification
- Proposal is stored in
prompt_proposalswith statuspending - Human reviews at
POST /api/v1/prompts/proposals/{id}/review - On approval: old prompt deactivated, new version created,
re-evalcan be triggered - Performance delta is recorded on the next
POST /api/v1/evals/trigger
Every step is auditable β timestamps, diffs, decisions, and deltas are all queryable.
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/query |
Submit query β SSE stream of real-time agent activity |
GET |
/api/v1/jobs/{job_id}/trace |
Full execution trace for a job |
GET |
/api/v1/evals/latest |
Latest eval run summary by category and dimension |
POST |
/api/v1/prompts/proposals/{id}/review |
Human approve/reject for a prompt rewrite |
POST |
/api/v1/evals/trigger |
Trigger targeted re-eval with latest approved prompts |
GET |
/api/v1/prompts/proposals |
List all prompt proposals (UI helper) |
Full interactive docs at http://localhost:8000/docs.
Error responses always include:
{
"error_code": "MACHINE_READABLE_CODE",
"message": "Human-readable message",
"job_id": "uuid-if-applicable"
}All configuration is via environment variables. No credentials are hardcoded.
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
(required) | API key for LLM calls via litellm |
LLM_MODEL |
gpt-4o-mini |
Any litellm-compatible model (e.g. claude-3-5-sonnet-20241022) |
LLM_TEMPERATURE |
0.2 |
Sampling temperature |
DATABASE_URL |
postgresql+asyncpg://... |
Postgres connection string |
REDIS_URL |
redis://redis:6379/0 |
Redis connection string |
USE_DOCKER_SANDBOX |
true |
Use Docker for code execution isolation |
CODE_EXEC_TIMEOUT |
10 |
Code execution timeout in seconds |
POSTGRES_USER |
admin |
Postgres user |
POSTGRES_PASSWORD |
admin |
Postgres password |
POSTGRES_DB |
multiagent_db |
Postgres DB name |
-
LLM JSON reliability: The system depends on LLMs returning valid JSON.
parse_json_response()includes fallback extraction, but adversarial LLM outputs can still cause parse failures. Production use would add a structured outputs API (OpenAIresponse_format: json_schema). -
Mock retrieval: The knowledge base is a fixed 10-document corpus with keyword-based retrieval, not a real vector database. Real deployment should use
pgvectoror Pinecone with embeddings. -
Web search stub: The web search tool is deterministic and mocked. It does not make real HTTP calls. Real deployment would integrate Tavily or Brave Search.
-
Sequential eval runs: The eval harness runs 15 cases sequentially (no parallelism) to avoid overwhelming the LLM API rate limits. This makes full eval runs slow (~5-10 minutes with a real LLM API).
-
Meta-agent proposals are shallow: The meta-agent proposes one prompt rewrite per eval run, targeting only the worst dimension. Real-world improvement loops would benefit from A/B testing and statistical significance testing.
-
Context compression is advisory: The compression agent receives context and returns a compressed version, but the system doesn't fully reconstruct state from the compressed text (it adjusts token counters). A production system would need structured context restoration.
-
Docker-in-Docker sandbox: Code execution requires mounting the Docker socket, which has security implications in shared environments. Use a dedicated network-isolated sandbox service in production.
- pgvector integration for real semantic retrieval with embedding-based multi-hop
- Parallel agent execution for independent sub-tasks (async DAG runner)
- Structured output schemas using OpenAI's
response_format: json_schemafor zero-parse-failure agents - A/B prompt testing framework that runs both old and new prompts on a holdout split before accepting rewrites
- Agent memory via a persistent vector store for session-level and cross-session context
- Human-in-the-loop interrupts β allow a human to pause the pipeline mid-run and inject feedback
- Distributed tracing with OpenTelemetry for cross-service span correlation
- Adversarial red-teaming module that continuously generates new injection patterns using the LLM itself
cd backend
pip install -r requirements.txt
pytest tests/ -v