Skip to content

Rkvishnu/multiagent-eval-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MultiAgent OS πŸ€–

A production-grade, containerized multi-agent orchestration system with a self-improving evaluation loop, dynamic tool orchestration, and adversarial robustness testing.


⚑ Quick Start (5 minutes)

git clone https://github.com/YOUR_USERNAME/multiagent-eval-system.git
cd multiagent-eval-system

# 1. Copy and configure environment variables
cp .env.example .env
# Open .env and set OPENAI_API_KEY=your-actual-key

# 2. Launch all services
docker compose up --build

# 3. Open the UI
open http://localhost:3000
# API docs: http://localhost:8000/docs

That's it. docker compose up starts everything with zero manual steps.


πŸ— Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Client (Browser)                          β”‚
β”‚              SSE Stream ←─────────── POST /api/v1/query          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     FastAPI Server (:8000)                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  5 Endpoints: /query, /trace, /evals/latest,             β”‚    β”‚
β”‚  β”‚               /proposals/:id/review, /evals/trigger       β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                            β”‚ Celery Task Queue                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Celery Worker                                  β”‚
β”‚                                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                  Master Orchestrator                       β”‚   β”‚
β”‚  β”‚   LLM-driven dynamic routing (no hardcoded chain)         β”‚   β”‚
β”‚  β”‚   Logs every routing decision with justification          β”‚   β”‚
β”‚  β”‚                                                            β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚
β”‚  β”‚  β”‚Decompositionβ”‚  β”‚Retrieval β”‚  β”‚Critique β”‚  β”‚Synthesisβ”‚  β”‚   β”‚
β”‚  β”‚  β”‚  Agent     β”‚  β”‚  Agent   β”‚  β”‚  Agent  β”‚  β”‚  Agent  β”‚  β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚   β”‚
β”‚  β”‚          ↑ SharedContext (Pydantic) ↑                     β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚   β”‚
β”‚  β”‚  β”‚ Compression β”‚              β”‚    Meta Agent         β”‚   β”‚   β”‚
β”‚  β”‚  β”‚   Agent     β”‚              β”‚  (eval loop only)     β”‚   β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                   β”‚
β”‚  Tools: WebSearch | CodeExec | StructuredData | SelfReflection   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PostgreSQL  β”‚             β”‚      Redis        β”‚
β”‚  (Postgres)  β”‚             β”‚  (Broker + PubSub)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ€– Agents

Agent Role Decision Boundary
Master Orchestrator Dynamic routing brain Uses LLM to evaluate SharedContext state and select the next agent. Never follows a fixed chain. Max 12 turns.
Decomposition Agent Query analysis Breaks queries into typed sub-tasks (factual, analytical, creative, retrieval) with explicit JSON dependency graphs. Dependent tasks are gated by the orchestrator.
Retrieval Agent Multi-hop RAG Retrieves β‰₯2 chunks and requires cross-chunk reasoning. Single-hop answers are not accepted. Cites [chunk_id] per answer span.
Critique Agent Output validation Reviews synthesis/retrieval outputs claim-by-claim. Assigns confidence scores per claim (not globally). Flags specific text spans with reasons.
Synthesis Agent Final answer Merges all sub-agent outputs. Resolves contradictions flagged by the critique agent. Produces a provenance map linking each sentence to its source agent + chunk.
Compression Agent Context management Triggered when any agent exceeds 90% of its declared token budget. Losslessly preserves structured data (JSON, citations, scores). Lossy only on conversational filler.
Meta Agent Self-improvement Runs after every eval. Analyzes worst-performing dimension, proposes a prompt rewrite as a unified diff. Proposal is stored but NOT auto-applied.

πŸ”§ Tools

Tool Description Failure Contracts
WebSearchTool Returns mocked search results with URLs and relevance scores timeout β†’ retry with 1.5Γ— delay; empty β†’ broaden query keyword; malformed β†’ immediate fail
CodeExecutionTool Runs Python in ephemeral Docker containers, returns stdout/stderr/exit code timeout β†’ logged + retry with longer timeout; empty β†’ logged; malformed β†’ immediate fail
StructuredDataTool NL→SQL translation against Postgres (Employee/Department schema), injection guarded timeout → retry; empty → logged; malformed → immediate fail
SelfReflectionTool Re-reads previous agent outputs in the session, identifies contradictions timeout β†’ retry; empty β†’ logged gracefully; invalid job_id β†’ malformed

All tools support up to 2 retries. Each attempt is logged separately with modified inputs. Retry strategy differs by failure type β€” it is explicit in code, not in prompt instructions.


πŸ“Š Evaluation

15 test cases across 3 categories:

  • Baseline (5): Known correct answers β€” validates basic pipeline correctness
  • Ambiguous (5): Underspecified queries β€” tests decomposition quality and multi-hop retrieval
  • Adversarial (5): Prompt injections, factually wrong premises, and forced critique/synthesis contradictions

6 Scoring Dimensions (each produces a numeric score + written justification string):

  1. correctness β€” answer accuracy vs expected; wrong premises penalized
  2. citation_accuracy β€” citation presence, relevance, and multi-hop requirement
  3. contradiction_resolution β€” were critique flags resolved in synthesis?
  4. tool_efficiency β€” unnecessary tool calls are penalized
  5. budget_compliance β€” policy violations lower the score
  6. critique_agreement β€” final output alignment with critique validation

Every eval run is stored in full: exact prompts, tool calls, outputs, scores, and timestamps. Re-running on same inputs produces a diff-able output.


πŸ”„ Self-Improving Loop

  1. Eval harness runs β†’ scores per case + dimension stored in eval_test_cases
  2. Meta Agent identifies worst-performing (agent, dimension) pair
  3. Meta Agent proposes a rewritten prompt with unified diff + justification
  4. Proposal is stored in prompt_proposals with status pending
  5. Human reviews at POST /api/v1/prompts/proposals/{id}/review
  6. On approval: old prompt deactivated, new version created, re-eval can be triggered
  7. Performance delta is recorded on the next POST /api/v1/evals/trigger

Every step is auditable β€” timestamps, diffs, decisions, and deltas are all queryable.


🌐 API Reference

Method Endpoint Description
POST /api/v1/query Submit query β†’ SSE stream of real-time agent activity
GET /api/v1/jobs/{job_id}/trace Full execution trace for a job
GET /api/v1/evals/latest Latest eval run summary by category and dimension
POST /api/v1/prompts/proposals/{id}/review Human approve/reject for a prompt rewrite
POST /api/v1/evals/trigger Trigger targeted re-eval with latest approved prompts
GET /api/v1/prompts/proposals List all prompt proposals (UI helper)

Full interactive docs at http://localhost:8000/docs.

Error responses always include:

{
  "error_code": "MACHINE_READABLE_CODE",
  "message": "Human-readable message",
  "job_id": "uuid-if-applicable"
}

βš™οΈ Environment Variables

All configuration is via environment variables. No credentials are hardcoded.

Variable Default Description
OPENAI_API_KEY (required) API key for LLM calls via litellm
LLM_MODEL gpt-4o-mini Any litellm-compatible model (e.g. claude-3-5-sonnet-20241022)
LLM_TEMPERATURE 0.2 Sampling temperature
DATABASE_URL postgresql+asyncpg://... Postgres connection string
REDIS_URL redis://redis:6379/0 Redis connection string
USE_DOCKER_SANDBOX true Use Docker for code execution isolation
CODE_EXEC_TIMEOUT 10 Code execution timeout in seconds
POSTGRES_USER admin Postgres user
POSTGRES_PASSWORD admin Postgres password
POSTGRES_DB multiagent_db Postgres DB name

⚠️ Known Limitations

  1. LLM JSON reliability: The system depends on LLMs returning valid JSON. parse_json_response() includes fallback extraction, but adversarial LLM outputs can still cause parse failures. Production use would add a structured outputs API (OpenAI response_format: json_schema).

  2. Mock retrieval: The knowledge base is a fixed 10-document corpus with keyword-based retrieval, not a real vector database. Real deployment should use pgvector or Pinecone with embeddings.

  3. Web search stub: The web search tool is deterministic and mocked. It does not make real HTTP calls. Real deployment would integrate Tavily or Brave Search.

  4. Sequential eval runs: The eval harness runs 15 cases sequentially (no parallelism) to avoid overwhelming the LLM API rate limits. This makes full eval runs slow (~5-10 minutes with a real LLM API).

  5. Meta-agent proposals are shallow: The meta-agent proposes one prompt rewrite per eval run, targeting only the worst dimension. Real-world improvement loops would benefit from A/B testing and statistical significance testing.

  6. Context compression is advisory: The compression agent receives context and returns a compressed version, but the system doesn't fully reconstruct state from the compressed text (it adjusts token counters). A production system would need structured context restoration.

  7. Docker-in-Docker sandbox: Code execution requires mounting the Docker socket, which has security implications in shared environments. Use a dedicated network-isolated sandbox service in production.


πŸš€ What I'd Build Next

  • pgvector integration for real semantic retrieval with embedding-based multi-hop
  • Parallel agent execution for independent sub-tasks (async DAG runner)
  • Structured output schemas using OpenAI's response_format: json_schema for zero-parse-failure agents
  • A/B prompt testing framework that runs both old and new prompts on a holdout split before accepting rewrites
  • Agent memory via a persistent vector store for session-level and cross-session context
  • Human-in-the-loop interrupts β€” allow a human to pause the pipeline mid-run and inject feedback
  • Distributed tracing with OpenTelemetry for cross-service span correlation
  • Adversarial red-teaming module that continuously generates new injection patterns using the LLM itself

πŸ§ͺ Running Tests

cd backend
pip install -r requirements.txt
pytest tests/ -v

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors