Skip to content

andyjmorgan/DonkeyWork-CodeSandbox-Manager

Repository files navigation

DonkeyWork CodeSandbox

PR Build and Test Release License .NET

A unified monorepo containing both the Manager API (Kata container orchestration) and Executor API (sandboxed code execution with Python, Node.js, and bash support).

Components

  • Manager API (src/DonkeyWork.CodeSandbox.Manager): REST API service for managing Kata containers in a Kubernetes cluster
  • Executor API (src/DonkeyWork.CodeSandbox.Server): HTTP+SSE server for executing commands inside sandboxed containers
  • Shared Contracts (src/DonkeyWork.CodeSandbox.Contracts): Common models and contracts
  • Client Library (src/DonkeyWork.CodeSandbox.Client): .NET client for consuming the Executor API

Features

  • Create Kata Containers: Dynamically create VM-isolated containers with custom configuration
  • Warm Pool Management: Maintains a pool of pre-warmed containers for instant allocation
  • High Availability: Stateless managers with Kubernetes-native leader election
  • Automatic Cleanup: Idle and expired containers are automatically cleaned up
  • Container Limits: Configurable maximum container limit prevents resource exhaustion
  • List Containers: Retrieve all Kata containers with their status and metadata
  • Get Container Details: Fetch detailed information about a specific container
  • Delete Containers: Remove containers and terminate their associated VMs
  • Health Checks: Built-in health check endpoint for monitoring
  • OpenAPI Documentation: Interactive API documentation via Scalar

High Availability Architecture

The CodeSandbox Manager is designed for high availability with no external dependencies:

  • Stateless Managers: Multiple manager instances can run simultaneously with no shared state
  • Kubernetes as Source of Truth: All container state (timestamps, allocation status) is stored as Kubernetes annotations and labels
  • Leader Election: Uses Kubernetes Lease objects for leader election - only the leader performs pool backfill operations
  • Optimistic Locking: Container allocation uses Kubernetes resourceVersion for conflict detection and automatic retry
  • Distributed Monitoring: All instances monitor and cleanup containers independently

Leader Election

flowchart TB
    subgraph Managers["Manager Instances"]
        M1["Manager 1<br/>(Leader)"]
        M2["Manager 2<br/>(Follower)"]
        M3["Manager 3<br/>(Follower)"]
    end

    subgraph K8s["Kubernetes"]
        Lease["Lease Object<br/>pool-backfill-leader"]
        Pods["Pod Resources<br/>(annotations/labels)"]
    end

    M1 -->|"Renews lease"| Lease
    M1 -->|"Backfills pool"| Pods
    M2 -.->|"Monitors lease"| Lease
    M3 -.->|"Monitors lease"| Lease
    M2 -->|"Allocates/Cleanup"| Pods
    M3 -->|"Allocates/Cleanup"| Pods
Loading

Key behaviors:

  • Only the leader creates new warm pool containers (prevents duplicate creation)
  • All managers can allocate containers and handle cleanup (distributed workload)
  • If the leader fails, another manager automatically takes over within 15 seconds
  • State survives manager restarts - Kubernetes annotations are the source of truth

Architecture

  • Framework: ASP.NET Core 10.0 (Minimal APIs)
  • Kubernetes Client: Official Kubernetes C# client (v18.0.13)
  • Logging: Serilog with structured logging
  • Configuration: IOptions with data validation
  • Container Runtime: Kata Containers (kata-qemu)

System Overview

flowchart TB
    subgraph Client["Client Application"]
        CA[API Consumer]
    end

    subgraph Manager["Manager API :8668"]
        ME["/api/kata endpoints"]
        KCS[KataContainerService]
    end

    subgraph K8s["Kubernetes Cluster"]
        API[Kubernetes API Server]
        subgraph NS["sandbox-containers namespace"]
            subgraph KP1["Kata Pod 1"]
                VM1["Kata VM"]
                EX1["Executor API :8666"]
            end
            subgraph KP2["Kata Pod 2"]
                VM2["Kata VM"]
                EX2["Executor API :8666"]
            end
        end
    end

    CA -->|"REST/SSE"| ME
    ME --> KCS
    KCS -->|"Create/List/Delete Pods"| API
    API --> KP1
    API --> KP2
    KCS -->|"Execute Commands"| EX1
    KCS -->|"Execute Commands"| EX2
Loading

Container Creation Flow

sequenceDiagram
    participant Client
    participant Manager as Manager API
    participant K8s as Kubernetes API
    participant Kata as Kata Pod
    participant Executor as Executor API

    Client->>Manager: POST /api/kata
    Manager->>Manager: Generate unique pod name
    Manager->>K8s: Create Pod (kata-qemu runtime)
    K8s-->>Manager: Pod created (Pending)
    Manager-->>Client: SSE: created event

    loop Wait for Pod Ready
        Manager->>K8s: Get Pod status
        K8s-->>Manager: Pod status
        Manager-->>Client: SSE: waiting event
    end

    K8s->>Kata: Start Kata VM
    Kata->>Kata: Boot VM + Start Executor API

    loop Health Check Executor API
        Manager->>Executor: GET /healthz
        alt Healthy
            Executor-->>Manager: 200 OK
            Manager-->>Client: SSE: healthcheck (healthy)
        else Not Ready
            Executor-->>Manager: Connection refused / timeout
            Manager-->>Client: SSE: healthcheck (unhealthy)
            Manager-->>Client: SSE: waiting event
        end
    end

    Manager-->>Client: SSE: ready event
Loading

Command Execution Flow

sequenceDiagram
    participant Client
    participant Manager as Manager API
    participant Executor as Executor API (in Kata Pod)
    participant Process as Bash Process

    Client->>Manager: POST /api/kata/{sandboxId}/execute
    Manager->>Manager: Lookup Pod IP
    Manager->>Executor: POST /api/execute (SSE)

    Executor->>Process: Spawn /bin/bash -c "command"

    loop Stream Output
        Process-->>Executor: stdout/stderr
        Executor-->>Manager: SSE: OutputEvent
        Manager-->>Client: SSE: OutputEvent
    end

    Process-->>Executor: Exit code
    Executor-->>Manager: SSE: CompletedEvent
    Manager-->>Client: SSE: CompletedEvent
Loading

Component Interaction

flowchart LR
    subgraph Manager["Manager API"]
        direction TB
        EP[Endpoints]
        SVC[KataContainerService]
        CFG[Configuration]
    end

    subgraph Executor["Executor API"]
        direction TB
        CTRL[ExecutionController]
        MP[ManagedProcess]
    end

    subgraph Shared["Contracts"]
        EE[ExecutionEvent]
        OE[OutputEvent]
        CE[CompletedEvent]
    end

    EP --> SVC
    SVC --> CFG
    SVC -.->|HTTP/SSE| CTRL
    CTRL --> MP
    MP --> OE
    MP --> CE
    OE --> EE
    CE --> EE
Loading

Prerequisites

  1. Kubernetes Cluster: k3s v1.33.5+ with Kata Containers enabled
  2. Namespace: sandbox-containers namespace must exist
  3. RBAC: ServiceAccount with appropriate permissions (see k8s/ folder)
  4. Runtime Class: kata-qemu RuntimeClass configured in the cluster

Configuration

The service is configured via appsettings.json:

{
  "KataContainerManager": {
    "TargetNamespace": "sandbox-containers",
    "RuntimeClassName": "kata-qemu",
    "DefaultResourceRequests": {
      "MemoryMi": 128,
      "CpuMillicores": 250
    },
    "DefaultResourceLimits": {
      "MemoryMi": 512,
      "CpuMillicores": 1000
    },
    "PodNamePrefix": "kata-sandbox",
    "CleanupCompletedPods": true,
    "PodReadyTimeoutSeconds": 90,
    "IdleTimeoutMinutes": 5,
    "MaxContainerLifetimeMinutes": 15,
    "MaxTotalContainers": 50,
    "WarmPoolSize": 10,
    "PoolBackfillCheckIntervalSeconds": 30,
    "LeaderLeaseDurationSeconds": 15
  }
}

Configuration Options

Option Default Range Description
TargetNamespace sandbox-containers - Kubernetes namespace for containers
RuntimeClassName kata-qemu - Runtime class for Kata isolation
PodNamePrefix kata-sandbox - Prefix for generated pod names
PodReadyTimeoutSeconds 90 30-300 Timeout waiting for pods to become ready
IdleTimeoutMinutes 5 1-1440 Delete allocated containers after this idle time
MaxContainerLifetimeMinutes 15 1-1440 Maximum lifetime for any allocated container
MaxTotalContainers 50 1-500 Maximum total containers (warm + allocated + manual)
WarmPoolSize 10 0-100 Target number of pre-warmed containers
PoolBackfillCheckIntervalSeconds 30 10-300 How often to check and backfill the pool
LeaderLeaseDurationSeconds 15 5-60 Leader election lease duration
CleanupCheckIntervalMinutes 1 1-60 How often to check for idle/expired containers

API Endpoints

All endpoints are prefixed with /api/kata to support future multi-runtime capabilities (Kata, gVisor, etc.).

POST /api/kata

Create a new Kata container. The container image is fixed to the configured default executor image for security.

Request Body:

{
  "labels": {
    "environment": "sandbox",
    "project": "test"
  },
  "environmentVariables": {
    "KEY": "value"
  },
  "resources": {
    "requests": {
      "memoryMi": 256,
      "cpuMillicores": 500
    },
    "limits": {
      "memoryMi": 1024,
      "cpuMillicores": 2000
    }
  },
  "waitForReady": true
}

Response: Server-Sent Events (SSE) stream with creation progress:

data: {"eventType":"created","podName":"kata-sandbox-a1b2c3d4","phase":"Pending"}

data: {"eventType":"waiting","podName":"kata-sandbox-a1b2c3d4","attemptNumber":1,"phase":"Pending","message":"Waiting for pod to be ready"}

data: {"eventType":"ready","podName":"kata-sandbox-a1b2c3d4","containerInfo":{...},"elapsedSeconds":15.2}

GET /api/kata

List all Kata containers.

Response: 200 OK

[
  {
    "name": "kata-sandbox-a1b2c3d4",
    "phase": "Running",
    "isReady": true,
    "createdAt": "2026-01-13T10:30:00Z",
    "nodeName": "office1",
    "podIP": "10.42.1.15",
    "labels": {
      "app": "kata-manager",
      "runtime": "kata"
    },
    "image": "nginx:alpine"
  }
]

GET /api/kata/{podName}

Get details of a specific container.

Response: 200 OK (same structure as list response)

DELETE /api/kata/{podName}

Delete a Kata container.

Response: 200 OK

{
  "success": true,
  "message": "Container kata-sandbox-a1b2c3d4 deleted successfully",
  "podName": "kata-sandbox-a1b2c3d4"
}

GET /healthz

Health check endpoint.

Response: 200 OK (Healthy) or 503 Service Unavailable (Unhealthy)

Development

Project Structure

DonkeyWork-CodeSandbox-Manager/
├── src/
│   ├── DonkeyWork.CodeSandbox.Manager/      # Manager API (container orchestration)
│   │   ├── Configuration/
│   │   │   └── KataContainerManager.cs      # Configuration models with validation
│   │   ├── Endpoints/
│   │   │   └── KataContainerEndpoints.cs    # Minimal API endpoints (/api/kata)
│   │   ├── Models/
│   │   │   ├── CreateContainerRequest.cs    # Request DTOs
│   │   │   ├── KataContainerInfo.cs         # Response DTOs
│   │   │   └── DeleteContainerResponse.cs
│   │   ├── Services/
│   │   │   ├── IKataContainerService.cs     # Service interface
│   │   │   └── KataContainerService.cs      # Kubernetes operations
│   │   ├── Program.cs                       # Application entry point
│   │   └── appsettings.json                 # Configuration
│   ├── DonkeyWork.CodeSandbox.Server/       # Executor API (code execution)
│   │   ├── Controllers/
│   │   │   └── ExecutionController.cs       # /api/execute endpoint
│   │   ├── Services/
│   │   │   └── ManagedProcess.cs            # Process management with streaming
│   │   └── Program.cs
│   ├── DonkeyWork.CodeSandbox.Contracts/    # Shared models
│   │   ├── Events/
│   │   │   └── ExecutionEvent.cs            # OutputEvent, CompletedEvent
│   │   └── Requests/
│   │       └── ExecuteCommand.cs
│   └── DonkeyWork.CodeSandbox.Client/       # .NET client library
├── test/
│   ├── DonkeyWork.CodeSandbox.Manager.Tests/
│   └── DonkeyWork.CodeSandbox.Server.IntegrationTests/
├── Dockerfile                               # Manager API container
├── docker-compose.yml                       # Local development setup
└── .github/workflows/
    ├── pr-build-test.yml                    # PR validation workflow
    └── release.yml                          # Release automation workflow

Key Design Decisions

  1. Minimal APIs: Uses ASP.NET Core minimal APIs for simpler, more performant endpoints
  2. IOptions Pattern: Configuration is validated at startup using data annotations
  3. In-Cluster Auth: Automatically uses ServiceAccount tokens when running in Kubernetes
  4. Scoped Services: KataContainerService is scoped to match request lifetime
  5. Structured Logging: Serilog provides structured logging with context

Troubleshooting

Container fails to create

  • Verify the image exists and is accessible
  • Check that the sandbox-containers namespace exists
  • Ensure the RuntimeClass kata-qemu is configured
  • Check RBAC permissions for the ServiceAccount

Permission denied errors

  • Verify Role and RoleBinding are correctly applied
  • Ensure ServiceAccount is attached to the pod
  • Check that the service can reach the Kubernetes API server

Pods stuck in Pending

  • Check cluster node capacity
  • Verify Kata is installed on nodes
  • Check pod events: kubectl describe pod <pod-name> -n sandbox-containers

Health check failures

  • Verify the application is running: kubectl logs <pod-name>
  • Check if the service can connect to Kubernetes API
  • Review configuration validation errors in logs

Security Considerations

  1. Fixed Container Image: Only the configured default executor image can be used (no arbitrary images)
  2. Least Privilege: Role only grants permissions in sandbox-containers namespace
  3. Non-Root Container: Dockerfile creates and runs as non-root user
  4. Resource Limits: All containers have resource limits to prevent exhaustion
  5. VM Isolation: Kata containers provide hardware-level isolation

Performance

  • Pod Creation: 12-25 seconds (includes VM boot time)
  • Overhead: +160Mi RAM, +250m CPU per Kata container
  • Recommended Rate: 5-10 Kata pods per minute
  • Cluster Capacity: 4 nodes, each supporting multiple Kata VMs

References

License

MIT

About

A manager for kata / sandbox containers, on k3s.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 2

  •  
  •