Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ specific thing (drain a queue, recover from a stuck row, etc.).
| Doc | Purpose |
|---|---|
| [cascade_runbook.md](cascade_runbook.md) | Cascade subsystem ops β€” drain queue, recover stuck rows |
| [multimodal.md](multimodal.md) | Ingest images, PDFs, audio, and office docs into memory |

## Engineering / Internal

Expand Down
318 changes: 318 additions & 0 deletions docs/multimodal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,318 @@
# Multimodal Memory

EverOS turns non-text content β€” images, PDFs, audio, office documents,
HTML, email β€” into the **same structured, searchable memory** as plain
text. You attach the asset to a message at ingest time; a vision/audio
capable LLM parses it into text, and from there it flows through the
identical extraction β†’ markdown β†’ index pipeline as any text turn. The
result is fully retrievable with the same `/search` stack.

## Table of contents

- [How it works](#how-it-works)
- [Prerequisites](#prerequisites)
- [Install the extra](#install-the-extra)
- [LibreOffice (office documents only)](#libreoffice-office-documents-only)
- [Configure the multimodal LLM](#configure-the-multimodal-llm)
- [Supported modalities](#supported-modalities)
- [Sending multimodal content](#sending-multimodal-content)
- [Payload: `uri` vs `base64`](#payload-uri-vs-base64)
- [Example: image by URL](#example-image-by-url)
- [Example: mixed text + image in one turn](#example-mixed-text--image-in-one-turn)
- [Example: inline PDF via base64](#example-inline-pdf-via-base64)
- [Example: local file via `file://`](#example-local-file-via-file)
- [Calling from Python (plain HTTP)](#calling-from-python-plain-http)
- [Configuration reference](#configuration-reference)
- [Errors and limits](#errors-and-limits)
- [Searching multimodal memory](#searching-multimodal-memory)

## How it works

```
POST /api/v1/memory/add
messages[].content = [ ContentItem, ContentItem, ... ]
β”‚
β”‚ text items β†’ used verbatim
β”‚ non-text items β†’ multimodal LLM (everalgo-parser)
β–Ό
parsed text merged back into the session buffer (in original order)
β”‚
β–Ό
boundary detector β†’ extraction LLM β†’ MemCell
β”‚
β–Ό
markdown (truth) + SQLite (state) + LanceDB (vector + BM25)
β”‚
β–Ό
retrievable via /search and /get like any text memory
```

Each **non-text** `ContentItem` is routed through the parser, which calls
a separate, vision/audio capable LLM (configured independently from the
main extraction `[llm]`, so parsing can target a multimodal endpoint
without changing boundary or extraction behaviour). Visual/audio formats
(image / pdf / audio / office) always go through that LLM; a few
text-bearing formats can be parsed without it (e.g. a plain email with no
inline images). The parser returns text; that text takes the place of the
asset in the message buffer. Nothing downstream of the parser
knows or cares that the content originated as an image or PDF β€” the raw
bytes are **not** persisted past extraction (the episode and memory cell
store only the parsed text).

## Prerequisites

### Install the extra

Multimodal parsing lives behind an optional dependency group so the base
install stays lean:

```bash
uv pip install 'everos[multimodal]' # or: pip install 'everos[multimodal]'
```

This pulls in `everalgo-parser[svg]` β€” the `[svg]` bundle adds `cairosvg`
so SVG works out of the box.

### LibreOffice (office documents only)

Office formats (`.doc` / `.docx` / `.ppt` / `.pptx` / `.xls` / `.xlsx`)
are converted to PDF before being fed to the multimodal LLM. The parser
shells out to `soffice`, LibreOffice's headless renderer, so LibreOffice
must be present on the **server** host:

```bash
brew install --cask libreoffice # macOS
sudo apt-get install -y libreoffice # Debian / Ubuntu
```

Without LibreOffice, **office uploads return `415`** with a clear error;
image / PDF / audio / HTML / email parsing is unaffected.

### Configure the multimodal LLM

The parser uses its own LLM section, independent from `[llm]`. The model
must accept OpenAI `image_url` parts. `everos init` writes these into the
generated `.env`:

```bash
EVEROS_MULTIMODAL__MODEL=google/gemini-3-flash-preview
EVEROS_MULTIMODAL__API_KEY=<your key>
EVEROS_MULTIMODAL__BASE_URL=https://openrouter.ai/api/v1
```

The default targets Gemini via OpenRouter so a single key covers both
chat extraction and multimodal parsing. See
[Configuration reference](#configuration-reference) for the full list.

## Supported modalities

| `type` | Typical formats | Payload | Notes |
|---|---|---|---|
| `text` | β€” | `text` | Plain text; the string shorthand also maps here |
| `image` | PNG / JPG / GIF / WebP / SVG | `uri` or `base64` | SVG via the bundled `cairosvg` |
| `pdf` | PDF | `uri` or `base64` | β€” |
| `audio` | MP3 / WAV / … | `uri` or `base64` | Endpoint must accept audio parts |
| `doc` | DOC / DOCX / PPT / PPTX / XLS / XLSX | `uri` or `base64` | **Requires LibreOffice** (converted to PDF first) |
| `html` | HTML | `uri` or `base64` | To inline HTML as plain text instead, send it as `type: "text"` |
| `email` | EML / MSG | `uri` or `base64` | β€” |

A **non-text** item must carry a fetchable/decodable payload (`uri` or
`base64`). A non-text item that only carries `text` returns `415` β€” the
parser has nothing to parse.

## Sending multimodal content

Multimodal input is a `content` **array** of `ContentItem` objects on a
[MessageItem](api.md#messageitem). A bare string `content` is shorthand
for a single text item; switch to the array form when you mix text with
non-text assets. Field-level rules are in
[api.md β†’ ContentItem](api.md#contentitem); the essentials:

| Field | Purpose |
|---|---|
| `type` | One of the modalities above |
| `text` | The literal text β€” **only** for `type: "text"` |
| `uri` | `http(s)://` (fetched server-side) or `file://` (read from the server fs) |
| `base64` | Inline payload, plain base64 (no `data:` prefix) |
| `ext` | Extension hint (`"pdf"`, `"png"`, …); effectively required for `base64` |
| `name` | Display filename for logs |

Carry the payload in exactly **one** of `text` / `uri` / `base64`.

### Payload: `uri` vs `base64`

| | `uri` (`http(s)://`) | `base64` |
|---|---|---|
| Where the bytes live | Fetched transiently at parse time | Held verbatim in the SQLite session buffer until flush |
| Wire size | URL only | ~4/3Γ— the raw size (base64 inflation) |
| Best for | Large assets, S3/OSS presigned URLs | Small assets, or when no reachable URL exists |

**Prefer `uri` for anything large.** A multi-MB base64 blob becomes
multi-MB of SQLite buffer text for the buffer's lifetime and slows
request parsing. The bytes are never persisted past extraction either
way β€” only the parsed text is.

### Example: image by URL

```bash
TS=$(($(date +%s) * 1000)) # v1 contract: timestamp in ms
curl -X POST http://127.0.0.1:8000/api/v1/memory/add \
-H 'Content-Type: application/json' \
-d "{
\"session_id\": \"mm-001\",
\"messages\": [
{
\"sender_id\": \"alice\",
\"role\": \"user\",
\"timestamp\": $TS,
\"content\": [
{ \"type\": \"image\", \"uri\": \"https://example.com/whiteboard.png\" }
]
}
]
}"
```

### Example: mixed text + image in one turn

```json
{
"session_id": "mm-001",
"messages": [
{
"sender_id": "alice",
"role": "user",
"timestamp": 1748390400000,
"content": [
{ "type": "text", "text": "Here's the whiteboard from today's planning session." },
{ "type": "image", "uri": "https://example.com/whiteboard.png", "name": "whiteboard.png" }
]
}
]
}
```

### Example: inline PDF via base64

```json
{
"session_id": "mm-001",
"messages": [
{
"sender_id": "alice",
"role": "user",
"timestamp": 1748390400000,
"content": [
{ "type": "text", "text": "Quarterly report attached." },
{ "type": "pdf", "base64": "JVBERi0xLjQK...", "ext": "pdf", "name": "q3.pdf" }
]
}
]
}
```

`ext` is effectively **required** for `base64` payloads β€” it drives
modality dispatch. Without it the server falls back to MIME inference and
otherwise `415`s.

### Example: local file via `file://`

A `file://` URI is read from the **server's** local filesystem (the path
must be reachable by the server process), guardrailed by size and an
optional allowlist:

```json
{ "type": "pdf", "uri": "file:///srv/uploads/q3.pdf" }
```

Guardrails (a violation surfaces as `415`):

- the resolved path (symlinks followed) must be an existing regular file;
- size ≀ `EVEROS_MULTIMODAL__FILE_URI_MAX_BYTES` (default 50 MiB);
- if `EVEROS_MULTIMODAL__FILE_URI_ALLOW_DIRS` is set, the path must lie
within one of the listed roots (unset = any readable file, the
local-first default β€” confine this when exposing the API beyond
loopback).

### Calling from Python (plain HTTP)

There is no EverOS Python client; call the HTTP API directly with any
HTTP library:

```python
import httpx

httpx.post(
"http://127.0.0.1:8000/api/v1/memory/add",
json={
"session_id": "mm-001",
"messages": [
{
"sender_id": "alice",
"role": "user",
"timestamp": 1748390400000,
"content": [
{"type": "text", "text": "Here's the whiteboard from today's meeting."},
{"type": "image", "uri": "https://example.com/whiteboard.png"},
],
}
],
},
)
```

## Configuration reference

All fields bind from the environment via the parent `Settings`
(`EVEROS_MULTIMODAL__<FIELD>`) or the `[multimodal]` TOML section.

| Env var | Default | Meaning |
|---|---|---|
| `EVEROS_MULTIMODAL__MODEL` | `google/gemini-3-flash-preview` | Parsing model; must accept `image_url` parts |
| `EVEROS_MULTIMODAL__API_KEY` | β€” | API key for the multimodal endpoint |
| `EVEROS_MULTIMODAL__BASE_URL` | `https://openrouter.ai/api/v1` | OpenAI-compatible base URL |
| `EVEROS_MULTIMODAL__MAX_CONCURRENCY` | `4` | Cap on parallel multimodal calls within one extraction |
| `EVEROS_MULTIMODAL__FILE_URI_MAX_BYTES` | `52428800` (50 MiB) | Max size of a `file://` asset |
| `EVEROS_MULTIMODAL__FILE_URI_ALLOW_DIRS` | `[]` (any) | JSON list of allowlisted base dirs for `file://` URIs |

## Errors and limits

Two failure classes behave differently. **Deterministic** problems
(nothing to parse, no handler, missing system dependency) **abort the
whole `/add` batch with `415`**. A **transient** multimodal-LLM failure
(timeout, rate-limit, the model rejecting the asset) **degrades just that
item** β€” the request still returns `200`, the item is marked
`parse_status="failed"` and contributes no text, and the rest of the
batch extracts normally.

| Condition | Result |
|---|---|
| Non-text item carries only `text` (no `uri` / `base64`) | `415` (batch aborted) |
| Extension / modality the parser has no handler for | `415` (batch aborted) |
| `base64` without a resolvable `ext` / MIME to dispatch on | `415` (batch aborted) |
| Office document but no LibreOffice (`soffice`) on host | `415` (batch aborted) |
| `file://` fails a guardrail (missing / non-regular / too large / outside allowlist) | `415` (batch aborted) |
| Multimodal **LLM call** fails (timeout / rate-limit / model rejects the asset) | **`200`** β€” that item is skipped (`parse_status="failed"`), the rest of the batch still extracts |

The `415` body uses the standard error envelope with the parse-failure
reason in `error.message` β€” see
[api.md β†’ POST /add](api.md#post-apiv1memoryadd).

## Searching multimodal memory

Nothing special is required. Because parsed text is folded into the same
episodes and memory cells as text turns, every retrieval method works
across multimodal-derived memory unchanged:

```bash
curl -X POST http://127.0.0.1:8000/api/v1/memory/search \
-H 'Content-Type: application/json' \
-d '{
"user_id": "alice",
"query": "whiteboard from the planning session",
"method": "hybrid"
}'
```

`keyword`, `vector`, `hybrid` (default), and `agentic` all apply β€” see
[api.md β†’ SearchMethod](api.md#searchmethod).
Loading