diff --git a/examples/README.md b/examples/README.md index cb92870e..2f80dc5f 100644 --- a/examples/README.md +++ b/examples/README.md @@ -18,6 +18,7 @@ service keys. | [Build a multimodal wine recommender with OCR](./wine-recommender) | Combining preference-based retrieval with OCR-driven label detection in one UI | `encode`, `score`, `extract` | Docker Compose app plus local SIE endpoint; API key optional for unauthenticated SIE | Runnable demo | | [Build a multi-modal product classifier with embeddings](./taxonomy-classification) | Evaluating text, image, NLI, and reranking approaches for hierarchical product taxonomy classification | `extract`, `encode`, `score` | SIE endpoint, Shopify dataset prep via `uv run` scripts, standalone `uv` project | Runnable evaluation example | | [Swap an OCR model with one identifier change](./document-ocr) | Driving recognition (VLM-OCR), structured extraction (Donut), and zero-shot NER (GLiNER) through the same `extract` call by swapping the model ID | `extract` | Docker Compose plus Node UI, no API key required, hosted version on [Hugging Face Spaces](https://huggingface.co/spaces/superlinked/document-ocr) | Runnable demo | +| [Vision-first document RAG](./vision-doc-rag) | Retrieving and answering questions over a multi-tenant page corpus by looking at page images, with OCR kept out of the score path | `encode`, `extract`, `score` (optional) | SIE endpoint with a GPU recommended for ColQwen2.5 + Florence-2-DocVQA | Runnable demo | For docs publishing, lead with the quickest runnable demos, then use the benchmark and evaluation examples for deeper technical users. diff --git a/examples/vision-doc-rag/.gitignore b/examples/vision-doc-rag/.gitignore new file mode 100644 index 00000000..a787e920 --- /dev/null +++ b/examples/vision-doc-rag/.gitignore @@ -0,0 +1,6 @@ +.venv/ +__pycache__/ +data/pages.json +data/pages/ +data/multivectors.npz +data/metadata.json diff --git a/examples/vision-doc-rag/README.md b/examples/vision-doc-rag/README.md new file mode 100644 index 00000000..f179051c --- /dev/null +++ b/examples/vision-doc-rag/README.md @@ -0,0 +1,209 @@ +# Vision-first document RAG + +Retrieve by image, answer by image. ColQwen2.5 reads each page as a picture +and ranks them via late interaction; Florence-2-DocVQA reads the winning +page and produces the textual answer. OCR never enters the score path, so +charts, screenshots, tables, and any other layout cue that would die in a +text round-trip still drives ranking. Everything runs on one SIE endpoint. + +Each page also carries a `client` tag, so the same corpus serves multiple +tenants from one index — queries scoped to `acme-corp` cannot retrieve a +`globex` page, no separate index per tenant required. + +## SIE features used + +- `encode` — `vidore/colqwen2.5-v0.2` on page images at ingest and on the + query text at search time. Output is a `[tokens, 128]` multivector. Late + interaction (`sie_sdk.scoring.maxsim`) is the only ranking signal. +- `extract` — `mynkchaudhry/Florence-2-FT-DocVQA`. Called twice, with two + jobs: with `instruction=` to get a textual answer for the + top page, and without `instruction` to OCR the same page for a display + snippet. The OCR snippet is UX-only — it never enters the score path. +- `score` *(optional)* — `Qwen/Qwen3-VL-Reranker-2B` second-stage rerank + over `(query text, page image)`. Off by default while we wait for an + upstream adapter fix; flip `search.visual_rerank: true` in `config.yaml` + to enable it on a cluster that's ready. + +## Why vision end-to-end + +OCR-then-text-rerank throws away the exact signal we pick ColQwen for — +charts, screenshots, tables, callouts, and the spatial layout that tells +a wiki page apart from a checklist. The rerank stays visual or doesn't +happen. The OCR step shows on-screen text next to the page image so the +user can copy/paste from the result, nothing more. + +## Multi-tenant by construction + +Every page carries a `client` field in `data/pages.json`. The metadata list +loaded by `python/search.py` is filtered by `client_name` before MaxSim +runs, so a query scoped to `acme-corp` cannot retrieve a `globex` page. +Real deployments would push `client` down into the multivector store's +filter expression; the demo keeps everything in memory because the corpus +is tiny. + +## Run it + +You need Python 3.12 and a reachable SIE cluster (or local `docker run`). + +```bash +# 1. SIE locally (or point SIE_CLUSTER_URL / SIE_API_KEY at a managed cluster). +docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default + +# 2. Generate the synthetic corpus and render each page to a PNG. +cd examples/vision-doc-rag +pip install -r python/requirements.txt +python data/fetch_dataset.py +python data/render_pages.py + +# 3. Encode every page with ColQwen2.5 and save the multivectors. +python python/ingest.py + +# 4a. CLI demo — runs four scoped queries and prints results. +python python/search.py + +# 4b. Or start the UI. +uvicorn --app-dir python server:app --port 8888 +open http://localhost:8888 +``` + +First run on a cold cluster pays a one-time model load: ColQwen2.5 and +Florence-2 are both several GB, expect roughly a minute on CPU and a few +seconds on GPU before the warm path kicks in. + +### Pointing at a managed cluster + +```bash +export SIE_CLUSTER_URL="https://your-cluster-host:8080" +export SIE_API_KEY="SL-..." +``` + +The defaults in `config.yaml` point at `http://localhost:8080` so the env +vars only matter when you're hitting something remote. Set `cluster.gpu` +to a profile name like `l4-spot` if the cluster needs an explicit GPU +class. + +## Try these queries + +| Tenant | Query | Why it's interesting | +|---|---|---| +| `acme-corp` | how do I sign in to the VPN? | Visual layout match — the page is titled "VPN setup for new engineers" with a bulleted body, and ColQwen2.5 picks it without keyword overlap with "sign in". DocVQA reads the page and answers with the client name and the auth method. | +| `globex` | what is the parental leave policy? | Disambiguates from "time off" — the right page mentions parental leave only halfway down the body. The textual answer cites the week count. | +| `initech` | audit prep evidence and walkthroughs | All three Initech pages are compliance-flavored; the visual model breaks the tie by reading the checklist layout. | +| `globex` | how do I sign in to the VPN? | Tenant filter — even though the same query hit acme-corp earlier, scoping to globex returns the closest globex page (Wi-Fi guide) and never leaks acme content. | + +## API + +### `GET /api/search` + +| Parameter | Required | Description | +|---|---|---| +| `q` | yes | Search query | +| `client` | no | Tenant filter (e.g. `acme-corp`). Omitted ⇒ search runs across all tenants. | + +```bash +curl "http://localhost:8888/api/search?q=how+do+I+sign+in+to+the+VPN&client=acme-corp" +``` + +```json +{ + "query": "how do I sign in to the VPN", + "client": "acme-corp", + "answer": "Okta credentials with Duo Push for 2FA", + "timings": { + "encode_query_s": 0.12, + "maxsim_s": 0.003, + "docvqa_s": 0.91, + "ocr_snippet_s": 0.84 + }, + "results": [ + { + "page_id": "ACME-101", + "client": "acme-corp", + "title": "VPN setup for new engineers", + "space": "Engineering", + "author": "alice@acme", + "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/101", + "page_image": "/pages/ACME-101.png", + "ocr_snippet": "VPN Setup for New Engineers · ...", + "scores": { "maxsim": 14.44, "rerank": null } + } + ] +} +``` + +### `GET /api/clients`, `GET /api/stats` + +Tenant list and runtime config (active models, rerank on/off, page count). + +## How it works + +``` + ┌──────────────────────────────────────────────────────────────┐ + │ ingest.py (once per corpus) │ + │ pages.json ─▶ render_pages.py ─▶ data/pages/*.png │ + │ ─▶ SIE.encode(ColQwen2.5, images, multivector) │ + │ ─▶ data/multivectors.npz + data/metadata.json │ + └──────────────────────────────────────────────────────────────┘ + │ + ▼ + ┌──────────────────────────────────────────────────────────────┐ + │ search.py / server.py (per query) │ + │ q ─▶ SIE.encode(ColQwen2.5, text, is_query=True) │ + │ ─▶ filter metadata by tenant │ + │ ─▶ sie_sdk.scoring.maxsim → top_k_candidates │ + │ ─▶ [optional] SIE.score(Qwen3-VL-Reranker, q, images) │ + │ ─▶ SIE.extract(Florence-2-DocVQA, instruction=q, │ + │ images=[top_page]) ⇒ textual answer │ + │ ─▶ SIE.extract(Florence-2-DocVQA, images=[top_page]) │ + │ ⇒ OCR snippet (UI) │ + └──────────────────────────────────────────────────────────────┘ +``` + +OCR is never on the score path. The visual reranker (when enabled) ranks +over the same modality as retrieval, so layout cues survive both stages. + +The corpus is small enough that MaxSim runs in Python. For thousands of +pages, hand the multivectors to LanceDB or Vespa; only the SIE calls stay +the same. + +## Customize + +`config.yaml` is the single tuning surface: + +```yaml +models: + retriever: "vidore/colqwen2.5-v0.2" # smaller: vidore/colpali-v1.3-hf + docvqa: "mynkchaudhry/Florence-2-FT-DocVQA" + reranker: "Qwen/Qwen3-VL-Reranker-2B" # used only when search.visual_rerank: true +search: + top_k_candidates: 5 + top_k_results: 3 + visual_rerank: false + answer: true + ocr_snippet: true +``` + +Swap any model for another from the +[SIE model catalog](https://superlinked.com/models) and the pipeline keeps +working. + +## Project layout + +```text +examples/vision-doc-rag/ +├── config.yaml +├── data/ +│ ├── fetch_dataset.py # synthetic 3-tenant page corpus +│ ├── render_pages.py # pages.json → PNG screenshots +│ ├── pages.json # generated +│ ├── pages/ # generated PNGs +│ ├── metadata.json # generated by ingest +│ └── multivectors.npz # generated by ingest +├── python/ +│ ├── ingest.py +│ ├── search.py +│ ├── server.py +│ └── requirements.txt +└── static/ + └── index.html +``` diff --git a/examples/vision-doc-rag/config.yaml b/examples/vision-doc-rag/config.yaml new file mode 100644 index 00000000..8b35ffda --- /dev/null +++ b/examples/vision-doc-rag/config.yaml @@ -0,0 +1,43 @@ +# SIE server (defaults to local Docker: docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default). +# Override with SIE_CLUSTER_URL / SIE_API_KEY env vars when targeting a managed cluster. +cluster: + url: "http://localhost:8080" + api_key: "" + gpu: "" # only set for managed multi-GPU clusters (e.g. "l4-spot"); ignored locally + provision_timeout_s: 600 + +# Models. The retrieval signal is vision end-to-end: ColQwen2.5 reads each page +# as an image and we late-interact (MaxSim) against the same model's text-side +# embedding of the query. No OCR is involved in ranking, so charts, screenshots, +# tables, and any other layout cue that wouldn't survive an OCR round-trip +# still contributes to the score. +# +# DocVQA produces a textual answer for the top page. The model takes the page +# image + the user's question (passed via `instruction`) and returns the answer +# as an entity in the response — no separate LLM call needed. +models: + retriever: "vidore/colqwen2.5-v0.2" + docvqa: "mynkchaudhry/Florence-2-FT-DocVQA" + # Optional second-stage cross-encoder rerank. Visual model so we don't have to + # collapse the page through OCR before reranking. Disabled by default while + # we wait for the cluster-side adapter bug to land: + # https://github.com/superlinked/sie-internal/issues/1026 + # Re-enable with search.visual_rerank: true once that ships. + reranker: "Qwen/Qwen3-VL-Reranker-2B" + +# Page rendering (used by data/render_pages.py to turn the synthetic page +# corpus into PNGs; replace with pdf2image, screenshots, or your own files +# for a real deployment). +render: + width: 1024 + height: 1280 + body_font_size: 20 + title_font_size: 30 + +# Retrieval +search: + top_k_candidates: 5 # how many pages survive MaxSim + top_k_results: 3 # how many pages return after optional rerank + visual_rerank: false # see models.reranker note above + answer: true # run DocVQA on the top page for a textual answer + ocr_snippet: true # OCR the top page for a display-only snippet in the UI diff --git a/examples/vision-doc-rag/data/fetch_dataset.py b/examples/vision-doc-rag/data/fetch_dataset.py new file mode 100644 index 00000000..eb901a6c --- /dev/null +++ b/examples/vision-doc-rag/data/fetch_dataset.py @@ -0,0 +1,211 @@ +"""Synthetic multi-tenant page corpus. + +Three fictional clients, each with a handful of pages — engineering runbooks, +HR policies, finance procedures. Small enough to encode in a minute on a warm +GPU cluster, varied enough to make multi-tenant filtering and visual retrieval +meaningful. Replace `PAGES` with your own pages (wiki export, Notion dump, +PDF batch, etc.) to point the demo at real content. +""" + +import json +from pathlib import Path + +PAGES = [ + # ── acme-corp: engineering ──────────────────────────────────────────── + { + "client": "acme-corp", + "page_id": "ACME-101", + "title": "VPN setup for new engineers", + "space": "Engineering", + "author": "alice@acme", + "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/101", + "body": [ + "All engineers need to connect through the corporate VPN to reach internal services.", + "We use Cisco AnyConnect on macOS and Windows, and the OpenConnect CLI on Linux.", + "Download the client from it.acme.com/vpn, then sign in with your Okta credentials.", + "Two-factor confirmation goes through Duo Push.", + "If you hit a TLS error on first connection, check that the device certificate from Jamf is installed.", + "For on-call rotations, request the always-on VPN profile from IT — it auto-reconnects after suspend.", + ], + }, + { + "client": "acme-corp", + "page_id": "ACME-102", + "title": "On-call rotation and paging", + "space": "Engineering", + "author": "bob@acme", + "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/102", + "body": [ + "Engineering on-call runs Monday to Monday handovers at 10:00 PT.", + "Primary takes the pager, secondary takes the laptop, both are paid the on-call stipend.", + "Pages route through PagerDuty; the escalation policy is primary -> secondary (15 min) -> manager.", + "During an incident open a Zoom bridge and a Slack channel named #inc-YYYYMMDD-summary.", + "Postmortems are due within five working days and live in the Incidents space.", + ], + }, + { + "client": "acme-corp", + "page_id": "ACME-103", + "title": "Deploying to production with our CI/CD pipeline", + "space": "Engineering", + "author": "carol@acme", + "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/103", + "body": [ + "We use GitHub Actions for CI and ArgoCD for delivery to Kubernetes.", + "Merging to main triggers a build, runs the test suite, pushes an image to ECR, and updates the staging manifest.", + "Production rollouts are gated by a manual approval in ArgoCD and require two reviewers from the service team.", + "Use the rolling strategy with maxSurge=25% by default.", + "Hotfix tags follow the pattern v1.2.3-hotfix.N and skip staging only with on-call approval recorded in the PR.", + ], + }, + { + "client": "acme-corp", + "page_id": "ACME-104", + "title": "Local development setup", + "space": "Engineering", + "author": "dan@acme", + "web_url": "https://acme.atlassian.net/wiki/spaces/ENG/pages/104", + "body": [ + "Install mise to manage runtimes — it pins Node, Python, and Go versions per repo.", + "Run `mise install` in the repo root, then `make dev` to spin up Postgres, Redis, and the API gateway in Docker.", + "The seed data covers the last 30 days of staging traffic, sanitized of PII.", + "If port 5432 is already taken, override DEV_PG_PORT in your shell profile.", + ], + }, + # ── globex: HR and admin ────────────────────────────────────────────── + { + "client": "globex", + "page_id": "GLOBEX-201", + "title": "Time off and vacation policy", + "space": "HR", + "author": "hr@globex", + "web_url": "https://globex.atlassian.net/wiki/spaces/HR/pages/201", + "body": [ + "Globex offers 25 working days of paid vacation per year, accruing monthly from the start date.", + "Requests go through Workday at least two weeks in advance for absences longer than three days.", + "Sick leave is separate and uncapped, but anything over three consecutive days requires a doctor's note.", + "Parental leave is 18 weeks at full pay for the primary caregiver and 6 weeks for the secondary, regardless of gender.", + "Unused vacation rolls over up to 10 days into the next calendar year; the rest is paid out.", + ], + }, + { + "client": "globex", + "page_id": "GLOBEX-202", + "title": "Expense reports and reimbursement", + "space": "HR", + "author": "finance@globex", + "web_url": "https://globex.atlassian.net/wiki/spaces/HR/pages/202", + "body": [ + "Submit expenses in Expensify within 30 days of the transaction.", + "Receipts are mandatory for any item over $25; below that, a description and category are enough.", + "Travel bookings should go through Navan when possible — direct bookings need pre-approval from your manager.", + "Reimbursements process every Friday and land in your payroll account the following Tuesday.", + "Per diem for international travel is $80 USD equivalent for meals.", + ], + }, + { + "client": "globex", + "page_id": "GLOBEX-203", + "title": "Office perks and meals", + "space": "HR", + "author": "office@globex", + "web_url": "https://globex.atlassian.net/wiki/spaces/HR/pages/203", + "body": [ + "Lunch is catered Monday through Thursday in the main cafe from 12:00 to 14:00.", + "There are always vegetarian, vegan, and gluten-free options labeled at the buffet.", + "Friday is a free-lunch credit you can spend at any partner restaurant in the office app.", + "Snacks and drinks in the micro-kitchens are unlimited; please refill empty trays.", + "The wellness stipend is $100 per month, claimable in Expensify under category Wellness.", + ], + }, + { + "client": "globex", + "page_id": "GLOBEX-204", + "title": "Office Wi-Fi and guest network", + "space": "IT", + "author": "it@globex", + "web_url": "https://globex.atlassian.net/wiki/spaces/IT/pages/204", + "body": [ + "Connect to Globex-Corp for the employee network; sign in with your @globex.com SSO.", + "Globex-Guest is for visitors — the rotating daily password is on the lobby screen.", + "Printing requires the Globex-Print network and a one-time pairing with your laptop using the Mobility Print app.", + "If your laptop will not join, forget the network and rejoin; the cert is renewed weekly and old caches get stuck.", + ], + }, + # ── initech: finance and compliance ─────────────────────────────────── + { + "client": "initech", + "page_id": "INIT-301", + "title": "SOX controls and quarterly attestation", + "space": "Compliance", + "author": "compliance@initech", + "web_url": "https://initech.atlassian.net/wiki/spaces/COMP/pages/301", + "body": [ + "Initech is subject to SOX 404 reporting for financial controls over revenue, expense, and access management.", + "Every quarter, control owners attest in AuditBoard that their controls operated as designed.", + "Evidence is automatically collected from Workday, NetSuite, and Okta where possible; manual evidence goes in the AuditBoard Drive folder.", + "External auditors test a sample of controls in Q3; expect requests for screenshots and approver lists.", + "Exceptions must be logged within five business days of detection.", + ], + }, + { + "client": "initech", + "page_id": "INIT-302", + "title": "Vendor onboarding and due diligence", + "space": "Procurement", + "author": "procurement@initech", + "web_url": "https://initech.atlassian.net/wiki/spaces/PROC/pages/302", + "body": [ + "New vendors above $50,000 annual spend require a security review and a SOC 2 Type II report on file.", + "Submit the vendor questionnaire through Vanta; legal will review the MSA within five business days.", + "Payment terms default to Net 60; faster terms require CFO approval and reduce the risk score in NetSuite.", + "Sanctioned-country checks run automatically via the OFAC integration; any hit halts the workflow until cleared.", + "Annual recertification of high-risk vendors happens every January.", + ], + }, + { + "client": "initech", + "page_id": "INIT-303", + "title": "Audit prep checklist", + "space": "Compliance", + "author": "audit@initech", + "web_url": "https://initech.atlassian.net/wiki/spaces/COMP/pages/303", + "body": [ + "Two weeks before the auditors arrive, freeze the control population in AuditBoard and export the evidence index.", + "Confirm with control owners that they will be available for walkthrough interviews — block 60 minutes in their calendars.", + "Pull the user access review reports for the prior two quarters from Okta and confirm sign-off in writing.", + "Have the change management JIRA queries ready: filter by label sox-relevant and status Done.", + "If a control failed mid-period, document the compensating control and the date the gap was closed.", + ], + }, + { + "client": "initech", + "page_id": "INIT-304", + "title": "Procurement card limits and exceptions", + "space": "Procurement", + "author": "procurement@initech", + "web_url": "https://initech.atlassian.net/wiki/spaces/PROC/pages/304", + "body": [ + "Procurement cards (P-cards) have a default monthly limit of $5,000 and a single-transaction limit of $1,500.", + "Use them for low-dollar, low-risk purchases — software subscriptions and conference tickets are the common cases.", + "Limit-increase requests need manager and CFO approval and a documented business need.", + "Personal use, cash advances, and split transactions to bypass the single-transaction limit are policy violations.", + "All P-card transactions reconcile in Coupa within 14 days of statement close.", + ], + }, +] + + +def main(): + out = Path(__file__).resolve().parent / "pages.json" + out.write_text(json.dumps(PAGES, indent=2)) + by_client = {} + for p in PAGES: + by_client[p["client"]] = by_client.get(p["client"], 0) + 1 + print(f"Wrote {len(PAGES)} pages to {out}") + for client, n in sorted(by_client.items()): + print(f" {client}: {n} pages") + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/data/render_pages.py b/examples/vision-doc-rag/data/render_pages.py new file mode 100644 index 00000000..4043d71b --- /dev/null +++ b/examples/vision-doc-rag/data/render_pages.py @@ -0,0 +1,106 @@ +"""Render the synthetic pages to PNG screenshots. + +Each entry in pages.json becomes one image in data/pages/.png. The +layout is intentionally plain — a title, a metadata line, and a body block — +so ColQwen2.5 sees the same kind of visual structure it would in real wikis, +docs, or PDFs. Replace this script with `pdf2image` (or screenshots) when +pointing at real content. +""" + +import json +import sys +from pathlib import Path + +import yaml +from PIL import Image, ImageDraw, ImageFont + + +def _font(size: int): + """Try the platform Helvetica, fall back to PIL's default bitmap font.""" + for path in [ + "/System/Library/Fonts/Helvetica.ttc", + "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", + "/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf", + ]: + if Path(path).exists(): + return ImageFont.truetype(path, size) + return ImageFont.load_default() + + +def _wrap(text: str, font: ImageFont.ImageFont, max_width: int) -> list[str]: + """Greedy word wrap so body paragraphs fit the page width.""" + lines: list[str] = [] + for paragraph in text.split("\n"): + words = paragraph.split() + current = "" + for word in words: + candidate = f"{current} {word}".strip() + if font.getlength(candidate) <= max_width: + current = candidate + else: + if current: + lines.append(current) + current = word + if current: + lines.append(current) + return lines + + +def render_page(page: dict, width: int, height: int, body_size: int, title_size: int) -> Image.Image: + img = Image.new("RGB", (width, height), "white") + draw = ImageDraw.Draw(img) + title_font = _font(title_size) + meta_font = _font(int(body_size * 0.9)) + body_font = _font(body_size) + + margin = 48 + cursor_y = margin + draw.text((margin, cursor_y), page["title"], fill="black", font=title_font) + cursor_y += int(title_size * 1.6) + meta = f"{page['space']} · {page['author']} · {page['page_id']}" + draw.text((margin, cursor_y), meta, fill=(96, 96, 96), font=meta_font) + cursor_y += int(title_size * 1.2) + draw.line([(margin, cursor_y), (width - margin, cursor_y)], fill=(200, 200, 200), width=2) + cursor_y += int(body_size * 1.2) + + max_text_width = width - 2 * margin + line_gap = int(body_size * 1.5) + for bullet in page["body"]: + # Render each body line as a wrapped paragraph block. + lines = _wrap(bullet, body_font, max_text_width) + for line in lines: + draw.text((margin, cursor_y), line, fill="black", font=body_font) + cursor_y += line_gap + cursor_y += int(line_gap * 0.4) # paragraph spacing + + return img + + +def main(): + here = Path(__file__).resolve().parent + pages_path = here / "pages.json" + if not pages_path.exists(): + print("pages.json not found; run fetch_dataset.py first", file=sys.stderr) + sys.exit(1) + config = yaml.safe_load((here.parent / "config.yaml").read_text()) + render = config["render"] + out_dir = here / "pages" + out_dir.mkdir(exist_ok=True) + + pages = json.loads(pages_path.read_text()) + for p in pages: + img = render_page( + p, + width=render["width"], + height=render["height"], + body_size=render["body_font_size"], + title_size=render["title_font_size"], + ) + out = out_dir / f"{p['page_id']}.png" + img.save(out) + print(f" {p['client']:10s} {p['page_id']:10s} -> {out.relative_to(here.parent)}") + print(f"Rendered {len(pages)} pages to {out_dir}") + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/python/ingest.py b/examples/vision-doc-rag/python/ingest.py new file mode 100644 index 00000000..15607f30 --- /dev/null +++ b/examples/vision-doc-rag/python/ingest.py @@ -0,0 +1,119 @@ +"""Build the per-tenant visual index. + +For every page PNG we ask SIE to encode the image with vidore/colqwen2.5-v0.2, +which returns a [tokens, 128] multivector. Each page's multivector goes into a +single .npz on disk, alongside a metadata.json that keeps the client name, +page id, title, and source url for routing and filtering at query time. + +There is no vector database here. MaxSim at the scale of one team's wiki +(hundreds to thousands of pages) is cheap and avoids the indexing step. +For larger corpora swap the .npz for a multivector store (LanceDB, Vespa, +Turbopuffer); the encode call is the same. +""" + +from __future__ import annotations + +import json +import os +import time +from pathlib import Path + +import numpy as np +import yaml + +from sie_sdk import SIEClient +from sie_sdk.types import Item + + +def load_config(): + return yaml.safe_load((Path(__file__).resolve().parent.parent / "config.yaml").read_text()) + + +def load_pages(): + pages_path = Path(__file__).resolve().parent.parent / "data" / "pages.json" + if not pages_path.exists(): + raise FileNotFoundError( + "data/pages.json not found. Run `python data/fetch_dataset.py` " + "and `python data/render_pages.py` first." + ) + return json.loads(pages_path.read_text()) + + +def encode_pages(client: SIEClient, model: str, pages: list[dict], gpu: str, timeout: float): + pages_dir = Path(__file__).resolve().parent.parent / "data" / "pages" + multivectors: list[np.ndarray] = [] + metadata: list[dict] = [] + + for i, page in enumerate(pages, 1): + image_path = pages_dir / f"{page['page_id']}.png" + if not image_path.exists(): + raise FileNotFoundError(f"Missing page image: {image_path}. Run data/render_pages.py.") + + start = time.time() + result = client.encode( + model, + Item(id=page["page_id"], images=[str(image_path)]), + output_types=["multivector"], + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + elapsed = time.time() - start + mv = result["multivector"].astype(np.float32) + multivectors.append(mv) + metadata.append( + { + "page_id": page["page_id"], + "client": page["client"], + "title": page["title"], + "space": page["space"], + "author": page["author"], + "web_url": page["web_url"], + "image_path": str(image_path.relative_to(image_path.parent.parent.parent)), + "num_tokens": int(mv.shape[0]), + } + ) + print(f" [{i}/{len(pages)}] {page['page_id']:10s} {page['client']:10s} {mv.shape} in {elapsed:.1f}s") + + return multivectors, metadata + + +def main(): + config = load_config() + pages = load_pages() + print(f"Loaded {len(pages)} pages") + + cluster_url = os.environ.get("SIE_CLUSTER_URL", config["cluster"]["url"]) + api_key = os.environ.get("SIE_API_KEY", config["cluster"]["api_key"]) + gpu = config["cluster"]["gpu"] + timeout = config["cluster"]["provision_timeout_s"] + model = config["models"]["retriever"] + + print(f"\n--- Encoding pages with {model} ---") + with SIEClient(cluster_url, api_key=api_key) as client: + multivectors, metadata = encode_pages(client, model, pages, gpu, timeout) + + data_dir = Path(__file__).resolve().parent.parent / "data" + # np.savez stores variable-length multivectors as one entry per array; we + # key them by page_id so the search side can reload without an extra index. + np.savez( + data_dir / "multivectors.npz", + **{m["page_id"]: mv for m, mv in zip(metadata, multivectors)}, + ) + (data_dir / "metadata.json").write_text(json.dumps(metadata, indent=2)) + + total_tokens = sum(m["num_tokens"] for m in metadata) + by_client: dict[str, int] = {} + for m in metadata: + by_client[m["client"]] = by_client.get(m["client"], 0) + 1 + + print(f"\n Saved {len(metadata)} multivectors to data/multivectors.npz") + print(f" Saved metadata to data/metadata.json") + print(f" Total visual tokens: {total_tokens}") + print(" Pages per tenant:") + for client_name in sorted(by_client): + print(f" {client_name}: {by_client[client_name]}") + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/python/requirements.txt b/examples/vision-doc-rag/python/requirements.txt new file mode 100644 index 00000000..bd32dcbc --- /dev/null +++ b/examples/vision-doc-rag/python/requirements.txt @@ -0,0 +1,6 @@ +sie-sdk==0.1.10 +fastapi>=0.115.0 +uvicorn>=0.30.0 +numpy>=1.26.0 +pyyaml>=6.0 +Pillow>=10.3.0 diff --git a/examples/vision-doc-rag/python/search.py b/examples/vision-doc-rag/python/search.py new file mode 100644 index 00000000..52dd2211 --- /dev/null +++ b/examples/vision-doc-rag/python/search.py @@ -0,0 +1,243 @@ +"""Visual document search + question answering, vision end-to-end. + +Pipeline per query: + 1. encode(ColQwen2.5, text) — query multivector + 2. sie_sdk.scoring.maxsim — late interaction against page images + 3. score(Qwen3-VL-Reranker, query, images) — optional, off by default + 4. extract(Florence-2-FT-DocVQA, instruction=query, images=[top page]) + — textual answer + citation + 5. extract(Florence-2-FT-DocVQA, images=[top page]) + — OCR snippet for the UI (display only, + NOT in the ranking path) + +The ranking is decided by a vision model looking at the page image, so charts, +screenshots, tables, and any other visual signal that OCR would erase still +contributes. OCR runs only on the chosen page, only to provide on-screen text +the user can read or copy. + +Multi-tenant isolation is a Python filter on metadata before MaxSim, so a +query scoped to one client never sees another client's pages. +""" + +from __future__ import annotations + +import json +import os +import time +from pathlib import Path + +import numpy as np +import yaml + +from sie_sdk import SIEClient +from sie_sdk.scoring import maxsim +from sie_sdk.types import Item + + +def load_config(): + return yaml.safe_load((Path(__file__).resolve().parent.parent / "config.yaml").read_text()) + + +def load_index(): + data_dir = Path(__file__).resolve().parent.parent / "data" + if not (data_dir / "multivectors.npz").exists(): + raise FileNotFoundError("data/multivectors.npz missing. Run `python python/ingest.py` first.") + npz = np.load(data_dir / "multivectors.npz") + metadata = json.loads((data_dir / "metadata.json").read_text()) + multivectors = {m["page_id"]: npz[m["page_id"]] for m in metadata} + return multivectors, metadata + + +def _ocr_snippet(entities: list[dict], max_chars: int = 400) -> str: + """Concatenate OCR text regions into a single readable snippet.""" + pieces = [] + for e in entities or []: + text = (e.get("text") or "").replace("", "").strip() + if text: + pieces.append(text) + joined = " · ".join(pieces) + if len(joined) > max_chars: + return joined[: max_chars - 1] + "…" + return joined + + +def _docvqa_answer(entities: list[dict]) -> str: + """Pick the answer string out of a Florence-2 DocVQA response. + + Florence-2 returns the answer as an entity (often the single one when the + `` task token is dispatched). We take the first non-empty text. + """ + for e in entities or []: + text = (e.get("text") or "").replace("", "").strip() + if text: + return text + return "" + + +def search( + client: SIEClient, + config: dict, + multivectors: dict[str, np.ndarray], + metadata: list[dict], + query: str, + client_filter: str | None = None, +) -> dict: + gpu = config["cluster"]["gpu"] + timeout = config["cluster"]["provision_timeout_s"] + top_k_candidates = config["search"]["top_k_candidates"] + top_k_results = config["search"]["top_k_results"] + do_visual_rerank = config["search"].get("visual_rerank", False) + do_answer = config["search"].get("answer", True) + do_ocr_snippet = config["search"].get("ocr_snippet", True) + + corpus = [m for m in metadata if not client_filter or m["client"] == client_filter] + if not corpus: + return {"results": [], "answer": None, "timings": {}} + + timings: dict[str, float] = {} + pages_root = Path(__file__).resolve().parent.parent / "data" + + # 1. Encode query (text side of ColQwen2.5). + t0 = time.time() + q_result = client.encode( + config["models"]["retriever"], + Item(text=query), + output_types=["multivector"], + is_query=True, + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + timings["encode_query_s"] = round(time.time() - t0, 3) + query_mv = q_result["multivector"].astype(np.float32) + + # 2. MaxSim against in-memory multivectors. + doc_mvs = [multivectors[m["page_id"]] for m in corpus] + t0 = time.time() + maxsim_scores = maxsim(query_mv, doc_mvs) + timings["maxsim_s"] = round(time.time() - t0, 3) + + order = np.argsort(maxsim_scores)[::-1][:top_k_candidates] + candidates: list[dict] = [] + for idx in order: + c = dict(corpus[idx]) + c["_maxsim_score"] = float(maxsim_scores[idx]) + c["_rerank_score"] = None + candidates.append(c) + + # 3. Optional visual rerank. Image-in cross-encoder so OCR never enters the + # ranking path. Disabled by default — see config.yaml for the cluster + # bug we're waiting on. + if do_visual_rerank and candidates: + try: + t0 = time.time() + rerank_items = [ + Item(id=c["page_id"], images=[str(pages_root / c["image_path"])]) + for c in candidates + ] + rerank = client.score( + config["models"]["reranker"], + Item(text=query), + rerank_items, + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + timings["visual_rerank_s"] = round(time.time() - t0, 3) + rerank_by_id = {s["item_id"]: s for s in rerank["scores"]} + for c in candidates: + s = rerank_by_id.get(c["page_id"]) + c["_rerank_score"] = float(s["score"]) if s else 0.0 + candidates.sort(key=lambda c: c["_rerank_score"] or 0.0, reverse=True) + except Exception as exc: + # Cluster adapter bug fallback: keep MaxSim ordering, surface the + # failure to the caller. See sie-internal#1026. + timings["visual_rerank_error"] = type(exc).__name__ + + results = candidates[:top_k_results] + + # 4. DocVQA answer from the top page image. instruction= goes in as the + # plain question; the adapter prepends Florence-2's `` task + # token. See superlinked.com/docs/extract/vision. + answer = None + if do_answer and results: + top = results[0] + try: + t0 = time.time() + qa = client.extract( + config["models"]["docvqa"], + Item(images=[str(pages_root / top["image_path"])]), + instruction=query, + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + timings["docvqa_s"] = round(time.time() - t0, 3) + answer = _docvqa_answer(qa[0].get("entities", []) if qa else []) + except Exception as exc: + timings["docvqa_error"] = type(exc).__name__ + + # 5. OCR snippet for display — only on the top result so users see the + # text on the page they're being shown. Never used as a ranking signal. + if do_ocr_snippet and results: + top = results[0] + try: + t0 = time.time() + ocr = client.extract( + config["models"]["docvqa"], # same model, no `instruction` ⇒ OCR mode + Item(images=[str(pages_root / top["image_path"])]), + gpu=gpu, + wait_for_capacity=True, + provision_timeout_s=timeout, + ) + timings["ocr_snippet_s"] = round(time.time() - t0, 3) + top["ocr_snippet"] = _ocr_snippet(ocr[0].get("entities", []) if ocr else []) + except Exception as exc: + timings["ocr_snippet_error"] = type(exc).__name__ + + return {"results": results, "answer": answer, "timings": timings} + + +def print_run(out: dict, query: str, client_filter: str | None): + scope = client_filter or "all clients" + print(f'\n Query: "{query}" ({scope})') + print(f" Timings: {out['timings']}") + if out["answer"]: + print(f"\n Answer: {out['answer']}") + if not out["results"]: + print(" No results.") + return + for i, r in enumerate(out["results"], 1): + rerank = r.get("_rerank_score") + rerank_str = f"rerank={rerank:.4f}" if rerank is not None else "rerank=—" + print(f"\n {i}. [{r['client']}] {r['title']}") + print(f" {r['page_id']} · {r['space']} · {r['author']}") + print(f" maxsim={r['_maxsim_score']:.3f} {rerank_str}") + if r.get("ocr_snippet"): + print(f" OCR snippet: {r['ocr_snippet'][:200]}") + print(f" url: {r['web_url']}") + + +def main(): + config = load_config() + multivectors, metadata = load_index() + print(f"Loaded index: {len(metadata)} pages") + + cluster_url = os.environ.get("SIE_CLUSTER_URL", config["cluster"]["url"]) + api_key = os.environ.get("SIE_API_KEY", config["cluster"]["api_key"]) + + demo = [ + ("how do I sign in to the VPN?", "acme-corp"), + ("what is the parental leave policy?", "globex"), + ("audit prep evidence and walkthroughs", "initech"), + # No tenant filter: shows the query routes across tenants. + ("expense reports and per diem", None), + ] + with SIEClient(cluster_url, api_key=api_key) as client: + for query, tenant in demo: + out = search(client, config, multivectors, metadata, query, tenant) + print_run(out, query, tenant) + + +if __name__ == "__main__": + main() diff --git a/examples/vision-doc-rag/python/server.py b/examples/vision-doc-rag/python/server.py new file mode 100644 index 00000000..d61e5962 --- /dev/null +++ b/examples/vision-doc-rag/python/server.py @@ -0,0 +1,96 @@ +"""FastAPI backend for the multi-tenant visual-document search + QA demo.""" + +from __future__ import annotations + +import os +from contextlib import asynccontextmanager +from pathlib import Path + +import yaml +from fastapi import FastAPI, Query +from fastapi.responses import FileResponse +from fastapi.staticfiles import StaticFiles + +from sie_sdk import SIEClient + +from search import load_index, search + +config = None +multivectors = None +metadata = None +client = None +clients_index: list[str] = [] + + +@asynccontextmanager +async def lifespan(app: FastAPI): + global config, multivectors, metadata, client, clients_index + root = Path(__file__).resolve().parent.parent + config = yaml.safe_load((root / "config.yaml").read_text()) + multivectors, metadata = load_index() + cluster_url = os.environ.get("SIE_CLUSTER_URL", config["cluster"]["url"]) + api_key = os.environ.get("SIE_API_KEY", config["cluster"]["api_key"]) + client = SIEClient(cluster_url, api_key=api_key) + clients_index = sorted({m["client"] for m in metadata}) + yield + client.close() + + +app = FastAPI(title="SIE Vision-First Document RAG", lifespan=lifespan) + +root = Path(__file__).resolve().parent.parent +static_dir = root / "static" +app.mount("/static", StaticFiles(directory=str(static_dir)), name="static") +app.mount("/pages", StaticFiles(directory=str(root / "data" / "pages")), name="pages") + + +@app.get("/") +def index(): + return FileResponse(str(static_dir / "index.html")) + + +@app.get("/api/clients") +def api_clients(): + return clients_index + + +@app.get("/api/stats") +def api_stats(): + return { + "total_pages": len(metadata), + "clients": clients_index, + "models": config["models"], + "visual_rerank": config["search"].get("visual_rerank", False), + "answer": config["search"].get("answer", True), + } + + +@app.get("/api/search") +def api_search( + q: str = Query(..., min_length=1), + client_name: str | None = Query(None, alias="client"), +): + out = search(client, config, multivectors, metadata, q, client_name) + return { + "query": q, + "client": client_name, + "answer": out["answer"], + "timings": out["timings"], + "results": [ + { + "page_id": r["page_id"], + "client": r["client"], + "title": r["title"], + "space": r["space"], + "author": r["author"], + "web_url": r["web_url"], + "page_image": f"/pages/{r['page_id']}.png", + "ocr_snippet": r.get("ocr_snippet", ""), + "scores": { + "maxsim": round(r["_maxsim_score"], 4), + "rerank": round(r["_rerank_score"], 4) if r.get("_rerank_score") is not None else None, + }, + } + for r in out["results"] + ], + } diff --git a/examples/vision-doc-rag/static/index.html b/examples/vision-doc-rag/static/index.html new file mode 100644 index 00000000..392c8791 --- /dev/null +++ b/examples/vision-doc-rag/static/index.html @@ -0,0 +1,190 @@ + + + + + + Vision-First Document RAG · SIE + + + +
+

Multi-Tenant Visual Doc Search + QA

+

ColQwen2.5 ranks pages by looking at the images. Florence-2-DocVQA reads the top page and answers the question. All on one SIE endpoint.

+
+
+
+ + + +
+
+
+
+
+ + +