Skip to content

[Python] Add agent-framework-azure-ai-contentunderstanding package#4829

Open
yungshinlintw wants to merge 72 commits intomicrosoft:mainfrom
yungshinlintw:yslin/contentunderstanding-context-provider
Open

[Python] Add agent-framework-azure-ai-contentunderstanding package#4829
yungshinlintw wants to merge 72 commits intomicrosoft:mainfrom
yungshinlintw:yslin/contentunderstanding-context-provider

Conversation

@yungshinlintw
Copy link
Copy Markdown
Member

@yungshinlintw yungshinlintw commented Mar 22, 2026

Reviewer's Guide

Closes #4942

This package adds a BaseContextProvider implementation that bridges Azure Content Understanding (CU) with the Agent Framework. When a user sends file attachments (PDF, images, audio, video), the provider intercepts them in before_run(), sends them to CU for analysis, and injects the structured results (markdown + extracted fields) back into the LLM context — so the agent can answer questions about the files without the developer writing any extraction code.

Quick usage:

cu = ContentUnderstandingContextProvider(
    endpoint="https://my-resource.services.ai.azure.com/",
    credential=AzureCliCredential(),
)
agent = Agent(
    client=client,
    name="DocQA",
    instructions="You are a document analyst.",
    context_providers=[cu],
)
# Files in Message.contents are auto-analyzed; results injected into LLM context
response = await agent.run(
    Message(role="user", contents=[
        Content.from_text("What's on this invoice?"),
        Content.from_uri("https://example.com/invoice.pdf", media_type="application/pdf",
                         additional_properties={"filename": "invoice.pdf"}),
    ]),
    session=session,
)

Suggested review order

1. Start with samples — they show the feature set and usage patterns end-to-end:

Sample What it demonstrates
01_document_qa.py Simplest flow — upload a PDF via URL, ask a question about it. Shows Content.from_uri(), context_providers=[cu], and how CU results appear in the agent's response.
02_multi_turn_session.py AgentSession persistence — upload a file on turn 1, ask follow-up questions on turns 2–3 without re-uploading. Shows how state["documents"] carries across turns.
03_multimodal_chat.py PDF + audio + video in a single session (5 turns). Shows auto-detection of media types, parallel analysis, and multi-segment video output with per-segment fields.
04_invoice_processing.py Per-file analyzer override — uses additional_properties={"analyzer_id": "prebuilt-invoice"} to extract structured invoice fields (vendor, total, line items) instead of generic markdown.
05_background_analysis.py Non-blocking analysis with max_wait=0.5 — file starts analyzing in the background while the agent responds immediately. Next turn resolves the pending result. Shows the analyzingready status flow.
06_large_doc_file_search.py CU extraction + OpenAI vector store for RAG — large documents are analyzed by CU, uploaded to a vector store, and retrieved via file_search tool instead of injecting full content into context.

2. Then review the core implementation:

Priority File Why
🔴 High _context_provider.py (1087 lines) Core logic — before_run() hook, file detection/stripping, CU analysis with timeout + background fallback, output formatting, tool registration. Most important file to review.
🔴 High _models.py Public API surface — DocumentEntry, DocumentStatus, AnalysisSection, FileSearchConfig TypedDicts and enums exposed to users
🟡 Medium _file_search.py FileSearchBackend protocol + OpenAI/Foundry factory methods for vector store integration
🟡 Medium __init__.py Public exports — verify the right symbols are exposed
🟡 Medium pyproject.toml Package metadata, dependencies, version constraints
🟢 Low tests/ 78 unit tests + 5 live integration tests

MAF API usage (needs team alignment)

This package uses the following internal/private MAF APIs — if any of these are changing or not intended for external use, this package may need updates:

  • BaseContextProvider and its before_run() hook
  • SessionContext.extend_instructions(), extend_messages(), extend_tools()
  • Content.from_data(), Content.from_uri(), Content.type, Content.media_type, Content.additional_properties
  • FunctionTool for registering list_documents()
  • agent_framework._sessions.AgentSession
  • agent_framework._settings.load_settings()

This PR adds agent-framework-azure-ai-contentunderstanding, an optional connector package that integrates Azure Content Understanding (CU) into the Agent Framework as a context provider.

What's Included

Core (_context_provider.py, _models.py, _file_search.py)

  • ContentUnderstandingContextProvider -- auto-analyzes file attachments (PDF, images, audio, video) via Azure CU and injects structured results (markdown, fields) into LLM context
  • Auto-detects media type and selects the right CU analyzer (prebuilt-documentSearch, prebuilt-audioSearch, prebuilt-videoSearch)
  • Multi-document session state with status tracking (analyzing/uploading/ready/failed)
  • Configurable timeout (max_wait) with async background fallback
  • Output filtering (>90% token reduction) via AnalysisSection enum
  • Auto-registered list_documents() tool for status queries
  • Document content injected into conversation history for follow-up turns
  • Multi-segment video/audio: per-segment fields with time ranges
  • MIME sniffing for misidentified files (application/octet-stream)
  • Per-file analyzer ID override via Content.additional_properties["analyzer_id"] -- mix different analyzers in the same turn (e.g., prebuilt-invoice for invoices alongside prebuilt-documentSearch for general docs)
  • Duplicate filename rejection (filenames must be unique within a session)
  • Optional FileSearchConfig for vector store integration (OpenAI/Foundry backends)

Samples (6 scripts + 3 DevUI)

  • 01_document_qa.py -- Single PDF upload + Q&A
  • 02_multi_turn_session.py -- AgentSession persistence across turns
  • 03_multimodal_chat.py -- PDF + audio + video parallel analysis (5 turns)
  • 04_invoice_processing.py -- Structured field extraction with prebuilt-invoice
  • 05_background_analysis.py -- Non-blocking analysis with max_wait + status tracking
  • 06_large_doc_file_search.py -- CU extraction + vector store RAG
  • 02-devui/01-multimodal_agent -- Interactive web UI for uploading and chatting with documents/audio/video
  • 02-devui/02-file_search_agent/azure_openai_backend -- DevUI with CU + Azure OpenAI file_search RAG
  • 02-devui/02-file_search_agent/foundry_backend -- DevUI with CU + Foundry file_search RAG

Tests

  • 66 unit tests covering all major flows
  • 5 live integration tests (CU endpoint required)
  • Test fixtures for PDF, audio, video, image, invoice modalities

Add Azure Content Understanding integration as a context provider for the
Agent Framework. The package automatically analyzes file attachments
(documents, images, audio, video) using Azure CU and injects structured
results (markdown, fields) into the LLM context.

Key features:
- Multi-document session state with status tracking (pending/ready/failed)
- Configurable timeout with async background fallback for large files
- Output filtering via AnalysisSection enum
- Auto-registered list_documents() and get_analyzed_document() tools
- Supports all CU modalities: documents, images, audio, video
- Content limits enforcement (pages, file size, duration)
- Binary stripping of supported files from input messages

Public API:
- ContentUnderstandingContextProvider (main class)
- AnalysisSection (output section selector enum)
- ContentLimits (configurable limits dataclass)

Tests: 46 unit tests, 91% coverage, all linting and type checks pass.
- Replace synthetic fixtures with real CU API responses (sanitized)
- Update test assertions to match real data (Contoso vs CONTOSO,
  TotalAmount vs InvoiceTotal, field values from real analysis)
- Add --pre install note in README (preview package)
- Document unenforced ContentLimits fields (max_pages, duration)
@markwallace-microsoft markwallace-microsoft added documentation Improvements or additions to documentation python labels Mar 22, 2026
@github-actions github-actions bot changed the title [WIP] [Python] Add agent-framework-azure-contentunderstanding package (DO NOT REVIEW) Python: [WIP] [Python] Add agent-framework-azure-contentunderstanding package (DO NOT REVIEW) Mar 22, 2026
Align naming with Azure SDK convention and AF pattern:
- Directory: azure-contentunderstanding -> azure-ai-contentunderstanding
- PyPI: agent-framework-azure-contentunderstanding -> agent-framework-azure-ai-contentunderstanding
- Module: agent_framework_azure_contentunderstanding -> agent_framework_azure_ai_contentunderstanding

CI fixes:
- Inline conftest helpers to avoid cross-package import collision in xdist
- Remove PyPI badge and dead API reference link from README (package not published yet)
@markwallace-microsoft
Copy link
Copy Markdown
Member

markwallace-microsoft commented Mar 23, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/azure-ai-contentunderstanding/agent_framework_azure_ai_contentunderstanding
   _constants.py50100% 
   _context_provider.py2694085%276–277, 279, 283–284, 287, 291, 347, 349, 356, 402, 406, 410, 418, 485, 489, 532, 553, 580, 585, 642, 646, 733, 737, 743, 763–768, 770–775, 783, 792–793
   _detection.py791186%97, 103, 108–113, 172–174
   _extraction.py1741293%40, 122, 125, 146, 230, 262, 283–287, 298
   _file_search.py23482%58, 65, 69, 72
   _models.py40197%113
TOTAL28555342288% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
5612 20 💤 0 ❌ 0 🔥 1m 27s ⏱️

yungshinlintw and others added 19 commits March 23, 2026 14:49
- document_qa.py: Single PDF upload, CU context provider, follow-up Q&A
- invoice_processing.py: Structured field extraction with prebuilt-invoice
- multimodal_chat.py: Multi-file session with status tracking
- Add ruff per-file-ignores for samples/ directory
- Update README with samples section, env vars, and run instructions
…earch)

- S3: devui_multimodal_agent/ — DevUI web UI with CU-powered file analysis
- S4: large_doc_file_search.py — CU extraction + OpenAI vector store RAG
- Update README and samples/README.md with all 5 samples
Add FileSearchConfig — when provided, CU-extracted markdown is automatically
uploaded to an OpenAI vector store and a file_search tool is registered on
the context. This enables token-efficient RAG retrieval for large documents
without users needing to manage vector stores manually.

- FileSearchConfig dataclass (openai_client, vector_store_name)
- Auto-create vector store, upload markdown, register file_search tool
- Auto-cleanup on close()
- When file_search is enabled, skip full content injection (use RAG instead)
- Update large_doc_file_search sample to use the integration
- 4 new tests (50 total, 90% coverage)
Follow established AF pattern: check for API key env var first,
fall back to AzureCliCredential. Supports AZURE_OPENAI_API_KEY and
AZURE_CONTENTUNDERSTANDING_API_KEY environment variables.
…zy init

_context_provider.py:
- Make analyzer_id optional (default None) with auto-detection by media
  type prefix: audio->audioSearch, video->videoSearch, else documentSearch
- Add _ensure_initialized() for lazy client creation in before_run()
- Add FileSearchConfig-based vector store upload
- Fix: background-completed docs in file_search mode now upload to vector
  store instead of injecting full markdown into context messages
- Add _pending_uploads queue for deferred vector store uploads

devui_file_search_agent/ (new sample):
- DevUI agent combining CU extraction + OpenAI file_search RAG

azure_responses_agent (existing sample fix):
- Add AzureCliCredential support and AZURE_AI_PROJECT_ENDPOINT fallback

Tests (19 new), Docs updated (AGENTS.md, README.md)
…tor store expiration

- Add three-layer MIME detection (fast path → filetype binary sniff → filename
  fallback) to handle unreliable upstream MIME types (e.g. mp4 sent as
  application/octet-stream). Adds filetype>=1.2,<2 dependency.
- Media-aware output formatting: video shows duration/resolution + all fields
  as JSON; audio promotes Summary as prose; document unchanged.
- Unified timeout for all media types (removed file_search special-case that
  waited indefinitely for video/audio). All files use max_wait with background
  polling fallback.
- Vector store created with expires_after=1 day as crash safety net.
- Add 8 MIME sniffing tests (TestMimeSniffing class).
CU's prebuilt-videoSearch and prebuilt-audioSearch analyzers split long
media files into multiple `contents[]` segments. Previously,
`_extract_sections()` only read `contents[0]`, causing truncated
duration, missing transcript, and incomplete fields for any video/audio
longer than a single scene.

Now iterates all segments and merges:
- duration: global min(startTimeMs) → max(endTimeMs)
- markdown: concatenated with `---` separators
- fields: same-named fields collected into per-segment list
- metadata (kind, resolution): taken from first segment

Single-segment results (documents, short audio) are unaffected.

Update test fixture to realistic 3-segment video structure and expand
assertions to verify multi-segment merging. Add documentation for
multi-segment processing and speaker diarization limitation.
- Improve class docstring: clarify endpoint (Azure AI Foundry URL with
  example), credential (AzureKeyCredential vs Entra ID), and analyzer_id
  (prebuilt/custom with auto-selection behavior and reference links)
- Add SUPPORTED_MEDIA_TYPES comments explaining MIME-based matching
  behavior and add missing file types per CU service docs
- Use namespaced logger to align with other packages
- Remove ContentLimits and related code/tests
- Rename DEFAULT_MAX_WAIT to DEFAULT_MAX_WAIT_SECONDS for clarity
- Add vector_store_id field to FileSearchConfig (None = auto-create)
- Track _owns_vector_store to only delete auto-created stores on close()
- Remove vector_store_name; use internal _DEFAULT_VECTOR_STORE_NAME
- Add inline comments for private state fields
- Document output_sections default in docstring
- Update AGENTS.md, samples, and tests
Resolve conflict in azure_responses_agent/agent.py by taking upstream
(AzureOpenAIResponsesClient -> FoundryChatClient rename)
Follow Azure AI Search provider pattern: create the client eagerly in
__init__, make __aenter__ a no-op. This ensures __aexit__/close() is
always safe to call and eliminates the _ensure_initialized() workaround.
Replace direct OpenAI client usage with FileSearchBackend ABC:
- OpenAIFileSearchBackend: for OpenAIChatClient (Responses API)
- FoundryFileSearchBackend: for FoundryChatClient (Azure Foundry)
- Shared base _OpenAICompatBackend for common vector store CRUD

FileSearchConfig now takes a backend instead of openai_client.
Factory methods from_openai() and from_foundry() for convenience.

BREAKING: FileSearchConfig(openai_client=...) -> FileSearchConfig.from_openai(...)
- Poll vector store indexing (create_and_poll) to ensure file_search
  returns results immediately after upload
- Set status to failed when vector store upload fails
- Skip get_analyzed_document tool in file_search mode to prevent
  LLM from bypassing RAG
- Simplify sample auth: single credential, direct parameters
- Use from_foundry backend for Foundry project endpoints
@yungshinlintw yungshinlintw changed the title Python: [WIP] [Python] Add agent-framework-azure-contentunderstanding package (DO NOT REVIEW) Python: [WIP] [Python] Add agent-framework-azure-ai-contentunderstanding package (DO NOT REVIEW) Mar 26, 2026
- Add module-level docstrings to __init__.py and _context_provider.py
- Use Self return type for __aenter__ (with typing_extensions fallback)
- Use explicit typed params for __aexit__ signature
- Add sync TokenCredential to AzureCredentialTypes union
- Pass AGENT_FRAMEWORK_USER_AGENT to ContentUnderstandingClient
- Remove unused ContentLimits from public API and tests
- Fix FileSearchConfig tests to match refactored backend API
- Fix lifecycle tests to match eager client initialization
- Refactor _analyze_file to return DocumentEntry instead of mutating dict
- Remove TokenCredential from AzureCredentialTypes (fixes mypy/pyright CI)
- Remove OpenAIFileSearchBackend/FoundryFileSearchBackend from public API
  (internal to FileSearchConfig factory methods)
- Remove DocumentStatus from public exports (implementation detail)
- Update file_search comments to reflect backend-agnostic design
- Add DocumentStatus enum, analysis/upload duration tracking
- Add combined timeout for CU analysis + vector store upload
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 38 changed files in this pull request and generated 3 comments.

…led tasks

- _file_search.py: Remove unused logger and logging import
- 01-multimodal_agent/README.md: Remove accidentally pasted Python script
- _context_provider.py close(): Await cancelled tasks before closing
  client to prevent 'Task destroyed but pending' warnings
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 38 changed files in this pull request and generated 3 comments.

- Add _sanitize_doc_key() to strip control characters, collapse
  whitespace, and cap length at 255 chars — prevents prompt injection
  via crafted filenames in extend_instructions() calls.
- Track accepted doc_keys in step 3 so step 5 only injects content
  for files actually analyzed this turn, not pre-existing duplicates.
- Soften duplicate upload instruction wording (remove IMPORTANT/caps).
Previously _pending_tasks, _pending_uploads, and _uploaded_file_ids
were stored on self, shared across all sessions. This caused
cross-session leakage: Session A's background task results could be
injected into Session B's context.

Now these are stored in the per-session state dict. Global copies
(_all_pending_tasks, _all_uploaded_file_ids) are kept on self only
for best-effort cleanup in close().

Add 2 new TestSessionIsolation tests verifying that background tasks
and resolved content stay within their originating session.
Only MARKDOWN and FIELDS are handled by _extract_sections().
Remove FIELD_GROUNDING, TABLES, PARAGRAPHS, SECTIONS to avoid
exposing dead options to users.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 38 changed files in this pull request and generated 1 comment.

- Use SDK .value property with recursive extraction for object/array fields
- Object: AmountDue -> {Amount: 610, CurrencyCode: USD} (was raw SDK dict)
- Array: LineItems -> list of flattened items (was raw SDK list)
- Update invoice fixture with object/array fields from prebuilt-invoice
- Add 3 unit tests for object, array, and nested object field extraction
- Use SDK AnalysisInput model instead of raw body dict for begin_analyze
- Forward content_range from additional_properties to CU (page/time ranges)
- Extract CU warnings with code/message/target (ODataV4Format) into output
- Include content-level category from classifier analyzers
- Add 5 new tests: warnings, category, content_range forwarding
- Fix pyright with explicit casts; fix en-dash lint (RUF002)
@yungshinlintw yungshinlintw force-pushed the yslin/contentunderstanding-context-provider branch from 5677c0b to 9f31124 Compare March 27, 2026 21:52
yungshinlin and others added 5 commits March 27, 2026 15:00
- Fix start_time_ms=0 treated as falsy by 'or' short-circuit, use
  'is None' checks instead for duration and segment time extraction
- Update warnings test to use RAI ContentFiltered codes
- Enrich warnings extraction to include code/message/target (ODataV4Format)
- Add multi-segment video category test with per-segment assertions
- Extract _constants.py: SUPPORTED_MEDIA_TYPES, MIME_ALIASES, analyzer maps
- Extract _detection.py: file detection, MIME sniffing, doc key derivation
- Extract _extraction.py: result extraction, field flattening, LLM formatting
- _context_provider.py delegates via thin wrappers (793 lines, was 1255)
- Update test imports to use _constants.py for SUPPORTED_MEDIA_TYPES
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

.NET: [Feature]: Azure Content Understanding context provider for multimodal document analysis

5 participants