Python: Foundry Evals integration for Python by alliscode · Pull Request #4750 · microsoft/agent-framework

alliscode · 2026-03-17T21:15:01Z

Add evaluation framework with local and Foundry-hosted evaluator support:

EvalItem/EvalResult core types with conversation splitting strategies
@evaluator decorator for defining custom evaluation functions
LocalEvaluator for running evaluations locally
FoundryEvals provider for Azure AI Foundry hosted evaluations
evaluate_agent() orchestration with expected values support
evaluate_workflow() for multi-agent workflow evaluation
Comprehensive test suite and evaluation samples

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the Contribution Guidelines
All unit tests pass, and I have added new tests where possible
Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

python/packages/core/agent_framework/_evaluation.py

python/packages/core/agent_framework/__init__.py

python/packages/core/agent_framework/_eval.py

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

markwallace-microsoft · 2026-03-19T20:44:50Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
packages/core/agent_framework
_evaluation.py	622	73	88%	153, 161, 475, 477, 586, 589, 668–670, 675, 712–715, 771–772, 775, 781–783, 787, 820–822, 876, 911, 923–925, 930, 954–959, 1050, 1128–1129, 1131–1135, 1141, 1179, 1524, 1526, 1534, 1544, 1548, 1593, 1611–1612, 1683, 1689, 1704, 1708–1710, 1740, 1746–1750, 1782, 1813–1814, 1816, 1841–1842, 1847
packages/foundry/agent_framework_foundry
_foundry_evals.py	256	4	98%	451, 456, 626, 691
TOTAL	28846	3425	88%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
5715	20 💤	0 ❌	0 🔥	1m 31s ⏱️

Copilot

Pull request overview

This PR adds a provider-agnostic evaluation framework to the Python Agent Framework, with both local (no-API) evaluators and an Azure AI Foundry-backed provider, plus end-to-end samples that demonstrate agent and workflow evaluation patterns.

Changes:

Introduces core evaluation types and orchestration (EvalItem, EvalResults, evaluate_agent(), evaluate_workflow()) plus local checks (LocalEvaluator, @evaluator).
Adds Azure AI Foundry provider integration (FoundryEvals) and trace/target evaluation helpers.
Adds/updates evaluation samples (Foundry evals patterns + self-reflection groundedness) and expands test coverage for local evaluation.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py	Migrates groundedness scoring to `FoundryEvals` and updates batch runner.
python/samples/05-end-to-end/evaluation/self_reflection/README.md	Updates self-reflection sample documentation for Foundry Evals usage and env vars.
python/samples/05-end-to-end/evaluation/self_reflection/.env.example	Updates env var example to `FOUNDRY_PROJECT_ENDPOINT`.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py	New sample: evaluate multi-agent workflows with Foundry evaluators.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py	New sample: evaluate existing responses / traces via Foundry.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_multiturn_sample.py	New sample: demonstrate conversation split strategies for eval.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py	New sample: mix `LocalEvaluator` with Foundry evaluators in one call.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_all_patterns_sample.py	New “kitchen sink” sample covering all evaluation patterns.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py	New sample: evaluate_agent patterns + direct `FoundryEvals.evaluate()`.
python/samples/05-end-to-end/evaluation/foundry_evals/README.md	New README describing Foundry eval samples and entry points.
python/samples/05-end-to-end/evaluation/foundry_evals/.env.example	New env example for Foundry eval samples.
python/samples/03-workflows/evaluation/evaluate_workflow.py	New workflow evaluation sample using local checks.
python/samples/02-agents/evaluation/evaluate_with_expected.py	New sample demonstrating expected outputs/tool call expectations.
python/samples/02-agents/evaluation/evaluate_agent.py	New sample demonstrating basic local evaluation for agents.
python/packages/core/tests/core/test_observability.py	Adjusts OTLP exporter-related test skipping.
python/packages/core/tests/core/test_local_eval.py	Adds a comprehensive test suite for local eval framework behaviors.
python/packages/core/agent_framework/_evaluation.py	Adds the provider-agnostic evaluation framework implementation.
python/packages/core/agent_framework/init.py	Re-exports evaluation APIs/types from the package root.
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py	Adds the Foundry-backed `FoundryEvals` provider + trace/target helpers.
python/packages/azure-ai/agent_framework_azure_ai/init.py	Exposes `FoundryEvals` and helper functions from the azure-ai package.

python/packages/core/tests/core/test_observability.py

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py

python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

python/packages/core/agent_framework/_evaluation.py

TaoChenOSU · 2026-03-26T20:19:34Z

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

+    Raises:
+        ValueError: If the name is not recognized.
+    """
+    if name.startswith("builtin."):


nit: should we also check if the name is actually one of the builtins?

Copilot on behalf of alliscode: Good call. We considered strict validation but decided to pass through any builtin.* name as-is since Foundry may add new evaluators that aren't in our local mapping yet. Only unqualified shorthand names (relevance, coherence, etc.) are validated against the known set. Added a comment explaining this rationale — see bb5f546.

I'm working with the Foundry Evals team to get these names defined in the Foundry SDK so this mapping issue should be resolved soon.

TaoChenOSU · 2026-03-26T20:24:39Z

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

+        qualified = _resolve_evaluator(name)
+        short = name if not name.startswith("builtin.") else name.split(".")[-1]
+
+        entry: dict[str, Any] = {


nit: is there a model defined somewhere with all the parameters documented and typed? The dictionary is hard to reason about and creates potential maintenance overhead in the future.

Copilot on behalf of alliscode: The testing_criteria dict structure is dictated by the OpenAI SDK's evals.create() API — it accepts dict[str, Any] with no typed model yet. We could introduce a TypedDict for internal representation, but it would just wrap the same dict. For now, added an inline comment with a link to the API reference docs. If the SDK adds typed models in the future, we'll adopt them. See bb5f546.

TaoChenOSU · 2026-03-26T20:26:08Z

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

+            if qualified in _AGENT_EVALUATORS:
+                # Agent evaluators: query/response as conversation arrays
+                mapping: dict[str, str] = {
+                    "query": "{{item.query_messages}}",


{{item.query_messages}}

Are these placeholders?

Copilot on behalf of alliscode: Yes, these are Mustache-style template placeholders used by the OpenAI evals API data_mapping feature. They reference fields in the JSONL data items (e.g. {{item.query_messages}} maps to the query_messages field in each item). Added an inline comment explaining this — see bb5f546.

TaoChenOSU · 2026-03-26T20:32:55Z

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

+        if remaining <= 0:
+            return EvalResults(provider=provider, eval_id=eval_id, run_id=run_id, status="timeout")
+        logger.debug("Eval run %s status: %s (%.0fs remaining)", run_id, run.status, remaining)
+        await asyncio.sleep(min(max(poll_interval, 1.0), remaining, 60.0))


If remaining is less than the poll interval, should we still go for another round? Also, there is the 60 second implicit upper bound on the poll interval, which may throw people off.

Removed 60 second upper bound

TaoChenOSU · 2026-03-26T20:34:06Z

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

+        await asyncio.sleep(min(max(poll_interval, 1.0), remaining, 60.0))
+
+
+def _extract_result_counts(run: Any) -> dict[str, int] | None:


nit: is it possible to narrow down the type of run?

Copilot on behalf of alliscode: Updated — removed the | Any union entirely. This is a brand new feature so no backward compat needed. The type is now just RunRetrieveResponse. See d05d032.

TaoChenOSU · 2026-03-26T20:36:03Z

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

+
+def _extract_result_counts(run: Any) -> dict[str, int] | None:
+    """Safely extract result_counts from an eval run object."""
+    counts = getattr(run, "result_counts", None)


If we can tighten the type, we can get rid of a lot of the getattr in the code.

Copilot on behalf of alliscode: Agreed — done. Replaced most getattr calls with direct field access using the typed RunRetrieveResponse and PerTestingCriteriaResult types from the OpenAI SDK. Still using getattr for result_counts since the return value is a Pydantic model that needs extraction, but the individual fields (errored/failed/passed/total) are now accessed by known name. See bb5f546.

TaoChenOSU · 2026-03-26T20:48:02Z

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

+            Provide this or *openai_client*.
+        openai_client: An ``AsyncOpenAI`` client with evals API.
+        model_deployment: Model deployment name for the evaluator LLM judge.
+        evaluators: Evaluator names (e.g. ``["relevance", "tool_call_accuracy"]``).


Question: are all the Foundry evaluators supported by a bare openai client?

Copilot on behalf of alliscode: Great point — updated. FoundryEvals now accepts client: FoundryChatClient instead of openai_client: AsyncOpenAI. Since the builtin.* evaluators require a Foundry endpoint, the type should reflect that. The project_client fallback is retained for the standalone functions (evaluate_traces, evaluate_foundry_target). All samples updated. See 669d67c.

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

python/samples/02-agents/evaluation/evaluate_with_expected.py

eavanvalkenburg · 2026-03-27T09:35:09Z

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

we should move this to the foundry project

Copilot on behalf of alliscode: Agree this belongs in the foundry package long-term. Tracking as a follow-up — it's a bigger structural change that we'd like to handle in a separate PR to keep this one focused on the eval integration itself.

Copilot on behalf of alliscode: Update — done! Moved _foundry_evals.py from agent_framework_azure_ai to agent_framework_foundry package. All imports now use from agent_framework.foundry import FoundryEvals (via the lazy re-export namespace). Tests moved to foundry/tests/, all 134 pass. See 398621f.

python/samples/02-agents/evaluation/evaluate_agent.py

eavanvalkenburg · 2026-03-27T09:47:54Z

python/samples/02-agents/evaluation/evaluate_agent.py

+    LocalEvaluator,
+    evaluate_agent,
+    evaluator,
+    keyword_check,


I'm a bit worried about discovery of these kinds of checks, could we think about doing something like a helper class with static methods or adding static methods on LocalEvaluator for these? Or some other way, we already have a lot in the main namespace and these might get lost, since they also do not have a clear prefix (like evaluat...)

Copilot on behalf of alliscode: Good point on discoverability. The checks are currently top-level functions (keyword_check, tool_called_check, etc.) importable from agent_framework. We could namespace them under a checks module or helper class — will discuss with the team and address in a follow-up.

eavanvalkenburg · 2026-03-27T09:49:28Z

python/samples/02-agents/evaluation/evaluate_agent.py

+            "Will it rain in London tomorrow?",
+            "What should I wear for 30°C weather?",
+        ],
+        evaluators=local,


could we also allow you to pass in a list of evaluators? (and then we wrap them in a LocalEvaluator)? or does that not make sense?

Copilot on behalf of alliscode: Agree — accepting a list of check functions and auto-wrapping in LocalEvaluator would reduce boilerplate. Tracking for a follow-up.

Copilot on behalf of alliscode: Already supported — _resolve_evaluators() auto-wraps bare EvalCheck callables into LocalEvaluator. You can pass evaluators=[keyword_check('hello'), tool_called_check('get_weather')] directly.

python/samples/05-end-to-end/evaluation/foundry_evals/.env.example

eavanvalkenburg · 2026-03-27T13:42:42Z

python/packages/core/agent_framework/_evaluation.py

+            eval_id="local",
+            run_id=eval_name,
+            status="completed",
+            result_counts={"passed": passed, "failed": failed, "errored": 0},


since we don't catch any errors, how would errored ever be set?

Copilot on behalf of alliscode: Done — removed the errored property. The errored count in result_counts dict is still populated by Foundry (from the API response) but we no longer expose a dedicated property for it. See 5b54c3f.

eavanvalkenburg · 2026-03-27T13:44:10Z

python/packages/core/agent_framework/_evaluation.py

+                    )
+                )
+
+            if item_passed:


this can be calculated from the underlying per_check counters, or not?

Copilot on behalf of alliscode: The per_check counters track per-evaluator stats (each check's passed/failed count), while passed/failed at the top level track per-item stats (an item passes only if ALL checks pass). These are different dimensions, so computing total from per_check would be incorrect. The current total = passed + failed (properties computed from result_counts) is the right approach.

eavanvalkenburg · 2026-03-27T13:44:24Z

python/packages/core/agent_framework/_evaluation.py

+
+async def evaluate_agent(
+    *,
+    agent: Any | None = None,


Suggested change

agent: Any | None = None,

agent: SupportsAgentRun | None = None,

Copilot on behalf of alliscode: Done — changed agent: Any | None to agent: SupportsAgentRun | None. See 5b54c3f.

eavanvalkenburg · 2026-03-27T15:56:13Z

python/packages/core/agent_framework/_evaluation.py

+# region Public orchestration functions
+
+
+async def evaluate_agent(


maybe add a overload for single query, response, expected tools calls, etc. vs sequences of them

Copilot on behalf of alliscode: Good suggestion — a single-query overload would simplify the common case. Will consider adding this in a follow-up.

Copilot on behalf of alliscode: Already supported — queries accepts str | Sequence[str], so evaluate_agent(queries='single question', ...) works out of the box.

eavanvalkenburg · 2026-03-27T16:01:52Z

python/packages/core/agent_framework/_evaluation.py

+    """
+    import warnings
+
+    warnings.warn(


why do we need to deprecate something new, just remove

Copilot on behalf of alliscode: Done — removed evaluate_response() entirely (was brand new, no need to deprecate). See 5b54c3f.

Merged and refactored eval module per Eduard's PR review: - Merge _eval.py + _local_eval.py into single _evaluation.py - Convert EvalItem from dataclass to regular class - Rename to_dict() to to_eval_data() - Convert _AgentEvalData to TypedDict - Simplify check system: unified async pattern with isawaitable - Parallelize checks and evaluators with asyncio.gather - Add all/any mode to tool_called_check - Fix bool(passed) truthy bug in _coerce_result - Remove deprecated function_evaluator/async_function_evaluator aliases - Remove _MinimalAgent, tighten evaluate_agent signature - Set self.name in __init__ (LocalEvaluator, FoundryEvals) - Limit FoundryEvals to AsyncOpenAI only - Type project_client as AIProjectClient - Remove NotImplementedError continuous eval code - Add evaluation samples in 02-agents/ and 03-workflows/ - Update all imports and tests (167 passing) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use cast(list[Any], x) with type: ignore[redundant-cast] comments to satisfy both mypy (which considers casting Any redundant) and pyright strict mode (which needs explicit casts to narrow Unknown types). Also fix evaluator decorator check_name type annotation to be explicitly str, resolving mypy str|Any|None mismatch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Replace module-level assert with if/raise for evaluator set consistency checks (ruff S101 disallows bare assert) - Add type: ignore[arg-type] and pyright: ignore[reportArgumentType] on OpenAI SDK evals API calls that pass dicts where typed params are expected (SDK accepts dicts at runtime) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix all_passed ignoring parent result_counts when sub_results present - Fix _extract_tool_calls: parse string arguments via json.loads before falling back to None (real LLM responses use string arguments) - Sanitize _raw_arguments to '[unparseable]' to avoid leaking sensitive tool-call data to external evaluation services - Add NOTE comment on to_eval_data message serialization dropping non-text content (tool calls, results) - Eliminate double conversation split in _evaluate_via_dataset: build JSONL dicts directly from split_messages + AgentEvalConverter - Raise poll_interval floor from 0.1s to 1.0s to prevent rate-limit exhaustion - Fix MagicMock(name=...) bug in test: sets display name not .name attr - Fix mock_output_item.sample: use MagicMock object instead of dict so _fetch_output_items exercises error/usage/input/output extraction Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Code fixes: - Move import-time RuntimeError checks to unit tests (avoids breaking imports for all users on developer set-drift mistake) - _filter_tool_evaluators now raises ValueError when all evaluators require tools but no items have tools (was silently substituting) - Add poll_interval upper bound (60s) to prevent single-iteration sleep - Log exc_info=True in _fetch_output_items for debugging API changes - Fix evaluate() docstring: remove claim about Responses API optimization - Validate target dict has 'type' key in evaluate_foundry_target - Document to_eval_data() limitation: non-text content is omitted Tests: - TestEvaluatorSetConsistency: verify _AGENT/_TOOL subsets of _BUILTIN - TestEvaluateTracesAgentId: agent_id-only path with lookback_hours - TestFilterToolEvaluatorsRaises: ValueError on all-tool no-items - TestEvaluateFoundryTargetValidation: target without 'type' key - Assert items==[] on failed/canceled poll results - Mock output_items.list in response_ids test for full flow - TestAllPassedSubResults: result_counts=None + sub_results delegation and parent failures override sub_results - TestBuildOverallItemEmpty: empty workflow outputs returns None Skipped r5-07 (_raw_arguments length hint): marginal debugging value, could leak content size information. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…s=...) The referenced function doesn't exist; the correct API is evaluate_traces(response_ids=...) from the azure-ai package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Remove to_eval_data() from EvalItem (dead code after r4-05 JSONL refactor) - Migrate 15 tests from to_eval_data() to split_messages() - Update sample to use split_messages() + Message properties - Remove unimplemented Responses API optimization docstring claim - Update split_messages() docstring to not reference removed method Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The method was never called — evaluate() uses _evaluate_via_dataset, and evaluate_traces() calls _evaluate_via_responses_impl directly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…format - Remove import of non-existent _foundry_memory_provider module (incorrectly kept during rebase conflict resolution) - Apply ruff formatter to test_local_eval.py and get-started samples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The upstream provider-leading client refactor (microsoft#4818) made client= a required parameter on Agent(). Update the three getting-started eval samples to use FoundryChatClient with FOUNDRY_PROJECT_ENDPOINT, matching the standard pattern from 01-get-started samples. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace ~80 lines of manual OpenAI evals API code (create_eval, run_eval, manual polling, raw JSONL params) with FoundryEvals: - evaluate_groundedness() uses FoundryEvals.evaluate() with EvalItem - Remove create_openai_client(), create_eval(), run_eval() functions - Remove openai SDK type imports (DataSourceConfigCustom, etc.) - run_self_reflection_batch creates FoundryEvals instance once, reuses it for all iterations across all prompts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Migrate all foundry_evals samples from AzureOpenAIResponsesClient to FoundryChatClient - Update env var from AZURE_AI_PROJECT_ENDPOINT to FOUNDRY_PROJECT_ENDPOINT - Use AzureCliCredential consistently across all samples - Fix README.md: correct function names (evaluate_dataset -> FoundryEvals.evaluate, evaluate_responses -> evaluate_traces) - Update self_reflection .env.example and README.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…jectClient AIProjectClient from azure.ai.projects.aio requires an async credential. Switch all foundry_evals samples from azure.identity.AzureCliCredential to azure.identity.aio.AzureCliCredential. Also pass project_client to FoundryChatClient instead of duplicating endpoint+credential. Close credential in self_reflection sample to avoid resource leak. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Convert all Example:: / Typical usage:: code blocks to .. code-block:: python format matching codebase convention (both _evaluation.py and _foundry_evals.py) - Add async pagination in _fetch_output_items via async for (handles large result sets) - Replace hasattr(__aenter__) with isinstance(client, AsyncOpenAI) in _resolve_openai_client - Move AsyncOpenAI import from TYPE_CHECKING to runtime (needed for isinstance) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix tests: use MagicMock(spec=AsyncOpenAI) for project_client mocks (isinstance check now requires proper type, not duck-typing) - Fix tests: replace mock_page.__iter__ with _AsyncPage helper for async for - Fix evaluate_response: auto-extract queries from response messages when query is not provided (previously always raised ValueError) - Add debug logging when skipping internal _-prefixed executor IDs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- T1: Add comment explaining builtin.* pass-through in _resolve_evaluator - T2: Add comment referencing OpenAI evals API for testing_criteria dict - T3: Document Mustache-style {{item.*}} template placeholders - T4: Document poll loop 60s sleep upper bound rationale - T5: Narrow run type to RunRetrieveResponse, use typed field access instead of vars()/getattr dance in _extract_result_counts and _extract_per_evaluator; use run.error and run.report_url directly - T6: Clarify openai_client docstring re: Azure Foundry endpoint - T8: Remove misleading empty expected_tool_calls from sample - Update tests to match real SDK PerTestingCriteriaResult shape Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

RunRetrieveResponse is the correct type — no backward compat needed for a brand new feature. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FoundryEvals now takes client: FoundryChatClient as its primary parameter instead of openai_client: AsyncOpenAI. The builtin.* evaluators require a Foundry endpoint, so the type should reflect that. - FoundryEvals.__init__: client: FoundryChatClient replaces openai_client - evaluate_traces / evaluate_foundry_target: same change - _resolve_openai_client: extracts .client from FoundryChatClient - project_client fallback retained for standalone functions - All samples updated to construct FoundryChatClient and pass as client= - Tests updated (openai_client= → client=) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

If a developer sets a higher poll_interval, respect it. Only clamp to remaining time and enforce a 1s minimum for rate-limit protection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…mple Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>

Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>

- Rename model_deployment -> model across FoundryEvals and all samples - Make model param optional, resolves from client.model - Convert EvalResults from dataclass to regular class - Remove deprecated evaluate_response() function - Refactor splitters: BUILT_IN_SPLITTERS dict + standalone functions - Change per_turn_items from classmethod to staticmethod - Simplify EvalCheck type alias to use Awaitable[CheckResult] - Remove errored property from EvalResults - Remove default value from Evaluator protocol eval_name - Rename assert_passed -> raise_for_status, add EvalNotPassedError - Type agent param as SupportsAgentRun | None - Fix Arguments docstring - Update __init__.py exports - Update all tests and samples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Move _foundry_evals.py from azure-ai to foundry package - Move test_foundry_evals.py to foundry/tests/ - Update lazy re-exports in agent_framework.foundry namespace - Update .pyi type stubs - All samples now import from agent_framework.foundry - Split tool-call evaluation into evaluate_tool_calls_sample.py - Fix all_passed to check errored count from result_counts - Fix raise_for_status to include errored item details Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FoundryEvals() now works zero-config when FOUNDRY_PROJECT_ENDPOINT and FOUNDRY_MODEL environment variables are set. Auto-creates a FoundryChatClient under the hood, matching the established env var pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rror check - Remove unused _normalize_queries function and its tests - Add pyright ignore for EvalAPIError None check (defensive guard) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

markwallace-microsoft added documentation Improvements or additions to documentation python labels Mar 17, 2026

github-actions bot changed the title ~~Foundry Evals integration for Python~~ Python: Foundry Evals integration for Python Mar 17, 2026

alliscode force-pushed the af-foundry-evals-python branch from a0edd5f to fe9e621 Compare March 17, 2026 21:21

eavanvalkenburg reviewed Mar 18, 2026

View reviewed changes

alliscode force-pushed the af-foundry-evals-python branch 6 times, most recently from 15d8640 to aad92ac Compare March 19, 2026 20:41

alliscode force-pushed the af-foundry-evals-python branch 8 times, most recently from a74c9d1 to 8d8b6e8 Compare March 25, 2026 17:55

alliscode marked this pull request as ready for review March 25, 2026 19:43

Copilot AI review requested due to automatic review settings March 25, 2026 19:43

Copilot started reviewing on behalf of alliscode March 25, 2026 19:46 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

alliscode force-pushed the af-foundry-evals-python branch from d266ee2 to 997a379 Compare March 25, 2026 20:01

moonbox3 reviewed Mar 26, 2026

View reviewed changes

TaoChenOSU reviewed Mar 26, 2026

View reviewed changes

eavanvalkenburg reviewed Mar 27, 2026

View reviewed changes

alliscode and others added 2 commits March 27, 2026 11:05

alliscode and others added 27 commits March 27, 2026 11:05

Fix error message: evaluate_responses() → evaluate_traces(response_id…

e1e232b

…s=...) The referenced function doesn't exist; the correct API is evaluate_traces(response_ids=...) from the azure-ai package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reduce default eval timeout from 600s to 180s (3 minutes)

834fd07

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove dead _evaluate_via_responses method from FoundryEvals

b00412a

The method was never called — evaluate() uses _evaluate_via_dataset, and evaluate_traces() calls _evaluate_via_responses_impl directly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert unrelated formatting changes to get-started samples

b568898

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix lint errors in eval samples (E501, ASYNC240, formatting)

641c25a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove evaluate_all_patterns_sample.py (redundant with focused samples)

8288bd9

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert test_observability.py to upstream/main (not our test)

9c050ef

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove unnecessary Any union from run type annotations

1af02d0

RunRetrieveResponse is the correct type — no backward compat needed for a brand new feature. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove implicit 60s upper bound on poll interval

1156a34

If a developer sets a higher poll_interval, respect it. Only clamp to remaining time and enforce a 1s minimum for rate-limit protection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove 1s floor on poll interval — let the developer control it

b5142f1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Update python/samples/05-end-to-end/evaluation/foundry_evals/.env.exa…

d0a57ef

…mple Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>

Update python/samples/02-agents/evaluation/evaluate_agent.py

2d4fb5f

Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>

alliscode force-pushed the af-foundry-evals-python branch from 398621f to b63dd34 Compare March 27, 2026 18:08

alliscode and others added 2 commits March 27, 2026 11:17

Fix pyright errors: remove dead _normalize_queries, suppress EvalAPIE…

fde1bb9

…rror check - Remove unused _normalize_queries function and its tests - Add pyright ignore for EvalAPIError None check (defensive guard) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

		await asyncio.sleep(min(max(poll_interval, 1.0), remaining, 60.0))


		def _extract_result_counts(run: Any) -> dict[str, int] \| None:

	agent: Any \| None = None,
	agent: SupportsAgentRun \| None = None,

		# region Public orchestration functions


		async def evaluate_agent(

Conversation

alliscode commented Mar 17, 2026

Contribution Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markwallace-microsoft commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python Unit Test Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markwallace-microsoft commented Mar 19, 2026 •

edited

Loading