How to run evals for the model router

One endpoint. Smarter spend. Model router in Foundry Models picks the optimal LLM for every prompt in real time based on signals like complexity, reasoning, and task type. Now with access to 28 frontier models, the model router makes model selection easier for developers and reduces manual overhead. This article walks you through how to run evaluations using a new open-source GitHub repo designed explicitly for the model router. Access the Eval Repo on GitHub

Before diving in, here are a few notes that developers should consider when building with the model router:

The effective context window equals the smallest underlying model’s window. Oversized prompts only succeed if the router happens to pick a model that can handle them.
Claude models require separate deployment before the router can route to them.
Routing decisions are text-only. Vision inputs are accepted, but images don’t influence the routing decision and audio isn’t supported.
Available regions are currently East US 2 and Sweden Central, in Global Standard and Data Zone Standard deployment types.

Doesn’t Foundry already have benchmarks? What does this do?

Microsoft Foundry already provides enterprise-grade evaluations. This repo is an open-source alternative that can be used alongside Foundry’s benchmarking tools. It’s especially useful before you’re ready to operationalize and when you just need a fast, defensible answer on whether the model router belongs in your stack.

As a developer integrating the model router, you need hard answers to questions like:

Quality: On my prompts, does the model router’s auto-selected model match or beat the single model I’d otherwise pick?

Cost: Including the model router’s own input-prompt billing on top of underlying model costs, am I actually saving money end-to-end, or just shifting spend around?

Latency: Does the routing step and the selected model’s response time cancel out the savings from using a smaller model?

Subset impact: If I lock down a model subset for compliance, what does that cost me in quality and price?

It gives you:

A local, scriptable pipeline that measures quality, cost, and latency in one run.
Router-aware cost math — input-prompt markup + underlying model pricing, resolved per-response from the model the router selected.
Bias-controlled LLM-as-a-judge scoring (dual-ordered pairwise to cancel position bias).
Value and efficiency composites — quality-per-dollar and quality-per-second — so trade-offs are explicit.
Model-distribution reporting so you can see which underlying models the model router is reaching for. Useful for sanity-checking Balanced vs Cost vs Quality runs, and for validating model subsets.
An optional hand-off (run_foundry_eval.py) that submits results back into Foundry’s enterprise tooling for cloud-graded quality, governance, and portal visibility — best of both worlds.

A step-by-step guide to running evals

Quick preview (no API keys needed)

Just want to see what the output looks like before wiring anything up? Either open WALKTHROUGH.ipynb in Jupyter and click Run All, or run the demo script:

# macOS / Linux
bash scripts/demo.sh

# Windows
.\scripts\demo.ps1

This uses mock data, so you can preview the full dashboard before you’ve touched a single API key.

Step 1 — Install

git clone https://github.com/microsoft/foundry-model-router-autoeval.git
cd foundry-model-router-autoeval
pip install -e ".[dev]"

Requires Python 3.9+. Make sure your model router deployment lives in East US 2 or Sweden Central — those are the only supported regions today.

Step 2 — Set up credentials

cp .env.example .env

Fill in .env with three sets of Azure endpoints and keys:

Model router (the thing you’re testing) — AZURE_MODEL_ROUTER_*
Baseline model (what you’re comparing against, e.g. GPT-5) — AZURE_OPENAI_* and AZURE_BASELINE_DEPLOYMENT
Judge model (the LLM that scores answers) — AZURE_JUDGE_*

If you plan to let the router reach Claude models, make sure you’ve deployed those separately from the Foundry model catalog first — the model router won’t deploy them for you.

Step 3 — Configure

Open configs/default.yaml and tweak: baseline deployment name, judge model + concurrency, and pricing. Or pick a preset: quick_test.yaml, large_scale.yaml, or foundry.yaml.

Step 4 — Bring your prompts

Format them as JSONL, CSV, or a SQL database. Minimum required fields: id and prompt.

{"id": "001", "prompt": "Explain quantum entanglement in simple terms."}

Keep prompts within the smallest underlying model’s context window unless you’re explicitly testing model-subset behavior — otherwise you’ll get context-exceeded errors on prompts that route to smaller models.

Step 5 — Run the eval

# Validate first (no API calls made)
python scripts/run_eval.py --dry-run

# Real run with your data
python scripts/run_eval.py --dataset my_prompts.jsonl --sample-size 100

# Interrupted? Just resume
python scripts/run_eval.py --resume --output-dir results/my-run

For 500–1,000+ prompts, see docs/how-to-resume-and-scale.md. Mind the rate limits: the default on Global Standard for the 2025-11-18 model router is 250 RPM / 250k TPM — concurrency settings in YAML should respect that.

Step 6 — View the results

Look in results/<run-name>/. The main artifact is dashboard.html — a self-contained report with 8 charts including model-selection distribution. Open it in any browser.

File	Description
`dashboard.html`	Self-contained report with 8 charts including model-selection distribution
`report.md`	Markdown summary
`results.json`	Machine-readable results
`detailed_results.csv`	Per-prompt detail, including which underlying model handled each request

Step 7 (optional) — Compare runs or push to Foundry

# Diff two runs side-by-side (Balanced vs Cost vs Quality)
python scripts/compare_results.py results/run-a results/run-b

# Submit to Microsoft Foundry for cloud-graded quality
pip install -e ".[foundry]"
az login
python scripts/run_foundry_eval.py --input-dir results/full-eval

Getting started

Everything you need is in two places:

The Model Router documentation on Microsoft Learn covers deployment, configuration, supported models, and pricing.
The foundry-model-router-autoeval repo on GitHub has the eval pipeline, walkthrough notebook, and preset configs.

Doesn’t Foundry already have benchmarks? What does this do?

A step-by-step guide to running evals

Quick preview (no API keys needed)

Step 1 — Install

Step 2 — Set up credentials

Step 3 — Configure

Step 4 — Bring your prompts

Step 5 — Run the eval

Step 6 — View the results

Step 7 (optional) — Compare runs or push to Foundry

Getting started

Category

Topics

Author

0 comments

Leave a commentCancel reply

Read next

Foundry Local 1.1: Live Transcription, Embeddings, and Responses API

What’s new in Microsoft Foundry | April 2026

Doesn’t Foundry already have benchmarks? What does this do?

A step-by-step guide to running evals

Quick preview (no API keys needed)

Step 1 — Install

Step 2 — Set up credentials

Step 3 — Configure

Step 4 — Bring your prompts

Step 5 — Run the eval

Step 6 — View the results

Step 7 (optional) — Compare runs or push to Foundry

Getting started

Category

Topics

Share

Author

0 comments

Leave a commentCancel reply

Read next

Foundry Local 1.1: Live Transcription, Embeddings, and Responses API

What’s new in Microsoft Foundry | April 2026

Stay informed