One endpoint. Smarter spend. Model router in Foundry Models picks the optimal LLM for every prompt in real time based on signals like complexity, reasoning, and task type. Now with access to 28 frontier models, the model router makes model selection easier for developers and reduces manual overhead. This article walks you through how to run evaluations using a new open-source GitHub repo designed explicitly for the model router. Access the Eval Repo on GitHub
Before diving in, here are a few notes that developers should consider when building with the model router:
- The effective context window equals the smallest underlying model’s window. Oversized prompts only succeed if the router happens to pick a model that can handle them.
- Claude models require separate deployment before the router can route to them.
- Routing decisions are text-only. Vision inputs are accepted, but images don’t influence the routing decision and audio isn’t supported.
- Available regions are currently East US 2 and Sweden Central, in Global Standard and Data Zone Standard deployment types.
Doesn’t Foundry already have benchmarks? What does this do?
Microsoft Foundry already provides enterprise-grade evaluations. This repo is an open-source alternative that can be used alongside Foundry’s benchmarking tools. It’s especially useful before you’re ready to operationalize and when you just need a fast, defensible answer on whether the model router belongs in your stack.
As a developer integrating the model router, you need hard answers to questions like:
Quality: On my prompts, does the model router’s auto-selected model match or beat the single model I’d otherwise pick?
Cost: Including the model router’s own input-prompt billing on top of underlying model costs, am I actually saving money end-to-end, or just shifting spend around?
Latency: Does the routing step and the selected model’s response time cancel out the savings from using a smaller model?
Subset impact: If I lock down a model subset for compliance, what does that cost me in quality and price?
It gives you:
- A local, scriptable pipeline that measures quality, cost, and latency in one run.
- Router-aware cost math — input-prompt markup + underlying model pricing, resolved per-response from the model the router selected.
- Bias-controlled LLM-as-a-judge scoring (dual-ordered pairwise to cancel position bias).
- Value and efficiency composites — quality-per-dollar and quality-per-second — so trade-offs are explicit.
- Model-distribution reporting so you can see which underlying models the model router is reaching for. Useful for sanity-checking Balanced vs Cost vs Quality runs, and for validating model subsets.
- An optional hand-off (
run_foundry_eval.py) that submits results back into Foundry’s enterprise tooling for cloud-graded quality, governance, and portal visibility — best of both worlds.
A step-by-step guide to running evals
Quick preview (no API keys needed)
Just want to see what the output looks like before wiring anything up? Either open WALKTHROUGH.ipynb in Jupyter and click Run All, or run the demo script:
# macOS / Linux
bash scripts/demo.sh
# Windows
.\scripts\demo.ps1
This uses mock data, so you can preview the full dashboard before you’ve touched a single API key.
Step 1 — Install
git clone https://github.com/microsoft/foundry-model-router-autoeval.git
cd foundry-model-router-autoeval
pip install -e ".[dev]"
Requires Python 3.9+. Make sure your model router deployment lives in East US 2 or Sweden Central — those are the only supported regions today.
Step 2 — Set up credentials
cp .env.example .env
Fill in .env with three sets of Azure endpoints and keys:
- Model router (the thing you’re testing) —
AZURE_MODEL_ROUTER_* - Baseline model (what you’re comparing against, e.g. GPT-5) —
AZURE_OPENAI_*andAZURE_BASELINE_DEPLOYMENT - Judge model (the LLM that scores answers) —
AZURE_JUDGE_*
If you plan to let the router reach Claude models, make sure you’ve deployed those separately from the Foundry model catalog first — the model router won’t deploy them for you.
Step 3 — Configure
Open configs/default.yaml and tweak: baseline deployment name, judge model + concurrency, and pricing. Or pick a preset: quick_test.yaml, large_scale.yaml, or foundry.yaml.
Step 4 — Bring your prompts
Format them as JSONL, CSV, or a SQL database. Minimum required fields: id and prompt.
{"id": "001", "prompt": "Explain quantum entanglement in simple terms."}
Keep prompts within the smallest underlying model’s context window unless you’re explicitly testing model-subset behavior — otherwise you’ll get context-exceeded errors on prompts that route to smaller models.
Step 5 — Run the eval
# Validate first (no API calls made)
python scripts/run_eval.py --dry-run
# Real run with your data
python scripts/run_eval.py --dataset my_prompts.jsonl --sample-size 100
# Interrupted? Just resume
python scripts/run_eval.py --resume --output-dir results/my-run
For 500–1,000+ prompts, see docs/how-to-resume-and-scale.md. Mind the rate limits: the default on Global Standard for the 2025-11-18 model router is 250 RPM / 250k TPM — concurrency settings in YAML should respect that.
Step 6 — View the results
Look in results/<run-name>/. The main artifact is dashboard.html — a self-contained report with 8 charts including model-selection distribution. Open it in any browser.
| File | Description |
|---|---|
dashboard.html |
Self-contained report with 8 charts including model-selection distribution |
report.md |
Markdown summary |
results.json |
Machine-readable results |
detailed_results.csv |
Per-prompt detail, including which underlying model handled each request |
Step 7 (optional) — Compare runs or push to Foundry
# Diff two runs side-by-side (Balanced vs Cost vs Quality)
python scripts/compare_results.py results/run-a results/run-b
# Submit to Microsoft Foundry for cloud-graded quality
pip install -e ".[foundry]"
az login
python scripts/run_foundry_eval.py --input-dir results/full-eval
Getting started
Everything you need is in two places:
- The Model Router documentation on Microsoft Learn covers deployment, configuration, supported models, and pricing.
- The foundry-model-router-autoeval repo on GitHub has the eval pipeline, walkthrough notebook, and preset configs.
0 comments
Be the first to start the discussion.