{"id":2266,"date":"2026-05-19T09:34:07","date_gmt":"2026-05-19T16:34:07","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/foundry\/?p=2266"},"modified":"2026-05-19T09:34:07","modified_gmt":"2026-05-19T16:34:07","slug":"how-to-run-evals-for-model-router","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/foundry\/how-to-run-evals-for-model-router\/","title":{"rendered":"How to run evals for the model router"},"content":{"rendered":"<p>One endpoint. Smarter spend. Model router in Foundry Models picks the optimal LLM for every prompt in real time based on signals like complexity, reasoning, and task type. Now with access to 28 frontier models, the model router makes model selection easier for developers and reduces manual overhead. This article walks you through how to run evaluations using a new open-source GitHub repo designed explicitly for the model router. <a class=\"cta_button_link\" href=\"https:\/\/aka.ms\/modelrouter\/evaluations\" target=\"_blank\" rel=\"noopener\">Access the Eval Repo on GitHub<\/a><\/p>\n<p>Before diving in, here are a few notes that developers should consider when building with the model router:<\/p>\n<ul>\n<li>The effective context window equals the smallest underlying model&#8217;s window. Oversized prompts only succeed if the router happens to pick a model that can handle them.<\/li>\n<li>Claude models require separate deployment before the router can route to them.<\/li>\n<li>Routing decisions are text-only. Vision inputs are accepted, but images don&#8217;t influence the routing decision and audio isn&#8217;t supported.<\/li>\n<li>Available regions are currently East US 2 and Sweden Central, in Global Standard and Data Zone Standard deployment types.<\/li>\n<\/ul>\n<h2>Doesn&#8217;t Foundry already have benchmarks? What does this do?<\/h2>\n<p>Microsoft Foundry already provides enterprise-grade evaluations. This <a href=\"https:\/\/aka.ms\/modelrouter\/evaluations\">repo<\/a> is an open-source alternative that can be used alongside Foundry&#8217;s benchmarking tools. It&#8217;s especially useful before you&#8217;re ready to operationalize and when you just need a fast, defensible answer on whether the model router belongs in your stack.<\/p>\n<p>As a developer integrating the model router, you need hard answers to questions like:<\/p>\n<p><strong>Quality:<\/strong> On my prompts, does the model router&#8217;s auto-selected model match or beat the single model I&#8217;d otherwise pick?<\/p>\n<p><strong>Cost:<\/strong> Including the model router&#8217;s own input-prompt billing on top of underlying model costs, am I actually saving money end-to-end, or just shifting spend around?<\/p>\n<p><strong>Latency:<\/strong> Does the routing step and the selected model&#8217;s response time cancel out the savings from using a smaller model?<\/p>\n<p><strong>Subset impact:<\/strong> If I lock down a model subset for compliance, what does that cost me in quality and price?<\/p>\n<p>It gives you:<\/p>\n<ul>\n<li>A local, scriptable pipeline that measures quality, cost, and latency in one run.<\/li>\n<li>Router-aware cost math \u2014 input-prompt markup + underlying model pricing, resolved per-response from the model the router selected.<\/li>\n<li>Bias-controlled LLM-as-a-judge scoring (dual-ordered pairwise to cancel position bias).<\/li>\n<li>Value and efficiency composites \u2014 quality-per-dollar and quality-per-second \u2014 so trade-offs are explicit.<\/li>\n<li>Model-distribution reporting so you can see which underlying models the model router is reaching for. Useful for sanity-checking Balanced vs Cost vs Quality runs, and for validating model subsets.<\/li>\n<li>An optional hand-off (<code>run_foundry_eval.py<\/code>) that submits results back into Foundry&#8217;s enterprise tooling for cloud-graded quality, governance, and portal visibility \u2014 best of both worlds.<\/li>\n<\/ul>\n<h2>A step-by-step guide to running evals<\/h2>\n<h3>Quick preview (no API keys needed)<\/h3>\n<p>Just want to see what the output looks like before wiring anything up? Either open <code>WALKTHROUGH.ipynb<\/code> in Jupyter and click <strong>Run All<\/strong>, or run the demo script:<\/p>\n<pre><code class=\"language-bash\"># macOS \/ Linux\r\nbash scripts\/demo.sh\r\n\r\n# Windows\r\n.\\scripts\\demo.ps1<\/code><\/pre>\n<p>This uses mock data, so you can preview the full dashboard before you&#8217;ve touched a single API key.<\/p>\n<h3>Step 1 \u2014 Install<\/h3>\n<pre><code class=\"language-bash\">git clone https:\/\/github.com\/microsoft\/foundry-model-router-autoeval.git\r\ncd foundry-model-router-autoeval\r\npip install -e \".[dev]\"<\/code><\/pre>\n<p>Requires Python 3.9+. Make sure your model router deployment lives in East US 2 or Sweden Central \u2014 those are the only supported regions today.<\/p>\n<h3>Step 2 \u2014 Set up credentials<\/h3>\n<pre><code class=\"language-bash\">cp .env.example .env<\/code><\/pre>\n<p>Fill in <code>.env<\/code> with three sets of Azure endpoints and keys:<\/p>\n<ul>\n<li><strong>Model router<\/strong> (the thing you&#8217;re testing) \u2014 <code>AZURE_MODEL_ROUTER_*<\/code><\/li>\n<li><strong>Baseline model<\/strong> (what you&#8217;re comparing against, e.g. GPT-5) \u2014 <code>AZURE_OPENAI_*<\/code> and <code>AZURE_BASELINE_DEPLOYMENT<\/code><\/li>\n<li><strong>Judge model<\/strong> (the LLM that scores answers) \u2014 <code>AZURE_JUDGE_*<\/code><\/li>\n<\/ul>\n<p>If you plan to let the router reach Claude models, make sure you&#8217;ve deployed those separately from the Foundry model catalog first \u2014 the model router won&#8217;t deploy them for you.<\/p>\n<h3>Step 3 \u2014 Configure<\/h3>\n<p>Open <code>configs\/default.yaml<\/code> and tweak: baseline deployment name, judge model + concurrency, and pricing. Or pick a preset: <code>quick_test.yaml<\/code>, <code>large_scale.yaml<\/code>, or <code>foundry.yaml<\/code>.<\/p>\n<h3>Step 4 \u2014 Bring your prompts<\/h3>\n<p>Format them as JSONL, CSV, or a SQL database. Minimum required fields: <code>id<\/code> and <code>prompt<\/code>.<\/p>\n<pre><code class=\"language-json\">{\"id\": \"001\", \"prompt\": \"Explain quantum entanglement in simple terms.\"}<\/code><\/pre>\n<p>Keep prompts within the smallest underlying model&#8217;s context window unless you&#8217;re explicitly testing model-subset behavior \u2014 otherwise you&#8217;ll get context-exceeded errors on prompts that route to smaller models.<\/p>\n<h3>Step 5 \u2014 Run the eval<\/h3>\n<pre><code class=\"language-bash\"># Validate first (no API calls made)\r\npython scripts\/run_eval.py --dry-run\r\n\r\n# Real run with your data\r\npython scripts\/run_eval.py --dataset my_prompts.jsonl --sample-size 100\r\n\r\n# Interrupted? Just resume\r\npython scripts\/run_eval.py --resume --output-dir results\/my-run<\/code><\/pre>\n<p>For 500\u20131,000+ prompts, see <code>docs\/how-to-resume-and-scale.md<\/code>. Mind the rate limits: the default on Global Standard for the 2025-11-18 model router is 250 RPM \/ 250k TPM \u2014 concurrency settings in YAML should respect that.<\/p>\n<h3>Step 6 \u2014 View the results<\/h3>\n<p>Look in <code>results\/&lt;run-name&gt;\/<\/code>. The main artifact is <code>dashboard.html<\/code> \u2014 a self-contained report with 8 charts including model-selection distribution. Open it in any browser.<\/p>\n<table>\n<thead>\n<tr>\n<th>File<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>dashboard.html<\/code><\/td>\n<td>Self-contained report with 8 charts including model-selection distribution<\/td>\n<\/tr>\n<tr>\n<td><code>report.md<\/code><\/td>\n<td>Markdown summary<\/td>\n<\/tr>\n<tr>\n<td><code>results.json<\/code><\/td>\n<td>Machine-readable results<\/td>\n<\/tr>\n<tr>\n<td><code>detailed_results.csv<\/code><\/td>\n<td>Per-prompt detail, including which underlying model handled each request<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Step 7 (optional) \u2014 Compare runs or push to Foundry<\/h3>\n<pre><code class=\"language-bash\"># Diff two runs side-by-side (Balanced vs Cost vs Quality)\r\npython scripts\/compare_results.py results\/run-a results\/run-b\r\n\r\n# Submit to Microsoft Foundry for cloud-graded quality\r\npip install -e \".[foundry]\"\r\naz login\r\npython scripts\/run_foundry_eval.py --input-dir results\/full-eval<\/code><\/pre>\n<h2>Getting started<\/h2>\n<p>Everything you need is in two places:<\/p>\n<ul>\n<li>The <a href=\"https:\/\/learn.microsoft.com\/azure\/ai-foundry\/concepts\/model-router\">Model Router documentation on Microsoft Learn<\/a> covers deployment, configuration, supported models, and pricing.<\/li>\n<li>The <a href=\"https:\/\/aka.ms\/modelrouter\/evaluations\">foundry-model-router-autoeval repo on GitHub<\/a> has the eval pipeline, walkthrough notebook, and preset configs.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Walk through running quality, cost, and latency evaluations for the Foundry model router using an open-source GitHub repo designed for router-aware eval pipelines.<\/p>\n","protected":false},"author":192942,"featured_media":1563,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[140,66,41,55,139],"class_list":["post-2266","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-microsoft-foundry","tag-benchmarking","tag-evaluations","tag-foundry-models","tag-model-router","tag-open-source"],"acf":[],"blog_post_summary":"<p>Walk through running quality, cost, and latency evaluations for the Foundry model router using an open-source GitHub repo designed for router-aware eval pipelines.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts\/2266","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/users\/192942"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/comments?post=2266"}],"version-history":[{"count":2,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts\/2266\/revisions"}],"predecessor-version":[{"id":2268,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts\/2266\/revisions\/2268"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/media\/1563"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/media?parent=2266"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/categories?post=2266"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/tags?post=2266"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}