{"id":25113,"date":"2026-01-26T09:37:21","date_gmt":"2026-01-26T17:37:21","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/?p=25113"},"modified":"2026-01-26T09:37:21","modified_gmt":"2026-01-26T17:37:21","slug":"introducing-the-evals-for-agent-interop-starter-kit","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/introducing-the-evals-for-agent-interop-starter-kit\/","title":{"rendered":"Introducing the Evals for Agent Interop starter kit"},"content":{"rendered":"<p>As enterprise customers roll out and govern AI agents through <a href=\"https:\/\/aka.ms\/agent365\"><strong>Agent 365<\/strong><\/a>, they have been asking for pre canned evals they can run out of the box. They want transparent, reproducible evaluations that reflect their own work in realistic environments, including interoperability, how agents connect across stacks and into Microsoft Agent 365 systems and tools. In response, we are investing in a comprehensive evaluation suite across <a href=\"https:\/\/learn.microsoft.com\/en-us\/microsoft-agent-365\/tooling-servers-overview?utm_source=chatgpt.com\"><strong>Agent 365 Tools<\/strong><\/a> with realistic scenarios, configurable rubrics, and results that stand up to governance and audit as customers deploy agents into production. Introducing <strong>Evals for Agent Interop<\/strong>, the way to evaluate those cross-stack connections end to end in realistic scenarios.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/11\/Pillars.webp\"><img decoding=\"async\" class=\"alignnone size-large wp-image-24946\" src=\"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/11\/Pillars-1024x566.webp\" alt=\"Image of the pillars of Agent 365 - registry, access control, visualization, interoperability, security.\" width=\"1024\" height=\"566\" srcset=\"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/11\/Pillars-1024x566.webp 1024w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/11\/Pillars-300x166.webp 300w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/11\/Pillars-768x424.webp 768w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/11\/Pillars.webp 1164w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/p>\n<h2 aria-level=\"3\"><span data-contrast=\"none\">Introducing Evals for Agent Interop<\/span><\/h2>\n<p>As a first step, we\u2019re launching \u2018Evals for Agent Interop\u2019, a starter evaluation kit debuting. \u2018Evals for Agent Interop\u2019 provides curated scenarios and representative data that emulate real digital work, along with an evaluation harness that organizations can use to self-run their agents across Microsoft 365 surfaces (Email, Documents, Teams, Calendar, and more). It\u2019s designed to be simple to start, yet capable enough to reveal quality, efficiency, robustness, and user experience tradeoffs between agent implementations, so organizations can make informed choices quickly.<\/p>\n<h2 aria-level=\"3\"><iframe title=\"YouTube video player\" src=\"\/\/www.youtube.com\/embed\/Js-XtfNgNLs?si=L9R5CLKCQfCXAnW-\" width=\"560\" height=\"315\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\" data-mce-fragment=\"1\"><\/iframe><\/h2>\n<p><strong>Get started<\/strong>: Download the starter evals and harness from our repo. We currently support Email and Calendar scenarios, and we\u2019re rapidly expanding the kit with new scenarios, richer rubrics, and additional judge options. (<a href=\"https:\/\/aka.ms\/EvalsForAgentInterop\"><em>https:\/\/aka.ms\/EvalsForAgentInterop<\/em><\/a>).<\/p>\n<p><strong>Leaderboard<\/strong>: Strawman agents, frameworks, and LLMs<\/p>\n<p>To help organizations benchmark and compare, we\u2019re introducing a leaderboard that reports strawman agents written using different stacks, a combination of agent embodiment frameworks (ex., Semantic Kernel, Lang Graph) and LLMs (ex., GPT 5.2). This gives organizations a clear view of how various approaches perform on the same scenarios and rubrics. The leaderboard will evolve as we add more agent types and frameworks, helping organizations determine the right set of agents for their Agent 365 Tools.<\/p>\n<h2 aria-level=\"3\"><span data-contrast=\"none\">Why\u00a0it\u00a0matters<\/span><\/h2>\n<p>Customers want to more easily optimize their AI agents to their unique business. Enterprise AI is shifting from isolated model metrics to customer-informed evaluation. Businesses want to define rubrics, calibrate AI judges, and correlate offline results with production signals, tightening iteration cycles from months to days to hours. As Microsoft, we realize that customers expect to bring their own grading criteria and scrutinize datasets for domain fit before they trust an agent in their environment. \u2018Evals for Agent Interop\u2019 is purpose-built for this new reality, unifying evaluation needs in one path: start with pre-canned evals, then tailor to your context.<\/p>\n<h2 aria-level=\"3\"><span data-contrast=\"none\">How Evals for Agent Interop works<\/span><\/h2>\n<p>\u2018Evals for Agent Interop\u2019 ships with templated, realistic and declarative evaluation specs. The harness measures programmatically verifiable signals (schema adherence, tool call correctness, policy checks) alongside calibrated AI judge assessments for qualities like helpfulness, coherence, and tone. This yields consistent, transparent, and reproducible results that teams can track over time, compare across agent variants, and share across organizations.<\/p>\n<h2 aria-level=\"3\"><span data-contrast=\"none\">How it will evolve into a full evaluation suite<\/span><span data-ccp-props=\"{&quot;134245418&quot;:true,&quot;134245529&quot;:true,&quot;335559738&quot;:160,&quot;335559739&quot;:80}\">\u00a0<\/span><\/h2>\n<p>We\u2019re building toward a full suite that helps organizations choose the right set of agents to run on their Agent 365 Tools:<\/p>\n<p>Product teams within Microsoft define rubrics, train and calibrate judges, ship scenarios and data, and correlate offline scores with production metrics.<\/p>\n<p>Customers bring their own data and grading logic via a that becomes the single source of truth for both offline grading and online guardrails at runtime. We\u2019ll support custom tenant rubrics, with LLM or human grading for ambiguous cases.<\/p>\n<p>Packaged governance includes audit trails, documented rubrics, and privacy posture aligned to usage. Over time, we intend to co-publish capability manifests, tool schemas, and calibration methods to foster transparency and community validation.<\/p>\n<p><span style=\"font-size: 18pt;\"><strong>What can organizations do with the Evals for Agent Interop kit?<\/strong><\/span><\/p>\n<p>With \u2018Evals for Agent Interop\u2019, organizations can compare multiple agent candidates head-to-head on the same scenarios and rubrics, quantify quality and risk controls, and verify improvements (for example, a fine-tuned model or a different LLM) before broad rollout. As we expand the suite, these offline signals will align with online evaluation, so organizations can move from confidence to controlled deployment. Faster, safer, and clearer accountability.<\/p>\n<h2 aria-level=\"3\"><span data-contrast=\"none\">Where to start (and what\u2019s next)?<\/span><span data-ccp-props=\"{&quot;134245418&quot;:true,&quot;134245529&quot;:true,&quot;335559738&quot;:160,&quot;335559739&quot;:80}\">\u00a0<\/span><\/h2>\n<p>Clone the GitHub repo (<a href=\"https:\/\/aka.ms\/EvalsForAgentInterop\"><em>https:\/\/aka.ms\/EvalsForAgentInterop<\/em><\/a>) with the starter evals and Harness. Run the included scenarios to baseline your agents and understand gaps.<\/p>\n<p>Tailor rubrics to your domain, then re-run to see how agent behavior shifts under your constraints.<\/p>\n<p>We\u2019ll expand \u2018Evals for Agent Interop\u2019 with new scenario families (document collaboration, communications, scheduling and tasking), richer scoring, and broader judge options, while integrating more tightly with Agent 365 Tools so evaluations and runtime guardrails share one source of truth.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/12\/Picture2.webp\"><img decoding=\"async\" class=\"alignnone size-large wp-image-25147\" src=\"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/12\/Picture2-1024x485.webp\" alt=\"Picture of Evals for Agent Interoperability\" width=\"1024\" height=\"485\" srcset=\"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/12\/Picture2-1024x485.webp 1024w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/12\/Picture2-300x142.webp 300w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/12\/Picture2-768x364.webp 768w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/12\/Picture2-1536x728.webp 1536w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2025\/12\/Picture2-2048x970.webp 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We\u2019re launching Evals for Agent Interop, a starter evaluation kit that provides curated scenarios and representative data that emulate real digital work, and an evaluation harness that organizations can use to self-run their agents across Microsoft 365 surfaces.<\/p>\n","protected":false},"author":204405,"featured_media":25122,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[411,1],"tags":[],"class_list":["post-25113","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-microsoft-365-copilot-microsoft-365-developer","category-microsoft-365-developer"],"acf":[],"blog_post_summary":"<p>We\u2019re launching Evals for Agent Interop, a starter evaluation kit that provides curated scenarios and representative data that emulate real digital work, and an evaluation harness that organizations can use to self-run their agents across Microsoft 365 surfaces.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/posts\/25113","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/users\/204405"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/comments?post=25113"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/posts\/25113\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/media\/25122"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/media?parent=25113"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/categories?post=25113"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/tags?post=25113"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}