{"id":25662,"date":"2026-05-08T13:51:57","date_gmt":"2026-05-08T20:51:57","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/?p=25662"},"modified":"2026-05-11T09:40:19","modified_gmt":"2026-05-11T16:40:19","slug":"announcing-the-public-preview-of-the-microsoft-365-copilot-agent-evaluations-tool","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/announcing-the-public-preview-of-the-microsoft-365-copilot-agent-evaluations-tool\/","title":{"rendered":"Announcing the public preview of the Microsoft 365 Copilot Agent Evaluations tool"},"content":{"rendered":"<p><span data-contrast=\"auto\">Today we&#8217;re announcing the public preview of the <\/span><a href=\"https:\/\/learn.microsoft.com\/en-us\/microsoft-365\/copilot\/extensibility\/evaluations-cli-overview\"><span data-contrast=\"none\">Microsoft 365 Copilot Agent Evaluations tool<\/span><\/a><span data-contrast=\"auto\">. The Agent Evaluations CLI tool helps developers measure and improve the quality of agents they build for Microsoft 365 Copilot. The tool provides a command-line interface that sends prompts to a deployed agent, captures responses, and scores them with the help of Azure OpenAI LLM models. It produces structured reports developers can use in their inner loop of development and in CI\/CD pipelines.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">This release is a step toward making rigorous, repeatable evaluation a standard part of how developers build for Microsoft 365 Copilot, alongside the broader work happening across the platform, from Work IQ to agent creation with the Microsoft 365 Agents Toolkit.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<h2 aria-level=\"2\"><span data-contrast=\"none\">Why evaluations\u00a0matter now<\/span><span data-ccp-props=\"{&quot;134245418&quot;:true,&quot;134245529&quot;:true,&quot;335559738&quot;:160,&quot;335559739&quot;:80}\">\u00a0<\/span><\/h2>\n<p><span data-contrast=\"auto\">Our mission is to enable Microsoft partners, ISVs, and enterprise developers to extend Microsoft 365 Copilot with custom agents, actions, and knowledge so Copilot can reason over any data and take action across any system.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">As agents move from demos into core business workflows, the bar for shipping rises with them. Customers expect agents that are accurate, grounded, and consistent across the breadth of real-world prompts they receive. Meeting that bar requires more than manual testing. It requires an evaluation framework that is objective, repeatable, and integrated into the developer workflow. The Agent Evaluations tool is designed to make that practical.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<h2 aria-level=\"2\"><span data-contrast=\"none\">What&#8217;s in the public preview<\/span><span data-ccp-props=\"{&quot;134245418&quot;:true,&quot;134245529&quot;:true,&quot;335559738&quot;:160,&quot;335559739&quot;:80}\">\u00a0<\/span><\/h2>\n<p><span data-contrast=\"auto\">The public preview brings the full evaluation loop into a simple command-line workflow. The CLI is designed to fit naturally into the way Microsoft 365 developers already build agents. Developers can invoke the CLI to evaluate declarative agents right inside the Microsoft 365 Agents Toolkit.<\/span><\/p>\n<ul>\n<li aria-setsize=\"-1\" data-leveltext=\"%1.\" data-font=\"\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">The tool supports evaluation of single-turn or multi-turn conversations to make it possible to test how an agent retains context, handles follow-ups, and completes end-to-end tasks the way real users actually interact with it. <\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/li>\n<\/ul>\n<p style=\"padding-left: 40px;\"><a href=\"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2026\/05\/multi-turn-conversation.webp\"><img decoding=\"async\" class=\"alignnone size-full wp-image-25667\" src=\"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2026\/05\/multi-turn-conversation.webp\" alt=\"Example of a multi-turn conversation.\" width=\"988\" height=\"742\" srcset=\"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2026\/05\/multi-turn-conversation.webp 988w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2026\/05\/multi-turn-conversation-300x225.webp 300w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2026\/05\/multi-turn-conversation-768x577.webp 768w\" sizes=\"(max-width: 988px) 100vw, 988px\" \/><\/a><\/p>\n<ul>\n<li aria-setsize=\"-1\" data-leveltext=\"%1.\" data-font=\"\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"2\" data-aria-level=\"1\"><span data-contrast=\"auto\">The tool offers an easy experience to select which agent to run an evaluation against. The interactive agent picker ensures that testing teams alongside development teams can evaluate the agents. <\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li aria-setsize=\"-1\" data-leveltext=\"%1.\" data-font=\"\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"3\" data-aria-level=\"1\"><span data-contrast=\"auto\">Responses are then scored automatically against evaluators like Coherence, Groundedness (LLM based) or ExactMatch \/PartialMatch (Code based), and <\/span><a href=\"https:\/\/learn.microsoft.com\/en-us\/microsoft-365\/copilot\/extensibility\/evaluations-cli-overview#evaluation-metrics\"><span data-contrast=\"none\">more evaluators<\/span><\/a><span data-contrast=\"auto\">.<\/span><\/li>\n<\/ul>\n<ul>\n<li aria-setsize=\"-1\" data-leveltext=\"%1.\" data-font=\"\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:0,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769242&quot;:[65533,0],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;%1.&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" data-aria-posinset=\"4\" data-aria-level=\"1\"><span data-contrast=\"auto\">Results are emitted in an HTML scorecard report. Developers can use the scorecard as a sharable artifact that captures objective evidence of agent quality across their inner loop, code reviews, and CI\/CD pipelines. <\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/li>\n<\/ul>\n<p style=\"padding-left: 40px;\"><a href=\"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2026\/05\/scorecard.webp\"><img decoding=\"async\" class=\"alignnone size-large wp-image-25668\" src=\"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2026\/05\/scorecard-1024x327.webp\" alt=\"Agent evaluation scorecard\" width=\"1024\" height=\"327\" srcset=\"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2026\/05\/scorecard-1024x327.webp 1024w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2026\/05\/scorecard-300x96.webp 300w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2026\/05\/scorecard-768x245.webp 768w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2026\/05\/scorecard-1536x490.webp 1536w, https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-content\/uploads\/sites\/73\/2026\/05\/scorecard.webp 1612w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/p>\n<p><span data-contrast=\"auto\">You can also access the evaluation skill wherever you vibe-code with your coding agents.<\/span><\/p>\n<h2 aria-level=\"2\"><span data-contrast=\"none\">Get started<\/span><span data-ccp-props=\"{&quot;134245418&quot;:true,&quot;134245529&quot;:true,&quot;335559738&quot;:160,&quot;335559739&quot;:80}\">\u00a0<\/span><\/h2>\n<p><span data-contrast=\"auto\">The\u00a0<\/span><span data-contrast=\"auto\">preview\u00a0<\/span><span data-contrast=\"auto\">tool is free to install<\/span><span data-contrast=\"auto\"> during public preview<\/span><span data-contrast=\"auto\">. You&#8217;ll need a Microsoft 365 Copilot license, an agent deployed to your tenant, Node.js 24.12.0+, admin consent to run the tool in your tenant, and an Azure OpenAI endpoint for the LLM-judge evaluators. Ask your admin to <a href=\"https:\/\/github.com\/microsoft\/work-iq\/blob\/main\/ADMIN-INSTRUCTIONS.md\">enable the tool for your tenant<\/a> today<\/span><span data-contrast=\"auto\">.<\/span><\/p>\n<p><b><span data-contrast=\"auto\">Get\u00a0started:\u00a0<\/span><\/b><a href=\"https:\/\/learn.microsoft.com\/en-us\/microsoft-365\/copilot\/extensibility\/evaluations-cli-overview\"><span data-contrast=\"none\">Microsoft 365 Copilot Agent Evaluations CLI overview<\/span><\/a><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><b><span data-contrast=\"auto\">Source &amp; samples (GitHub):\u00a0<\/span><\/b><a href=\"https:\/\/github.com\/microsoft\/m365-copilot-eval\"><span data-contrast=\"none\">github.com\/microsoft\/m365-copilot-eval<\/span><\/a><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><b><span data-contrast=\"auto\">Agents Toolkit<\/span><\/b><span data-contrast=\"auto\">:\u00a0<\/span><a href=\"https:\/\/learn.microsoft.com\/en-us\/microsoft-365\/copilot\/extensibility\/build-declarative-agents\"><span data-contrast=\"none\">Create declarative agents using Microsoft 365 Agents Toolkit | Microsoft Learn<\/span><\/a><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><b><span data-contrast=\"auto\">Agent evaluation skill<\/span><\/b><span data-contrast=\"auto\">\u00a0\u2013\u00a0Use the\u00a0<\/span><a href=\"https:\/\/github.com\/microsoft\/work-iq\/tree\/main\/plugins\/microsoft-365-agents-toolkit\"><span data-contrast=\"none\">microsoft-365-agents-toolkit@workiq<\/span><\/a><span data-contrast=\"auto\"> skill for Claude Code and Copilot\u00a0to create and evaluate agents using Agents Toolkit and\u00a0Agents Evaluation Tool.<\/span><\/p>\n<h2 aria-level=\"1\"><span data-contrast=\"none\">We want your feedback<\/span><span data-ccp-props=\"{&quot;134245418&quot;:true,&quot;134245529&quot;:true,&quot;335559738&quot;:360,&quot;335559739&quot;:80}\">\u00a0<\/span><\/h2>\n<p><span data-contrast=\"auto\">During the preview period, <\/span><span data-contrast=\"auto\">we <\/span><span data-contrast=\"auto\">need your voice. Try the tool against your agents, file issues in the <\/span><a href=\"https:\/\/github.com\/microsoft\/m365-copilot-eval\"><span data-contrast=\"none\">GitHub repo<\/span><\/a><span data-contrast=\"auto\">, and tell us which evaluators, integrations, and workflows <\/span><span data-contrast=\"auto\">make the biggest difference for your team. Your feedback will directly shape the path to\u00a0<\/span><span data-contrast=\"auto\">g<\/span><span data-contrast=\"auto\">eneral\u00a0<\/span><span data-contrast=\"auto\">a<\/span><span data-contrast=\"auto\">vailability<\/span><span data-contrast=\"auto\">\u00a0(GA)<\/span><span data-contrast=\"auto\">.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">We can&#8217;t wait to see the high-quality, trustworthy <\/span><span data-contrast=\"auto\">agents\u00a0<\/span><span data-contrast=\"auto\">for Microsoft 365 Copilot <\/span><span data-contrast=\"auto\">you&#8217;ll\u00a0ship next.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Microsoft 365 Copilot Agent Evaluations tool helps developers measure and improve the quality of agents they build for Microsoft 365 Copilot.<\/p>\n","protected":false},"author":187599,"featured_media":25665,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[411,1],"tags":[],"class_list":["post-25662","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-microsoft-365-copilot-microsoft-365-developer","category-microsoft-365-developer"],"acf":[],"blog_post_summary":"<p>The Microsoft 365 Copilot Agent Evaluations tool helps developers measure and improve the quality of agents they build for Microsoft 365 Copilot.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/posts\/25662","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/users\/187599"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/comments?post=25662"}],"version-history":[{"count":2,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/posts\/25662\/revisions"}],"predecessor-version":[{"id":25685,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/posts\/25662\/revisions\/25685"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/media\/25665"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/media?parent=25662"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/categories?post=25662"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/microsoft365dev\/wp-json\/wp\/v2\/tags?post=25662"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}