{"id":1844,"date":"2025-07-30T06:20:36","date_gmt":"2025-07-30T10:20:36","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/powerplatform\/?p=1844"},"modified":"2025-07-30T06:20:36","modified_gmt":"2025-07-30T10:20:36","slug":"plan-validation-cat-kit","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/powerplatform\/plan-validation-cat-kit\/","title":{"rendered":"Introducing Plan Validation in Copilot Studio Kit"},"content":{"rendered":"<p>Ever wonder why your agent&#8217;s answers can be right but for the wrong reasons? Let&#8217;s dive into why that matters, with a brief foray into epistemology (bear with me!)<\/p>\n<h2>Knowledge: More Than Just Right Answers<\/h2>\n<p>Imagine asking someone the time, and they confidently reply, \u201c2:30\u202fPM.\u201d They&#8217;re correct\u2014but what if their watch stopped exactly 12 hours ago, and they just happened to get lucky? This illustrates a classic philosophical point: the difference between a <strong>true belief<\/strong> and a <a href=\"https:\/\/plato.stanford.edu\/entries\/knowledge-analysis\/#JustCond\"><strong>justified true belief<\/strong><\/a>.<\/p>\n<p>And no, this isn\u2019t just a theoretical curiosity (yes, philosophy folks are sensitive about that critique). In practice, we <em>do<\/em> care about how people form their beliefs\u2014and, by analogy, how AI agents arrive at their responses.<\/p>\n<p>In LLM system design, we evaluate not only <strong>if<\/strong> an agent is correct, but <strong>how<\/strong> it got there. A key metric here is <a href=\"https:\/\/www.confident-ai.com\/blog\/llm-agent-evaluation-complete-guide\"><strong>tool correctness<\/strong><\/a>\u2014a measure of whether the AI used the right tools, in the right way, to arrive at its conclusion.<\/p>\n<p>This is where the analogy to justification comes in. Tool correctness isn\u2019t about whether the answer looks right, it\u2019s about whether the <strong>reasoning process was sound<\/strong>. It shifts focus from the <strong>content<\/strong> of the response to the <strong>process behind it<\/strong>.<\/p>\n<h2>Enter Plan Validation<\/h2>\n<p>If you\u2019re not familiar with the <a href=\"https:\/\/appsource.microsoft.com\/en-us\/product\/dynamics-365\/microsoftpowercatarch.copilotstudiokit2?tab=overview\"><strong>Copilot Studio Kit<\/strong><\/a> (<em>don\u2019t worry, we won\u2019t judge<\/em>), it\u2019s an open-source toolkit for testing and evaluating Copilot Studio agents. It allows you to define test cases, simulate user queries, and validate agent responses.<\/p>\n<p>One of the built-in test types is <strong>Generative Answers<\/strong>, which uses <a href=\"https:\/\/learn.microsoft.com\/en-us\/ai-builder\/create-a-custom-prompt\">AI Builder<\/a> to LLM-judge whether the agent\u2019s response is semantically correct. That\u2019s important, mainly when you\u2019re evaluating the agent\u2019s <a href=\"https:\/\/learn.microsoft.com\/en-us\/microsoft-copilot-studio\/knowledge-copilot-studio\">knowledge capabilities<\/a>, like whether it retrieved relevant content or phrased an answer in a helpful way.<\/p>\n<p>But sometimes, semantic accuracy isn\u2019t enough.<\/p>\n<p>Some tasks don\u2019t produce meaningful responses at all\u2014like updating a database row or triggering a backend process. In those cases, there may be <strong>no content to evaluate<\/strong> (other than a vague \u201cdone\u201d?). Other times, the agent provides a fluent, seemingly correct answer\u2014but <strong>relies on the wrong tools<\/strong>, or skips tool usage entirely. The response looks right, but the resoning process behind it isn\u2019t reliable.<\/p>\n<p>Plan Validation is a recently added testing capability in the Copilot Studio Kit that focuses on <strong>tool correctness<\/strong>. Instead of evaluating what the agent says, it checks whether the <strong>expected tools<\/strong> were used during the plan.<\/p>\n<p>When defining a Plan Validation test, you specify:<\/p>\n<ul>\n<li>A test utterance<\/li>\n<li>A list of <strong>expected tools<\/strong><\/li>\n<li>A <strong>pass threshold<\/strong>\u2014representing how much deviation you\u2019re willing to tolerate from that list<\/li>\n<\/ul>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/plan-validation-test-mode.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/plan-validation-test-mode-1024x633.png\" alt=\"plan validation test mode image\" width=\"717\" height=\"364\" class=\"aligncenter size-large wp-image-1856\" \/><\/a><\/p>\n<p>This allows you to validate the agent\u2019s <a href=\"https:\/\/learn.microsoft.com\/en-us\/microsoft-copilot-studio\/advanced-generative-actions\"><strong>orchestrated plan<\/strong><\/a>, instead of validating its response. It\u2019s about verifying that your agent isn\u2019t just saying the right thing\u2014but actually doing the right thing. Plan validation is a <strong>deterministic test<\/strong>: it calculates the deviation of the actual tools from the expected tools\u2014no LLM judgment involved.<\/p>\n<h2>Example: Same Response, Different Reasoning<\/h2>\n<p>Let\u2019s look at a real test case.<\/p>\n<p>We asked the agent:<\/p>\n<blockquote>\n<p>\u201cHey. I&#8217;m in Colorado. Any parks with interesting things to do? Just make sure it has a proper place to camp.\u201d<\/p>\n<\/blockquote>\n<p>In both test runs, the agent returned a seemingly correct response: it listed Colorado parks, included descriptions of activities, and mentioned camping availability. On the surface, both responses seemed equally valid, but the underlying tool usage tells a different story.<\/p>\n<p>In the <strong>first case<\/strong>, the agent used <strong>all the expected tools<\/strong>: <code>GetParks<\/code>, <code>GetCampgrounds<\/code> and <code>GetThingsToDoInParks<\/code>, to fetch live, authoritative data on camping options.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/accurate-plan-scaled.png\"> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/accurate-plan-1024x520.png\" \n       alt=\"accurate plan image\" width=\"717\" height=\"364\" \n       class=\"aligncenter size-large wp-image-1848\" srcset=\"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/accurate-plan-1024x520.png 1024w, https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/accurate-plan-300x152.png 300w, https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/accurate-plan-768x390.png 768w, https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/accurate-plan-1536x780.png 1536w, https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/accurate-plan-2048x1040.png 2048w\" sizes=\"(max-width: 717px) 100vw, 717px\" \/> <\/a><\/p>\n<p>In the <strong>second case<\/strong>, the agent used <code>GetParks<\/code> and <code>GetThingsToDoInParks<\/code>, but <strong>skipped<\/strong> the <code>GetCampgrounds<\/code> tool. Instead, it relied on <strong>general knowledge the AI was trained on<\/strong> to generate the camping details.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/inaccurate-plan-4.png\"> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/inaccurate-plan-4-300x152.png\" \n       alt=\"inaccurate plan image\" width=\"717\" height=\"364\" \n       class=\"aligncenter size-large wp-image-1853\" srcset=\"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/inaccurate-plan-4-300x152.png 300w, https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/inaccurate-plan-4-1024x520.png 1024w, https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/inaccurate-plan-4-768x390.png 768w, https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/inaccurate-plan-4-1536x780.png 1536w, https:\/\/devblogs.microsoft.com\/powerplatform\/wp-content\/uploads\/sites\/79\/2025\/07\/inaccurate-plan-4.png 1772w\" sizes=\"(max-width: 717px) 100vw, 717px\" \/> <\/a><\/p>\n<p>That may sound fine, but it&#8217;s risky. Large language models can produce fluent, confident-sounding answers even when they&#8217;re wrong. Without grounding in live data, the response might reference campgrounds that are closed\u2014or worse, nonexistent (imagine showing up with your tent and getting soaked in the rain). These failures are especially dangerous because they\u2019re hard to detect, as <em>the response still looks good<\/em>.<\/p>\n<p>The difference isn\u2019t in the wording\u2014it\u2019s in the <strong>process<\/strong>. Only one of these responses is <strong>tool-correct<\/strong>, because only one is grounded in the complete and intended tool set.<\/p>\n<p>This is what <strong>Plan Validation<\/strong> helps uncover: agents that may sound right, but skipped critical steps\u2014and those that followed the full reasoning path.<\/p>\n<h2>What\u2019s Next for Plan Validation?<\/h2>\n<p>Copilot Studio already supports autonomous agents that respond to <a href=\"https:\/\/learn.microsoft.com\/en-us\/microsoft-copilot-studio\/authoring-triggers-about\">event triggers<\/a>, and Plan Validation is a step toward making it easier to evaluate not just what agents say\u2014but how they act. While today it helps validate conversational plans, we\u2019re exploring how similar techniques could be extended to autonomous scenarios as well.<\/p>\n<p>Stay tuned for additional testing modes coming soon!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ever wonder why your agent&#8217;s answers can be right but for the wrong reasons? Let&#8217;s dive into why that matters, with a brief foray into epistemology (bear with me!) Knowledge: More Than Just Right Answers Imagine asking someone the time, and they confidently reply, \u201c2:30\u202fPM.\u201d They&#8217;re correct\u2014but what if their watch stopped exactly 12 hours [&hellip;]<\/p>\n","protected":false},"author":194199,"featured_media":6,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1844","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-powerplatform"],"acf":[],"blog_post_summary":"<p>Ever wonder why your agent&#8217;s answers can be right but for the wrong reasons? Let&#8217;s dive into why that matters, with a brief foray into epistemology (bear with me!) Knowledge: More Than Just Right Answers Imagine asking someone the time, and they confidently reply, \u201c2:30\u202fPM.\u201d They&#8217;re correct\u2014but what if their watch stopped exactly 12 hours [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-json\/wp\/v2\/posts\/1844","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-json\/wp\/v2\/users\/194199"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-json\/wp\/v2\/comments?post=1844"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-json\/wp\/v2\/posts\/1844\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-json\/wp\/v2\/media\/6"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-json\/wp\/v2\/media?parent=1844"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-json\/wp\/v2\/categories?post=1844"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powerplatform\/wp-json\/wp\/v2\/tags?post=1844"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}