{"id":1175,"date":"2023-09-07T09:21:09","date_gmt":"2023-09-07T16:21:09","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/semantic-kernel\/?p=1175"},"modified":"2023-09-14T04:48:55","modified_gmt":"2023-09-14T11:48:55","slug":"evaluate-your-plugins-and-planners-with-prompt-flow","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/agent-framework\/evaluate-your-plugins-and-planners-with-prompt-flow\/","title":{"rendered":"Evaluate your plugins and planners with Prompt flow"},"content":{"rendered":"<h1 style=\"padding-bottom: 2rem;\"><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2023\/03\/skpatternlarge.png\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-89\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2023\/03\/skpatternlarge.png\" alt=\"Image skpatternlarge\" width=\"1638\" height=\"136\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/03\/skpatternlarge.png 1638w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/03\/skpatternlarge-300x25.png 300w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/03\/skpatternlarge-1024x85.png 1024w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/03\/skpatternlarge-768x64.png 768w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/03\/skpatternlarge-1536x128.png 1536w\" sizes=\"(max-width: 1638px) 100vw, 1638px\" \/><\/a><\/h2>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2023\/09\/semantic-kernel-in-prompt-flow-1.png\"><img decoding=\"async\" class=\"alignright wp-image-1183 size-large\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2023\/09\/semantic-kernel-in-prompt-flow-1-584x1024.png\" alt=\"Image semantic kernel in prompt flow 1\" width=\"292\" height=\"512\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/semantic-kernel-in-prompt-flow-1-584x1024.png 584w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/semantic-kernel-in-prompt-flow-1-171x300.png 171w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/semantic-kernel-in-prompt-flow-1-768x1346.png 768w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/semantic-kernel-in-prompt-flow-1-877x1536.png 877w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/semantic-kernel-in-prompt-flow-1-1169x2048.png 1169w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/semantic-kernel-in-prompt-flow-1.png 1464w\" sizes=\"(max-width: 292px) 100vw, 292px\" \/><\/a>As you build <a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/plugins\/?tabs=Csharp\">plugins<\/a> and add them to <a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/plugins\/planner?tabs=Csharp\">planners<\/a>, it&#8217;s important to make sure they work as intended. This becomes more important as you add more and more plugins to your planners. With more functions, your planners have a greater chance of hallucinating and doing incorrect things.<\/p>\n<p>Until now, testing your plugins and planners was a manual process that required you to individually test different user asks. While tools like <a href=\"https:\/\/github.com\/microsoft\/chat-copilot\">Chat Copilot<\/a> can help, it is still a time consuming task. But what if you could automate testing and evaluation? With <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/machine-learning\/prompt-flow\/overview-what-is-prompt-flow?view=azureml-api-2\">Prompt flow<\/a> you can!<\/p>\n<p>With our updated docs, we show you how you can <a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/planners\/evaluate-and-deploy-planners\/create-a-prompt-flow-with-semantic-kernel\">create a new Prompt flow<\/a> with Semantic Kernel, <a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/planners\/evaluate-and-deploy-planners\/running-batches-with-prompt-flow?tabs=gpt-35-turbo\">run batch tests<\/a>, and <a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/planners\/evaluate-and-deploy-planners\/\">run evaluations<\/a> to quantifiably measure the accuracy of your planners and plugins.<\/p>\n<p>Lastly, we also provide <a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/planners\/evaluate-and-deploy-planners\/evaluating-plugins-and-planners-with-prompt-flow?tabs=gpt-35-turbo#improving-your-flow-with-prompt-engineering\">some tips<\/a> for using prompt engineering to improve the quality of your planners to get even better evaluation results. Below are some of the highlights of using Prompt flow with Semantic Kernel.<\/p>\n<h2 style=\"clear: both; padding-top: 2rem;\">Automate batch runs with Prompt flow.<\/h2>\n<p>Instead of manually testing different scenarios one-by-one, you can now automatically run large batches of tests using Prompt flow and benchmark data.<\/p>\n<p>In our <a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/planners\/evaluate-and-deploy-planners\/running-batches-with-prompt-flow?tabs=gpt-35-turbo\">updated docs<\/a>, we demonstrate how you can use this functionality to run batch tests on a planner that uses a math plugin. By <a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/planners\/evaluate-and-deploy-planners\/running-batches-with-prompt-flow?tabs=gpt-35-turbo#create-benchmark-data-for-your-prompt-flow\">defining a bunch of word problems<\/a>, we can quickly test any changes we make to our plugins or planners so we can catch regressions early and often.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2023\/09\/using-batch-runs-with-prompt-flow.png\"><img decoding=\"async\" class=\"aligncenter wp-image-1176 size-full\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2023\/09\/using-batch-runs-with-prompt-flow.png\" alt=\"Image using batch runs with prompt flow\" width=\"2500\" height=\"1071\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/using-batch-runs-with-prompt-flow.png 2500w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/using-batch-runs-with-prompt-flow-300x129.png 300w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/using-batch-runs-with-prompt-flow-1024x439.png 1024w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/using-batch-runs-with-prompt-flow-768x329.png 768w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/using-batch-runs-with-prompt-flow-1536x658.png 1536w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/using-batch-runs-with-prompt-flow-2048x877.png 2048w\" sizes=\"(max-width: 2500px) 100vw, 2500px\" \/><\/a><\/p>\n<p>After defining our benchmark data and setting up the Prompt flow, you can easily run a batch test with the following command.<\/p>\n<pre><code class=\"language-cs language-csharp\">pf run create --flow . --data data.jsonl --stream<\/code><\/pre>\n<p>Afterwards, you can view the results of your batch run directly within VS Code with the <a href=\"https:\/\/marketplace.visualstudio.com\/items?itemName=prompt-flow.prompt-flow\">Prompt flow extension<\/a> to start investigating where the planner made mistakes.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.42.32-PM.png\"><img decoding=\"async\" class=\"aligncenter wp-image-1222 size-large\" style=\"border: 1px solid #ddd;\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.42.32-PM-1024x601.png\" alt=\"Image Screenshot 2023 09 07 at 2 42 32 PM\" width=\"640\" height=\"376\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.42.32-PM-1024x601.png 1024w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.42.32-PM-300x176.png 300w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.42.32-PM-768x451.png 768w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.42.32-PM-1536x902.png 1536w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.42.32-PM-2048x1203.png 2048w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/a><\/p>\n<h2 style=\"clear: both; padding-top: 2rem;\">Quantify the accuracy of your planners.<\/h2>\n<p>After you&#8217;ve run a batch, you then need an easy way to see how many of the tests were &#8220;good enough&#8221;. With this information, you can start developing accuracy scores that you can try to improve over time.<\/p>\n<p>With <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/machine-learning\/prompt-flow\/how-to-bulk-test-evaluate-flow?view=azureml-api-2\">evaluation flows<\/a>, you can do just that. With the sample evaluation flows provided by Prompt flow, you can measure <a href=\"https:\/\/github.com\/microsoft\/promptflow\/tree\/main\/examples\/flows\/evaluation\/eval-classification-accuracy\">classification accuracy<\/a>, <a href=\"https:\/\/github.com\/microsoft\/promptflow\/tree\/main\/examples\/flows\/evaluation\/eval-perceived-intelligence\">perceived intelligence<\/a>, <a href=\"https:\/\/github.com\/microsoft\/promptflow\/tree\/main\/examples\/flows\/evaluation\/eval-groundedness\">groundedness<\/a>, and more. You can even create your own <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/machine-learning\/prompt-flow\/how-to-develop-an-evaluation-flow?view=azureml-api-2\">custom evaluators<\/a> if you want!<\/p>\n<p>In the evaluation docs for Semantic Kernel, we demonstrate how to use the <a href=\"https:\/\/github.com\/microsoft\/promptflow\/tree\/main\/examples\/flows\/evaluation\/eval-accuracy-maths-to-code\">math accuracy evaluation flow<\/a>\u00a0to test our planner to see how well it solves word problems.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2023\/09\/evaluating-batch-run-with-prompt-flow.png\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-1187\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2023\/09\/evaluating-batch-run-with-prompt-flow.png\" alt=\"Image evaluating batch run with prompt flow\" width=\"2500\" height=\"1064\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/evaluating-batch-run-with-prompt-flow.png 2500w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/evaluating-batch-run-with-prompt-flow-300x128.png 300w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/evaluating-batch-run-with-prompt-flow-1024x436.png 1024w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/evaluating-batch-run-with-prompt-flow-768x327.png 768w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/evaluating-batch-run-with-prompt-flow-1536x653.png 1536w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/evaluating-batch-run-with-prompt-flow-2048x871.png 2048w\" sizes=\"(max-width: 2500px) 100vw, 2500px\" \/><\/a><\/p>\n<p>After running the evaluator, you&#8217;ll get a summary back of your metrics. The first time you run it, you may get subpar results that you&#8217;ll want to immediately improve. For example, the math plugin in the docs initially only has an accuracy score of 60% and an error rate of 20%!<\/p>\n<p>We can definitely do better.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.21.37-PM.png\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-1192\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.21.37-PM.png\" alt=\"Image Screenshot 2023 09 07 at 2 21 37 PM\" width=\"1608\" height=\"1078\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.21.37-PM.png 1608w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.21.37-PM-300x201.png 300w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.21.37-PM-1024x686.png 1024w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.21.37-PM-768x515.png 768w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2023\/09\/Screenshot-2023-09-07-at-2.21.37-PM-1536x1030.png 1536w\" sizes=\"(max-width: 1608px) 100vw, 1608px\" \/><\/a><\/p>\n<h2 style=\"clear: both; padding-top: 2rem;\">Learn how to make your plugins and planners <em>even better<\/em>.<\/h2>\n<p>If you find that your plugins and planners aren&#8217;t performing as well as they should, there are steps you can take to make them better. In the docs, we walkthrough a few concrete strategies you can take to make your plugins and planners better. At a high-level though, you should consider the following:<\/p>\n<ol>\n<li>Use a more advanced model like GPT-4 instead of GPT-3.5-turbo.<\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/planners\/evaluate-and-deploy-planners\/evaluating-plugins-and-planners-with-prompt-flow?tabs=gpt-35-turbo#improving-the-descriptions-of-your-plugin\">Improve the description of your plugins<\/a> so they&#8217;re easier for the planner to use.<\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/planners\/evaluate-and-deploy-planners\/evaluating-plugins-and-planners-with-prompt-flow?tabs=gpt-35-turbo#improving-the-descriptions-of-your-plugin\">Inject additional help to the planner<\/a> when sending the user&#8217;s ask.<\/li>\n<\/ol>\n<p>By doing a combination of these three things, we demonstrate how you can take a failing planner and turn it into a winning one! At the end of the walkthrough, you should have a planner that can correctly answer all of the benchmark data.<\/p>\n<h2 style=\"clear: both; padding-top: 2rem;\">Follow along with our docs to get started!<\/h2>\n<p>If you&#8217;re interested in learning more about how you can use Prompt flow to test and evaluate Semantic Kernel, we recommend following along to the new doc articles we created. At each step, we provide sample code and explanations so you can use Prompt flow successfully with Semantic Kernel.<\/p>\n<table style=\"border-collapse: collapse; width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 50%;\"><strong>Article<\/strong><\/td>\n<td style=\"width: 50%;\"><strong>Description<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\"><a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/planners\/evaluate-and-deploy-planners\/\">Using Prompt flow with Semantic Kernel<\/a><\/td>\n<td style=\"width: 50%;\">Learn more about what Prompt flow is and why you would want to use it with Semantic Kernel.<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\"><a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/planners\/evaluate-and-deploy-planners\/create-a-prompt-flow-with-semantic-kernel\">Create a Prompt flow with Semantic Kernel<\/a><\/td>\n<td style=\"width: 50%;\">Follow the steps necessary to create a simple Prompt flow with Semantic Kernel at its core.<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\"><a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/planners\/evaluate-and-deploy-planners\/running-batches-with-prompt-flow\">Running batches with Prompt flow<\/a><\/td>\n<td style=\"width: 50%;\">Run a suite of benchmark data on your new Prompt flow.<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 50%;\"><a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/ai-orchestration\/planners\/evaluate-and-deploy-planners\/\">Evaluate your plugins and planners<\/a><\/td>\n<td style=\"width: 50%;\">Evaluate your plugins and planners and experiment with ways of improving them.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2 style=\"clear: both; padding-top: 2rem;\">Keep an eye out for more integrations.<\/h2>\n<p><span style=\"font-size: 1rem; text-align: var(--bs-body-text-align);\">The Prompt flow and Semantic Kernel teams aren&#8217;t done with adding more integrations. Keep an eye out for new integrations and reasons to use Prompt flow with Semantic Kernel over the coming weeks.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>As you build plugins and add them to planners, it&#8217;s important to make sure they work as intended. This becomes more important as you add more and more plugins to your planners. With more functions, your planners have a greater chance of hallucinating and doing incorrect things. Until now, testing your plugins and planners was [&hellip;]<\/p>\n","protected":false},"author":121401,"featured_media":1228,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1175","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-semantic-kernel"],"acf":[],"blog_post_summary":"<p>As you build plugins and add them to planners, it&#8217;s important to make sure they work as intended. This becomes more important as you add more and more plugins to your planners. With more functions, your planners have a greater chance of hallucinating and doing incorrect things. Until now, testing your plugins and planners was [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/posts\/1175","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/users\/121401"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/comments?post=1175"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/posts\/1175\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/media\/1228"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/media?parent=1175"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/categories?post=1175"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/tags?post=1175"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}