Evaluate your plugins and planners with Prompt flow

Matthew Bolanos

September 7th, 20230 3

As you build plugins and add them to planners, it’s important to make sure they work as intended. This becomes more important as you add more and more plugins to your planners. With more functions, your planners have a greater chance of hallucinating and doing incorrect things.

Until now, testing your plugins and planners was a manual process that required you to individually test different user asks. While tools like Chat Copilot can help, it is still a time consuming task. But what if you could automate testing and evaluation? With Prompt flow you can!

With our updated docs, we show you how you can create a new Prompt flow with Semantic Kernel, run batch tests, and run evaluations to quantifiably measure the accuracy of your planners and plugins.

Lastly, we also provide some tips for using prompt engineering to improve the quality of your planners to get even better evaluation results. Below are some of the highlights of using Prompt flow with Semantic Kernel.

Automate batch runs with Prompt flow.

Instead of manually testing different scenarios one-by-one, you can now automatically run large batches of tests using Prompt flow and benchmark data.

In our updated docs, we demonstrate how you can use this functionality to run batch tests on a planner that uses a math plugin. By defining a bunch of word problems, we can quickly test any changes we make to our plugins or planners so we can catch regressions early and often.

After defining our benchmark data and setting up the Prompt flow, you can easily run a batch test with the following command.

pf run create --flow . --data data.jsonl --stream

Afterwards, you can view the results of your batch run directly within VS Code with the Prompt flow extension to start investigating where the planner made mistakes.

Quantify the accuracy of your planners.

After you’ve run a batch, you then need an easy way to see how many of the tests were “good enough”. With this information, you can start developing accuracy scores that you can try to improve over time.

With evaluation flows, you can do just that. With the sample evaluation flows provided by Prompt flow, you can measure classification accuracy, perceived intelligence, groundedness, and more. You can even create your own custom evaluators if you want!

In the evaluation docs for Semantic Kernel, we demonstrate how to use the math accuracy evaluation flow to test our planner to see how well it solves word problems.

After running the evaluator, you’ll get a summary back of your metrics. The first time you run it, you may get subpar results that you’ll want to immediately improve. For example, the math plugin in the docs initially only has an accuracy score of 60% and an error rate of 20%!

We can definitely do better.

Learn how to make your plugins and planners even better.

If you find that your plugins and planners aren’t performing as well as they should, there are steps you can take to make them better. In the docs, we walkthrough a few concrete strategies you can take to make your plugins and planners better. At a high-level though, you should consider the following:

Use a more advanced model like GPT-4 instead of GPT-3.5-turbo.
Improve the description of your plugins so they’re easier for the planner to use.
Inject additional help to the planner when sending the user’s ask.

By doing a combination of these three things, we demonstrate how you can take a failing planner and turn it into a winning one! At the end of the walkthrough, you should have a planner that can correctly answer all of the benchmark data.

Follow along with our docs to get started!

If you’re interested in learning more about how you can use Prompt flow to test and evaluate Semantic Kernel, we recommend following along to the new doc articles we created. At each step, we provide sample code and explanations so you can use Prompt flow successfully with Semantic Kernel.

Article	Description
Using Prompt flow with Semantic Kernel	Learn more about what Prompt flow is and why you would want to use it with Semantic Kernel.
Create a Prompt flow with Semantic Kernel	Follow the steps necessary to create a simple Prompt flow with Semantic Kernel at its core.
Running batches with Prompt flow	Run a suite of benchmark data on your new Prompt flow.
Evaluate your plugins and planners	Evaluate your plugins and planners and experiment with ways of improving them.

Keep an eye out for more integrations.

The Prompt flow and Semantic Kernel teams aren’t done with adding more integrations. Keep an eye out for new integrations and reasons to use Prompt flow with Semantic Kernel over the coming weeks.