The future of Planners in Semantic Kernel

Since the very earlier days of Semantic Kernel, we have shipped experimental “planners” that use prompts to generate multi-step plans. This was extremely powerful because it allowed developers to use LLMs (which were created to merely generate text) to begin automating business processes.

Since then, the Semantic Kernel team has evolved its experimental planners over time so it could adopt the latest research from both inside and outside of Microsoft. Most notably, we began leveraging function calling. With function calling, the action planner could be replaced with a single function call request and the ReAct based planning of the stepwise planner could be replicated with multiple function calling steps.

As function calling has gotten increasingly more accurate and efficient, however, the need for additional “planning” logic on top of the model has become less necessary, and in some cases, can reduce the speed, cost, and accuracy of a plan.

In this blog post we’ll cover our current experimental planners and what comes next for the function calling stepwise planner and the Handlebars planner. We’ll answer: How do they work? What are they good at? And how do we plan to make them even better.

Keep an eye out for additional blog posts that provides a deep dive on how to use the future versions of the function calling stepwise planner and the Handlebars planner.

Function calling stepwise planner

As we evolved the original stepwise planner, we took advantage of function calling from OpenAI. This allowed us to reduce the size of the original prompt while still increasing accuracy. When function calling was first introduced, however, it wasn’t perfect. The LLMs had difficulty stringing together multiple function calls to complete a multi-step process (or plan) using the ReAct methodology.

To address this shortcoming, we introduced a prompt at the very beginning asking the AI to enumerate the list of steps it needed to complete a customer’s goal. This prompt was relatively lightweight and provided enough structure for function calling to work on more complex tasks.

The prompt in the function calling stepwise planner

The “better” way to ReAct

As function calling has gotten better, however, this additional step has become less necessary, and instead, has gotten in the way of functionality needed in enterprise applications…

Streaming was difficult to implement
The model had a harder time making parallel function calls
Reusing the entire context from a chat history object became difficult
And customization was difficult

So we thought to ourselves… what if we got out of the way entirely and just encouraged customers to use “vanilla” function calling. Over the last few months, the results have been astounding. Customers could achieve everything the function calling stepwise planner could with fewer tokens, more control, and significantly lower time-to-first token.

Streaming with function calling

Because of this, we’ll be sunsetting the function calling stepwise planner in favor of “vanilla” function calling. Our docs have already been updated to reflect this new recommendation, and shortly, we’ll be publishing a blog detailing how to migrate your code if you already use the function calling stepwise planner.

Handlebars planner

The Handlebars planner was the natural successor to the stepwise planner. Both were powerful because it allowed the LLM to generate an entire plan in a single LLM call. This had several benefits: a user could approve an entire “plan” before execution began and it theoretically use fewer tokens.

The challenge, however, is how do you tell the LLM how to generate a plan using as few tokens as possible? With the original sequential planner, we had to “teach” the LLM how to generate custom XML within a single prompt. This was relatively expensive and yielded poor results because the LLM couldn’t use the knowledge to generate this novel XML structure.

The prompt for the original sequential planner

The aha moment: code-based planners are more accurate

But what if we could use a language that the LLM did know how to natively write? Both TaskWeaver and AutoGen both had success using this approach, so the Semantic Kernel team had LLMs generate plans in several different programming languages (C#, Python, JavaScript) to see if it had more success. In our tests, any language performed remarkably better. The new challenge? How do you “safely” run code generated by an LLM?

Without extremely secure and limited runtime containers, we determined that this was not possible to run LLM generated code in an enterprise deployment, so we searched for a language that was purposefully very limited so the LLM couldn’t do anything it shouldn’t. We landed on the templating language Handlebars because it could do nothing more than invoke helpers. It also had the benefit of having implementations in almost all languages (meaning we could drive parity across all our SDKs)

The choice of language matters

As more customers used the Handlebars planner, we began to realize its limitations. Because the LLMs had less training data on Handlebars templates, we had to make our prompts increasingly more detailed. What originally started off as a cheaper way to generate a plan became just as token intensive as the original sequential planner.

It was around this time that Azure Container Apps released dynamic sessions. A feature that allows you to generate locked down Python containers for the explicit use of running LLM generated code. The same technology powers the Code interpreter in the Azure Assistants API. We finally had an enterprise way to run a proper coding language that the LLMs were adept at writing!

Asking LLMs to generate Python instead

LLMs are particularly good at writing Python since it’s the language of AI researchers. From the very early days of machine learning, models were trained on Python code, and because OpenAI has chosen Python as their language for Code interpreter, the model’s ability to write Python code will only get exponentially better.

Code interpreter in ChatGPT

Because of this, we will be replacing our Handlebars planner with a Python version later this fall that behaves more like OpenAI’s current Code interpreter, only with our Code interpreter, we will also allow the LLM to invoke local plugins and functions. While we develop this new planner, we recommend that new customers not use the existing Handlebars planner because it will be obsoleted.

It’s important to note that only the Handlebars planner will be deprecated. The Handlebars templating engine will remain for prompt templates.

Will a Python-based planner work with C# and Java?

The greatest concern we’ve heard with moving to a Python-based planner is whether or not this will work with the C# and Java SDK for Semantic Kernel, and for that, we have good news! Just like how we used Handlebars template generation in all three SDKs, we’ll be able to use Python code generation in all three SDKs as well.

To support local development, we’ll provide an out-of-the-box container for local testing. For production deployments, we’ll provide a connection to Azure Container Apps dynamic sessions.

We also get concern from C# and Java developers who think this will require them to author Python code. Just like the Handlebars planner, we do not expect developers to be author these plans. This is a language that only the LLM needs to know to create plans for the user during runtime and Python appears to be the best language LLMs can generate today.

Keep an eye out for an announcement in the next few months for when we release our new Python-based planning solution.

Give us feedback on our transition

If you are a current user of the existing planners and have unique scenarios you want to make sure we support with the updated planners, please reach out to us on our Semantic Kernel GitHub Discussion Channel! We want to ensure that all scenarios of the previous planners continue to be supported.

5 comments

Discussion is closed. Login to edit/delete existing comments.

Chris Rickman July 29, 2024

Dmytro has recently presented an alternate take on planning based on our learned experience:

https://devblogs.microsoft.com/semantic-kernel/planning-with-semantic-kernel-using-automatic-function-calling/
José Luis Latorre Millás July 24, 2024

Thanks Matthew & Team,
While this is clearly understandable I'd like to have some freedom here. Even the de-facto and most supported code interpreter is trained for Python, how bad is it in generating .NET code, C# for example? have you tried some simple, medium and complex scenarios? how accurate and token-intensive wise are they compared to Python? (You know, I along some of us, love .NET)

Also, while I love the cloud, I love to execute the code in my computer, either directly on it or on Docker, with the option to deploy to the cloud, the ACA feature...
Read more
Thanks Matthew & Team,
While this is clearly understandable I’d like to have some freedom here. Even the de-facto and most supported code interpreter is trained for Python, how bad is it in generating .NET code, C# for example? have you tried some simple, medium and complex scenarios? how accurate and token-intensive wise are they compared to Python? (You know, I along some of us, love .NET)

Also, while I love the cloud, I love to execute the code in my computer, either directly on it or on Docker, with the option to deploy to the cloud, the ACA feature is brilliant though, without discussion 🙂

And finally, what about the future of “advanced prompting”? As Handlebars as a planer will “fade away”, will Handlebars still be supported as a prompt template/language type? or it will also “fade away” leaving ground for other advanced prompt template languages/systems to take over its legacy?

Read less
- Matthew Bolanos Author July 24, 2024
  
  The main challenge that we've had with languages like .NET and Java is that it's more difficult to get packages to work. With Python, you simply reference a package and the Python runtime is able to grab it. That's less true for other languages, but we're interested in investigating it once we get Python working.
  
  Regarding containers, as mentioned in the article, we'll provide both a plugin to Azure Container Apps Dynamic Sessions and a local container that could be run with Docker.
  
  And lastly, Handlebars as a templating language for prompt templates isn't going anywhere. "Only the Handlebars planner will...
  Read more
  The main challenge that we’ve had with languages like .NET and Java is that it’s more difficult to get packages to work. With Python, you simply reference a package and the Python runtime is able to grab it. That’s less true for other languages, but we’re interested in investigating it once we get Python working.
  
  Regarding containers, as mentioned in the article, we’ll provide both a plugin to Azure Container Apps Dynamic Sessions and a local container that could be run with Docker.
  
  And lastly, Handlebars as a templating language for prompt templates isn’t going anywhere. “Only the Handlebars planner will be deprecated. The Handlebars templating engine will remain for prompt templates”.
  
  Read less
  - Paulo Pinto July 29, 2024
    
    Given all the issues Python packages have to deal with compilers, the various setup scripts and virtual enviroments, including have to use special distributions like Anaconda and ActivePython, is quite surprising to see this being mentioned as a better experience than using Maven Central or NuGet packages.
  - Matthew Bolanos Author July 30, 2024 · Edited
    
    I definitely agree that the package management provided by Maven or NuGet is better than what’s available in Python. What’s more challenging, however, is that importing a package in Java or C# typically requires updating something like a csproj file. This isn’t necessary for Python, so it’s easier for an LLM to author a script that can be immediately run in a container.