{"id":16462,"date":"2025-11-07T00:00:00","date_gmt":"2025-11-07T08:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/ise\/?p=16462"},"modified":"2025-11-07T04:44:46","modified_gmt":"2025-11-07T12:44:46","slug":"multi-agent-systems-at-scale","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/multi-agent-systems-at-scale\/","title":{"rendered":"Patterns for Building a Scalable Multi-Agent System"},"content":{"rendered":"<h1>Introduction<\/h1>\n<p>LLMs are enabling a new generation of multi-agent systems that tackle complex, multi-step tasks. Moving from prototype to production is hard\u2014especially when orchestrating dozens or hundreds of specialized agents. Commercial viability depends on intentional design, scalability, latency control, and predictable outcomes.<\/p>\n<p>This post walks through patterns and lessons learned while building such a system.<\/p>\n<p>The core requirements are:<\/p>\n<ul>\n<li><strong>Accurate Agent Selection<\/strong>: Identify and select the most relevant agents for the given task.<\/li>\n<li><strong>Optimized LLM Usage<\/strong>: Control latency and token spend as the agent set grows.<\/li>\n<li><strong>Efficient Orchestration<\/strong>: Efficiently coordinate agent interactions and hand-offs while producing a coherent final response.<\/li>\n<li><strong>Scalability<\/strong>: Add new agents quickly without degrading performance.<\/li>\n<\/ul>\n<h2>The Problem Statement<\/h2>\n<p>A leading ecommerce company is aiming to supercharge its customer experience via a newly introduced smart digital voice assistant. This assistant helps customers track orders, manage returns, get product recommendations, answer FAQs, and more\u2014powered by a diverse set of specialized AI agents. The challenge? Delivering intelligence and efficiency at scale, while keeping latency low and performance high. Since customers can ask anything, in any order, the system must be able to identify the correct intent of user and invoke the right agents to complete the task.<\/p>\n<p>However, if every agent is included in each request token usage (and thus cost and latency) can skyrocket. It\u2019s simply not practical to involve all agents defined in the system to participate in every user&#8217;s interaction. Instead, the system must dynamically narrow the agent pool to those relevant to the user\u2019s query, while gracefully handling out-of-scope requests.<\/p>\n<p>Systems of this nature often involve multi-turn clarification, where the assistant may need to ask clarifying questions such as \u201cWhich order?\u201d. This complexity should also be considered.<\/p>\n<h2>Solution Design<\/h2>\n<p>Generally, multi-agent systems define a static orchestration workflow. However, in our use case, the relevant agents must be determined at runtime, based on the user\u2019s intent.<\/p>\n<p>Our solution design centers on &#8220;dynamic agent selection&#8221; and orchestration\u2014optimizing both accuracy and performance.<\/p>\n<h3>Challenge 1: Narrowing Down the Universe of Agents<\/h3>\n<p>Including all agents every time is expensive and inefficient. Here we can leverage a semantic cache\u2013based retrieval layer to solve this.<\/p>\n<p>By embedding every agent&#8217;s name (like &#8220;OrderTrackingAgent&#8221;) along with diverse sample utterances (like &#8220;Track my recent order&#8221;, &#8220;Give me status summary of my last 3 orders&#8221;) and indexing those vectors in Azure AI Search, incoming queries can be embedded and compared via similarity scores. This increases agent selection accuracy such that we now narrow down to only handful of agents which are matching with user&#8217;s query.<\/p>\n<p>For example:<\/p>\n<ul>\n<li>\u201cTrack my recent order\u201d \u2192 <code>OrderTrackingAgent<\/code><\/li>\n<li>\u201cI want to return an item\u201d \u2192 <code>ReturnsAgent<\/code><\/li>\n<li>\u201cRecommend a laptop under $1000\u201d \u2192 <code>ProductRecommendationAgent<\/code><\/li>\n<li>\u201cWhat are today\u2019s deals?\u201d \u2192 <code>PromotionsAgent<\/code><\/li>\n<\/ul>\n<p>Using a pretrained OpenAI embedding model (like <code>text-embedding-3-small<\/code>), we can support multiple languages (like English, Korean etc.). So, when a user submits a query, we embed it and retrieve the most relevant agents from Azure AI Search, based on their similarity scores.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/11\/semantic_cache_agent_retrieval.webp\" alt=\"semantic_cache_agent_retrieval.png\" \/><\/p>\n<p><strong>Tip:<\/strong> For each agent, add at least five varied sample utterances to the semantic cache to meaningfully improve retrieval accuracy.<\/p>\n<h3>Challenge 2: Onboarding Agents<\/h3>\n<p>A scalable system needs a standardized repeatable onboarding path for new agents, especially when integrating numerous agents. Here, we propose two agent onboarding approaches:<\/p>\n<ul>\n<li><strong>Code-Based<\/strong>: Agents are defined based on the programming language of the chosen framework. For example, here is an example of defining a Python-based agent using Microsoft&#8217;s <a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/frameworks\/agent\/agent-orchestration\/?pivots=programming-language-python\">Semantic Kernel<\/a> library.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/11\/python_based_agent.webp\" alt=\"Python based agents\" \/><\/li>\n<li><strong>Template-Based<\/strong>: Agents are created declaratively using a configuration language like YAML. This abstracts away the programming language or framework constructs and retains only essential agent attributes. This pattern is ideal for defining agents that are simple in nature and vary only in their metadata fields (such as descriptions, LLM prompts, etc.). Template-based agents also make it easy to introduce third-party agents into the system via a configuration-based repository. Here is an example of defining a template-based agent using YAML:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/11\/template_based_agent.webp\" alt=\"Template based agents\" \/><\/li>\n<\/ul>\n<p><strong>Note:<\/strong> Agent Onboarding is an important aspect of your application. The process should be described in a Standard Operating Procedure (SOP) detailing all the necessary steps. For example, in our discussed scenario, one such step would be adding relevant embeddings into the Semantic Cache for that specific agent.<\/p>\n<h3>Challenge 3: Creating Agent Objects<\/h3>\n<p>The Factory Design Pattern is a well-established approach for creating objects without having to specify the exact class of the object to be created. This pattern is particularly useful in scenarios where the system needs to manage and instantiate a variety of objects dynamically.<\/p>\n<p>As our agents&#8217; definitions can exist either as code (e.g., Python) or as configuration templates (e.g., YAML files), we can leverage the Factory Design Pattern, to define an <code>AgentFactory<\/code>. Given the name of an agent, this factory will transparently create the agent objects.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/11\/agent_factory_xl.webp\" alt=\"Agent Factory\" \/><\/p>\n<p>The <code>AgentFactory<\/code> can further be designed to give priority to agents of one type over another. For example, if both <code>X_Agent.py<\/code> &amp; <code>x_agent.yaml<\/code> are defined, our <code>AgentFactory<\/code> can prioritize the creation of the agent via the template over the code. This approach allows for easy overriding of any agent implementation by adding a new configuration-based definition instead of changing the code.<\/p>\n<h3>Challenge 4: Orchestrating Agent Group Chat<\/h3>\n<p>Our multi-agent system is ready to identify relevant agents for a group chat, thanks to the <code>Semantic Cache<\/code>. However, there is still a challenge with multi-intent queries involving multiple agents.<\/p>\n<p>For example, instead of providing responses from individual agents, we would like to deliver a single summarized and coherent response from our system.<\/p>\n<p>Additionally, the group chat should also follow the sequence of the user&#8217;s query. If the query is &#8220;If X then do Y,&#8221; it is important to invoke the X Agent first followed by the Y Agent.<\/p>\n<p>To address these, a special agent called <code>SupervisorAgent<\/code> is introduced.<\/p>\n<p>The <code>SupervisorAgent<\/code> is the central orchestrator: it parses the user\u2019s intent, applies a selection strategy to pick the next agent hand\u2011off, manages an iterative loop until all relevant agents have contributed or clarification is needed, and then ends the exchange via its termination strategy.<\/p>\n<p>Some orchestration frameworks, such as Microsoft Semantic Kernel&#8217;s <a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/frameworks\/agent\/agent-orchestration\/group-chat?pivots=programming-language-python\">Group Chat<\/a>, explicitly expose <code>Selection<\/code> and <code>Termination<\/code> as explicit strategy hooks. If your framework lacks those hooks or you follow a different orchestration pattern, you can still implement equivalent selection and termination behavior by encoding that logic in the <code>SupervisorAgent<\/code>\u2019s LLM instructions.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/11\/supervisor_agent_xl.webp\" alt=\"Supervisor Agent Led Group chat\" \/><\/p>\n<h3>Bringing It All Together<\/h3>\n<p>Bringing together all components:<\/p>\n<ol>\n<li><code>Semantic Cache-Based Retrieval<\/code>: Efficiently narrows down the universe of agents relevant to user query.<\/li>\n<li><code>Agent Onboarding<\/code>: Simplify the process of integrating new agents using code-based or template-based approaches.<\/li>\n<li><code>AgentFactory<\/code>: Dynamically create agent objects using the Factory Design Pattern.<\/li>\n<li><code>SupervisorAgent<\/code>: Orchestrate group chats, maintaining the sequence of user queries while ensuring coherent responses.<\/li>\n<\/ol>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/11\/group_chat_orchestration.webp\" alt=\"Group Chat Orchestration\" \/><\/p>\n<p>By integrating these components, we achieve a robust multi-agent system capable of dynamically responding to complex user queries with low latency and high accuracy.<\/p>\n<h2>Optimizations<\/h2>\n<p>With the foundational multi-agent system now in place, let&#8217;s look at some areas where we optimized to enhance the overall performance and user experience of the system even further:<\/p>\n<h3>Optimizing for Single-Intent Queries<\/h3>\n<p>We realise that the <code>Selection Strategy<\/code> involving LLM reasoning is not necessary when only one agent is required for a single-intent query with a very high confidence score from the <code>Semantic Cache<\/code>, hence in such cases we simply invoke that agent directly with the given query instead of going via the orchestration path involving <code>SupervisorAgent<\/code>.<\/p>\n<h3>Chattiness Control<\/h3>\n<p>During our development, we also observed an issue of our system becoming too &#8220;chatty&#8221;, where agents continued to converse in loops without concluding the conversation. Generally, the orchestration frameworks expose a property like <code>max_iterations<\/code> to control the group chat iterations.<\/p>\n<p>For example, if we plan to support a maximum of two intents in a user&#8217;s query, the <code>max_iterations<\/code> property can be set around 3, considering interactions with other agents like the <code>SupervisorAgent<\/code>. However, if chattiness persists due to other reasons, debugging should be conducted to understand the interaction dynamics among agents and identify the root cause of the issue.<\/p>\n<h3>LLM Parameter Tuning<\/h3>\n<p>It is crucial to fine-tune the respective LLM parameters to ensure optimal performance and consistency in responses. Key parameters to consider include: <code>temperature<\/code>, <code>top_p<\/code>, <code>max_completion_tokens<\/code>.<\/p>\n<p>For example:<\/p>\n<ul>\n<li><code>Temperature<\/code> and <code>Top_p<\/code>: Setting these values to 0 ensures consistency in agent responses, as it reduces randomness.<\/li>\n<li><code>Max_completion_tokens<\/code>: Set this based on the expected response length to avoid verbosity. Ensuring that the generated text is concise and relevant.<\/li>\n<\/ul>\n<h2>Evaluation and Iteration<\/h2>\n<p>We implemented thorough evaluations\u2014both end-to-end and for individual components. Each agent maintains a golden dataset with ground truth for invocation and responses. We use metrics like <code>recall@k<\/code>, <code>precision@k<\/code>, <code>BLEU scores<\/code>, and <code>relevance<\/code> to assess performance and guide improvements.<\/p>\n<p>As part of agent onboarding, new agents must provide non-overlapping, meaningful descriptions and sample utterances to avoid confusion with other agents.Evaluations help ensure that the overall system quality always meets or exceeds the set benchmark as we add new agents.<\/p>\n<h2>Results and Impact<\/h2>\n<p>This architecture yields a scalable, low-latency multi-agent system with controlled token usage and high selection accuracy. Dynamic retrieval + Supervisor orchestration enables coherent, ordered responses even for multi-intent queries. Optimizations (single-intent fast path, chattiness limits, parameter tuning) further improve efficiency. The integration of these components results in a seamless user experience, providing reliable and coherent responses to user queries.<\/p>\n<h2>Summary<\/h2>\n<p>Scalable multi-agent systems benefit from a few core patterns: semantic retrieval for agent narrowing, standardized onboarding, a factory for flexible instantiation, and a supervising orchestrator for sequencing and summarization. These patterns generalize beyond voice assistants to any AI workflow requiring dynamic, intent-driven composition of specialized agents.<\/p>\n<p>As agent ecosystems grow, disciplined evaluation and operational playbooks become the backbone of continued performance.<\/p>\n<h2>References<\/h2>\n<ul>\n<li><a href=\"https:\/\/learn.microsoft.com\/en-us\/semantic-kernel\/frameworks\/agent\/agent-orchestration\/?pivots=programming-language-python\">Semantic Kernel Documentation<\/a><\/li>\n<li><a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/search\/\">Azure AI Search Documentation<\/a><\/li>\n<li><a href=\"https:\/\/platform.openai.com\/docs\/api-reference\/responses\/create\">Open AI Model Parameters<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/en-us\/agent-framework\/overview\/agent-framework-overview\">Microsoft Agent Framework<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Practical patterns for designing scalable, high-performing multi-agent systems\u2014grounded in real implementation experience.<\/p>\n","protected":false},"author":138329,"featured_media":16463,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[3400],"class_list":["post-16462","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","tag-ise"],"acf":[],"blog_post_summary":"<p>Practical patterns for designing scalable, high-performing multi-agent systems\u2014grounded in real implementation experience.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16462","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/138329"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=16462"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16462\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/16463"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=16462"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=16462"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=16462"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}