{"id":15666,"date":"2024-09-27T00:00:49","date_gmt":"2024-09-27T07:00:49","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/ise\/?p=15666"},"modified":"2024-09-27T02:51:38","modified_gmt":"2024-09-27T09:51:38","slug":"promptflow-performance-testing-analysis","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/promptflow-performance-testing-analysis\/","title":{"rendered":"PromptFlow Serve &#8211; Benchmark Result Analysis"},"content":{"rendered":"<h2>PromptFlow Serve: Benchmark Result Analysis<\/h2>\n<p>My team have been developing a relatively complex PromptFlow application for the last few months and as a part of getting it ready for production scale, did performance testing and optimisations. This blog post summarises our finding from doing performance testing of various runner options of <code>promptflow-serve<\/code> and our recommendations.<\/p>\n<h2>Test Scenario<\/h2>\n<p>Before testing the entire application, we created a sample flow that mimicked <strong>part<\/strong> of our real flow. It contained a fan-out and fan-in flow to replicate LLM nodes like guardrails running in parallelly and a final node that does an API call to an external dependency. This API call was made to a <strong>mock HTTP API<\/strong>. A synthetic delay was added to each of the parallel nodes to replicate LLM calls.<\/p>\n<p>The harness contained these components:<\/p>\n<ul>\n<li>A mock HTTP API that acts as a service used by your PromptFlow flow.<\/li>\n<li>The following PromptFlow flows that are hosted using <code>pf-serve<\/code>.\n<ol>\n<li>Synchronous flow hosted using <a href=\"https:\/\/wsgi.readthedocs.io\/en\/latest\/what.html\">WSGI<\/a>.<\/li>\n<li>Asynchronous flow hosted using <a href=\"https:\/\/asgi.readthedocs.io\/en\/latest\">ASGI<\/a> and <a href=\"https:\/\/docs.python.org\/3\/library\/asyncio.html\">async Python functions<\/a> as PromptFlow nodes.<\/li>\n<\/ol>\n<\/li>\n<li><a href=\"https:\/\/locust.io\/\">Locust<\/a> load test generator.<\/li>\n<li>Makefile, scripts and docker-compose file to orchestrate the tests.<\/li>\n<\/ul>\n<p>The test harness and example used to create the test scenarios discussed has been contributed to the PromptFlow repository via a pull request <a href=\"https:\/\/github.com\/microsoft\/promptflow\/pull\/3486\">here<\/a>.<\/p>\n<h3>Flow<\/h3>\n<p>The directed acyclic graph (DAG) for both the flows are shown below.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2024\/09\/flow.png\" alt=\"flow\" \/><\/p>\n<ul>\n<li>Parallel <code>nodes 1, 2 and 3<\/code> have a synthetic delay of <code>0.25ms<\/code> to simulate an LLM call like guardrails.<\/li>\n<li>The <code>chat<\/code> node makes a HTTP call to the mock API to simulate a downstream service call.<\/li>\n<\/ul>\n<h3>Host Runner Options, Synchronous and Asynchronous Nodes<\/h3>\n<p>The aim of our selected variations were to test how <code>pf-serve<\/code> behaves when using the default <a href=\"https:\/\/wsgi.readthedocs.io\/en\/latest\/what.html\">WSGI<\/a> runner (<a href=\"https:\/\/gunicorn.org\/\">Gunicorn<\/a>) compared to an <a href=\"https:\/\/asgi.readthedocs.io\/en\/latest\/\">ASGI<\/a> runner (<a href=\"https:\/\/fastapi.tiangolo.com\/\">FastApi<\/a>) with the combination of async PromptFlow nodes.<\/p>\n<p>The load was generated using <a href=\"https:\/\/locust.io\/\">Locust<\/a> and had a maximum of <strong>1000 concurrent users<\/strong> with a <strong>hatch rate of 10<\/strong>. The test was run for <strong>5 minutes<\/strong>. We ran each combination with <strong>8 workers and 8 threads per worker<\/strong>. The test was run on WSL which had <strong>16GB of memory and 8 logical processors<\/strong>. The guidance around concurrency can be found <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/machine-learning\/prompt-flow\/how-to-deploy-to-code?view=azureml-api-2&amp;tabs=managed#configure-concurrency-for-deployment\">here<\/a>.<\/p>\n<p>In hindsight, <code>1000<\/code> concurrent users were exhausting the limited resources available on the host environment. We not only had the test harness but also the mock API running in the same host. Ideally the mock API should have been run elsewhere so it would not interfere with the test harness. But our aim was to <strong>find patterns and bottlenecks rather than be super precise around maximum achievable throughput.<\/strong><\/p>\n<p>If you want to find a more accurate number for the sustainable concurrent users for your environment, run the included load test against the mock API endpoint provided in the benchmark suite. This will give you sense of what ASGI (<code>FastApi<\/code>) can support in your environment without PromptFlow in the mix. You can also use this result as a guide to compare the throughput to the sync and async variants (where PromptFlow is in the mix).<\/p>\n<h3>Sync vs Async Nodes<\/h3>\n<p>The PromptFlow <code>@tool<\/code> annotation supports both sync and async functions. Hence both of the below code examples are valid but as you will see later, they have a massive performance impact as the sync example blocks the thread.<\/p>\n<p>The sync example uses the <a href=\"https:\/\/requests.readthedocs.io\/en\/latest\/\">requests<\/a> library to make a synchronous call to the mock API.<\/p>\n<pre><code class=\"language-python\">import os\r\nimport time\r\n\r\nimport requests\r\nfrom promptflow.core import tool\r\n\r\n@tool\r\ndef my_python_tool(node1: str, node2: str, node3: str) -&gt; str:\r\n\r\n    start_time = time.time()\r\n\r\n    # make a call to the mock endpoint\r\n    url = os.getenv(\"MOCK_API_ENDPOINT\", None)\r\n    if url is None:\r\n        raise RuntimeError(\"Failed to read MOCK_API_ENDPOINT env var.\")\r\n\r\n    # respond with the service call and tool total times\r\n    response = requests.get(url)\r\n    if response.status_code == 200:\r\n        response_dict = response.json()\r\n        end_time = time.time()\r\n        response_dict[\"pf_node_time_sec\"] = end_time - start_time\r\n        response_dict[\"type\"] = \"pf_dag_sync\"\r\n        return response_dict\r\n    else:\r\n        raise RuntimeError(f\"Failed call to {url}: {response.status_code}\")<\/code><\/pre>\n<p>The below async example uses <a href=\"https:\/\/docs.aiohttp.org\/en\/stable\/\">aiohttp<\/a> to make an async call to the mock API which allows the node function to be async as well.<\/p>\n<pre><code class=\"language-python\">import os\r\nimport time\r\n\r\nimport aiohttp\r\nfrom promptflow.core import tool\r\n\r\n@tool\r\nasync def my_python_tool(node1: str, node2: str, node3: str) -&gt; str:\r\n\r\n    start_time = time.time()\r\n\r\n    # make a call to the mock endpoint\r\n    url = os.getenv(\"MOCK_API_ENDPOINT\", None)\r\n    if url is None:\r\n        raise RuntimeError(\"Failed to read MOCK_API_ENDPOINT env var.\")\r\n\r\n    async with aiohttp.ClientSession() as session:\r\n        async with session.get(url) as response:\r\n            if response.status == 200:\r\n                response_dict = await response.json()\r\n                end_time = time.time()\r\n                response_dict[\"pf_node_time_sec\"] = end_time - start_time\r\n                response_dict[\"type\"] = \"pf_dag_async\"\r\n                return response_dict\r\n            else:\r\n                raise RuntimeError(f\"Failed call to {url}: {response.status}\")<\/code><\/pre>\n<h3>Combinations Tested<\/h3>\n<ul>\n<li>WSGI (<code>gunicorn<\/code>) + Sync PF Nodes<\/li>\n<li>ASGI (<code>fastapi<\/code>) + Async PF Nodes<\/li>\n<\/ul>\n<p>It&#8217;s important to note that <a href=\"https:\/\/microsoft.github.io\/promptflow\/how-to-guides\/deploy-a-flow\/deploy-using-docker.html\">PromptFlow Docker image based deployment<\/a> uses the WSGI runner (<code>Flask<\/code>) by default. You must <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/machine-learning\/prompt-flow\/how-to-deploy-to-code?view=azureml-api-2&amp;tabs=managed#use-fastapi-serving-engine-preview\">opt-in to use FastApi<\/a>.<\/p>\n<h2>Test Results<\/h2>\n<table>\n<thead>\n<tr>\n<th><strong>Metric<\/strong><\/th>\n<th style=\"text-align: center;\"><strong>WSGI + Sync Nodes<\/strong><\/th>\n<th style=\"text-align: center;\"><strong>ASGI + Async Nodes<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Request Count<\/strong><\/td>\n<td style=\"text-align: center;\">12,157<\/td>\n<td style=\"text-align: center;\">65,554<\/td>\n<\/tr>\n<tr>\n<td><strong>Failure Count<\/strong><\/td>\n<td style=\"text-align: center;\">1<\/td>\n<td style=\"text-align: center;\">43<\/td>\n<\/tr>\n<tr>\n<td><strong>Median Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">9,900<\/td>\n<td style=\"text-align: center;\">1,400<\/td>\n<\/tr>\n<tr>\n<td><strong>Average Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">13,779.82<\/td>\n<td style=\"text-align: center;\">1,546.90<\/td>\n<\/tr>\n<tr>\n<td><strong>Min Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">1.07<\/td>\n<td style=\"text-align: center;\">0.73<\/td>\n<\/tr>\n<tr>\n<td><strong>Max Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">48,101.42<\/td>\n<td style=\"text-align: center;\">4,212.50<\/td>\n<\/tr>\n<tr>\n<td><strong>Requests\/s<\/strong><\/td>\n<td style=\"text-align: center;\">40.60<\/td>\n<td style=\"text-align: center;\">218.85<\/td>\n<\/tr>\n<tr>\n<td><strong>Failures\/s<\/strong><\/td>\n<td style=\"text-align: center;\">0.0033<\/td>\n<td style=\"text-align: center;\">0.1435<\/td>\n<\/tr>\n<tr>\n<td><strong>50% Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">9,900<\/td>\n<td style=\"text-align: center;\">1,400<\/td>\n<\/tr>\n<tr>\n<td><strong>66% Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">17,000<\/td>\n<td style=\"text-align: center;\">1,500<\/td>\n<\/tr>\n<tr>\n<td><strong>75% Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">22,000<\/td>\n<td style=\"text-align: center;\">1,700<\/td>\n<\/tr>\n<tr>\n<td><strong>80% Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">24,000<\/td>\n<td style=\"text-align: center;\">1,800<\/td>\n<\/tr>\n<tr>\n<td><strong>90% Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">34,000<\/td>\n<td style=\"text-align: center;\">2,100<\/td>\n<\/tr>\n<tr>\n<td><strong>95% Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">40,000<\/td>\n<td style=\"text-align: center;\">2,400<\/td>\n<\/tr>\n<tr>\n<td><strong>98% Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">47,000<\/td>\n<td style=\"text-align: center;\">2,800<\/td>\n<\/tr>\n<tr>\n<td><strong>99% Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">48,000<\/td>\n<td style=\"text-align: center;\">3,000<\/td>\n<\/tr>\n<tr>\n<td><strong>99.9% Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">48,000<\/td>\n<td style=\"text-align: center;\">4,000<\/td>\n<\/tr>\n<tr>\n<td><strong>99.99% Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">48,000<\/td>\n<td style=\"text-align: center;\">4,100<\/td>\n<\/tr>\n<tr>\n<td><strong>100% Response Time (ms)<\/strong><\/td>\n<td style=\"text-align: center;\">48,000<\/td>\n<td style=\"text-align: center;\">4,200<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Throughput Graph<\/h3>\n<h4>WSGI + Sync Nodes<\/h4>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2024\/09\/sync_result_graph.png\" alt=\"sync result\" \/><\/p>\n<h4>ASGI + Async Nodes<\/h4>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2024\/09\/async_result_graph.png\" alt=\"async result\" \/><\/p>\n<h3>Detailed Comparison<\/h3>\n<h4>Request Count and Throughput<\/h4>\n<ul>\n<li><strong>Request Count:<\/strong> Async variant handled approximately 5.4 times more requests than Sync setup.<\/li>\n<li><strong>Requests\/s:<\/strong> Async variant achieved 218.85 requests\/s compared to Sync setup&#8217;s 40.60 requests\/s. This indicates async variant handled requests more efficiently, possibly due to its asynchronous nature.<\/li>\n<\/ul>\n<h4>Response Time<\/h4>\n<ul>\n<li><strong>Median Response Time:<\/strong> Async variant&#8217;s median response time (1,400 ms) was significantly lower than Sync setup&#8217;s (9,900 ms). This shows that most requests in async variant were completed faster.<\/li>\n<li><strong>Average Response Time:<\/strong> Async variant&#8217;s average response time (1,546.90 ms) was also significantly lower than Sync setup&#8217;s (13,779.82 ms). This highlights async variant&#8217;s overall better performance in handling requests.<\/li>\n<li><strong>Min Response Time:<\/strong> Async variant&#8217;s minimum response time (0.73 ms) was slightly lower than Sync setup&#8217;s (1.07 ms), indicating faster handling of the quickest requests.<\/li>\n<li><strong>Max Response Time:<\/strong> Async variant&#8217;s maximum response time (4,212.50 ms) was much lower than Sync setup&#8217;s (48,101.42 ms), suggesting async variant managed peak loads more effectively.<\/li>\n<\/ul>\n<h4>Failure Count and Rate<\/h4>\n<ul>\n<li><strong>Failure Count:<\/strong> Sync setup had only 1 failure, whereas async variant had 43 failures. Despite this, async variant&#8217;s higher request count still resulted in a very low failure rate.<\/li>\n<li><strong>Failures\/s:<\/strong> Sync setup&#8217;s failure rate (0.0033 failures\/s) was lower than async variant&#8217;s (0.1435 failures\/s), although this needs to be considered in the context of the much higher load handled by async variant.<\/li>\n<\/ul>\n<h4>Percentiles<\/h4>\n<ul>\n<li><strong>Sync setup:<\/strong> The higher percentiles (75%, 90%, 95%, etc.) showed very high response times, peaking at 48,000 ms. This indicates Sync setup struggled significantly with higher loads, leading to large delays.<\/li>\n<li><strong>Async variant:<\/strong> The percentiles for async variant were much lower, with the 99.9th percentile at 4,000 ms. This shows async variant provided more consistent performance under load.<\/li>\n<\/ul>\n<h2>Analysis of Performance Differences<\/h2>\n<p>The async variant demonstrated significantly better performance compared to the Sync setup application across all key metrics, handling more requests with lower response times and higher throughput. The primary reason for this difference is async variant&#8217;s end-to-end asynchronous nature, which allows it to handle I\/O-bound operations more efficiently and manage multiple requests concurrently, unlike Sync setup&#8217;s synchronous handling. While async variant had a slightly higher failure rate, this was relatively minor considering the much higher load it managed to process effectively.<\/p>\n<h3>Evidence Of Backpressure In The Sync Setup<\/h3>\n<p>Backpressure occurs when a system becomes overwhelmed by the volume of incoming requests and cannot process them quickly enough, leading to increased response times and potential failures. Here are the indicators suggesting backpressure in the sync variant:<\/p>\n<ol>\n<li><strong>High Median and Average Response Times<\/strong>:\n<ul>\n<li>Median Response Time: 9,900 ms<\/li>\n<li>Average Response Time: 13,779.82 ms<\/li>\n<\/ul>\n<p>These high response times indicate that the sync variant is taking a long time to process requests, which is a sign that it is struggling to keep up with the incoming load.<\/li>\n<li><strong>Wide Range of Response Times<\/strong>:\n<ul>\n<li>Min Response Time: 1.07 ms<\/li>\n<li>Max Response Time: 48,101.42 ms<\/li>\n<\/ul>\n<p>The vast difference between the minimum and maximum response times shows that while some requests are processed quickly, others take an excessively long time, suggesting the system is experiencing periods of high load that it cannot handle efficiently.<\/li>\n<li><strong>High Percentile Response Times<\/strong>:\n<ul>\n<li>75% Response Time: 22,000 ms<\/li>\n<li>90% Response Time: 34,000 ms<\/li>\n<li>95% Response Time: 40,000 ms<\/li>\n<li>99% Response Time: 48,000 ms<\/li>\n<li>99.9% Response Time: 48,000 ms<\/li>\n<\/ul>\n<p>The high response times at these percentiles indicate that a significant proportion of requests are delayed, further suggesting the application is overwhelmed.<\/li>\n<li><strong>Max Response Time<\/strong>:\n<ul>\n<li>The maximum response time of 48,101.42 ms is extremely high and indicates that under peak load, some requests are waiting an excessively long time to be processed.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Low Requests per Second (Requests\/s)<\/strong>:\n<ul>\n<li>WSGI sync setup: 40.60 Requests\/s<\/li>\n<li>ASGI async setup: 218.85 Requests\/s<\/li>\n<\/ul>\n<p>The sync variant handled far fewer requests per second compared to the async variant, indicating it is less capable of handling high loads efficiently.<\/li>\n<\/ol>\n<p>These metrics collectively suggest that the <strong>sync variant is experiencing backpressure<\/strong>. The high and variable response times, coupled with the lower throughput, indicate that the application cannot process requests quickly enough under high load, resulting in delays and potentially dropped requests. In contrast, the <strong>async variant demonstrates significantly better performance and is more resilient to high loads<\/strong>, thanks to its asynchronous processing model.<\/p>\n<p>The sync setup blocks the thread while waiting for the nodes to finish executing and <strong>thread pool exhaustion occurs when all available threads in the thread pool are occupied<\/strong> and new requests cannot be processed until some threads are freed up. This situation can lead to backpressure, where incoming requests are delayed or queued because the application server cannot handle them immediately. Increasing the worker and thread count may help but ultimately still suffers from thread blocking operations.<\/p>\n<p>While async setup also shows signs of backpressure, such as higher response times for a small percentage of requests and a non-zero failure rate, it performs significantly better than sync setup under similar conditions. The asynchronous nature allows it to handle higher loads more efficiently, but it is still not entirely immune to the effects of backpressure when pushed to its limits. The signs of backpressure in <code>FastAPI<\/code> are much less severe compared to that of the synchronous <code>gunicorn<\/code> setup, highlighting its superior performance in handling concurrent requests.<\/p>\n<h2>tldr<\/h2>\n<blockquote><p>The experiment showed that the bottleneck on the sync variant was the PromptFlow application itself, where as with the async variant the limiting factor was the system resources. This is an important learning showing that the <strong>async option achieves more with the same resources<\/strong>. This may seem obvious but there is responsibility on your part as a developer to ensure that the Python functions are async compatible (using the async await pattern and picking the right libraries for I\/O) so the PromptFlow flow executor and ASGI hosting can take advantage of it.<\/p><\/blockquote>\n<h2>Bonus Reading: Why Are There Relatively High Network Failures In The Async Variant?<\/h2>\n<p>During the test, the async setup resulted in 44 network errors while the sync setup only had 1.<\/p>\n<p>The observed errors were:<\/p>\n<table>\n<thead>\n<tr>\n<th>Error<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>RemoteDisconnected('Remote end closed connection without response')<\/code><\/td>\n<td>This error occurs when the client closes the connection abruptly. In the context of a high-load environment, this can happen if the server is too slow to respond, causing the client to time out and close the connection.<\/td>\n<\/tr>\n<tr>\n<td><code>ConnectionResetError(104, 'Connection reset by peer')<\/code><\/td>\n<td>This error indicates that the server closed the connection without sending a response. This can happen if the server is overwhelmed and cannot handle new incoming connections or if it runs out of resources to maintain open connections.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>These can be indicative of resource limits on the host.<\/p>\n<ul>\n<li>CPU Limits: If the server&#8217;s CPU is fully utilized, it might not be able to process incoming requests in a timely manner, leading to clients timing out and closing connections.<\/li>\n<li>Memory Limits: If the server runs out of memory, it might kill processes or fail to accept new connections, leading to connection resets.<\/li>\n<\/ul>\n<p>Remember that we ran the mock API on the same environment and it required CPU, memory and network resources as well. As mentioned earlier the mock API is competing for resources with the test harness in this shared environment.<\/p>\n<h3>Explaining The Abrupt Changes In Requests Per Second and Response Time<\/h3>\n<p>If you looked at the time series graph closely, you would have noticed that there is some abrupt changes in the throughput and response times.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2024\/09\/async_exceptions.png\" alt=\"network errors\" \/><\/p>\n<p>We observed that these occurred at the same timestamps when the above mentioned network errors occurred. Further indicating that this happened due to resource limitations.<\/p>\n<h2>Summary Of Findings And Recommendations<\/h2>\n<p>As observed, the async setup demonstrated significantly higher throughput and better response times. The only metric that was worse in the async setup was the number of network exceptions occurred but that was most likely due to the limitation of the memory and CPU in the environment the test was run in.<\/p>\n<ul>\n<li>Utilise an <a href=\"https:\/\/oxylabs.io\/blog\/httpx-vs-requests-vs-aiohttp\">async supported http client<\/a> like <code>aiohttp<\/code> or <code>httpx<\/code> when calling downstream APIs or LLM endpoints. This would allow you to bubble up the async await pattern up to the node function level and allow PromptFlow flow executor to take advantage of it.<\/li>\n<li>Start using <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/machine-learning\/prompt-flow\/how-to-deploy-to-code?view=azureml-api-2&amp;tabs=managed#use-fastapi-serving-engine-preview\">FastAPI as the runner<\/a> for <code>pf-serve<\/code> by setting the <code>PROMPTFLOW_SERVING_ENGINE=fastapi<\/code> environment variable.<\/li>\n<\/ul>\n<p>These recommendations are easy to implement and should be good defaults for most scenarios. We tested the new flex flow from PromptFlow with an async setup and it behaved similar to the static DAG based flow.<\/p>\n<h2>Closing<\/h2>\n<p>You can start using the <a href=\"https:\/\/github.com\/microsoft\/promptflow\/tree\/main\/benchmark\/promptflow-serve\">test harness we developed<\/a> to test your own flow if you haven&#8217;t done any form of throughput testing.<\/p>\n<p>It&#8217;s important that you make evidence based decisions when it comes to performance optimisations. This ensures you invest the effort in the most critical areas and helps make informed decisions. The approach we took to create a sample representation allowed us to experiment and isolate different aspects of a complex system. This approach needs to be continuous as your application evolves to identify the bottlenecks and make trade-off where required.<\/p>\n<p><em>The feature image was generated using Bing Image Creator using prompt &#8220;There is a water stream flowing through a lush grassland. There are boulders blocking the flow and a robot is fixing a meter to measure the flow. View from above. Digital art.&#8221; &#8211; <a href=\"https:\/\/www.bing.com\/new\/termsofuse?FORM=GENTOS\">Terms<\/a> can be found here.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post we discuss how to test the throughput of PromptFlow pf-serve module and key learnings doing so. We explore the impact on throughput and performance the different WSGI and ASGI hosting methods have and the importance of engineering your Python nodes with the async await pattern for I\/O.<\/p>\n","protected":false},"author":162372,"featured_media":15667,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,3451],"tags":[3557,3556,3558,3542,3430,3529,3555],"class_list":["post-15666","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","category-ise","tag-async","tag-benchmark","tag-fastapi","tag-llm","tag-performance","tag-promptflow","tag-throughput"],"acf":[],"blog_post_summary":"<p>In this post we discuss how to test the throughput of PromptFlow pf-serve module and key learnings doing so. We explore the impact on throughput and performance the different WSGI and ASGI hosting methods have and the importance of engineering your Python nodes with the async await pattern for I\/O.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/15666","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/162372"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=15666"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/15666\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/15667"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=15666"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=15666"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=15666"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}