September 27th, 2024

PromptFlow Serve – Benchmark Result Analysis

PromptFlow Serve: Benchmark Result Analysis

My team have been developing a relatively complex PromptFlow application for the last few months and as a part of getting it ready for production scale, did performance testing and optimisations. This blog post summarises our finding from doing performance testing of various runner options of promptflow-serve and our recommendations.

Test Scenario

Before testing the entire application, we created a sample flow that mimicked part of our real flow. It contained a fan-out and fan-in flow to replicate LLM nodes like guardrails running in parallelly and a final node that does an API call to an external dependency. This API call was made to a mock HTTP API. A synthetic delay was added to each of the parallel nodes to replicate LLM calls.

The harness contained these components:

  • A mock HTTP API that acts as a service used by your PromptFlow flow.
  • The following PromptFlow flows that are hosted using pf-serve.
    1. Synchronous flow hosted using WSGI.
    2. Asynchronous flow hosted using ASGI and async Python functions as PromptFlow nodes.
  • Locust load test generator.
  • Makefile, scripts and docker-compose file to orchestrate the tests.

The test harness and example used to create the test scenarios discussed has been contributed to the PromptFlow repository via a pull request here.

Flow

The directed acyclic graph (DAG) for both the flows are shown below.

flow

  • Parallel nodes 1, 2 and 3 have a synthetic delay of 0.25ms to simulate an LLM call like guardrails.
  • The chat node makes a HTTP call to the mock API to simulate a downstream service call.

Host Runner Options, Synchronous and Asynchronous Nodes

The aim of our selected variations were to test how pf-serve behaves when using the default WSGI runner (Gunicorn) compared to an ASGI runner (FastApi) with the combination of async PromptFlow nodes.

The load was generated using Locust and had a maximum of 1000 concurrent users with a hatch rate of 10. The test was run for 5 minutes. We ran each combination with 8 workers and 8 threads per worker. The test was run on WSL which had 16GB of memory and 8 logical processors. The guidance around concurrency can be found here.

In hindsight, 1000 concurrent users were exhausting the limited resources available on the host environment. We not only had the test harness but also the mock API running in the same host. Ideally the mock API should have been run elsewhere so it would not interfere with the test harness. But our aim was to find patterns and bottlenecks rather than be super precise around maximum achievable throughput.

If you want to find a more accurate number for the sustainable concurrent users for your environment, run the included load test against the mock API endpoint provided in the benchmark suite. This will give you sense of what ASGI (FastApi) can support in your environment without PromptFlow in the mix. You can also use this result as a guide to compare the throughput to the sync and async variants (where PromptFlow is in the mix).

Sync vs Async Nodes

The PromptFlow @tool annotation supports both sync and async functions. Hence both of the below code examples are valid but as you will see later, they have a massive performance impact as the sync example blocks the thread.

The sync example uses the requests library to make a synchronous call to the mock API.

import os
import time

import requests
from promptflow.core import tool

@tool
def my_python_tool(node1: str, node2: str, node3: str) -> str:

    start_time = time.time()

    # make a call to the mock endpoint
    url = os.getenv("MOCK_API_ENDPOINT", None)
    if url is None:
        raise RuntimeError("Failed to read MOCK_API_ENDPOINT env var.")

    # respond with the service call and tool total times
    response = requests.get(url)
    if response.status_code == 200:
        response_dict = response.json()
        end_time = time.time()
        response_dict["pf_node_time_sec"] = end_time - start_time
        response_dict["type"] = "pf_dag_sync"
        return response_dict
    else:
        raise RuntimeError(f"Failed call to {url}: {response.status_code}")

The below async example uses aiohttp to make an async call to the mock API which allows the node function to be async as well.

import os
import time

import aiohttp
from promptflow.core import tool

@tool
async def my_python_tool(node1: str, node2: str, node3: str) -> str:

    start_time = time.time()

    # make a call to the mock endpoint
    url = os.getenv("MOCK_API_ENDPOINT", None)
    if url is None:
        raise RuntimeError("Failed to read MOCK_API_ENDPOINT env var.")

    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            if response.status == 200:
                response_dict = await response.json()
                end_time = time.time()
                response_dict["pf_node_time_sec"] = end_time - start_time
                response_dict["type"] = "pf_dag_async"
                return response_dict
            else:
                raise RuntimeError(f"Failed call to {url}: {response.status}")

Combinations Tested

  • WSGI (gunicorn) + Sync PF Nodes
  • ASGI (fastapi) + Async PF Nodes

It’s important to note that PromptFlow Docker image based deployment uses the WSGI runner (Flask) by default. You must opt-in to use FastApi.

Test Results

Metric WSGI + Sync Nodes ASGI + Async Nodes
Request Count 12,157 65,554
Failure Count 1 43
Median Response Time (ms) 9,900 1,400
Average Response Time (ms) 13,779.82 1,546.90
Min Response Time (ms) 1.07 0.73
Max Response Time (ms) 48,101.42 4,212.50
Requests/s 40.60 218.85
Failures/s 0.0033 0.1435
50% Response Time (ms) 9,900 1,400
66% Response Time (ms) 17,000 1,500
75% Response Time (ms) 22,000 1,700
80% Response Time (ms) 24,000 1,800
90% Response Time (ms) 34,000 2,100
95% Response Time (ms) 40,000 2,400
98% Response Time (ms) 47,000 2,800
99% Response Time (ms) 48,000 3,000
99.9% Response Time (ms) 48,000 4,000
99.99% Response Time (ms) 48,000 4,100
100% Response Time (ms) 48,000 4,200

Throughput Graph

WSGI + Sync Nodes

sync result

ASGI + Async Nodes

async result

Detailed Comparison

Request Count and Throughput

  • Request Count: Async variant handled approximately 5.4 times more requests than Sync setup.
  • Requests/s: Async variant achieved 218.85 requests/s compared to Sync setup’s 40.60 requests/s. This indicates async variant handled requests more efficiently, possibly due to its asynchronous nature.

Response Time

  • Median Response Time: Async variant’s median response time (1,400 ms) was significantly lower than Sync setup’s (9,900 ms). This shows that most requests in async variant were completed faster.
  • Average Response Time: Async variant’s average response time (1,546.90 ms) was also significantly lower than Sync setup’s (13,779.82 ms). This highlights async variant’s overall better performance in handling requests.
  • Min Response Time: Async variant’s minimum response time (0.73 ms) was slightly lower than Sync setup’s (1.07 ms), indicating faster handling of the quickest requests.
  • Max Response Time: Async variant’s maximum response time (4,212.50 ms) was much lower than Sync setup’s (48,101.42 ms), suggesting async variant managed peak loads more effectively.

Failure Count and Rate

  • Failure Count: Sync setup had only 1 failure, whereas async variant had 43 failures. Despite this, async variant’s higher request count still resulted in a very low failure rate.
  • Failures/s: Sync setup’s failure rate (0.0033 failures/s) was lower than async variant’s (0.1435 failures/s), although this needs to be considered in the context of the much higher load handled by async variant.

Percentiles

  • Sync setup: The higher percentiles (75%, 90%, 95%, etc.) showed very high response times, peaking at 48,000 ms. This indicates Sync setup struggled significantly with higher loads, leading to large delays.
  • Async variant: The percentiles for async variant were much lower, with the 99.9th percentile at 4,000 ms. This shows async variant provided more consistent performance under load.

Analysis of Performance Differences

The async variant demonstrated significantly better performance compared to the Sync setup application across all key metrics, handling more requests with lower response times and higher throughput. The primary reason for this difference is async variant’s end-to-end asynchronous nature, which allows it to handle I/O-bound operations more efficiently and manage multiple requests concurrently, unlike Sync setup’s synchronous handling. While async variant had a slightly higher failure rate, this was relatively minor considering the much higher load it managed to process effectively.

Evidence Of Backpressure In The Sync Setup

Backpressure occurs when a system becomes overwhelmed by the volume of incoming requests and cannot process them quickly enough, leading to increased response times and potential failures. Here are the indicators suggesting backpressure in the sync variant:

  1. High Median and Average Response Times:
    • Median Response Time: 9,900 ms
    • Average Response Time: 13,779.82 ms

    These high response times indicate that the sync variant is taking a long time to process requests, which is a sign that it is struggling to keep up with the incoming load.

  2. Wide Range of Response Times:
    • Min Response Time: 1.07 ms
    • Max Response Time: 48,101.42 ms

    The vast difference between the minimum and maximum response times shows that while some requests are processed quickly, others take an excessively long time, suggesting the system is experiencing periods of high load that it cannot handle efficiently.

  3. High Percentile Response Times:
    • 75% Response Time: 22,000 ms
    • 90% Response Time: 34,000 ms
    • 95% Response Time: 40,000 ms
    • 99% Response Time: 48,000 ms
    • 99.9% Response Time: 48,000 ms

    The high response times at these percentiles indicate that a significant proportion of requests are delayed, further suggesting the application is overwhelmed.

  4. Max Response Time:
    • The maximum response time of 48,101.42 ms is extremely high and indicates that under peak load, some requests are waiting an excessively long time to be processed.
  5. Low Requests per Second (Requests/s):
    • WSGI sync setup: 40.60 Requests/s
    • ASGI async setup: 218.85 Requests/s

    The sync variant handled far fewer requests per second compared to the async variant, indicating it is less capable of handling high loads efficiently.

These metrics collectively suggest that the sync variant is experiencing backpressure. The high and variable response times, coupled with the lower throughput, indicate that the application cannot process requests quickly enough under high load, resulting in delays and potentially dropped requests. In contrast, the async variant demonstrates significantly better performance and is more resilient to high loads, thanks to its asynchronous processing model.

The sync setup blocks the thread while waiting for the nodes to finish executing and thread pool exhaustion occurs when all available threads in the thread pool are occupied and new requests cannot be processed until some threads are freed up. This situation can lead to backpressure, where incoming requests are delayed or queued because the application server cannot handle them immediately. Increasing the worker and thread count may help but ultimately still suffers from thread blocking operations.

While async setup also shows signs of backpressure, such as higher response times for a small percentage of requests and a non-zero failure rate, it performs significantly better than sync setup under similar conditions. The asynchronous nature allows it to handle higher loads more efficiently, but it is still not entirely immune to the effects of backpressure when pushed to its limits. The signs of backpressure in FastAPI are much less severe compared to that of the synchronous gunicorn setup, highlighting its superior performance in handling concurrent requests.

tldr

The experiment showed that the bottleneck on the sync variant was the PromptFlow application itself, where as with the async variant the limiting factor was the system resources. This is an important learning showing that the async option achieves more with the same resources. This may seem obvious but there is responsibility on your part as a developer to ensure that the Python functions are async compatible (using the async await pattern and picking the right libraries for I/O) so the PromptFlow flow executor and ASGI hosting can take advantage of it.

Bonus Reading: Why Are There Relatively High Network Failures In The Async Variant?

During the test, the async setup resulted in 44 network errors while the sync setup only had 1.

The observed errors were:

Error Description
RemoteDisconnected('Remote end closed connection without response') This error occurs when the client closes the connection abruptly. In the context of a high-load environment, this can happen if the server is too slow to respond, causing the client to time out and close the connection.
ConnectionResetError(104, 'Connection reset by peer') This error indicates that the server closed the connection without sending a response. This can happen if the server is overwhelmed and cannot handle new incoming connections or if it runs out of resources to maintain open connections.

These can be indicative of resource limits on the host.

  • CPU Limits: If the server’s CPU is fully utilized, it might not be able to process incoming requests in a timely manner, leading to clients timing out and closing connections.
  • Memory Limits: If the server runs out of memory, it might kill processes or fail to accept new connections, leading to connection resets.

Remember that we ran the mock API on the same environment and it required CPU, memory and network resources as well. As mentioned earlier the mock API is competing for resources with the test harness in this shared environment.

Explaining The Abrupt Changes In Requests Per Second and Response Time

If you looked at the time series graph closely, you would have noticed that there is some abrupt changes in the throughput and response times.

network errors

We observed that these occurred at the same timestamps when the above mentioned network errors occurred. Further indicating that this happened due to resource limitations.

Summary Of Findings And Recommendations

As observed, the async setup demonstrated significantly higher throughput and better response times. The only metric that was worse in the async setup was the number of network exceptions occurred but that was most likely due to the limitation of the memory and CPU in the environment the test was run in.

  • Utilise an async supported http client like aiohttp or httpx when calling downstream APIs or LLM endpoints. This would allow you to bubble up the async await pattern up to the node function level and allow PromptFlow flow executor to take advantage of it.
  • Start using FastAPI as the runner for pf-serve by setting the PROMPTFLOW_SERVING_ENGINE=fastapi environment variable.

These recommendations are easy to implement and should be good defaults for most scenarios. We tested the new flex flow from PromptFlow with an async setup and it behaved similar to the static DAG based flow.

Closing

You can start using the test harness we developed to test your own flow if you haven’t done any form of throughput testing.

It’s important that you make evidence based decisions when it comes to performance optimisations. This ensures you invest the effort in the most critical areas and helps make informed decisions. The approach we took to create a sample representation allowed us to experiment and isolate different aspects of a complex system. This approach needs to be continuous as your application evolves to identify the bottlenecks and make trade-off where required.

The feature image was generated using Bing Image Creator using prompt “There is a water stream flowing through a lush grassland. There are boulders blocking the flow and a robot is fixing a meter to measure the flow. View from above. Digital art.” – Terms can be found here.