Introduction

We recently worked on an engagement to build a GenAI gateway with multi-tenancy and quota management as key capabilities. In this blog, we share our learnings on how we solved this problem by using Azure API Management (aka “APIM” in short).

The Problem Statement

To build a multi-tenant GenAI gateway that can load-balance requests among various deployments based on customer’s entitlements (more details about this in the Business context section) and enforce quota and rate limits accordingly.

The Business Context

Our customer aimed to build a SaaS-based model for their GenAI resources. Resources here refer to the various LLM deployments. For example, in case of Azure, these are deployments like GPT-4 Turbo, GPT3.5 etc.

The business intends to offer these GenAI capabilities to their SaaS customers through a tier-based model as defined below:

Freemium Tier: This is for customers who wish to explore the service at no cost. The requests would be served by deployments with basic capacity and throughput (for instance, in context of Azure OpenAI, it would be a pay-as-you-go deployment). Other Freemium Tier customers would also utilize this same deployment.
Basic Tier: This is aimed at customers who desire a more enhanced experience (for e.g: higher number of tokens, lower latency) than the Freemium Tier. Thus, in the Basic Tier, requests would be handled by deployments with greater capacity and throughput (like a PTU based deployment in the context of Azure OpenAI). However, like Freemium Tier, a model deployment in Basic Tier will also be utilized by other Basic Tier customers.
Premium Tier: This is for customers requiring the most premium experience (i.e. highest number of tokens and lowest latency) with a dedicated deployment (i.e. no sharing with other customers). Thus, in the Premium Tier, requests would be handled by dedicated deployments with highest capacity and throughput (for example,in the context of Azure OpenAI, a dedicated PTU instance for the specific customer).

Another advantage of offering a tier-based model is to be able to define tier-specific quota and rate limits. For example: Freemium Tier would have the lowest quota and rate limits and Premium Tier would have the highest.

Lastly, there’s a concept of entitlements. We can think of “entitlement” as access to only specific GenAI resources such as chat, image generation, embeddings etc. For instance, a customer with a “chat-based entitlement” will be limited to chat-based scenarios only. Similarly, customer with “image-based entitlement” will only have access to image generation APIs. Furthermore, the business are able to define different type of entitlements.

Quota and Rate Limits

Before we explore the solution, let’s recap on Quotas and Rate Limits.

Quota

Quotas are mechanisms utilized to regulate consumption over an extended period (for instance, a month). They are allocated according to a subscription model and are defined based on metrics such as the “Number of requests”. Quotas are refreshed at the end of the quota period.

Rate Limits

Rate limits are implemented to safeguard against short and intense spikes of requests and are set for a shorter duration (for instance, a minute). In the context of GenAI Gateways, rate limits are defined on metrics like “Tokens per minute”. Like Quotas, Rate limits are also renewed at the end of the rate limit period.

For more details on Quota and Rate Limits, please refer to: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/quota

Solution Approaches

Leveraging existing policies

In Azure API Management it is possible to define policies using the available built-in list of policies. We already have policies for quota and rate limiting. We can leverage these policies as-is. Furthermore, as announced at Microsoft Build, we also have some dedicated policies for Azure OpenAI. For example, in our case instead of using the normal rate-limit policy we can use the azure-openai-token-limit policy. This policy can rate limit the requests based on number of tokens.

Using SubscriptionId as Counter Key

Both the quota and rate limit policies have a “-by-key” variants. This means they require a “counter key” (a unique identifier) on which these policies would enforce the quota and rate limits. We can use the “-by-key” variant of these policies for our design and employ APIM’s inherent SubscriptionId as our counter key. SubscriptionId is a fundamental concept of API Management. It also serves as dual role by acting as authentication key for API calls. This aligns well with the concept of a unique identifier based on which the quotas can be regulated for a specific customer or tenant.

Solution Approach 1 – Defining all Policies at API Level

One approach to design the solution could be to create separate APIs for each tier/entitlement and define the quota/rate limiting policies for each API. However, this would quickly become a maintenance problem. We would be repeating the APIs, and we would also be inflating our APIs with “tier-wise” policies that would be shared across multiple APIs.

Solution Approach 2 – Using APIM’s Products and Product Policies

An alternative solution is to leverage the concept of “Products” of Azure API Management.

We can think of a “Product” as an abstraction, such that it only consists of APIs that should be part of a specific “entitlement”. For example, if we need to define a chat-based entitlement, then we can create a Chat product which contains only the chat APIs.

Products and APIs

We can further define tier-wise policies (such as rate limit/quota policies) at the “Product level”. Any API specific policies (such as common headers, emitting metrics, error handling) can be defined at the “API level”.

Lastly, we can create “subscriptions” at the Product level (which also helps in protecting our Products from unauthorized access). The user would include the subscription ID as part of their request when interacting with Azure APIM. Based on the subscription ID from the request, Azure APIM will map it to the relevant Product and apply the Product and API policies accordingly.

The following high-level diagram illustrates this approach further:

Quota Management using Products

Example of the Solution Design in Azure API Management

Let us walkthrough some example Products and Policies which would be defined based on the above approach to illustrate this further.

API and API Policies

Here’s an example of how an API can be defined with policies specific to the API (i.e., excluding quota and rate limiting policies). We’re also employing Policy fragments to define our various policies, which helps to maintain concise code and prevents the repetition of the same policy code.

APIM APIs

Product and Product Policies

Here’s an example of how an APIM Product can be set up with its respective APIs: Free-Chat-Product-APIs

Quota and Rate Limit Policies can be implemented at the Product Level for all corresponding APIs that are part of that product. Free-Chat-Product-Policies

Lastly, Products are safeguarded and accessed via their respective subscriptions. APIM-Product-Subscriptions

Enabling `SubscriptionId` as an Authentication Key for APIs

SubscriptionId is a built-in security feature provided as a turn-on feature by APIM for the authentication of APIs.

Following snapshot shows how this can be enabled from the API level: APIM-Subscription-check

Benefits of the Approach using Products and Policies

The overall design of the GenAI gateway becomes simple and easy to understand.
There is a clear separation of concerns between APIs and Products and their respective policies.
Entities can scale independently. One can create as many Products and APIs and combine them as per your requirement.
The DRY (Do not Repeat Yourself) principle is adhered appropriately.Thus, there is no duplication of policies and no redundant APIs.

Recommendations and Learnings

In addition to the concept of Products and Policies, we’d also like to discuss some other key learnings and insights which we gained while working with Azure API Management. We found these to be useful as part of the design and implementation process.:

Using APIM Subscriptions

We discovered that SubscriptionID is a fundamental concept of API Management, and we used it at the centre of our solution design (as a key for quota and rate limit policies and for authentication). The built-in authentication feature allowed us to secure our APIs and Products without requiring any additional code. Without a SubscriptionID, one would have to build this authentication layer. We hence recommend incorporating the concept of SubscriptionID as part of your APIM solution design. SubscriptionID also serves as a crucial element in Observability and any new future capabilities of APIM would likely integrate with SubscriptionID.

A note on Products and Subscriptions

APIM currently enforces a 1:1 mapping restriction between a subscription and product, meaning one subscription can only be linked to one Product. However, one Product can have multiple subscriptions.

Policy Fragments

Considering the fact that there could be numerous policies at both the API and product level, and the policy code could range from a few to several lines, we recommend defining policies as Policy Fragments.

Using Policy fragments greatly simplifies the overall policy code, making it more concise and readable. Additionally, it helps to follow single source of truth for policies, thus adhering to the DRY (Don’t Repeat Yourself) principle.

A note on Policy Fragments Versioning from an Operations(Ops) perspective

Currently Policy Fragments, unlike APIs, don’t support versioning. Therefore, if an API revision is needed and we are using policy fragments, a new policy fragment should be created for testing, or alternatively, policies should be defined inline without using policy fragments.

On APIM Quota and Rate Limit Policies’s Refresh Time

We also observed that like any other gateway the value of refresh time in the response of APIM’s quota and rate limit policies is determined by the gateway’s internal fixed size window. For instance, if there’s a quota policy allowing 5 calls in 5 minutes, the replenishment time given in the response can range anywhere from 0 to 5 minutes. This depends on the initiation time of the internal 5-minute window which could be at any given moment.

Summary

To summarize, the concept of Products and subscriptions is effective in building a multi-tenant SaaS system with a simple and easy-to-understand design, leveraging the built-in APIM policies for quota and rate limiting.

Building a Multi-tenant GenAI gateway using APIM

Introduction

The Problem Statement

The Business Context