Introduction
We recently worked on an engagement to build a GenAI gateway with multi-tenancy and quota management as key capabilities. In this blog, we share our learnings on how we solved this problem by using Azure API Management (aka “APIM” in short).
The Problem Statement
To build a multi-tenant GenAI gateway that can load-balance requests among various deployments based on customer’s entitlements (more details about this in the Business context section) and enforce quota and rate limits accordingly.
The Business Context
Our customer aimed to build a SaaS-based model for their GenAI resources. Resources here refer to the various LLM deployments. For example, in case of Azure, these are deployments like GPT-4 Turbo, GPT3.5 etc.
The business intends to offer these GenAI capabilities to their SaaS customers through a tier-based model as defined below:
- Freemium Tier: This is for customers who wish to explore the service at no cost. The requests would be served by deployments with basic capacity and throughput (for instance, in context of Azure OpenAI, it would be a pay-as-you-go deployment). Other Freemium Tier customers would also utilize this same deployment.
- Basic Tier: This is aimed at customers who desire a more enhanced experience (for e.g: higher number of tokens, lower latency) than the Freemium Tier. Thus, in the Basic Tier, requests would be handled by deployments with greater capacity and throughput (like a PTU based deployment in the context of Azure OpenAI). However, like Freemium Tier, a model deployment in Basic Tier will also be utilized by other Basic Tier customers.
- Premium Tier: This is for customers requiring the most premium experience (i.e. highest number of tokens and lowest latency) with a dedicated deployment (i.e. no sharing with other customers). Thus, in the Premium Tier, requests would be handled by dedicated deployments with highest capacity and throughput (for example,in the context of Azure OpenAI, a dedicated PTU instance for the specific customer).
Another advantage of offering a tier-based model is to be able to define tier-specific quota and rate limits. For example: Freemium Tier would have the lowest quota and rate limits and Premium Tier would have the highest.
Lastly, there’s a concept of entitlements. We can think of “entitlement” as access to only specific GenAI resources such as chat, image generation, embeddings etc. For instance, a customer with a “chat-based entitlement” will be limited to chat-based scenarios only. Similarly, customer with “image-based entitlement” will only have access to image generation APIs. Furthermore, the business are able to define different type of entitlements.
Quota and Rate Limits
Before we explore the solution, let’s recap on Quotas and Rate Limits.
Quota
Quotas are mechanisms utilized to regulate consumption over an extended period (for instance, a month). They are allocated according to a subscription model and are defined based on metrics such as the “Number of requests”. Quotas are refreshed at the end of the quota period.
Rate Limits
Rate limits are implemented to safeguard against short and intense spikes of requests and are set for a shorter duration (for instance, a minute). In the context of GenAI Gateways, rate limits are defined on metrics like “Tokens per minute”. Like Quotas, Rate limits are also renewed at the end of the rate limit period.
For more details on Quota and Rate Limits, please refer to: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/quota
Solution Approaches
Leveraging existing policies
In Azure API Management it is possible to define policies using the available built-in list of policies. We already have policies for quota and rate limiting. We can leverage these policies as-is. Furthermore, as announced at Microsoft Build, we also have some dedicated policies for Azure OpenAI. For example, in our case instead of using the normal rate-limit policy we can use the azure-openai-token-limit policy. This policy can rate limit the requests based on number of tokens.
Using SubscriptionId as Counter Key
Both the quota and rate limit policies have a “-by-key” variants. This means they require a “counter key” (a unique identifier) on which these policies would enforce the quota and rate limits.
We can use the “-by-key” variant of these policies for our design and employ APIM’s inherent SubscriptionId
as our counter key.
SubscriptionId
is a fundamental concept of API Management. It also serves as dual role by acting as authentication key for API calls. This aligns well with the concept of a unique identifier based on which the quotas can be regulated for a specific customer or tenant.
More on Subscriptions here.
Solution Approach 1 – Defining all Policies at API Level
One approach to design the solution could be to create separate APIs for each tier/entitlement and define the quota/rate limiting policies for each API. However, this would quickly become a maintenance problem. We would be repeating the APIs, and we would also be inflating our APIs with “tier-wise” policies that would be shared across multiple APIs.
Solution Approach 2 – Using APIM’s Products and Product Policies
An alternative solution is to leverage the concept of “Products” of Azure API Management.
We can think of a “Product” as an abstraction, such that it only consists of APIs that should be part of a specific “entitlement”. For example, if we need to define a chat-based entitlement, then we can create a Chat product which contains only the chat APIs.
We can further define tier-wise policies (such as rate limit/quota policies) at the “Product level”. Any API specific policies (such as common headers, emitting metrics, error handling) can be defined at the “API level”.
Lastly, we can create “subscriptions” at the Product level (which also helps in protecting our Products from unauthorized access). The user would include the subscription ID as part of their request when interacting with Azure APIM. Based on the subscription ID from the request, Azure APIM will map it to the relevant Product and apply the Product and API policies accordingly.
The following high-level diagram illustrates this approach further:
Example of the Solution Design in Azure API Management
Let us walkthrough some example Products and Policies which would be defined based on the above approach to illustrate this further.
API and API Policies
Here’s an example of how an API can be defined with policies specific to the API (i.e., excluding quota and rate limiting policies). We’re also employing Policy fragments to define our various policies, which helps to maintain concise code and prevents the repetition of the same policy code.
Product and Product Policies
Here’s an example of how an APIM Product can be set up with its respective APIs:
Quota and Rate Limit Policies can be implemented at the Product Level for all corresponding APIs that are part of that product.
Lastly, Products are safeguarded and accessed via their respective subscriptions.
Enabling SubscriptionId
as an Authentication Key for APIs
SubscriptionId
is a built-in security feature provided as a turn-on feature by APIM for the authentication of APIs.
Following snapshot shows how this can be enabled from the API level:
Benefits of the Approach using Products and Policies
- The overall design of the GenAI gateway becomes simple and easy to understand.
- There is a clear separation of concerns between APIs and Products and their respective policies.
- Entities can scale independently. One can create as many Products and APIs and combine them as per your requirement.
- The DRY (Do not Repeat Yourself) principle is adhered appropriately.Thus, there is no duplication of policies and no redundant APIs.
Recommendations and Learnings
In addition to the concept of Products and Policies, we’d also like to discuss some other key learnings and insights which we gained while working with Azure API Management. We found these to be useful as part of the design and implementation process.:
Using APIM Subscriptions
We discovered that SubscriptionID
is a fundamental concept of API Management, and we used it at the centre of our solution design (as a key for quota and rate limit policies and for authentication). The built-in authentication feature allowed us to secure our APIs and Products without requiring any additional code. Without a SubscriptionID
, one would have to build this authentication layer. We hence recommend incorporating the concept of SubscriptionID
as part of your APIM solution design. SubscriptionID
also serves as a crucial element in Observability and any new future capabilities of APIM would likely integrate with SubscriptionID
.
A note on Products and Subscriptions
APIM currently enforces a 1:1 mapping restriction between a subscription and product, meaning one subscription can only be linked to one Product. However, one Product can have multiple subscriptions.
Policy Fragments
Considering the fact that there could be numerous policies at both the API and product level, and the policy code could range from a few to several lines, we recommend defining policies as Policy Fragments.
Using Policy fragments greatly simplifies the overall policy code, making it more concise and readable. Additionally, it helps to follow single source of truth for policies, thus adhering to the DRY (Don’t Repeat Yourself) principle.
A note on Policy Fragments Versioning from an Operations(Ops) perspective
Currently Policy Fragments, unlike APIs, don’t support versioning. Therefore, if an API revision is needed and we are using policy fragments, a new policy fragment should be created for testing, or alternatively, policies should be defined inline without using policy fragments.
On APIM Quota and Rate Limit Policies’s Refresh Time
We also observed that like any other gateway the value of refresh time in the response of APIM’s quota and rate limit policies is determined by the gateway’s internal fixed size window. For instance, if there’s a quota policy allowing 5 calls in 5 minutes, the replenishment time given in the response can range anywhere from 0 to 5 minutes. This depends on the initiation time of the internal 5-minute window which could be at any given moment.
Summary
To summarize, the concept of Products and subscriptions is effective in building a multi-tenant SaaS system with a simple and easy-to-understand design, leveraging the built-in APIM policies for quota and rate limiting.