Problem Statement
In our recent engagement, we helped a customer develop a Copilot chatbot solution using the Retrieval Augmented Generation (RAG) pattern. One key benefit of this pattern is to be able to retrieve relevant data or documents before generating a response or answer. In order to ensure the accuracy of the system, we collect ground truth data to perform evaluation against. On the most basic level, a ground truth file contains a question-and-answer pair. A more complex ground truth can contain chat history as well as annotations about the question or answer.
With evaluation, we are comparing the answer from the language model to the ground truth answer in relation to the question. The purpose of the evaluation platform is to help an Engineer or Data Scientist run experiments to validate the best configurations, search index, language models, and prompts.
Before we discuss the requirements for the Evaluation platform, we should note that there are other aspects of an experiment where we, as Engineers or Data Scientists, need to perform after an experiment, such as to summarize/aggregate the evaluation results for interpretation. This means capabilities such as comparisons against previous experiment runs, and in-depth filtering of the results as examples. This is something we are not covering but should be considered.
Consider the following steps for running an experiment. The steps represent only the mechanics of running an experiment because before running an experiment, we should consider a hypothesis and parameters for the experiment as an example (which we are not covering here).
- Inference – A system that when given a question will find relevant articles via some kind of search mechanism such as Retrieval-Augmented Generation (RAG) and then provide those references to a LLM to generate the appropriate answer. For evaluation, the output will include not just the answer, but also which documents were retrieved, which documents were used, what configuration was used (including prompts), and maybe even workflow step details.
- Evaluation – A system that when given inference output can determine, by the use of metrics, the quality of the answer delivered by the inference system. Metrics could be computed using deterministic means, such as a mathematical algorithm, or non-deterministic means, such as having another LLM judge the result. It will produce an evaluation result that will contain the value of each metric and sometimes the justification for why the value was chosen.
The following are requirements the Evaluation platform:
- Support user-defined number of iterations for running the same ground truth against inference as generated answers may differ each time.
- The ground truth, inference results output, and evaluation results output are stored in Azure Blob Storage.
- Support running inference jobs at the same time as evaluation jobs (streaming).
- Ensure jobs are reliable when there are failures such as quota violations (ex. rate-limiting, back-off, retry, etc.).
- Allow for different developers to run multiple evaluations at the same time.
- Allow developers to see the progress of their long-running evaluations.
Azure Machine Learning (AML)
Before we go over how we are leveraging Azure Machine Learning (AML) to meet the requirements, we should note that we did consider other alternatives, such as Azure AI Studio, but settled on AML because we do not plan on using Promptflow, a key capability of AI Studio. Additionally, we also required a deeper level of customization with some of the dependent services such as inference. It is worthwhile to mention here that inference is built in .NET and leverages the .NET version of Semantic Kernel and it is also containerized. We will cover more on this in a later section.
AML consist of two high-level components, the AML workspace which manages the meta information related to job runs, datastores, compute info, etc. and the AML compute cluster which contains the compute nodes that run the actual jobs. AML provides a client Azure ML Package client library for Python library which we can use for submitting jobs programmatically. As such, we wrote a Python script that can be executed on the client machine and is configured to read in a user-defined environment file that contains the required parameters for kicking off an experiment/AML job. We have custom Python scripts for each of the inference and evaluation steps that are executed on the compute nodes. These Python scripts are uploaded for each AML job run as part of kicking off the experiment.
When an AML job is kicked off, we can view the progress of the job on Azure Machine Learning Studio which is a specific portal for AML. We can click on an individual experiment to view the AML job runs for each experiment. Within each job, we can review the steps for each job. For a job that is currently running, we can see which step is currently running, and also review the progress of the job by reviewing the job_progress_overview.<YYYYMMDD>.txt
file which shows the current number of files processed and whether it is successful or not.
Parallel Job
For the inference step, we would like to take an Azure storage path and discover all the ground truth files defined in the path to be processed in parallel. In addition, we would want to provide the number of iterations to run the same ground truth file more than once.
Parallel jobs allow us to pass in a Datastore name as an input to the step. Datastore allows us to define the storage account, container name, and path to the ground truth within the AML workspace. On the client-side script, we can define how files are loaded. For our use case, given we only got a few hundred ground truth, we went with downloading the files to the compute target filesystem.
The following is a code snippet of the inference Python script.
def init():
global OUTPUT_PATH
...
parser.add_argument("--aml_output_path", type=str, default="")
OUTPUT_PATH = args.job_output_path
...
ITERATION_COUNT = int(os.getenv("iteration_count"))
def run(mini_batch):
...
for entry in mini_batch:
...
rf = open(entry, "r")
...
for i in range(ITERATION_COUNT):
output_ground_truth_file_path = f"{OUTPUT_PATH}/{output_ground_truth_file_name}"
# run inference and get inf_result
...
with open(output_ground_truth_file_path, "w") as wf:
json.dump(inf_result, output_ground_truth_file_path)
...
With a parallel job, there is a defined interface we need to implement – init
and run(mini_batch)
. init
is called once when the process is instantiated. This is useful for defining variables that are used throughout the lifetime of the process. Each entry
in mini_batch
contains the local path to the ground truth file. The number of entries in mini_batch
depends on the mini_batch_size
. The path defined in entry
is a local path because, as we noted earlier, we opted to use the download capability to download all files. For output, we opted for the upload capability. The OUTPUT_PATH
path is passed in as an argument aml_output_path
and we persist each inference result to the OUTPUT_PATH
which will be uploaded to the appropriate storage location.
There are two other parameters we should discuss instance_count
and max_concurrency_per_instance
. Before that, as a quick note, when we are defining the compute nodes in AML, we can specify the minimum and maximum number of nodes, as well as the cool down period in minutes before a node is deprovisioned. The instance_count
should not exceed the maximum nodes.
For the sake of simplicity, if we assume 100 ground truth pairs to process and configure instance_count=2
and max_concurrency_per_instance=2
, we should have 4 processes in total running in parallel (on 2 nodes with 2 processes each). In each process, there is a batch of 25 ground truth pairs to process. We could decide to run them in each process as a batch which is a variable being passed into our script called mini_batch
.
When we consider multiple users running at the same time, we designed with the expectation to support at least three users running experiments at the same time. This is where instance_count
matters, given if the default is an instance count of 2
, then the maximum number of nodes should be set to at least 6
when we are defining the compute nodes in AML.
For our use case for inference, we have kept mini_batch
as 1 which means each batch has only 1 entry and run
is executed 25 times. This design is useful as we consider retry for each ground truth file later. The instance_count
and max_concurrency_per_instance
are configurable parameters which gives flexibility for a higher parallel processing as needed.
The ITERATION_COUNT
is passed in as an environment variable. Parallel job exposes an environment_variables
setting where we can pass in a dictionary. This allows us to perform a user-defined number of inference with the same ground truth question and produce potentially a different answer in each iteration.
In summary, we would like to call out the following benefits of Parallel job for our use case:
- We are abstracted away from Azure Blob storage and do not have to concern ourselves with how to connect to, download or upload files. We simply need to be able to read and write files from given local paths. This reduces the complexity because having to write code to connect to Azure Blob storage requires additional work.
- We are also abstracted away from having to write any code related to processing files in parallel, both on the compute level (number of instances to spin up and down) as well as the code level (number of process to spin up and manage). The settings for tuning parallelism is all configuration based.
Retry
The bottleneck for inference is the language model we are using. Azure OpenAI models have token and rate limits across deployments within the Azure subscription. If the limits are breached, we will get 429 response code until the counter resets. We should also note the issue is further exacerbated when multiple users are running experiments at the same time. This may be a case for setting instance_count
and max_concurrency_per_instance
lower before kicking off an experiment but this means there needs to be either a manual check or some way to programmatically check for running inference jobs to determine if there are too many running.
This is where retry comes into play. AML supports retry using the retry_settings
but it is only limited to simple number of retries. With the .NET inference service, we have retry applied with Polly on a per request basis. Another approach implemented on the .NET inference service is to round robin requests between regional OpenAI model deployments to spread out the load which can help minimize hitting the limits. We should note that retry by itself is only useful in a limited setting because in the case of 429s, we should be backing off by respecting the 429 backoff time that is part of the response.
There is also the approach of rate-limiting where inference service instances can coordinate with each other to limit requests to OpenAI model deployments. For example, when 429s are seen and there are 8 inference processes running, perhaps only any 4 inference processes can send request to OpenAI model deployment at any moment. This is not implemented as it still manageable with the existing retry but listed here for additional consideration.
This said, the issue of 429s is now less of an issue since we migrated from GPT 3-5 to GPT 4o mini which is using global standard deployment and gives a much higher level of token and rate limits as well as other important benefits such as better accuracy.
Streaming
From a sequential perspective, the evaluation step takes in inference results produced by the inference step to produce evaluation results. However, we can imagine the inference step is constantly producing inference results which the evaluation step can execute on-demand as it sees inference files being produced and process each inference result file as it sees it created. This parallel processing allows for a faster processing time than waiting for inference to finish before kicking off evaluation. Unfortunately AML does not support streaming.
For our use case, it was acceptable to drop this requirement given running an experiment in the end took less than an hour and this is acceptable to the customer.
Other considerations
Although not part of the requirement, these are important design/implementation we should cover for the Evaluation platform.
AML Custom image
For kicking off a job, we can provide an AML container image as well as a Conda file with the python dependencies for our scripts using Environment
parameter. When an AML job runs, it will take the AML container image as a base and create a job runtime specific image that contains our python dependencies. AML also allows us to pass in your own custom image as a base as well.
From a security perspective, calling out to the .NET inference service will require us to add authentication logic to the inference python script even if the service is hosted privately in the same network. As such, we opted to create a custom AML image where we enable .NET and copy the .NET bits from the existing .NET inference container image. Having this setup allowed us bypass the need for authentication on as it is now running inside the container itself and access via localhost:port
. There is also no need to account for network related issue to connecting to the service as it is not external to the container. There are still authentication requirements for the .NET inference service to connect to external services such as Azure OpenAI deployments, but through managed identity on the node, we can configure access using appropriate role-based access control (RBAC) to those services. For use cases that requires actual keys/secrets, we can leverage RBAC to Azure Key Vault to get the secrets.
The following is a example of the dockerfile we have created.
# Description: Dockerfile for building the inference image with the .NET runtime and azureml
# Stage 1: Get the inference app from a custom image
FROM <ACR_NAME>.azurecr.io/inference:latest AS inference
# Stage 2: Base image for the final stage
FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04
# Install the .NET runtime and remove the package list to keep the image small
RUN curl -sSL https://dot.net/v1/dotnet-install.sh | bash /dev/stdin --channel 8.0 --runtime aspnetcore --install-dir /usr/share/dotnet
# Set the path to include the .NET runtime
ENV PATH="${PATH}:/usr/share/dotnet"
WORKDIR /inference
COPY --from=inference /app .
Resume from failed evaluation step
When evaluation step does fail but the previous inference step is successful, it would be wasteful to have to start from inference step. Having the ability to resume from a step would be save both time and cost.
In our design, each AML job run would produce a unique identifier. We have selected timestamp as an identifier (job Id) as a GUID may be too long. A timestamp can also help to identify when the job was executed. We also found numbers are a better way to communicate to others about an experiment run. The job Id is then used as a path output to the job for all steps. As an example, the following represents the output path of inference results with job Id yyyyMMddHHmmss\inference-results\
. Using this, if we need to resume from a previous job run, we simply need for the user to pass in the job Id from a previous run and the client side script can take the job Id and infer the path. This path is used then as an input to the next step of the job which is evaluation. Essentially we have a job with only one evaluation step.
There is potential for collision for the timestamp if there are many users running experiments at the same time. For our use case of supporting three users, this is not a likely scenario and works out for us.
Using a separate storage account for job runs
There is already an existing storage account associated with AML workspace that can be used to store ground truths and job run outputs for both inference and evaluation results. However we noted that the AML workspace would create artifacts for each job run on the container level. If we were using Azure Storage Explorer or the Portal to find our container for ground truths or the container that store the outputs for our jobs, it would not be easy to locate the container as the job runs grow. As such, we created a separate storage account to store ground truths, and job inference/evaluation results. This made it easier to find the right container and then filter on the path to find the appropriate ground truth files or job inference/evaluation results.
MLFlow
MLFlow is built into AML jobs and allows us to produce logs, metrics, parameters etc. One use we had for MLFlow is to capture the parameters we used for the experiment run. For example, we pass in the name of the search index, search mode, top_k, prompt temperature etc. All these are capture as part of the Job step using the log_param
method. When we navigate to the job and drill down to the step using AML Studio portal, we can review those parameters.
Summary
Although Azure Machine Learning is known for training and deploying models, we have found it useful for running experiments as well — as seen from the fact that, except from streaming, we were able to meet all requirements for the Evaluation Platform on AML. The Parallel job capability did play a big part here except it has gaps in retry. We were able to overcome some of that on the .NET inference level. We also discussed other considerations outside of the basic requirements that are useful. We hope you find this article useful.
Note: the picture that illustrates this article has been generated by AI on Bing Image Creator.