How to debug and optimize RAG agents in Microsoft Foundry

TL;DR

Learn the best practices to evaluate and optimize the RAG quality of your agent using evaluations in Foundry Observability. This tutorial demonstrates these 2 best practices before deploying your RAG agent:

1. Evaluate and optimize end to end your RAG agent using reference-free RAG triad evaluators: Groundedness and Relevance evaluators;

2. For advanced search use case requiring ground truths and more precise measurement of retrieval quality, optimize your Search parameters using golden metrics such as XDCG and Max Relevance: Document Retrieval evaluator

Framing Agent Observability

Agents can be powerful productivity enablers, they can increasingly understand business context and plan, make decisions, execute actions, or interact with human stakeholders or other agents to create more complex workflows for business needs. For example, RAG agents can use enterprise documents to ground their responses for relevance. However, the problems of agents as a black-box present significant challenges to developers building and observing agents. Developers need tools to assess quality and safety aspects of agent’s workflows.

[Demo] How to debug your RAG agent in Microsoft Foundry – framing agent observability

This post will focus on the best practices in RAG quality in agent workflows.

Best Practice 1: Evaluate your RAG end to end

For complex queries that is a common use for RAG agents, we know that agentic RAG is better than traditional RAG in principle and in practice. You can now use agentic retrieval API (aka Knowledge Agent) shown to improve Up to 40% better relevance for complex queries with new agentic retrieval engine than traditional RAG.

At a high-level, a retrieval-augmented generation (RAG) system tries to generate the most relevant answer consistent with grounding documents in response to a user’s query. At a high level, a user’s query triggers a search retrieval in the corpus of grounding documents to provide grounding context for the AI model to generate a response. It’s important to evaluate the following aspects using our evaluators (aka the “RAG triad” metrics):

Retrieval: Is the search output relevant and useful to resolve user’s query? The relevance of the retrieval results to the user’s query is important.
Groundedness: Is the response supported by the grounding source (e.g. output of a search tool)? The consistency of the generated response with respect to the grounding documents is at stake here.
Relevance: After agentic RAG retrieval, is the response relevant to the user query? The relevance of the final response to the query matters to the end-customer’s satisfaction of the RAG application.

In this best practice, we will focus on the evaluating the end-to-end response of Knowledge Agent using Groundedness and Relevance evaluators.

What can Knowledge Agent do? Knowledge Agent is an advanced agentic retrieval pipeline designed to extract grounding information from your knowledge sources. Given the conversation history and retrieval parameters, the agent:

Analyzes the entire conversation to infer the user’s information need.
Decomposes compound queries into focused subqueries.
Executes subqueries concurrently against the configured knowledge sources.
Uses the semantic ranker to re-rank and filter results.
Merges and synthesizes the top results into a unified output.

The synthesis mode means that not only it serves as a RAG search engine but also performs question-answering in knowledge-intense domain. The following snippet illustrates the idea of complex, contextual queries in a conversation:

from azure.search.documents.agent import KnowledgeAgentRetrievalClient
from azure.search.documents.agent.models import (
    KnowledgeAgentRetrievalRequest, KnowledgeAgentMessage, KnowledgeAgentMessageTextContent, SearchIndexKnowledgeSourceParams
)

# test a knowledge-intense query or two

agent_client = KnowledgeAgentRetrievalClient(endpoint=endpoint, agent_name=agent_name, credential=credential)
query_1 = """
    Why do suburban belts display larger December brightening than urban cores even though absolute light levels are higher downtown?
    Why is the Phoenix nighttime street grid is so sharply visible from space, whereas large stretches of the interstate between midwestern cities remain comparatively dim?
    """
messages.append({
    "role": "user",
    "content": query_1
})
retrieval_request= KnowledgeAgentRetrievalRequest(
    messages=[
        KnowledgeAgentMessage(
            role=m["role"],
            content=[KnowledgeAgentMessageTextContent(text=m["content"])]
        ) for m in messages if m["role"] != "system"
    ],
    knowledge_source_params=[
        SearchIndexKnowledgeSourceParams(
            knowledge_source_name=knowledge_source_name,
            kind="searchIndex"
        )
    ]
)
result = agent_client.retrieve(retrieval_request=retrieval_request, api_version=api_version)
print("Retrieve successful")


import json
# Build simple string values for response_content, activity_content, references_content
# Responses -> concatenate all text/value fields from all response contents
response_parts = []
if getattr(result, "response", None):
    for resp in result.response:
        for content in getattr(resp, "content", []):
            text = getattr(content, "text", None) or getattr(content, "value", None) or str(content)
            response_parts.append(text)
response_content = "\n\n".join(response_parts) if response_parts else "No response found on 'result'"
response_contents.append(response_content)
# Print the three string values
print("response_content:\n", response_content, "\n")

The Knowledge Agent returns in response: “Suburban belts display larger December brightening than urban cores primarily due to the nature of residential areas, which often have more yard space and single-family homes conducive to holiday lighting displays. In contrast, urban cores, while having higher absolute light levels, do not experience as significant an increase in lighting during the holiday season, with brightening typically around 20 to 30 percent [ref_id:8].

Regarding the visibility of the Phoenix nighttime street grid from space, it is sharply defined due to the regular grid layout of the city blocks and streets, which is illuminated by street lighting. This grid pattern is particularly evident at night, making it easy to distinguish from a low-Earth-orbit perspective [ref_id:0]. In contrast, large stretches of the interstate between midwestern cities remain comparatively dim because these areas are less densely populated and have fewer urban centers, resulting in less artificial lighting overall [ref_id:9].”

Notice that the synthesized response addresses each query and includes reference tags. Because the agent was configured as an ANSWER_SYNTHESIS knowledge agent, the retrieval response contains three useful parts:

response — a natural-language, synthesized answer (returned in response.content).
activity — detailed planning and execution information (shows subqueries, reranking decisions, and intermediate steps).
references — the source documents and chunks that contributed to the answer (present when include_references and include_reference_source_data are enabled).

Tip: retrieval parameters (for example, reranker thresholds, target index params, and knowledge source params) influence how aggressive the agent is in reranking and which sources it queries. Inspect activity and references to validate grounding and build traceable citations.

To finetune these parameters, we need to first be able to evaluate the quality of the current set of parameters using end-to-end RAG evaluators Groundedness and Relevance.

## Run Evaluation with Microsoft Foundry
ai_foundry_project_endpoint = os.environ.get("AI_FOUNDRY_PROJECT_ENDPOINT")
import os, json
from azure.ai.evaluation import (
    AzureOpenAIModelConfiguration,
    GroundednessEvaluator,
    RelevanceEvaluator,
    RetrievalEvaluator,
    evaluate,
)
evaluation_data = []
for q, r, g in zip([query_1, query_2], references_contents, response_contents):
    evaluation_data.append({
        "query": q,
        "response": g,
        "context": r,
    })
filename = "evaluation_data.jsonl"
with open(filename, "w") as f:
    for item in evaluation_data:
        f.write(json.dumps(item) + "\n")
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=azure_openai_endpoint,
    api_version=azure_openai_api_version,
    azure_deployment=azure_openai_gpt_model
)
# RAG triad metrics
groundedness = GroundednessEvaluator(model_config=model_config)
relevance = RelevanceEvaluator(model_config=model_config)
result = evaluate(
    data=filename,
    evaluators={
        "groundedness": groundedness,
        "relevance": relevance,
    },
    azure_ai_project=ai_foundry_project_endpoint,
)
print("Evaluation complete.")
studio_url = result.get("studio_url")
if studio_url:
    print("AI Foundry Studio URL:", studio_url)

Now you can navigate to the Foundry UI to visualize the batch evaluation results and gauge the answer quality of your Knowledge Agent. The Groundedness and Relevance evaluation results include pass/fail and the supported reasoning. After evaluating one set of parameters for Knowledge Agent, we simply repeat the exercise with another set of parameters of interest, such as reranker threshold. By evaluating different sets of parameters (A/B testing), we can make sure that your Knowledge Agent is attuned to your enterprise data.

Follow this notebook for an end-to-end example: https://aka.ms/knowledge-agent-eval-sample.

Best Practice 2: Optimize your RAG search parameters

Document retrieval quality is a common bottleneck in RAG Search. One best practice is to optimize your RAG search parameters according to your enterprise data. For advanced users who can curate ground truth for document retrieval results – query relevance labels (“qrels” for short), it is an important scenario to “sweep” and thus optimize the parameters by evaluating the document retrieval quality, using golden metrics such as XDCG and Max Relevance.

What are the Document Retrieval Metrics? They include the following golden metrics in information retrieval:

Metric	Higher is better	Description
Fidelity	Yes	How well the top n retrieved chunks reflect the content for a given query; number of good documents returned out of the total number of known good documents in a dataset
NDCG	Yes	How good are the rankings to an ideal order where all relevant items are at the top of the list.
XDCG	Yes	How good the results are in the top-k documents regardless of scoring of other index documents.
Max Relevance N	Yes	Maximum relevance in the top-k chunks.
Holes	No	Number of documents with missing query relevance judgments (Ground truth).

Given these golden metrics in Document Retrieval evaluators, we can enable even more precise measurement and turbo-charge the parameter sweep scenario for any search engine that returns relevance scores. For illustration purposes, we will use Azure AI Search as the search engine, but you can use anything from Knowledge Agent (information extraction mode) to LlamaIndex.

[Demo] How to debug your RAG agent in Microsoft Foundry – optimize your RAG

Firstly, prepare some test queries and generate some retrieval results with relevance scores with Azure AI Search. Then for each search query, label the relevance of each search result by a human judge typically, a subject matter expert. The other approach is to employ an LLM-judge. For example, we can reuse the Relevance evaluator as mentioned above to score the text chunk.

# for each search query, say 5 documents ids are retrieved with relevance scores
retrieved_documents = [
{"document_id": "2", "relevance_score": 0.70},
{"document_id": "1", "relevance_score": 0.65},
{"document_id": "5", "relevance_score": 0.6},
{"document_id": "6", "relevance_score": 0.5},
{"document_id": "7", "relevance_score": 0.25}
]
# these label can come from a human judge or an LLM-judge
# let's say we use our a 0-5 scale for rating
retrieval_ground_truth = [
{"document_id": "1", "query_relevance_label": 4},
{"document_id": "2", "query_relevance_label": 3},
{"document_id": "3", "query_relevance_label": 3},
{"document_id": "4", "query_relevance_label": 3},
{"document_id": "5", "query_relevance_label": 2}
]

from azure.ai.evaluation import DocumentRetrievalEvaluator
import json

doc_retrieval_eval = DocumentRetrievalEvaluator(
# inform the evalutor about your ground truth rating scale
ground_truth_label_min=0,
ground_truth_label_max=5,
# optionally, customize your threshold for pass or fail with your data
ndcg_threshold=0.5,
xdcg_threshold=50.0,
fidelity_threshold=0.5,
top1_relevance_threshold=50.0,
top3_max_relevance_threshold=50.0,
total_retrieved_documents_threshold=50,
total_ground_truth_documents_threshold=50

)
results = doc_retrieval_eval(retrieval_ground_truth=retrieval_ground_truth, retrieved_documents=retrieved_documents)
print(json.dumps(results, indent=4))

The results show the quality of one instance of a search retrieval:

{'ndcg@3': 0.31075932533963707, 'xdcg@3': 39.285714285714285, 'fidelity': 0.39285714285714285, 'top1_relevance': 2, 'top3_max_relevance': 3, 'holes': 2, 'holes_ratio': 0.4, 'total_retrieved_documents': 5, 'total_ground_truth_documents': 5, ..., 'ndcg@3_result': 'fail', 'ndcg@3_threshold': 0.5, 'ndcg@3_higher_is_better': True, 'xdcg@3_result': 'fail', 'xdcg@3_threshold': 50.0, 'xdcg@3_higher_is_better': True, 'fidelity_result': 'fail', 'fidelity_threshold': 0.5 }

The metric name reports numerical score, and the “{metric_name}_result” field returns pass/fail based on the set threshold you can override. This is helpful when we compare multiple evaluation results across different sets of parameters. For illustration purposes, we prepare 4 datasets corresponding to multiple search algorithms as parameters (text, semantic, vector, hybrid search), and submits a batch evaluation to compare their quality. Other parameters of interest to tune would be`top_k` or `chunk_size` or document overlap size as you create the index. In general, as mentioned, there will be multiple parameter settings by other search engines (as long as they return relevance scores) you would want to finetune according to your enterprise dataset.

import os
from azure.ai.evaluation import evaluate

azure_ai_project = os.getenv("PROJECT_ENDPOINT")
results = {}
for parameter_name, sample_data_file_name in [
    ("Text Search", "evaluate-text.jsonl"),
    ("Semantic Search", "evaluate-semantic.jsonl"),
    ("Vector Search", "evaluate-vector.jsonl"),
    ("Hybrid Search", "evaluate-hybrid.jsonl")
]:
    doc_retrieve = DocumentRetrievalEvaluator()


    # randomize data
    data = pd.read_json(os.path.join(".", sample_data_file_name), lines=5)
   
    response = evaluate(
        data=sample_data_file_name,
        evaluation_name=f"Doc retrieval eval demo - {config_name} run",
        evaluators={
            "DocumentRetrievalEvaluator": doc_retrieve,
        },
        output_path=
        azure_ai_project=azure_ai_project,
    )
    results[config_name] = response

Then this generates multiple evaluation runs on Microsoft Foundry based on different search parameter settings, in this case, different search algorithms. Use the rich visualization for Document Retrieval in Foundry Evaluation UI (a fast-follow item available within 30 days) to find the optimal search parameter as follows:

1. Select the multiple parameter runs on Foundry and select “Compare”:

2. View the tabular results for all evaluation runs:

3. Find the best parameter for the charts per metric (`xdcg@3` for example):

Having found your optimal parameter for Azure AI Search, you can then confidently plug it back into your RAG agent, now attuned to your enterprise data. Note that this evaluator works for any search engine that returns relevance ranking scores, including LlamaIndex. Follow this notebook for an end-to-end example: https://aka.ms/doc-retrieval-sample.

Related links:

How to debug and optimize RAG agents in Microsoft Foundry

TL;DR

Framing Agent Observability

Best Practice 1: Evaluate your RAG end to end

Best Practice 2: Optimize your RAG search parameters

Author

0 comments

Read next

Assess Agentic Risks with the AI Red Teaming Agent in Microsoft Foundry

Azure Content Understanding is now generally available

TL;DR

Framing Agent Observability

Best Practice 1: Evaluate your RAG end to end

Best Practice 2: Optimize your RAG search parameters

Author

0 comments

Read next

Assess Agentic Risks with the AI Red Teaming Agent in Microsoft Foundry

Azure Content Understanding is now generally available

Stay informed