TL;DR
Learn the best practices to evaluate and optimize the RAG quality of your agent using evaluations in Foundry Observability. This tutorial demonstrates these 2 best practices before deploying your RAG agent:
1. Evaluate and optimize end to end your RAG agent using reference-free RAG triad evaluators: Groundedness and Relevance evaluators;
2. For advanced search use case requiring ground truths and more precise measurement of retrieval quality, optimize your Search parameters using golden metrics such as XDCG and Max Relevance: Document Retrieval evaluator
Framing Agent Observability
Agents can be powerful productivity enablers, they can increasingly understand business context and plan, make decisions, execute actions, or interact with human stakeholders or other agents to create more complex workflows for business needs. For example, RAG agents can use enterprise documents to ground their responses for relevance. However, the problems of agents as a black-box present significant challenges to developers building and observing agents. Developers need tools to assess quality and safety aspects of agent’s workflows.
[Demo] How to debug your RAG agent in Microsoft Foundry – framing agent observability
This post will focus on the best practices in RAG quality in agent workflows.
Best Practice 1: Evaluate your RAG end to end
For complex queries that is a common use for RAG agents, we know that agentic RAG is better than traditional RAG in principle and in practice. You can now use agentic retrieval API (aka Knowledge Agent) shown to improve Up to 40% better relevance for complex queries with new agentic retrieval engine than traditional RAG.
At a high-level, a retrieval-augmented generation (RAG) system tries to generate the most relevant answer consistent with grounding documents in response to a user’s query. At a high level, a user’s query triggers a search retrieval in the corpus of grounding documents to provide grounding context for the AI model to generate a response. It’s important to evaluate the following aspects using our evaluators (aka the “RAG triad” metrics):
- Retrieval: Is the search output relevant and useful to resolve user’s query? The relevance of the retrieval results to the user’s query is important.
- Groundedness: Is the response supported by the grounding source (e.g. output of a search tool)? The consistency of the generated response with respect to the grounding documents is at stake here.
- Relevance: After agentic RAG retrieval, is the response relevant to the user query? The relevance of the final response to the query matters to the end-customer’s satisfaction of the RAG application.
In this best practice, we will focus on the evaluating the end-to-end response of Knowledge Agent using Groundedness and Relevance evaluators.
What can Knowledge Agent do? Knowledge Agent is an advanced agentic retrieval pipeline designed to extract grounding information from your knowledge sources. Given the conversation history and retrieval parameters, the agent:
- Analyzes the entire conversation to infer the user’s information need.
- Decomposes compound queries into focused subqueries.
- Executes subqueries concurrently against the configured knowledge sources.
- Uses the semantic ranker to re-rank and filter results.
- Merges and synthesizes the top results into a unified output.
from azure.search.documents.agent import KnowledgeAgentRetrievalClient
from azure.search.documents.agent.models import (
  KnowledgeAgentRetrievalRequest, KnowledgeAgentMessage, KnowledgeAgentMessageTextContent, SearchIndexKnowledgeSourceParams
)
# test a knowledge-intense query or two
agent_client = KnowledgeAgentRetrievalClient(endpoint=endpoint, agent_name=agent_name, credential=credential)
query_1 = """
  Why do suburban belts display larger December brightening than urban cores even though absolute light levels are higher downtown?
  Why is the Phoenix nighttime street grid is so sharply visible from space, whereas large stretches of the interstate between midwestern cities remain comparatively dim?
  """
messages.append({
  "role": "user",
  "content": query_1
})
retrieval_request= KnowledgeAgentRetrievalRequest(
  messages=[
    KnowledgeAgentMessage(
      role=m["role"],
      content=[KnowledgeAgentMessageTextContent(text=m["content"])]
    ) for m in messages if m["role"] != "system"
  ],
  knowledge_source_params=[
    SearchIndexKnowledgeSourceParams(
      knowledge_source_name=knowledge_source_name,
      kind="searchIndex"
    )
  ]
)
result = agent_client.retrieve(retrieval_request=retrieval_request, api_version=api_version)
print("Retrieve successful")
import json
# Build simple string values for response_content, activity_content, references_content
# Responses -> concatenate all text/value fields from all response contents
response_parts = []
if getattr(result, "response", None):
  for resp in result.response:
    for content in getattr(resp, "content", []):
      text = getattr(content, "text", None) or getattr(content, "value", None) or str(content)
      response_parts.append(text)
response_content = "\n\n".join(response_parts) if response_parts else "No response found on 'result'"
response_contents.append(response_content)
# Print the three string values
print("response_content:\n", response_content, "\n")
- response — a natural-language, synthesized answer (returned in response.content).
- activity — detailed planning and execution information (shows subqueries, reranking decisions, and intermediate steps).
- references — the source documents and chunks that contributed to the answer (present when include_references and include_reference_source_data are enabled).
## Run Evaluation with Microsoft Foundry
ai_foundry_project_endpoint = os.environ.get("AI_FOUNDRY_PROJECT_ENDPOINT")
import os, json
from azure.ai.evaluation import (
  AzureOpenAIModelConfiguration,
  GroundednessEvaluator,
  RelevanceEvaluator,
  RetrievalEvaluator,
  evaluate,
)
evaluation_data = []
for q, r, g in zip([query_1, query_2], references_contents, response_contents):
  evaluation_data.append({
    "query": q,
    "response": g,
    "context": r,
  })
filename = "evaluation_data.jsonl"
with open(filename, "w") as f:
  for item in evaluation_data:
    f.write(json.dumps(item) + "\n")
model_config = AzureOpenAIModelConfiguration(
  azure_endpoint=azure_openai_endpoint,
  api_version=azure_openai_api_version,
  azure_deployment=azure_openai_gpt_model
)
# RAG triad metrics
groundedness = GroundednessEvaluator(model_config=model_config)
relevance = RelevanceEvaluator(model_config=model_config)
result = evaluate(
  data=filename,
  evaluators={
    "groundedness": groundedness,
    "relevance": relevance,
  },
  azure_ai_project=ai_foundry_project_endpoint,
)
print("Evaluation complete.")
studio_url = result.get("studio_url")
if studio_url:
  print("AI Foundry Studio URL:", studio_url)
Best Practice 2: Optimize your RAG search parameters
Document retrieval quality is a common bottleneck in RAG Search. One best practice is to optimize your RAG search parameters according to your enterprise data. For advanced users who can curate ground truth for document retrieval results – query relevance labels (“qrels” for short), it is an important scenario to “sweep” and thus optimize the parameters by evaluating the document retrieval quality, using golden metrics such as XDCG and Max Relevance.
| Metric | Higher is better | Description |
| Fidelity | Yes | How well the top n retrieved chunks reflect the content for a given query; number of good documents returned out of the total number of known good documents in a dataset |
| NDCG | Yes | How good are the rankings to an ideal order where all relevant items are at the top of the list. |
| XDCG | Yes | How good the results are in the top-k documents regardless of scoring of other index documents. |
| Max Relevance N | Yes | Maximum relevance in the top-k chunks. |
| Holes | No | Number of documents with missing query relevance judgments (Ground truth). |
[Demo] How to debug your RAG agent in Microsoft Foundry – optimize your RAG
Firstly, prepare some test queries and generate some retrieval results with relevance scores with Azure AI Search. Then for each search query, label the relevance of each search result by a human judge typically, a subject matter expert. The other approach is to employ an LLM-judge. For example, we can reuse the Relevance evaluator as mentioned above to score the text chunk.
# for each search query, say 5 documents ids are retrieved with relevance scores
retrieved_documents = [
{"document_id": "2", "relevance_score": 0.70},
{"document_id": "1", "relevance_score": 0.65},
{"document_id": "5", "relevance_score": 0.6},
{"document_id": "6", "relevance_score": 0.5},
{"document_id": "7", "relevance_score": 0.25}
]
# these label can come from a human judge or an LLM-judge
# let's say we use our a 0-5 scale for rating
retrieval_ground_truth = [
{"document_id": "1", "query_relevance_label": 4},
{"document_id": "2", "query_relevance_label": 3},
{"document_id": "3", "query_relevance_label": 3},
{"document_id": "4", "query_relevance_label": 3},
{"document_id": "5", "query_relevance_label": 2}
]
from azure.ai.evaluation import DocumentRetrievalEvaluator
import json
doc_retrieval_eval = DocumentRetrievalEvaluator(
# inform the evalutor about your ground truth rating scale
ground_truth_label_min=0,
ground_truth_label_max=5,
# optionally, customize your threshold for pass or fail with your data
ndcg_threshold=0.5,
xdcg_threshold=50.0,
fidelity_threshold=0.5,
top1_relevance_threshold=50.0,
top3_max_relevance_threshold=50.0,
total_retrieved_documents_threshold=50,
total_ground_truth_documents_threshold=50
)
results = doc_retrieval_eval(retrieval_ground_truth=retrieval_ground_truth, retrieved_documents=retrieved_documents)
print(json.dumps(results, indent=4))
The results show the quality of one instance of a search retrieval:
{'ndcg@3': 0.31075932533963707, 'xdcg@3': 39.285714285714285, 'fidelity': 0.39285714285714285, 'top1_relevance': 2, 'top3_max_relevance': 3, 'holes': 2, 'holes_ratio': 0.4, 'total_retrieved_documents': 5, 'total_ground_truth_documents': 5, ..., 'ndcg@3_result': 'fail', 'ndcg@3_threshold': 0.5, 'ndcg@3_higher_is_better': True, 'xdcg@3_result': 'fail', 'xdcg@3_threshold': 50.0, 'xdcg@3_higher_is_better': True, 'fidelity_result': 'fail', 'fidelity_threshold': 0.5 }
The metric name reports numerical score, and the “{metric_name}_result” field returns pass/fail based on the set threshold you can override. This is helpful when we compare multiple evaluation results across different sets of parameters. For illustration purposes, we prepare 4 datasets corresponding to multiple search algorithms as parameters (text, semantic, vector, hybrid search), and submits a batch evaluation to compare their quality. Other parameters of interest to tune would be`top_k` or `chunk_size` or document overlap size as you create the index. In general, as mentioned, there will be multiple parameter settings by other search engines (as long as they return relevance scores) you would want to finetune according to your enterprise dataset.
import os
from azure.ai.evaluation import evaluate
azure_ai_project = os.getenv("PROJECT_ENDPOINT")
results = {}
for parameter_name, sample_data_file_name in [
  ("Text Search", "evaluate-text.jsonl"),
  ("Semantic Search", "evaluate-semantic.jsonl"),
  ("Vector Search", "evaluate-vector.jsonl"),
  ("Hybrid Search", "evaluate-hybrid.jsonl")
]:
  doc_retrieve = DocumentRetrievalEvaluator()
  # randomize data
  data = pd.read_json(os.path.join(".", sample_data_file_name), lines=5)
 Â
  response = evaluate(
    data=sample_data_file_name,
    evaluation_name=f"Doc retrieval eval demo - {config_name} run",
    evaluators={
      "DocumentRetrievalEvaluator": doc_retrieve,
    },
    output_path=
    azure_ai_project=azure_ai_project,
  )
  results[config_name] = response
Then this generates multiple evaluation runs on Microsoft Foundry based on different search parameter settings, in this case, different search algorithms. Use the rich visualization for Document Retrieval in Foundry Evaluation UI (a fast-follow item available within 30 days) to find the optimal search parameter as follows:
1. Select the multiple parameter runs on Foundry and select “Compare”:

2. View the tabular results for all evaluation runs:

3. Find the best parameter for the charts per metric (`xdcg@3` for example):

Having found your optimal parameter for Azure AI Search, you can then confidently plug it back into your RAG agent, now attuned to your enterprise data. Note that this evaluator works for any search engine that returns relevance ranking scores, including LlamaIndex. Follow this notebook for an end-to-end example: https://aka.ms/doc-retrieval-sample.
Related links:
- Unlocking the Power of Agentic Applications New Evaluation Metrics for Quality and Safety | Azure AI Foundry Blog
- Achieve End-to-End Observability in Azure AI Foundry | Azure AI Foundry Blog
- Bonus Journey: Agentic RAG – Combining Agents with Retrieval-Augmented Generation

0 comments
Be the first to start the discussion.