{"id":1118,"date":"2025-11-20T08:00:28","date_gmt":"2025-11-20T16:00:28","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/foundry\/?p=1118"},"modified":"2025-11-19T17:21:27","modified_gmt":"2025-11-20T01:21:27","slug":"how-to-debug-and-optimize-rag-agents-in-azure-ai-foundry","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/foundry\/how-to-debug-and-optimize-rag-agents-in-azure-ai-foundry\/","title":{"rendered":"How to debug and optimize RAG agents in Microsoft Foundry"},"content":{"rendered":"<h3>TL;DR<\/h3>\n<p>Learn the best practices to evaluate and optimize the RAG quality of your agent using evaluations in Foundry Observability. This tutorial demonstrates these 2 best practices before deploying your RAG agent:<\/p>\n<p>1. Evaluate and optimize end to end your RAG agent using reference-free RAG triad evaluators: <a href=\"https:\/\/aka.ms\/groundedness-doc\">Groundedness<\/a> and <a href=\"https:\/\/aka.ms\/relevance-doc\">Relevance<\/a> evaluators;<\/p>\n<p>2. For advanced search use case requiring ground truths and more precise measurement of retrieval quality, optimize your Search parameters using golden metrics such as XDCG and Max Relevance: <a href=\"https:\/\/aka.ms\/doc-retrieval-evaluator\">Document Retrieval <\/a>evaluator<\/p>\n<h3>Framing Agent Observability<\/h3>\n<p><!--StartFragment -->Agents can be powerful productivity enablers, they can increasingly understand business context and plan, make decisions, execute actions, or interact with human stakeholders or other agents to create more complex workflows for business needs. For example, RAG agents can use enterprise documents to ground their responses for relevance. However, the problems of agents as a black-box present significant challenges to developers building and observing agents. Developers need tools to assess quality and safety aspects of agent&#8217;s workflows.<\/p>\n<p><a href=\"https:\/\/microsoft-my.sharepoint.com\/:v:\/p\/changliu2\/ERkUvBz4F39GggC5UlPn8REBdsklTEY1w_TCZK_xLNKjCA?e=Y7kuMW\">[Demo] How to debug your RAG agent in Microsoft Foundry &#8211; framing agent observability<\/a><\/p>\n<p>This post will focus on the best practices in RAG quality in agent workflows.<\/p>\n<h3>Best Practice 1: Evaluate your RAG end to end<\/h3>\n<p>For complex queries that is a common use for RAG agents, we know that <a href=\"http:\/\/Bonus Journey: Agentic RAG - Combining Agents with Retrieval-Augmented Generation\">agentic RAG<\/a> is better than traditional RAG in principle and in practice. You can now use <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/search\/search-agentic-retrieval-how-to-create\">agentic retrieval API<\/a> (aka Knowledge Agent) shown to improve <a href=\"https:\/\/techcommunity.microsoft.com\/blog\/azure-ai-foundry-blog\/up-to-40-better-relevance-for-complex-queries-with-new-agentic-retrieval-engine\/4413832\">Up to 40% better relevance for complex queries with new agentic retrieval engine<\/a> than traditional RAG.<\/p>\n<p>At a high-level, a retrieval-augmented generation (RAG) system tries to generate the most relevant answer consistent with grounding documents in response to a user&#8217;s query. At a high level, a user&#8217;s query triggers a search retrieval in the corpus of grounding documents to provide grounding context for the AI model to generate a response. It&#8217;s important to evaluate the following aspects using our evaluators (aka the &#8220;RAG triad&#8221; metrics):<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/08\/RAG-triads.png\"><img decoding=\"async\" class=\" wp-image-1122 aligncenter\" src=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/08\/RAG-triads-300x169.png\" alt=\"RAG triads image\" width=\"653\" height=\"368\" srcset=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/08\/RAG-triads-300x169.png 300w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/08\/RAG-triads-1024x576.png 1024w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/08\/RAG-triads-768x432.png 768w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/08\/RAG-triads.png 1280w\" sizes=\"(max-width: 653px) 100vw, 653px\" \/><\/a><\/p>\n<ul>\n<li><a href=\"http:\/\/aka.ms\/retrieval\">Retrieval<\/a>: Is the search output relevant and useful to resolve user&#8217;s query? The relevance of the retrieval results to the user&#8217;s query is important.<\/li>\n<li><a href=\"http:\/\/aka.ms\/groundedness-doc\">Groundedness<\/a>: Is the response supported by the grounding source (e.g. output of a search tool)? The consistency of the generated response with respect to the grounding documents is at stake here.<\/li>\n<li><a href=\"http:\/\/aka.ms\/relevance\">Relevance<\/a>: After agentic RAG retrieval, is the response relevant to the user query? The relevance of the final response to the query matters to the end-customer&#8217;s satisfaction of the RAG application.<\/li>\n<\/ul>\n<p>In this best practice, we will focus on the evaluating the end-to-end response of Knowledge Agent using Groundedness and Relevance evaluators.<\/p>\n<p>What can Knowledge Agent do? Knowledge Agent is an advanced agentic retrieval pipeline designed to extract grounding information from your knowledge sources. Given the conversation history and retrieval parameters, the agent:<\/p>\n<div>\n<ol>\n<li>Analyzes the entire conversation to infer the user\u2019s information need.<\/li>\n<li>Decomposes compound queries into focused subqueries.<\/li>\n<li>Executes subqueries concurrently against the configured knowledge sources.<\/li>\n<li>Uses the semantic ranker to re-rank and filter results.<\/li>\n<li>Merges and synthesizes the top results into a unified output.<\/li>\n<\/ol>\n<\/div>\n<div>\n<div>\n<div>The synthesis mode means that not only it serves as a RAG search engine but also performs question-answering in knowledge-intense domain. The following snippet illustrates the idea of complex, contextual queries in a conversation:<\/div>\n<\/div>\n<\/div>\n<div>\n<div>\n<div>\n<pre class=\"prettyprint language-py\"><code class=\"language-py\">from azure.search.documents.agent import KnowledgeAgentRetrievalClient\r\nfrom azure.search.documents.agent.models import (\r\n\u00a0 \u00a0 KnowledgeAgentRetrievalRequest, KnowledgeAgentMessage, KnowledgeAgentMessageTextContent, SearchIndexKnowledgeSourceParams\r\n)\r\n\r\n# test a knowledge-intense query or two\r\n\r\nagent_client = KnowledgeAgentRetrievalClient(endpoint=endpoint, agent_name=agent_name, credential=credential)\r\nquery_1 = \"\"\"\r\n\u00a0 \u00a0 Why do suburban belts display larger December brightening than urban cores even though absolute light levels are higher downtown?\r\n\u00a0 \u00a0 Why is the Phoenix nighttime street grid is so sharply visible from space, whereas large stretches of the interstate between midwestern cities remain comparatively dim?\r\n\u00a0 \u00a0 \"\"\"\r\nmessages.append({\r\n\u00a0 \u00a0 \"role\": \"user\",\r\n\u00a0 \u00a0 \"content\": query_1\r\n})\r\nretrieval_request= KnowledgeAgentRetrievalRequest(\r\n\u00a0 \u00a0 messages=[\r\n\u00a0 \u00a0 \u00a0 \u00a0 KnowledgeAgentMessage(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 role=m[\"role\"],\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 content=[KnowledgeAgentMessageTextContent(text=m[\"content\"])]\r\n\u00a0 \u00a0 \u00a0 \u00a0 ) for m in messages if m[\"role\"] != \"system\"\r\n\u00a0 \u00a0 ],\r\n\u00a0 \u00a0 knowledge_source_params=[\r\n\u00a0 \u00a0 \u00a0 \u00a0 SearchIndexKnowledgeSourceParams(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 knowledge_source_name=knowledge_source_name,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 kind=\"searchIndex\"\r\n\u00a0 \u00a0 \u00a0 \u00a0 )\r\n\u00a0 \u00a0 ]\r\n)\r\nresult = agent_client.retrieve(retrieval_request=retrieval_request, api_version=api_version)\r\nprint(\"Retrieve successful\")\r\n\r\n\r\nimport json\r\n# Build simple string values for response_content, activity_content, references_content\r\n# Responses -&gt; concatenate all text\/value fields from all response contents\r\nresponse_parts = []\r\nif getattr(result, \"response\", None):\r\n\u00a0 \u00a0 for resp in result.response:\r\n\u00a0 \u00a0 \u00a0 \u00a0 for content in getattr(resp, \"content\", []):\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 text = getattr(content, \"text\", None) or getattr(content, \"value\", None) or str(content)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 response_parts.append(text)\r\nresponse_content = \"\\n\\n\".join(response_parts) if response_parts else \"No response found on 'result'\"\r\nresponse_contents.append(response_content)\r\n# Print the three string values\r\nprint(\"response_content:\\n\", response_content, \"\\n\")<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<div>The Knowledge Agent returns in response: &#8220;Suburban belts display larger December brightening than urban cores primarily due to the nature of residential areas, which often have more yard space and single-family homes conducive to holiday lighting displays. In contrast, urban cores, while having higher absolute light levels, do not experience as significant an increase in lighting during the holiday season, with brightening typically around 20 to 30 percent [ref_id:8].<\/div>\n<div><\/div>\n<div>Regarding the visibility of the Phoenix nighttime street grid from space, it is sharply defined due to the regular grid layout of the city blocks and streets, which is illuminated by street lighting. This grid pattern is particularly evident at night, making it easy to distinguish from a low-Earth-orbit perspective [ref_id:0]. In contrast, large stretches of the interstate between midwestern cities remain comparatively dim because these areas are less densely populated and have fewer urban centers, resulting in less artificial lighting overall [ref_id:9].&#8221;<\/div>\n<div><\/div>\n<div>\n<div>Notice that the synthesized response addresses each query and includes reference tags. Because the agent was configured as an ANSWER_SYNTHESIS knowledge agent, the retrieval response contains three useful parts:<\/div>\n<ul>\n<li>response \u2014 a natural-language, synthesized answer (returned in response.content).<\/li>\n<li>activity \u2014 detailed planning and execution information (shows subqueries, reranking decisions, and intermediate steps).<\/li>\n<li>references \u2014 the source documents and chunks that contributed to the answer (present when include_references and include_reference_source_data are enabled).<\/li>\n<\/ul>\n<div><em><strong>Tip<\/strong><\/em>: retrieval parameters (for example, reranker thresholds, target index params, and knowledge source params) influence how aggressive the agent is in reranking and which sources it queries. Inspect activity and references to validate grounding and build traceable citations.<\/div>\n<div>To finetune these parameters, we need to first be able to evaluate the quality of the current set of parameters using end-to-end RAG evaluators Groundedness and Relevance.<\/div>\n<\/div>\n<div><\/div>\n<div>\n<div>\n<div>\n<pre class=\"prettyprint language-py\"><code class=\"language-py\">## Run Evaluation with Microsoft Foundry\r\nai_foundry_project_endpoint = os.environ.get(\"AI_FOUNDRY_PROJECT_ENDPOINT\")\r\nimport os, json\r\nfrom azure.ai.evaluation import (\r\n\u00a0 \u00a0 AzureOpenAIModelConfiguration,\r\n\u00a0 \u00a0 GroundednessEvaluator,\r\n\u00a0 \u00a0 RelevanceEvaluator,\r\n\u00a0 \u00a0 RetrievalEvaluator,\r\n\u00a0 \u00a0 evaluate,\r\n)\r\nevaluation_data = []\r\nfor q, r, g in zip([query_1, query_2], references_contents, response_contents):\r\n\u00a0 \u00a0 evaluation_data.append({\r\n\u00a0 \u00a0 \u00a0 \u00a0 \"query\": q,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \"response\": g,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \"context\": r,\r\n\u00a0 \u00a0 })\r\nfilename = \"evaluation_data.jsonl\"\r\nwith open(filename, \"w\") as f:\r\n\u00a0 \u00a0 for item in evaluation_data:\r\n\u00a0 \u00a0 \u00a0 \u00a0 f.write(json.dumps(item) + \"\\n\")\r\nmodel_config = AzureOpenAIModelConfiguration(\r\n\u00a0 \u00a0 azure_endpoint=azure_openai_endpoint,\r\n\u00a0 \u00a0 api_version=azure_openai_api_version,\r\n\u00a0 \u00a0 azure_deployment=azure_openai_gpt_model\r\n)\r\n# RAG triad metrics\r\ngroundedness = GroundednessEvaluator(model_config=model_config)\r\nrelevance = RelevanceEvaluator(model_config=model_config)\r\nresult = evaluate(\r\n\u00a0 \u00a0 data=filename,\r\n\u00a0 \u00a0 evaluators={\r\n\u00a0 \u00a0 \u00a0 \u00a0 \"groundedness\": groundedness,\r\n\u00a0 \u00a0 \u00a0 \u00a0 \"relevance\": relevance,\r\n\u00a0 \u00a0 },\r\n\u00a0 \u00a0 azure_ai_project=ai_foundry_project_endpoint,\r\n)\r\nprint(\"Evaluation complete.\")\r\nstudio_url = result.get(\"studio_url\")\r\nif studio_url:\r\n\u00a0 \u00a0 print(\"AI Foundry Studio URL:\", studio_url)<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<div>\n<div>Now you can navigate to the Foundry UI to visualize the batch evaluation results and gauge the answer quality of your Knowledge Agent. The Groundedness and Relevance evaluation results include pass\/fail and the supported reasoning. After evaluating one set of parameters for Knowledge Agent, we simply repeat the exercise with another set of parameters of interest, such as reranker threshold. By evaluating different sets of parameters (A\/B testing), we can <strong>make sure that your Knowledge Agent is attuned to your enterprise data<\/strong>.<\/div>\n<\/div>\n<div><\/div>\n<div>Follow this notebook for an end-to-end example: https:\/\/aka.ms\/knowledge-agent-eval-sample.<\/div>\n<h3>Best Practice 2: Optimize your RAG search parameters<\/h3>\n<p>Document retrieval quality is a common bottleneck in RAG Search. One best practice is to optimize your RAG search parameters according to your enterprise data. For advanced users who can curate ground truth for document retrieval results &#8211; query relevance labels (&#8220;qrels&#8221; for short), it is an important scenario to &#8220;sweep&#8221; and thus optimize the parameters by evaluating the document retrieval quality, using golden metrics such as XDCG and Max Relevance.<\/p>\n<div>\n<div>What are the Document Retrieval Metrics? They include the following golden metrics in information retrieval:<\/div>\n<div>\n<table style=\"border-collapse: collapse; width: 100%; height: 144px;\">\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"width: 6.55734%; height: 24px;\">Metric<\/td>\n<td style=\"width: 7.94831%; height: 24px;\">Higher is better<\/td>\n<td style=\"width: 85.4942%; height: 24px;\">Description<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 6.55734%; height: 24px;\">Fidelity<\/td>\n<td style=\"width: 7.94831%; height: 24px;\">Yes<\/td>\n<td style=\"width: 85.4942%; height: 24px;\">How well the top n retrieved chunks reflect the content for a given query; number of good documents returned out of the total number of known good documents in a dataset<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 6.55734%; height: 24px;\">NDCG<\/td>\n<td style=\"width: 7.94831%; height: 24px;\">Yes<\/td>\n<td style=\"width: 85.4942%; height: 24px;\">How good are the rankings to an ideal order where all relevant items are at the top of the list.<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 6.55734%; height: 24px;\">XDCG<\/td>\n<td style=\"width: 7.94831%; height: 24px;\">Yes<\/td>\n<td style=\"width: 85.4942%; height: 24px;\">How good the results are in the top-k documents regardless of scoring of other index documents.<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 6.55734%; height: 24px;\">Max Relevance N<\/td>\n<td style=\"width: 7.94831%; height: 24px;\">Yes<\/td>\n<td style=\"width: 85.4942%; height: 24px;\">Maximum relevance in the top-k chunks.<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 6.55734%; height: 24px;\">Holes<\/td>\n<td style=\"width: 7.94831%; height: 24px;\">No<\/td>\n<td style=\"width: 85.4942%; height: 24px;\">Number of documents with missing query relevance judgments (Ground truth).<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<div><\/div>\n<div>Given these golden metrics in <a href=\"https:\/\/aka.ms\/doc-retrieval-evaluator\">Document Retrieval <\/a>evaluators, we can <strong>enable even more precise measurement and turbo-charge the parameter sweep scenario for any search engine that returns relevance scores<\/strong>.\u00a0 For illustration purposes, we will use <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-foundry\/agents\/how-to\/tools\/azure-ai-search?tabs=azurecli\">Azure AI Search<\/a> as the search engine, but you can use anything from Knowledge Agent (information extraction mode) to LlamaIndex.<\/div>\n<\/div>\n<p><!--StartFragment --> <a href=\"https:\/\/microsoft-my.sharepoint.com\/:v:\/p\/changliu2\/ER03-yajg05JvRWcFtNlO_UB57XtJH4ryAJfL91NqTWkxw?e=acg7Pc\">[Demo] How to debug your RAG agent in Microsoft Foundry &#8211; optimize your RAG<\/a><\/p>\n<p>Firstly, prepare some test queries and generate some retrieval results with relevance scores with <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-foundry\/agents\/how-to\/tools\/azure-ai-search?tabs=azurecli\">Azure AI Search<\/a>. Then for each search query, label the relevance of each search result by a human judge typically, a subject matter expert. The other approach is to employ an LLM-judge. For example, we can reuse the <a href=\"http:\/\/aka.ms\/relevance-doc\">Relevance<\/a> evaluator as mentioned above to score the text chunk.<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\"># for each search query, say 5 documents ids are retrieved with relevance scores\r\nretrieved_documents = [\r\n{\"document_id\": \"2\", \"relevance_score\": 0.70},\r\n{\"document_id\": \"1\", \"relevance_score\": 0.65},\r\n{\"document_id\": \"5\", \"relevance_score\": 0.6},\r\n{\"document_id\": \"6\", \"relevance_score\": 0.5},\r\n{\"document_id\": \"7\", \"relevance_score\": 0.25}\r\n]\r\n# these label can come from a human judge or an LLM-judge\r\n# let's say we use our a 0-5 scale for rating\r\nretrieval_ground_truth = [\r\n{\"document_id\": \"1\", \"query_relevance_label\": 4},\r\n{\"document_id\": \"2\", \"query_relevance_label\": 3},\r\n{\"document_id\": \"3\", \"query_relevance_label\": 3},\r\n{\"document_id\": \"4\", \"query_relevance_label\": 3},\r\n{\"document_id\": \"5\", \"query_relevance_label\": 2}\r\n]\r\n\r\nfrom azure.ai.evaluation import DocumentRetrievalEvaluator\r\nimport json\r\n\r\ndoc_retrieval_eval = DocumentRetrievalEvaluator(\r\n# inform the evalutor about your ground truth rating scale\r\nground_truth_label_min=0,\r\nground_truth_label_max=5,\r\n# optionally, customize your threshold for pass or fail with your data\r\nndcg_threshold=0.5,\r\nxdcg_threshold=50.0,\r\nfidelity_threshold=0.5,\r\ntop1_relevance_threshold=50.0,\r\ntop3_max_relevance_threshold=50.0,\r\ntotal_retrieved_documents_threshold=50,\r\ntotal_ground_truth_documents_threshold=50\r\n\r\n)\r\nresults = doc_retrieval_eval(retrieval_ground_truth=retrieval_ground_truth, retrieved_documents=retrieved_documents)\r\nprint(json.dumps(results, indent=4))<\/code><\/pre>\n<div>\n<div>\n<p>The results show the quality of one instance of a search retrieval:<\/p>\n<pre class=\"prettyprint language-py\"><code class=\"language-py\">{'ndcg@3': 0.31075932533963707, 'xdcg@3': 39.285714285714285, 'fidelity': 0.39285714285714285, 'top1_relevance': 2, 'top3_max_relevance': 3, 'holes': 2, 'holes_ratio': 0.4, 'total_retrieved_documents': 5, 'total_ground_truth_documents': 5, ..., 'ndcg@3_result': 'fail', 'ndcg@3_threshold': 0.5, 'ndcg@3_higher_is_better': True, 'xdcg@3_result': 'fail', 'xdcg@3_threshold': 50.0, 'xdcg@3_higher_is_better': True, 'fidelity_result': 'fail', 'fidelity_threshold': 0.5 }<\/code><\/pre>\n<\/div>\n<p>The metric name reports numerical score, and the &#8220;{metric_name}_result&#8221; field returns pass\/fail based on the set threshold you can override. This is helpful when we compare multiple evaluation results across different sets of parameters. For illustration purposes, we prepare 4 datasets corresponding to multiple search algorithms as parameters (text, semantic, vector, hybrid search), and submits a batch evaluation to compare their quality. Other parameters of interest to tune would be`top_k` or `chunk_size` or document overlap size as you create the index. In general, as mentioned, there will be multiple parameter settings by other search engines (as long as they return relevance scores) you would want to finetune according to your enterprise dataset.<\/p>\n<div>\n<pre class=\"prettyprint language-py\"><code class=\"language-py\">import os\r\nfrom azure.ai.evaluation import evaluate\r\n\r\nazure_ai_project = os.getenv(\"PROJECT_ENDPOINT\")\r\nresults = {}\r\nfor parameter_name, sample_data_file_name in [\r\n\u00a0 \u00a0 (\"Text Search\", \"evaluate-text.jsonl\"),\r\n\u00a0 \u00a0 (\"Semantic Search\", \"evaluate-semantic.jsonl\"),\r\n\u00a0 \u00a0 (\"Vector Search\", \"evaluate-vector.jsonl\"),\r\n\u00a0 \u00a0 (\"Hybrid Search\", \"evaluate-hybrid.jsonl\")\r\n]:\r\n\u00a0 \u00a0 doc_retrieve = DocumentRetrievalEvaluator()\r\n\r\n\r\n\u00a0 \u00a0 # randomize data\r\n\u00a0 \u00a0 data = pd.read_json(os.path.join(\".\", sample_data_file_name), lines=5)\r\n\u00a0 \u00a0\r\n\u00a0 \u00a0 response = evaluate(\r\n\u00a0 \u00a0 \u00a0 \u00a0 data=sample_data_file_name,\r\n\u00a0 \u00a0 \u00a0 \u00a0 evaluation_name=f\"Doc retrieval eval demo - {config_name} run\",\r\n\u00a0 \u00a0 \u00a0 \u00a0 evaluators={\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \"DocumentRetrievalEvaluator\": doc_retrieve,\r\n\u00a0 \u00a0 \u00a0 \u00a0 },\r\n\u00a0 \u00a0 \u00a0 \u00a0 output_path=\r\n\u00a0 \u00a0 \u00a0 \u00a0 azure_ai_project=azure_ai_project,\r\n\u00a0 \u00a0 )\r\n\u00a0 \u00a0 results[config_name] = response<\/code><\/pre>\n<\/div>\n<\/div>\n<p>Then this generates multiple evaluation runs on Microsoft Foundry based on different search parameter settings, in this case, different search algorithms. Use the rich visualization for Document Retrieval in Foundry Evaluation UI (a fast-follow item available within 30 days) to find the optimal search parameter as follows:<\/p>\n<p><strong>1. Select the multiple parameter runs on Foundry and select &#8220;Compare&#8221;:<\/strong>\n<img decoding=\"async\" class=\"alignnone wp-image-766\" src=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-list-view-300x198.png\" alt=\"Doc retrieval list view image\" width=\"1020\" height=\"673\" srcset=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-list-view-300x198.png 300w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-list-view-768x507.png 768w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-list-view.png 985w\" sizes=\"(max-width: 1020px) 100vw, 1020px\" \/><\/p>\n<p><strong>2. View the tabular results for all evaluation runs:<\/strong>\n<img decoding=\"async\" class=\"alignnone wp-image-767\" src=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-compare-view-300x87.png\" alt=\"Doc retrieval compare view image\" width=\"955\" height=\"277\" srcset=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-compare-view-300x87.png 300w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-compare-view-1024x298.png 1024w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-compare-view-768x223.png 768w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-compare-view-1536x447.png 1536w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-compare-view.png 1658w\" sizes=\"(max-width: 955px) 100vw, 955px\" \/><\/p>\n<p><strong>3. Find the best parameter for the charts per metric (`xdcg@3` for example):<\/strong>\n<img decoding=\"async\" class=\"alignnone wp-image-765\" src=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-best-parameter-xdcg-chart-300x150.png\" alt=\"Doc retrieval best parameter xdcg chart image\" width=\"1044\" height=\"522\" srcset=\"https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-best-parameter-xdcg-chart-300x150.png 300w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-best-parameter-xdcg-chart-1024x512.png 1024w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-best-parameter-xdcg-chart-768x384.png 768w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-best-parameter-xdcg-chart-1536x768.png 1536w, https:\/\/devblogs.microsoft.com\/foundry\/wp-content\/uploads\/sites\/89\/2025\/05\/Doc-retrieval-best-parameter-xdcg-chart.png 1631w\" sizes=\"(max-width: 1044px) 100vw, 1044px\" \/><\/p>\n<p>Having found your optimal parameter for Azure AI Search, you can then confidently plug it back into your RAG agent, now attuned to your enterprise data. Note that this evaluator works for any search engine that returns relevance ranking scores, including LlamaIndex. Follow this notebook for an end-to-end example: https:\/\/aka.ms\/doc-retrieval-sample.<\/p>\n<p><strong>Related links:<\/strong><\/p>\n<ul>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/foundry\/evaluation-metrics-azure-ai-foundry\/\">Unlocking the Power of Agentic Applications New Evaluation Metrics for Quality and Safety | Azure AI Foundry Blog<\/a><\/li>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/foundry\/achieve-end-to-end-observability-in-azure-ai-foundry\/\">Achieve End-to-End Observability in Azure AI Foundry | Azure AI Foundry Blog<\/a><\/li>\n<li><a href=\"https:\/\/techcommunity.microsoft.com\/blog\/azure-ai-services-blog\/bonus-rag-time-journey-agentic-rag\/4404652\">Bonus Journey: Agentic RAG &#8211; Combining Agents with Retrieval-Augmented Generation<\/a><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>TL;DR Learn the best practices to evaluate and optimize the RAG quality of your agent using evaluations in Foundry Observability. This tutorial demonstrates these 2 best practices before deploying your RAG agent: 1. Evaluate and optimize end to end your RAG agent using reference-free RAG triad evaluators: Groundedness and Relevance evaluators; 2. For advanced search [&hellip;]<\/p>\n","protected":false},"author":186954,"featured_media":1563,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1118","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-microsoft-foundry"],"acf":[],"blog_post_summary":"<p>TL;DR Learn the best practices to evaluate and optimize the RAG quality of your agent using evaluations in Foundry Observability. This tutorial demonstrates these 2 best practices before deploying your RAG agent: 1. Evaluate and optimize end to end your RAG agent using reference-free RAG triad evaluators: Groundedness and Relevance evaluators; 2. For advanced search [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts\/1118","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/users\/186954"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/comments?post=1118"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/posts\/1118\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/media\/1563"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/media?parent=1118"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/categories?post=1118"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/foundry\/wp-json\/wp\/v2\/tags?post=1118"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}