{"id":16233,"date":"2025-06-05T00:00:00","date_gmt":"2025-06-05T07:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/ise\/?p=16233"},"modified":"2025-06-05T12:02:49","modified_gmt":"2025-06-05T19:02:49","slug":"hierarchical-waterfall-evaluation-query-classification-rag-llm","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/hierarchical-waterfall-evaluation-query-classification-rag-llm\/","title":{"rendered":"Hierarchical Waterfall Evaluation of Query Classification, Retrieval &amp; Generation in Multi-Agent LLM Systems"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p>As organizations increasingly deploy multi-agent LLM systems to handle complex queries and workflows, the need for robust evaluation frameworks becomes crucial. Based on our recent work implementing hypothesis-driven experimentation for a large enterprise client, this post shares insights on building a hierarchical waterfall evaluation framework focused on query classification and retrieval-augmented generation (RAG) in multi-agent systems.<\/p>\n<p>This blog explores how we established structured evaluation pipelines, worked with domain experts, and implemented key metrics that went beyond simple accuracy scores to provide actionable insights for system improvement.<\/p>\n<h2>The Evaluation Challenge<\/h2>\n<p>Multi-agent systems, where different specialist AI agents handle different aspects of user queries, introduce unique evaluation challenges:<\/p>\n<ul>\n<li><strong>Routing Accuracy<\/strong>: If queries aren&#8217;t routed to the right agent(s), the entire system breaks down, potentially producing plausible but incorrect answers that can deceive users without their knowledge<\/li>\n<li><strong>Multi-label Classification<\/strong>: Queries often require multiple agents to work together<\/li>\n<li><strong>Retrieval Quality<\/strong>: Agents need to find the right documents and content chunks from knowledge sources, making retrieval accuracy a critical factor in multi-agent systems just as it is in traditional RAG implementations<\/li>\n<li><strong>Answer Generation<\/strong>: Final responses must be accurate, relevant, and properly grounded<\/li>\n<\/ul>\n<p>Our approach focused on developing comprehensive evaluation pipelines that addressed each of these aspects. We prioritized providing transparent, actionable feedback rather than opaque performance scores. Because our current implementation primarily uses retrieval-based agents, retrieval quality is a key metric in our evaluation framework. However, the framework is extensible\u2014additional metrics, such as task completion rates and execution accuracy, can be incorporated for other types of agents.<\/p>\n<h2>Evaluation Framework Development<\/h2>\n<p>A key focus of our engagement was implementing hypothesis-driven experimentation, which required establishing a robust evaluation framework. This framework was built on high-quality evaluation data collected from multiple sources including subject matter experts (SMEs), existing system data, actual user queries, and supplementary synthetic data.<\/p>\n<p>We prioritized creating evaluation datasets that genuinely represented real-world use cases rather than merely generating evaluation metrics. This approach ensured our improvements directly impacted business outcomes and user experience.<\/p>\n<p>The data annotation process involved close collaboration with SMEs who provided critical annotations including:<\/p>\n<ul>\n<li>Query classification<\/li>\n<li>Appropriate agent selection<\/li>\n<li>Link\/document retrieval for knowledge indexes<\/li>\n<li>Chunk retrieval within documents<\/li>\n<li>Reference answers<\/li>\n<\/ul>\n<p>We treated evaluation data with the same rigor as production code. This included implementing versioning, comprehensive documentation, and detailed process tracking. This methodical approach prevents metric misinterpretation and ensures teams maintain context when analyzing results.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/06\/eval-framework.png\" alt=\"Multi-Agent Evaluation Framework\" \/><\/p>\n<p><em>Fig 1: Comprehensive multi-agent evaluation framework \u2013 Our hierarchical approach evaluates routing, retrieval, and generation components to provide a complete assessment of system performance.<\/em><\/p>\n<h3>Hierarchical Schema for Evaluation Data<\/h3>\n<p>To structure our evaluation data effectively, we developed a simple hierarchical schema that follows our sequential evaluation process:<\/p>\n<pre><code>EvaluationData\/\r\n\u251c\u2500\u2500 Queries\/\r\n\u2502   \u251c\u2500\u2500 all_queries.json               # All test queries\r\n\u2502   \u2514\u2500\u2500 query_metadata.json            # Domain, difficulty, query type metadata\r\n\u2502\r\n\u251c\u2500\u2500 AgentClassification\/\r\n\u2502   \u251c\u2500\u2500 ground_truth_labels.json       # Correct agent assignments for each query\r\n\u2502   \u251c\u2500\u2500 model_predictions.json         # System's agent selections\r\n\u2502   \u2514\u2500\u2500 classification_results.json    # Accuracy, precision, recall, F1 metrics\r\n\u2502\r\n\u251c\u2500\u2500 Retrieval\/                         # Only evaluated for correctly classified queries\r\n\u2502   \u251c\u2500\u2500 document_retrieval\/\r\n\u2502   \u2502   \u251c\u2500\u2500 ground_truth_links.json    # Expected documents\/links\r\n\u2502   \u2502   \u251c\u2500\u2500 retrieved_links.json       # Actually retrieved documents\r\n\u2502   \u2502   \u2514\u2500\u2500 document_metrics.json      # Precision@k, recall@k scores\r\n\u2502   \u2502\r\n\u2502   \u2514\u2500\u2500 chunk_retrieval\/\r\n\u2502       \u251c\u2500\u2500 ground_truth_chunks.json   # Expected content chunks\r\n\u2502       \u251c\u2500\u2500 retrieved_chunks.json      # Actually retrieved chunks\r\n\u2502       \u2514\u2500\u2500 chunk_metrics.json         # Rouge scores, precision, recall\r\n\u2502\r\n\u2514\u2500\u2500 Generation\/                        # Final answer evaluation\r\n    \u251c\u2500\u2500 reference_answers.json         # SME-provided reference answers\r\n    \u251c\u2500\u2500 system_responses.json          # Generated answers\r\n    \u2514\u2500\u2500 answer_metrics.json            # Groundedness, similarity, factuality scores<\/code><\/pre>\n<p>This sequential schema ensures we:<\/p>\n<ol>\n<li>First evaluate if queries are routed to the correct agent(s)<\/li>\n<li>Only for correctly routed queries, measure document and chunk retrieval accuracy<\/li>\n<li>Finally assess answer quality based on groundedness and alignment with references<\/li>\n<\/ol>\n<p>This approach prevents misattribution of errors and allows us to clearly identify which stage of the pipeline may need improvement.<\/p>\n<h2>Multi-Label Router Classification Evaluation Pipeline<\/h2>\n<p>User queries sent to multi-agent applications need to be routed to the relevant AI Agents and then the relevant agent(s) invoked to generate a response. This involves classifying the user queries into categories that align with each AI Agent or multiple agents.<\/p>\n<p>The router agent was responsible for deciding which AI Agents needed to be invoked based on the user query. Each agent is assigned a label, so multi-label classification refers to evaluating the multi-agent participation requirements for a user query.<\/p>\n<h3>Multi-Class vs. Multi-Label<\/h3>\n<ol>\n<li><strong>Multi-Class<\/strong>\n<ul>\n<li>Each data point (query) is assigned exactly one label\/class\/agent<\/li>\n<li>Example: A router picks only one agent (e.g., knowledge base, generic LLM, or web search) for each query, even if more than one agent might apply<\/li>\n<li>This is typically simpler to implement and interpret because each query has a single &#8220;best&#8221; choice<\/li>\n<\/ul>\n<\/li>\n<li><strong>Multi-Label<\/strong>\n<ul>\n<li>Each data point (query) can be assigned multiple labels\/classes<\/li>\n<li>Example: A query could be routed to both &#8220;knowledge base&#8221; and &#8220;web search&#8221; if appropriate<\/li>\n<li>More challenging because you have to handle overlaps between classes, and your metrics (precision, recall, F1) become more complex in a multi-label setting<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>Generally, multi-label classification is harder than multi-class because you must learn not just to pick the correct single label, but to identify all correct labels simultaneously. This requires a more robust approach, especially to ensure that the router captures how labels\/agents relate to each other.<\/p>\n<p>Our evaluation pipeline for multi-label classification serves as the critical entry point and primary accuracy bottleneck for the entire multi-agent system. If queries aren&#8217;t routed to the correct agents at this stage, even perfect downstream components will fail to produce accurate results.<\/p>\n<p>Rather than providing just another &#8220;black box&#8221; metric like F1 score, our pipeline generates detailed diagnostic error analysis reports that illuminate exactly where and how the system is failing, enabling targeted improvements. Specifically, the pipeline evaluates how effectively the multi-agent router classifies each query. It compares predicted agent assignments against expert-labeled ground truth data to identify both strengths and areas for improvement.<\/p>\n<h3>Error Analysis Framework<\/h3>\n<p>The error analysis report serves as the cornerstone of our development process, addressing critical questions:<\/p>\n<ul>\n<li>Which specific query types consistently trigger incorrect agent selection?<\/li>\n<li>What patterns exist in queries where the system performs well vs. poorly?<\/li>\n<li>Are there particular agent combinations the system struggles to identify?<\/li>\n<li>Do specific keywords or phrasings correlate with routing errors?<\/li>\n<li>Which errors would have the highest business impact if fixed?<\/li>\n<\/ul>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/06\/multi-label-error-analysis-report.png\" alt=\"Error Analysis Driven Development\" \/><\/p>\n<p><em>Fig 2: Error analysis driven development approach \u2013 Our evaluation pipeline focuses on detailed error analysis rather than single metrics, enabling iterative improvements through targeted insights.<\/em><\/p>\n<p>Our pipeline provides more than standard metrics (precision, recall, F1):<\/p>\n<ul>\n<li>Per-query analysis showing exactly which agent assignments were correct and incorrect<\/li>\n<li>Detailed examination of multi-label cases where some but not all agents were correctly assigned<\/li>\n<li>Pattern recognition across error categories to identify systematic failure modes<\/li>\n<li>Performance breakdowns by query type, complexity, and domain<\/li>\n<\/ul>\n<p>The analysis results include several key components:<\/p>\n<ol>\n<li><strong>Dataset Statistics<\/strong>: General statistics about the overall dataset<\/li>\n<li><strong>Missed Entries<\/strong>: Data queries with ground truth but no prediction entries<\/li>\n<li><strong>Filtered Data<\/strong>: Any data queries excluded from evaluation<\/li>\n<li><strong>Class Distribution<\/strong>: The distribution between different agent classes<\/li>\n<li><strong>Multi-Label Data<\/strong>: Analysis of queries requiring multiple agents<\/li>\n<li><strong>Incorrect Predictions<\/strong>: Breakdown of errors as partial matches or complete misses<\/li>\n<li><strong>Overall Metrics<\/strong>: Exact match accuracy, average precision, recall, and F1-scores<\/li>\n<li><strong>Per-Class Performance<\/strong>: Metrics broken down by agent type<\/li>\n<\/ol>\n<h3>Implementation and Example<\/h3>\n<p>We implemented this approach in a Python evaluation script that processes ground truth and prediction data to generate detailed reports. Here&#8217;s a simplified version of our implementation:<\/p>\n<pre><code class=\"language-python\">#!\/usr\/bin\/env python3\r\n\"\"\"\r\nGeneral Multi-Label Classification Evaluation with Detailed Analysis\r\n\r\nUsage:\r\n    python eval_multilabel_full.py \\\r\n      --gt ground_truth.json \\\r\n      --pred predictions.json \\\r\n      --outdir results\/ \\\r\n      --remove unknown outofscope\r\n\"\"\"\r\n\r\nimport json\r\nimport argparse\r\nfrom pathlib import Path\r\nfrom collections import Counter\r\nfrom sklearn.preprocessing import MultiLabelBinarizer\r\nfrom sklearn.metrics import classification_report\r\n\r\ndef load_json(path):\r\n    return json.loads(Path(path).read_text())\r\n\r\ndef norm(raw):\r\n    \"\"\"Normalize a raw label field into a list of lowercase labels.\"\"\"\r\n    if raw is None:\r\n        return []\r\n    if isinstance(raw, str):\r\n        raw = [x.strip() for x in raw.split(',')]\r\n    if isinstance(raw, (list, tuple)):\r\n        return [x.lower() for x in raw]\r\n    return [str(raw).lower()]\r\n\r\ndef align(gt, pred):\r\n    \"\"\"Align GT and predictions by ID, return common IDs, missing, extra, and maps.\"\"\"\r\n    gt_map = {d['id']: d for d in gt}\r\n    pred_map = {d['id']: d for d in pred}\r\n    common = sorted(set(gt_map) &amp; set(pred_map))\r\n    missing = sorted(set(gt_map) - set(pred_map))\r\n    extra   = sorted(set(pred_map) - set(gt_map))\r\n    return common, missing, extra, gt_map, pred_map\r\n\r\ndef filter_idxs(labels_list, remove):\r\n    \"\"\"Return indices of entries whose labels do not include any remove-categories.\"\"\"\r\n    keep = []\r\n    for i, labs in enumerate(labels_list):\r\n        if not any(r in labs for r in remove):\r\n            keep.append(i)\r\n    return keep\r\n\r\ndef analyze(gt_raw, pred_raw, remove, outdir):\r\n    # Align by ID\r\n    common, missing, extra, gt_map, pred_map = align(gt_raw, pred_raw)\r\n    outdir = Path(outdir)\r\n    outdir.mkdir(parents=True, exist_ok=True)\r\n\r\n    # Extract &amp; normalize labels\r\n    gt = [norm(gt_map[i].get('labels')) for i in common]\r\n    pr = [norm(pred_map[i].get('labels')) for i in common]\r\n    ids = list(common)\r\n\r\n    # Filter out unwanted categories\r\n    keep = filter_idxs(gt, remove)\r\n    gt = [gt[i] for i in keep]\r\n    pr = [pr[i] for i in keep]\r\n    ids = [ids[i] for i in keep]\r\n\r\n    # Build class distributions\r\n    def dist(lst):\r\n        cnt = Counter([l for row in lst for l in row])\r\n        total = sum(cnt.values())\r\n        return [\r\n            {'label': lab, 'count': cnt[lab], 'pct': cnt[lab]\/total*100 if total else 0}\r\n            for lab, _ in cnt.most_common()\r\n        ]\r\n\r\n    stats = {\r\n        'dataset': {\r\n            'gt_total':   len(gt_raw),\r\n            'pred_total': len(pred_raw),\r\n            'common':     len(common),\r\n            'missing':    missing,\r\n            'extra':      extra\r\n        },\r\n        'distribution': {\r\n            'gt': dist(gt),\r\n            'pred': dist(pr)\r\n        }\r\n    }\r\n\r\n    # Multi-label analysis\r\n    multi = [i for i, labs in enumerate(gt) if len(labs) &gt; 1]\r\n    ml_exact = sum(1 for i in multi if set(gt[i]) == set(pr[i]))\r\n\r\n    # Single\u2192multi cases\r\n    s2m = [i for i, labs in enumerate(gt) if len(labs) == 1 and len(pr[i]) &gt; 1]\r\n    s2m_corr = sum(1 for i in s2m if gt[i][0] in pr[i])\r\n\r\n    # Collect incorrect predictions\r\n    incorrect = []\r\n    for i, (g0, p0) in enumerate(zip(gt, pr)):\r\n        if set(g0) != set(p0):\r\n            incorrect.append({\r\n                'id': ids[i],\r\n                'gt': g0,\r\n                'pred': p0,\r\n                'partial': bool(set(g0) &amp; set(p0)),\r\n                'missed': [x for x in g0 if x not in p0],\r\n                'extra': [x for x in p0 if x not in g0],\r\n            })\r\n\r\n    # Compute multi-label metrics\r\n    all_classes = sorted({lab for row in gt for lab in row})\r\n    mlb = MultiLabelBinarizer(classes=all_classes)\r\n    y_true = mlb.fit_transform(gt)\r\n    y_pred = mlb.transform(pr)\r\n    report = classification_report(\r\n        y_true, y_pred,\r\n        target_names=mlb.classes_,\r\n        output_dict=True,\r\n        zero_division=0\r\n    )\r\n    exact_match_acc = sum(1 for g0, p0 in zip(gt, pr) if set(g0) == set(p0)) \/ len(gt) if gt else 0\r\n\r\n    # Save metrics.json\r\n    out_metrics = {\r\n        **stats,\r\n        'exact_match': exact_match_acc,\r\n        'report': report,\r\n        'multi_label': {\r\n            'total': len(multi),\r\n            'exact_match': ml_exact,\r\n            'accuracy': ml_exact \/ len(multi) if multi else None\r\n        },\r\n        'single_to_multi': {\r\n            'total': len(s2m),\r\n            'correct': s2m_corr,\r\n            'pct_correct': (s2m_corr \/ len(s2m) * 100) if s2m else None\r\n        },\r\n        'incorrect': incorrect\r\n    }\r\n    with open(outdir \/ 'metrics.json', 'w') as f:\r\n        json.dump(out_metrics, f, indent=2)\r\n\r\n    # Write human-readable report\r\n    rpt = outdir \/ 'analysis_report.txt'\r\n    with rpt.open('w') as f:\r\n        f.write(\"=== DATASET STATISTICS ===\\n\")\r\n        f.write(f\"GT entries: {stats['dataset']['gt_total']}\\n\")\r\n        f.write(f\"Pred entries: {stats['dataset']['pred_total']}\\n\")\r\n        f.write(f\"Common IDs: {stats['dataset']['common']}\\n\")\r\n        f.write(f\"Missing IDs: {len(missing)}\\n\")\r\n        f.write(f\"Extra IDs: {len(extra)}\\n\\n\")\r\n\r\n        f.write(\"=== CLASS DISTRIBUTIONS ===\\n\")\r\n        for side in ['gt', 'pred']:\r\n            f.write(f\"-- {side.upper()} --\\n\")\r\n            for d in stats['distribution'][side]:\r\n                f.write(f\"  {d['label']}: {d['count']} ({d['pct']:.1f}%)\\n\")\r\n            f.write(\"\\n\")\r\n\r\n        f.write(\"=== OVERALL METRICS ===\\n\")\r\n        f.write(f\"Exact-match accuracy: {exact_match_acc:.3f}\\n\")\r\n        f.write(f\"Weighted F1-score: {report['weighted avg']['f1-score']:.3f}\\n\\n\")\r\n\r\n        f.write(\"=== MULTI-LABEL ANALYSIS ===\\n\")\r\n        f.write(f\"Total multi-label: {len(multi)}\\n\")\r\n        f.write(f\"Exact matches: {ml_exact}\\n\\n\")\r\n\r\n        f.write(\"=== SINGLE\u2192MULTI INSPECTION ===\\n\")\r\n        f.write(f\"Total single\u2192multi: {len(s2m)}, correct includes: {s2m_corr}\\n\\n\")\r\n\r\n        f.write(\"=== INCORRECT PREDICTIONS ===\\n\")\r\n        for e in incorrect:\r\n            f.write(\r\n                f\"- ID {e['id']}: GT={e['gt']} PRED={e['pred']} \"\r\n                f\"partial={e['partial']} missed={e['missed']} extra={e['extra']}\\n\"\r\n            )\r\n\r\n    print(f\"Metrics and report saved to: {outdir}\")\r\n\r\nif __name__ == \"__main__\":<\/code><\/pre>\n<h3>Example Usage<\/h3>\n<p>Below are example JSON files that demonstrate how to use the evaluation script:<\/p>\n<p><strong>ground_truth.json<\/strong><\/p>\n<pre><code class=\"language-json\">[\r\n  { \"id\": \"1\", \"labels\": [\"billing\", \"search\"] },\r\n  { \"id\": \"2\", \"labels\": [\"faq\"] },\r\n  { \"id\": \"3\", \"labels\": [\"unknown\"] },\r\n  { \"id\": \"4\", \"labels\": [\"faq\", \"billing\"] }\r\n]<\/code><\/pre>\n<p><strong>predictions.json<\/strong><\/p>\n<pre><code class=\"language-json\">[\r\n  { \"id\": \"1\", \"labels\": [\"search\", \"billing\"] },\r\n  { \"id\": \"2\", \"labels\": [\"faq\"] },\r\n  { \"id\": \"3\", \"labels\": [\"fileupload\"] },\r\n  { \"id\": \"4\", \"labels\": [\"faq\"] }\r\n]<\/code><\/pre>\n<p>Place these files alongside your script, then run:<\/p>\n<pre><code class=\"language-bash\">python eval_multilabel_full.py --gt ground_truth.json --pred predictions.json --outdir my_results<\/code><\/pre>\n<p>Your <code>my_results\/metrics.json<\/code> and <code>my_results\/analysis_report.txt<\/code> will contain the full diagnostics that enable teams to identify and address specific issues in the multi-agent routing system.<\/p>\n<p>The weighted average F1-score provides a good overall performance tracker, while per-class metrics help identify which query types need improvement. This approach enabled our teams to iteratively improve routing accuracy from the initial 70% range to the high 80% range by focusing on the most impactful error categories.<\/p>\n<h2>Retrieval and Generation Metrics<\/h2>\n<p>Once we&#8217;ve established that queries are correctly routed to the appropriate agents, we need to evaluate two critical downstream components: retrieval accuracy and answer generation quality. These evaluations only proceed for queries that were correctly classified in the routing stage.<\/p>\n<h3>Retrieval Evaluation<\/h3>\n<p>For agents that retrieve information before generating answers, we implemented two levels of retrieval evaluation:<\/p>\n<ol>\n<li><strong>Document-Level Retrieval Metrics<\/strong>:\n<ul>\n<li>Precision@k (P@k): Proportion of retrieved documents that are relevant<\/li>\n<li>Recall@k (R@k): Proportion of relevant documents that are retrieved<\/li>\n<li>F1@k: Harmonic mean of precision and recall at k documents<\/li>\n<\/ul>\n<p>We used these metrics to evaluate how effectively agents found the correct documents or knowledge sources\/links for each query. These evaluations were applied to both our web search agent and our internal knowledge retrieval agent that queried the enterprise knowledge index.<\/li>\n<li><strong>Chunk-Level Retrieval Metrics<\/strong>:\n<ul>\n<li>Rouge-L scores measuring overlap between retrieved text chunks and ground truth chunks<\/li>\n<li>Semantic similarity between retrieved chunks and reference chunks using embedding models<\/li>\n<li>Coverage assessment evaluating whether retrieved chunks contain all necessary information<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>This approach helped us identify whether retrieval errors were occurring at the document selection stage or during chunk extraction. This insight provided valuable feedback to improve our indexing and chunking approaches while creating the knowledge index.<\/p>\n<h3>Answer Generation Evaluation<\/h3>\n<p>For the final generated responses, we developed a multifaceted evaluation approach:<\/p>\n<ol>\n<li><strong>Reference-Based Metrics<\/strong>:\n<ul>\n<li>Lexical similarity using Rouge, BLEU, and GLEU scores to measure word and n-gram overlap<\/li>\n<li>Semantic similarity using embedding-based models (BERTScore, SentenceBERT) to capture meaning beyond exact matches<\/li>\n<li>Structure and style assessment comparing formatting, citations, and organizational elements<\/li>\n<\/ul>\n<\/li>\n<li><strong>Reference-Free Metrics<\/strong>:\n<ul>\n<li>Groundedness evaluation ensuring answers don&#8217;t contain hallucinated information<\/li>\n<li>Factual correctness assessment using frameworks like RAGAS to verify claims<\/li>\n<li>Answer relevancy scoring to determine if responses address the original query<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>These comprehensive metrics gave us a holistic view of answer quality and identified specific improvement areas without relying on single aggregate scores.<\/p>\n<h3>Implementation Details<\/h3>\n<p>For generation evaluation, we used a combination of:<\/p>\n<ul>\n<li>Custom metrics calculated directly on the response texts<\/li>\n<li>Off-the-shelf libraries like RAGAS for automated assessment<\/li>\n<li>Human evaluation for a subset of responses to calibrate automated metrics<\/li>\n<\/ul>\n<p>We found that factuality and groundedness were particularly important for enterprise applications, where incorrect information poses significant risks. Our evaluation framework prioritized these aspects alongside traditional metrics like relevance and completeness.<\/p>\n<p>For reference-based evaluations, we worked with SMEs to develop gold-standard reference answers. We also implemented embedding-based similarity scores to assess semantic alignment between system responses and reference answers.<\/p>\n<p>Additionally, we incorporated reference-free evaluators focusing on factuality and relevancy, providing multiple implementation options that the team could select based on specific use cases.<\/p>\n<h2>Evaluation Complexities<\/h2>\n<h3>Web Search Evaluation Challenges<\/h3>\n<p>When evaluating web search accuracy, we encountered a significant challenge: web links change over time, and equivalent or better content might be available at different URLs than those in our ground truth data. For web search agents, we shifted focus from exact URL matching to chunk-based retrieval evaluation and answer accuracy metrics.<\/p>\n<p>This approach acknowledges the dynamic nature of the web while still ensuring the agent retrieves high-quality, relevant information regardless of its specific source URL.<\/p>\n<h3>SME Bias and Inter-Annotator Agreement<\/h3>\n<p>Another critical insight involved potential bias in SME-prepared reference answers. To mitigate this, we proposed to implement:<\/p>\n<ul>\n<li>Inter-annotator agreement protocols requiring multiple SMEs to review each reference answer<\/li>\n<li>Multiple review cycles to ensure consensus<\/li>\n<li>Clear guidelines for what constitutes a good answer in each domain<\/li>\n<li>Build a clear criteria\/protocol for evaluation data annotation to be shared with SMEs<\/li>\n<\/ul>\n<h3>Domain-Specific Evaluation Design<\/h3>\n<p>We recognized that effective evaluation must be domain-centric, not AI-centric. This core principle guided our entire approach: the people who will use and benefit from the system should determine how it&#8217;s evaluated.<\/p>\n<p>In each domain, we consistently placed subject matter experts (SMEs) at the center of our evaluation frameworks. Rather than imposing metrics defined by AI researchers, we collaborated with domain experts to establish criteria that reflected real-world requirements and usage contexts.<\/p>\n<p>This approach ensured that:<\/p>\n<ul>\n<li>Evaluation criteria aligned with actual business needs and domain-specific requirements<\/li>\n<li>Systems were measured on metrics that mattered to end-users<\/li>\n<li>Results were interpretable and actionable for stakeholders without AI expertise<\/li>\n<\/ul>\n<p>For every agent in our multi-agent system, we partnered with the respective domain SMEs to define appropriate success metrics. This centered the actual users&#8217; needs rather than emphasizing abstract AI metrics that might not translate to real-world value.<\/p>\n<h2>Conclusion<\/h2>\n<p>Developing effective evaluation frameworks for multi-agent LLM systems requires:<\/p>\n<ol>\n<li>A comprehensive approach that addresses routing, retrieval, and generation<\/li>\n<li>Collaboration with domain experts to establish relevant evaluation criteria<\/li>\n<li>Detailed error analysis that goes beyond simple accuracy metrics<\/li>\n<li>Recognition of domain-specific nuances and challenges<\/li>\n<li>Iterative improvement based on insights from evaluation reports<\/li>\n<\/ol>\n<p>By implementing these principles, we were able to create evaluation pipelines that provided actionable insights and drove significant improvements in our multi-agent system performance.<\/p>\n<p>The future of AI evaluation must continue moving toward domain-expert-driven approaches rather than AI-centric metrics. Only by centering the actual users and use cases can we ensure our systems deliver real-world value.<\/p>\n<h2>Acknowledgements<\/h2>\n<p>Special thanks to the ISE crew\u2014<a href=\"https:\/\/www.linkedin.com\/in\/juanburckhardt\/\">Juan<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/nejatyab\/\">Jarre<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/abdotalema\/\">Abdo<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/court-coates-942308173\/\">Court<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/paul-glavich-2a613b1\/\">Paul<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/nidhi-wadhwa\/\">Nidhi<\/a>, and <a href=\"https:\/\/www.linkedin.com\/in\/ankurbad\/\">Ankur<\/a>\u2014for their valuable contributions to this project.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post presents a hierarchical waterfall framework for evaluating query classification, retrieval, and generation in multi-agent LLM systems.<\/p>\n","protected":false},"author":169282,"featured_media":16234,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,19],"tags":[3400],"class_list":["post-16233","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","category-machine-learning","tag-ise"],"acf":[],"blog_post_summary":"<p>This post presents a hierarchical waterfall framework for evaluating query classification, retrieval, and generation in multi-agent LLM systems.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16233","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/169282"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=16233"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16233\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/16234"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=16233"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=16233"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=16233"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}