{"id":16344,"date":"2025-08-20T00:00:00","date_gmt":"2025-08-20T07:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/ise\/?p=16344"},"modified":"2025-08-20T07:33:11","modified_gmt":"2025-08-20T14:33:11","slug":"ground-truth-curation-for-ai-systems","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/ground-truth-curation-for-ai-systems\/","title":{"rendered":"Ground Truth Curation Process for AI Systems"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p>Imagine you&#8217;re building a powerful new AI assistant to support real business tasks \u2014 like answering maintenance questions in a manufacturing plant or surfacing insights from historical service data. To ensure that this AI system produces accurate and useful responses, we need a reliable way to measure its performance. That starts with defining what a \u201ccorrect\u201d answer actually looks like.<\/p>\n<p>This is where the concept of <strong>ground truth<\/strong> comes in.<\/p>\n<p><strong>Ground truth<\/strong> refers to a set of accurate, verified answers that serve as the benchmark against which an AI system\u2019s outputs are evaluated. It\u2019s the gold standard \u2014 the data you use to test whether the system is behaving as expected. In practice, ground truths are carefully curated question-and-answer pairs that reflect what users <em>should<\/em> receive when they ask a particular question based on the system&#8217;s underlying data sources.<\/p>\n<p>For example, if a user asks, \u201cWhat are the most recent updates related to this item?\u201d, the ground truth would be a verified, accurate list of those updates pulled directly from the system of record. This response represents the correct answer the AI system is expected to return and serves as the benchmark against which its performance can be tested and evaluated.<\/p>\n<p>During a recent customer engagement, our team developed a structured approach to curate high-quality ground truths. This process was essential not just for evaluating the AI assistant\u2019s performance, but also for building confidence among end users and stakeholders.<\/p>\n<p>In this post, we\u2019ll walk through the key steps in our approach:<\/p>\n<ol>\n<li><strong>Collection of real user questions<\/strong> \u2013 to ensure relevance.<\/li>\n<li><strong>Contextualized data curation<\/strong> \u2013 to ground questions in verifiable, representative examples.<\/li>\n<li><strong>Subject matter expert (SME) validation<\/strong> \u2013 to ensure the answers reflect domain expertise and business reality.<\/li>\n<\/ol>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/08\/ground-truth-process.png\" alt=\"A graphic outlining three components of the process: questions, answers grounded in data, and validation of grounded data answer\" \/><\/p>\n<p>Let\u2019s dive into the details of how we created ground truths that truly reflect user intent and organizational knowledge.<\/p>\n<h2>1. Collecting User Questions<\/h2>\n<p>Once we\u2019ve established access to the underlying data sources our AI system will rely on, the first step in building reliable ground truths is to gather a meaningful set of real user questions. While it\u2019s possible to generate hypothetical prompts using the available data, these often miss the nuances, priorities, and terminology that actual users bring to the table. To develop ground truths that reflect real-world needs, we must engage directly with our target end users.<\/p>\n<p>To do this, we conducted a focused, interactive workshop with subject matter experts (SMEs) \u2014 the people who understand the business domain and will ultimately use or benefit from the AI system. These sessions are not only about collecting questions, but also about building shared understanding and trust.<\/p>\n<p>We began the workshop by clearly explaining the purpose of the project: to develop an AI assistant that can answer user questions accurately and consistently by drawing on existing organizational data. We emphasized how their expertise and participation were crucial to shaping an assistant that would be genuinely helpful and aligned with their day-to-day challenges.<\/p>\n<p>To help SMEs generate high-quality questions, we gave them a set of thought-starters in advance. These prompts were designed to spark ideas and encourage them to think in terms of real scenarios:<\/p>\n<ol>\n<li>What types of questions do you frequently ask in your daily work?<\/li>\n<li>What information is hard to find or requires digging through multiple systems?<\/li>\n<li>What would you ideally ask an AI assistant if it could understand your intent and return exactly what you need?<\/li>\n<li>Are there any follow-up questions you typically ask after receiving an initial answer?<\/li>\n<\/ol>\n<p>During the session, we used a collaborative <a href=\"https:\/\/app.mural.co\/\">Mural board<\/a> \u2014 a virtual whiteboard with pre-labeled sections for different categories of questions. (If you have a Mural account, you can view our template <a href=\"https:\/\/app.mural.co\/template\/c491d198-d624-48e2-9f09-3a8b8f984b93\/f2d23239-ad03-4f22-842d-5fb7575e45f7\">here<\/a>.) Participants added their questions using digital sticky notes, and the format encouraged conversation, cross-pollination of ideas, and deeper exploration of user needs. Importantly, <strong>all questions were considered valid and valuable, no matter how specific, broad, or exploratory<\/strong>.<\/p>\n<p>The workshop yielded dozens of user-submitted questions, many of which built on each other as SMEs discussed and refined their inputs in real time. In the second half of the session, we worked together to organize and tag the questions. We applied a color-coding system to indicate which data sources or systems would be required to answer each one. This step later enabled us to map questions to the appropriate context and data during the ground truth curation phase.<\/p>\n<p>This collaborative question collection process not only gave us a strong foundation of realistic prompts, but also fostered user buy-in and deepened our understanding of the customer and their business \u2014 ensuring that the questions our AI assistant was being tested against were directly relevant to the people it was designed to serve.<\/p>\n<h2>2. Contextualizing Data Curation<\/h2>\n<p>Once we had a rich set of real user questions, the next step was to connect those questions to the actual data that could be used to answer them. This is what we call <strong>contextualized data curation<\/strong> \u2014 the process of identifying and extracting the relevant records or facts that represent a \u201ccorrect\u201d response to each user question <em>based on the available data<\/em>.<\/p>\n<p>This step is critical because it transforms abstract questions into concrete, testable pairs of input and expected output, which is the essence of a ground truth. It ensures that each ground truth is rooted not only in what users <em>want<\/em> to know, but also in what the underlying data can support.<\/p>\n<p>To streamline the effort, we grouped similar or related user questions into small sets. This allowed our team to work in parallel, with different individuals or subgroups focused on curating data for different clusters of questions. We also partnered with data scientists from the customer\u2019s organization, which helped accelerate progress and ensured alignment with internal data knowledge.<\/p>\n<p>We developed custom tooling to support this phase. These tools enabled curators to define the data queries needed to answer each question and associate those queries with the relevant context using a well-defined, repeatable method.<\/p>\n<p>For each user question, we followed a structured three-step process:<\/p>\n<ol>\n<li><strong>Identify relevant data sources<\/strong>\n<p>Determine which databases, data lakes, or structured files contain the information needed to answer the question.<\/li>\n<li><strong>Define filters and properties<\/strong>\n<p>For each relevant source, identify the fields and values that should be used to retrieve precise results. This might include timestamps, asset IDs, status fields, or other filters specific to the domain.<\/li>\n<li><strong>Write and test the database query<\/strong>\n<p>Construct a query that reliably returns the data needed for that specific question. These queries had to be complete and executable, returning interpretable results that could serve as the &#8220;correct&#8221; answer.<\/li>\n<\/ol>\n<p>To give our queries real-world grounding and test their robustness, we introduced <strong>execution contexts<\/strong> \u2014 specific scenarios or entities that the query should operate on. For example, a query might be applied to a particular customer account, project, or product instance, depending on the domain.<\/p>\n<p>We intentionally selected a mix of contexts to reflect different types of outcomes:<\/p>\n<ul>\n<li><strong>Typical case<\/strong>: A standard scenario where the query returns a reasonable, expected set of results.<\/li>\n<li><strong>Negative case<\/strong>: A valid scenario where no data should be returned, which is useful for testing how the system handles empty or null responses.<\/li>\n<li><strong>Edge or extreme case<\/strong>: A scenario that produces an unusually large or complex result \u2014 such as a case with high data volume or inconsistent formatting \u2014 which helps stress-test the system.<\/li>\n<\/ul>\n<p>By including this range of contexts, we ensured that the curated ground truths represented a realistic cross-section of how users might interact with the system \u2014 from everyday questions to more difficult or uncommon scenarios.<\/p>\n<p>This curated set of data-backed answers, anchored in well-defined contexts, became the foundation for the next step: automating the generation of structured ground truth files that could be validated and refined through SME review.<\/p>\n<h3>2.1 &#8211; Operations: Contextualized Data Curation Utility<\/h3>\n<p>To operationalize the data curation process, we developed a utility that automates the generation of ground truth records at scale. For each set of user questions, we created a structured <code>JSON<\/code> input file. This file contains:<\/p>\n<ul>\n<li>The original user query<\/li>\n<li>The corresponding database query logic<\/li>\n<li>A list of execution contexts (e.g. specific identifiers or filter values)<\/li>\n<\/ul>\n<p>This input is fed into our <strong>Create Ground Truths Utility<\/strong>, which systematically runs each database query across all defined contexts for every user question. The utility then captures the resulting data records and packages them into a structured output file.<\/p>\n<p>Each output <code>JSON<\/code> file maps:<\/p>\n<ul>\n<li>The user question<\/li>\n<li>The execution context<\/li>\n<li>The resulting data records<\/li>\n<\/ul>\n<p>These elements together form a <strong>ground truth entry<\/strong> \u2014 a verified, context-specific answer to a user question.<\/p>\n<p>To further enrich the dataset, the utility also applies <strong>automated tagging<\/strong> to each ground truth. For example:<\/p>\n<ul>\n<li>\u201cNegative Case\u201d for queries that return no records<\/li>\n<li>\u201cMultiple Data Sets\u201d for queries pulling from more than one source or table<\/li>\n<\/ul>\n<p>In addition to the primary <code>JSON<\/code> output, the utility generates <code>JSONL<\/code> (JSON Lines) versions of the data. This format is especially useful for machine learning workflows, as it\u2019s compatible with platforms like <a href=\"https:\/\/azure.microsoft.com\/en-us\/products\/machine-learning\">Azure Machine Learning<\/a> (AML), where it can be used for testing, experimentation, and model evaluation.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/08\/create-ground-truths-utility-scaled.png\" alt=\"A graphic showing the contents of a JSON input file resulting in the creation of a JSON output file after being processed by Create Ground Truths Utility\" \/><\/p>\n<p>This tooling helped ensure consistency and reproducibility while dramatically accelerating the process of building a robust, diverse set of ground truths.<\/p>\n<h2>3. Subject Matter Expert Validation<\/h2>\n<p>Once we&#8217;ve curated contextualized data for each user question, the next critical step is <strong>validation<\/strong> \u2014 ensuring that the curated answers truly reflect what a domain expert would consider correct. After all, even a well-structured query can produce misleading or incomplete results if the intent of the question was misunderstood or if the wrong data properties were used.<\/p>\n<p>To ensure the <strong>quality and credibility<\/strong> of our ground truths, we built a review loop centered on <strong>subject matter expert (SME) feedback<\/strong>. These experts are best positioned to assess whether the question was interpreted accurately and whether the returned data constitutes a valid and useful answer.<\/p>\n<p>To streamline this feedback process, we created a utility that programmatically extracts the latest curated ground truths and converts them into a human-readable review format. Specifically, it generates an <code>XLSX<\/code> (Excel) document containing worksheets for the latest batch of ground truth sets that need review.<\/p>\n<p>Each spreadsheet file includes:<\/p>\n<ul>\n<li><strong>One worksheet per set of user questions<\/strong><\/li>\n<li><strong>One row per ground truth entry<\/strong> that represents a unique combination of question and context<\/li>\n<li>Columns for:\n<ul>\n<li>The original user question<\/li>\n<li>The execution context (e.g. specific item, timeframe, or filter value)<\/li>\n<li>The resulting data records (as pulled from the database)<\/li>\n<li>The database query used<\/li>\n<li>A placeholder for <strong>SME feedback<\/strong><\/li>\n<li>Metadata such as tags and links to the original JSON files<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Here is a generic example of an Excel spreadsheet showing contextualized ground truths for a fictionalized manufacturing plant that uses SAP terms and values:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/08\/validation-spreadsheet.png\" alt=\"A screenshot of an Excel spreadsheet showing contextualized ground truths with SME feedback\" \/><\/p>\n<p>SMEs can review each entry, provide feedback on interpretation, flag any issues, and suggest adjustments. This feedback loop is intentionally <strong>iterative<\/strong>. Once feedback is collected, the ground truth input files can be updated, and the utility can regenerate revised versions of each spreadsheet for further review.<\/p>\n<p>Because the entire process \u2014 from query execution to spreadsheet generation \u2014 is automated, we can rapidly repeat the cycle of curation, validation, and refinement as needed.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/08\/ground-truth-validation-process-scaled.png\" alt=\"A graphic depicting a cycle of Curate Ground Truth Set, Review Spreadsheet and Provide Feedback, and Update Ground Truth Input File Based on Feedback\" \/><\/p>\n<p>This collaborative loop with SMEs ensures that the final ground truth dataset is not only technically sound, but also <strong>aligned with domain expertise and user expectations<\/strong> \u2014 a foundational step in delivering trusted AI systems.<\/p>\n<h2>Conclusion<\/h2>\n<p>High-quality ground truth data is the foundation of any reliable and trustworthy AI system. It ensures that models are not only trained on accurate examples but also evaluated against realistic, domain-relevant expectations.<\/p>\n<p>By following a structured and collaborative process that collects authentic user questions, grounds those questions in real data, and validates the results with subject matter experts, teams can build ground truth datasets that reflect the complexity and nuance of real-world use cases.<\/p>\n<p>This iterative approach doesn\u2019t just improve model performance. It strengthens alignment between technical teams and end users, ensures transparency in how AI systems are evaluated, and ultimately fosters trust in the solutions being delivered.<\/p>\n<p>Investing in the ground truth process is more than a technical necessity \u2014 it\u2019s a strategic step toward building AI systems that are truly useful, dependable, and aligned with business goals.<\/p>\n<h2>Acknowledgments<\/h2>\n<p>We are grateful for the data science insights provided by <a href=\"https:\/\/www.linkedin.com\/in\/amatullah-badshah\/\">Amatullah Badshah<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/farhanbaluch\/\">Farhan Baluch<\/a>, and <a href=\"https:\/\/www.linkedin.com\/in\/bostdiek\/\">Bryan Ostdiek<\/a> and the data expertise shared by <a href=\"https:\/\/www.linkedin.com\/in\/ejoseperales\/\">Jose Perales<\/a>. We also appreciate <a href=\"https:\/\/www.linkedin.com\/in\/bindu-msft-cse\/\">Bindu Chinnasamy<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/johnhauppa\/\">John Hauppa<\/a>, <a href=\"https:\/\/www.linkedin.com\/in\/cameron-taylor-a27078127\/\">Cameron Taylor<\/a>, and <a href=\"https:\/\/www.linkedin.com\/in\/kanishk-t-1723b2107\/\">Kanishk Tantia<\/a> for their valuable contributions to this project.<\/p>\n<p>The feature image was generated using Bing Image Creator. Terms can be found <a href=\"https:\/\/www.bing.com\/new\/termsofuse?FORM=GENTOS\">here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Steps to Produce High Quality Ground Truth Pairs for AI Systems<\/p>\n","protected":false},"author":122005,"featured_media":16345,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,3451,19],"tags":[3611,3610],"class_list":["post-16344","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","category-ise","category-machine-learning","tag-ai-development","tag-ai-evaluation"],"acf":[],"blog_post_summary":"<p>Steps to Produce High Quality Ground Truth Pairs for AI Systems<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16344","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/122005"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=16344"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16344\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/16345"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=16344"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=16344"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=16344"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}