{"id":69,"date":"2024-12-09T08:33:58","date_gmt":"2024-12-09T16:33:58","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/udm\/?p=69"},"modified":"2025-02-17T12:22:44","modified_gmt":"2025-02-17T20:22:44","slug":"unified-data-models-101","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/udm\/unified-data-models-101\/","title":{"rendered":"Why a Unified Data Model is Critical: Lessons from Building Microsoft&#8217;s Semantic Layer"},"content":{"rendered":"<h2>Introduction<\/h2>\n<hr \/>\n<p>Some years ago, we were wrestling with a persistent issue in our data stack. Every team had their own way of collecting and structuring data. What was a simple query for one team became a debugging nightmare for another. Discovering the right dataset felt like looking for a needle in a haystack, and standardizing definitions was almost impossible. These headaches slowed us down, hurt trust in our data, and left our AI models grappling with inconsistent input. When we started the effort to build a unified data model at Microsoft, we realized these problems weren\u2019t just ours\u2014they were universal. This blog shares how we approached these challenges and how a unified data model not only resolves them but unlocks new possibilities.<\/p>\n<h2>From Relational Databases to AI: The Evolution of Data Modeling<\/h2>\n<hr \/>\n<p><a href=\"http:\/\/devblogs.microsoft.com\/udm\/wp-content\/uploads\/sites\/84\/2024\/10\/Evolution-of-Data-Modeling-1.svg\"><img decoding=\"async\" src=\"http:\/\/devblogs.microsoft.com\/udm\/wp-content\/uploads\/sites\/84\/2024\/10\/Evolution-of-Data-Modeling-1.svg\" alt=\"Image showing the evolution of data modeling\" class=\"aligncenter\" \/><\/a><\/p>\n<h3>The Relational Roots<\/h3>\n<p>Back in the day, relational databases and SQL were the backbone of data modeling. By using star schemas and semantic relationships, businesses ensured that their data could be queried and analyzed efficiently. This structure was critical for consistency, but it also meant everyone needed to adhere to strict schemas\u2014a challenge in itself.<\/p>\n<h3>Big Data Chaos<\/h3>\n<p>Fast forward to the 2000s, when data collection exploded. NoSQL databases and MapReduce let organizations handle unstructured data, but they came at a cost: loss of consistency and clarity. I remember a project where data definitions varied so wildly between teams that consolidating reports took longer than building the product they were reporting on.<\/p>\n<h3>AI Raises the Stakes<\/h3>\n<p>With AI becoming mainstream, the value of data has skyrocketed. AI systems require vast amounts of high-quality data to function effectively. However, inconsistencies in data models and definitions across organizations can hinder AI performance. Unlike humans, AI systems can&#8217;t easily interpret or correct ambiguous data, making a unified data model not just beneficial but essential.<\/p>\n<h2>Why a Unified Data Model is Necessary<\/h2>\n<hr \/>\n<p>In large organizations, dozens or even hundreds of teams collect and use data independently. Every large organization faces data silos. This siloed approach leads to inconsistencies in data definitions and usage. Without alignment, teams duplicate efforts, analysts struggle to trust insights, and AI models flounder. A unified data model ensures:<\/p>\n<ul>\n<li><strong>Consistency<\/strong>: Everyone uses the same definitions and data sources.<\/li>\n<li><strong>Discoverability<\/strong>: Data assets are easy to find and understand.<\/li>\n<li><strong>Efficiency<\/strong>: Reduces duplication of effort and streamlines data processing.<\/li>\n<li><strong>Trustworthiness<\/strong>: Data is reliable, which is crucial for decision-making and AI applications.<\/li>\n<\/ul>\n<p>But achieving this isn\u2019t just a technical challenge\u2014it\u2019s a cultural one. Teams need to move from siloed ownership to shared accountability. It\u2019s tough at first, but the payoff is exponential.<\/p>\n<h3>Critical Requirements for Success<\/h3>\n<p>To achieve this unified model, two critical requirements must be met:<\/p>\n<ol>\n<li>Alignment on Common Data Shapes: Establishing standard structures or &#8220;shapes&#8221; for data ensures that everyone interprets data in the same way.<\/li>\n<li>Consistent Metadata Collection: Detailed metadata helps users and AI systems find, interpret, and use data correctly. Implementing a unified data model often requires a cultural shift within the organization. Teams must move away from siloed practices and embrace shared standards. While challenging at first, the benefits become evident as more teams adopt the model, creating a snowball effect that drives widespread acceptance.<\/li>\n<\/ol>\n<h2>Building Microsoft&#8217;s Semantic Layer<\/h2>\n<hr \/>\n<p>At Microsoft, the scale was daunting: hundreds of products, diverse teams, and sprawling datasets. We needed an approach that balanced flexibility with standardization. Here\u2019s how we did it:<\/p>\n<h3>Defining Common Data Shapes and Concepts<\/h3>\n<p><a href=\"http:\/\/devblogs.microsoft.com\/udm\/wp-content\/uploads\/sites\/84\/2024\/10\/Components-of-Sematic-Layer-2.svg\"><img decoding=\"async\" src=\"http:\/\/devblogs.microsoft.com\/udm\/wp-content\/uploads\/sites\/84\/2024\/10\/Components-of-Sematic-Layer-2.svg\" alt=\"Image showing the components of Semantic Layer\" class=\"aligncenter\"\/><\/a><\/p>\n<p>We focused on core components:<\/p>\n<ul>\n<li><strong>Entities<\/strong>: The main subjects of reports or analyses (e.g., users, devices, documents). They are uniquely identifiable and relatively static.<\/li>\n<li><strong>Profiles<\/strong>: Lists of entities with additional metadata, such as creation dates.<\/li>\n<li><strong>Profile Extensions<\/strong>: Additional attributes added to profiles, maintained separately for flexibility and control.<\/li>\n<li><strong>Attributes<\/strong>: Specific data points within profile extensions that describe entities (e.g., billing country, license type).<\/li>\n<li><strong>Outcomes<\/strong>: State changes or measures associated with entities, often time-stamped (e.g., a user making a purchase).<\/li>\n<li><strong>Dimensions<\/strong>: Standardized tables used for categorizing attributes and outcomes. <\/li>\n<\/ul>\n<p>By structuring data using these shapes, Microsoft enabled consistent data usage across teams and tools.<\/p>\n<h3>Facilitating Discovery and Use<\/h3>\n<p>Defining data shapes was only part of the solution. We also needed to make data easy to find and trust. This is why we invested in:<\/p>\n<ol>\n<li>Data Engineering Infrastructure: Building an orchestration system that mandates the collection of essential metadata for every data asset. This includes details about creation, refresh schedules, data lineage, and responsible contacts.<\/li>\n<li>Discovery and Governance Tools: Developing tools that allow users and AI systems to visualize and search for concepts within the semantic layer. This includes enforcing rich descriptions and maintaining a glossary of terms, acronyms, and synonyms.<\/li>\n<li>Structured Workspace Management: Creating production workspaces containing only approved assets from the semantic layer. Exploratory workspaces allow for experimentation but restrict publishing, ensuring consistency and preventing the proliferation of unvetted data definitions.<\/li>\n<\/ol>\n<h2>Data Processing Considerations<\/h2>\n<hr \/>\n<p>Microsoft also recognized the importance of efficient data processing before data reaches the semantic layer. They identified three key stages:<\/p>\n<ul>\n<li><strong>Events and Telemetry<\/strong>: Raw, unprocessed data captured at the most granular level. While valuable, this data is often too voluminous and unrefined for direct use in analytics or reporting. <\/li>\n<li><strong>Cleaned Data<\/strong>: Data that has undergone initial processing to clean, enrich, and standardize it. This stage often involves normalizing values and reducing volume without losing essential information. <\/li>\n<li><strong>Semantic Layer<\/strong>: The refined, high-value data assets ready for consumption in analytics, reporting, and AI applications. This layer incorporates all critical business definitions and ensures data is consistent and reusable. <\/li>\n<\/ul>\n<p>By structuring data processing in this way, Microsoft ensures that the semantic layer is both robust and efficient, serving as the single source of truth for data consumers.<\/p>\n<h2>Lessons Learned<\/h2>\n<p>Looking back, managing our own data stack often felt like patching a leaky ship. Teams were constantly reinventing wheels, and critical insights were lost in translation. By investing in a unified data model, we stopped firefighting and started innovating. Now, analysts and AI systems can trust the data they use. Engineers don\u2019t waste cycles reconciling definitions. And when we ask, \u201cWhat\u2019s our most valuable dataset?\u201d everyone knows where to look.<\/p>\n<h2>Conclusion<\/h2>\n<hr \/>\n<p>In an era where data is abundant, but consistency is scarce, a unified data model is indispensable. Microsoft&#8217;s approach to building a semantic layer showcases how organizations can tackle the challenges of data inconsistency, especially when scaling AI initiatives.<\/p>\n<p>With that being said, building a unified data model isn\u2019t just about solving technical problems\u2014it\u2019s about empowering teams and amplifying the value of data. At Microsoft, this effort paid off by aligning teams, reducing duplication, and enabling better AI.<\/p>\n<p>For anyone struggling with discoverability, standardization, or trust in your data, I can\u2019t recommend this journey enough. Start small, win over key teams, and let the results speak for themselves. Before long, your organization won\u2019t just handle data\u2014it\u2019ll thrive on it.<\/p>\n<p>Stay tuned for upcoming posts where we\u2019ll take a closer look at the individual components of UDM. We\u2019ll also share real-world stories and case studies highlighting how UDM drives tangible benefits.<\/p>\n<p>Don\u2019t miss out\u2014subscribe to get notified, and feel free to start a discussion below in the comments section. Like and share this post on your favorite platforms to keep the conversation going!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Some years ago, we were wrestling with a persistent issue in our data stack. Every team had their own way of collecting and structuring data. What was a simple query for one team became a debugging nightmare for another. Discovering the right dataset felt like looking for a needle in a haystack, and standardizing [&hellip;]<\/p>\n","protected":false},"author":171116,"featured_media":90,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-69","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-udm","tag-unified-data-model"],"acf":[],"blog_post_summary":"<p>Introduction Some years ago, we were wrestling with a persistent issue in our data stack. Every team had their own way of collecting and structuring data. What was a simple query for one team became a debugging nightmare for another. Discovering the right dataset felt like looking for a needle in a haystack, and standardizing [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/udm\/wp-json\/wp\/v2\/posts\/69","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/udm\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/udm\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/udm\/wp-json\/wp\/v2\/users\/171116"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/udm\/wp-json\/wp\/v2\/comments?post=69"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/udm\/wp-json\/wp\/v2\/posts\/69\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/udm\/wp-json\/wp\/v2\/media\/90"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/udm\/wp-json\/wp\/v2\/media?parent=69"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/udm\/wp-json\/wp\/v2\/categories?post=69"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/udm\/wp-json\/wp\/v2\/tags?post=69"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}