{"id":16155,"date":"2025-04-03T00:00:00","date_gmt":"2025-04-03T07:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/ise\/?p=16155"},"modified":"2025-04-03T01:30:02","modified_gmt":"2025-04-03T08:30:02","slug":"data-validations-with-great-expectations-in-ms-fabric","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/data-validations-with-great-expectations-in-ms-fabric\/","title":{"rendered":"Data Validations with Great Expectations in MS Fabric"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p>Data quality is a critical challenge for companies. Inaccurate, incomplete, or inconsistent data can lead to flawed decision making, operational inefficiency, and loss of trust. Poor data quality affects all aspects of a business.<\/p>\n<p>Great Expectations (GX) is a robust framework to check data quality and provide concrete, actionable results. Microsoft Fabric is an end-to-end data platform that integrates data engineering, data science, real-time analytics, and business intelligence into a unified environment. GX can be effectively utilized in the Microsoft Fabric environment, providing seamless solution for data validation and management. By validating datasets against defined expectations, companies can proactively identify and resolve issues, ensuring that their data is accurate, reliable, and fit for purpose.<\/p>\n<p>GX allows us to define test criteria programmatically using its open-source Python library, GX Core, and execute tests to ensure data quality. Also, GX offers GX Cloud, which is a fully managed cloud platform for data quality management.<\/p>\n<p>In this blog post, we will explore how to integrate GX Core within the Microsoft Fabric environment and use it to validate data programmatically.<\/p>\n<h2>Sample data and quality requirements<\/h2>\n<p>To illustrate the problem and solution, let&#8217;s define the scenario for this post. We are working with the following test dataset:<\/p>\n<table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Name<\/th>\n<th>Age<\/th>\n<th>City<\/th>\n<th>Score<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>Alice<\/td>\n<td>25<\/td>\n<td>New York<\/td>\n<td>85.5<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>Bob<\/td>\n<td>30<\/td>\n<td>Los Angeles<\/td>\n<td>90.2<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>Charlie<\/td>\n<td>35<\/td>\n<td>Chicago<\/td>\n<td>78.9<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>David<\/td>\n<td>40<\/td>\n<td>Houston<\/td>\n<td>88.3<\/td>\n<\/tr>\n<tr>\n<td>5<\/td>\n<td>Eva<\/td>\n<td>45<\/td>\n<td>Phoenix<\/td>\n<td>92.5<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>To validate the quality of this dataset, we define the following requirements:<\/p>\n<ol>\n<li>The <code>ID<\/code> column should have unique values.<\/li>\n<li>The <code>score<\/code> column should contain values greater than 80 (This expectation is intentionally set to fail to demonstrate how issues can be identified).<\/li>\n<\/ol>\n<p>We will validate these requirements programmatically in a Microsoft Fabric Notebook using Great Expectations (GX). By integrating GX within this environment, we will demonstrate how to define expectations, apply validations, and analyze the results effectively.<\/p>\n<h2>Setting up GX in MS Fabric<\/h2>\n<h3>Create an Environment<\/h3>\n<p>To use GX in MS Fabric, we need to add it to an <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-engineering\/create-and-use-environment\">Environment<\/a>. We can include the <code>great_expectations<\/code> library with the desired version under the Public libraries &gt; Public library tab. Once added, we must publish the updated environment to ensure the changes take effect.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/04\/set-library-in-environment.png\" alt=\"Set library in environment\" \/><\/p>\n<h3>Create a Notebook<\/h3>\n<p>We are using a Notebook to programmatically validate the data using GX. We create a Notebook, and select the Environment you set up for GX. Now, the Notebook is running on the environment that has GX imported.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/04\/set-environment-on-notebook.png\" alt=\"Set environment on notebook\" \/><\/p>\n<h2>Load GX components<\/h2>\n<h3>Load data<\/h3>\n<p>Let\u2019s begin by loading the data we want to validate into Microsoft Fabric. In this example, we upload a test dataset to the <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-engineering\/lakehouse-overview\">Lakehouse<\/a> and then read it into the <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-engineering\/how-to-use-notebook\">Notebook<\/a> for processing using PySpark.<\/p>\n<pre><code>df = spark.read.table(\"test_dataset\")<\/code><\/pre>\n<h3>Create GX Context<\/h3>\n<p>Next, we need to define a Data source in the <strong>GX context<\/strong>. The GX context contains all the necessary information to run GX within our workflow.<\/p>\n<pre><code>import great_expectations as gx\r\n\r\ncontext = gx.get_context()<\/code><\/pre>\n<h3>Create an expectation suite<\/h3>\n<p>In GX, the validation criteria we want to test are referred to as <strong>expectations<\/strong>. The collection of these expectations is called <strong>expectation suite<\/strong>. The expectation suite contains all the expectations we intend to apply to the data. Also, it can be stored in GX context for future reuse and easier maintenance.<\/p>\n<pre><code>expectation_suite_name = \"test_data_validation_suite\"\r\ncontext.suites.add(gx.ExpectationSuite(name=expectation_suite_name))\r\nsuite = context.suites.get(name=expectation_suite_name)<\/code><\/pre>\n<h3>Create Data source and Batch<\/h3>\n<p>We need to create a data source, a data asset and a batch definition. These elements reference the target dataset in GX context. In the <strong>Validation definition<\/strong>, all the references required to run the validation are included, such as the target dataset and the expectation suite.<\/p>\n<pre><code>data_source = context.data_sources.add_spark(name=\"test_data_source\")\r\ndata_asset = data_source.add_dataframe_asset(name=\"test_data\")\r\n\r\nbatch_definition = data_asset.add_batch_definition_whole_dataframe(\"test_batch_definition\")\r\n\r\nvalidation_definition = gx.ValidationDefinition(\r\n    data = batch_definition,\r\n    suite = suite,\r\n    name = \"test_validation_definition\",\r\n)<\/code><\/pre>\n<h2>Run GX in Fabric<\/h2>\n<h3>Add new expectations<\/h3>\n<p>Now we are ready to start adding expectations. GX provides the pre-defined expectations that we can easily leverage. We can explore the complete list of available expectations in the <a href=\"https:\/\/greatexpectations.io\/expectations\">GX Expectations Gallery<\/a> and find the ones which fit our requirements.<\/p>\n<p>As mentioned in the requirements above, we will define two expectations : the <code>ID<\/code> column should have unique values, and the <code>score<\/code> column should be contain values greater than 80.<\/p>\n<p>These expectations are defined and added in the expectations suite for validation.<\/p>\n<pre><code>unique_id = gx.expectations.ExpectColumnValuesToBeUnique(column=\"ID\")\r\nsuite.add_expectation(unique_id)\r\n\r\nscore_min = gx.expectations.ExpectColumnValuesToBeBetween(column=\"Score\", min_value=80)\r\nsuite.add_expectation(score_min)<\/code><\/pre>\n<h2>Run the validation<\/h2>\n<p>Let&#8217;s validate the data(stored in the PySpark DataFrame) using the validation definition. This will apply the expectations from the expectation suite to the dataset, as defined in the validation definition, and generate a validation result.<\/p>\n<pre><code>results = validation_definition.run(batch_parameters = {\"dataframe\": df})<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2025\/04\/validation-results.png\" alt=\"Validation results\" \/><\/p>\n<h2>Validation results<\/h2>\n<p>Let&#8217;s take a closer look at the details of the validation results from the previous step. This will help us to understand how the validation works and effectively analyze the results.<\/p>\n<h3>Success<\/h3>\n<p>We can see that the validation failed from the <code>success<\/code> flag at the top of the result. This is expected as we intentionally included false expectations in the suite.<\/p>\n<pre><code class=\"language-json\">{\r\n  \"success\": false,\r\n  ...\r\n}<\/code><\/pre>\n<h3>Details for each expectation<\/h3>\n<p>The detailed results are provided for each expectation. It allows us to see which expectations succeeded or failed individually. This breakdown also helps in understanding how each expectation was evaluated.<\/p>\n<pre><code class=\"language-json\">{\r\n    ...\r\n      \"results\": [\r\n    {\r\n      \"success\": true,\r\n      \"expectation_config\": {\r\n        \"type\": \"expect_column_values_to_be_unique\",\r\n        \"kwargs\": {\r\n          \"batch_id\": \"test_data_source-test_data\",\r\n          \"column\": \"ID\"\r\n        },\r\n        ...\r\n      },\r\n      \"result\": {\r\n        \"element_count\": 5,\r\n        \"unexpected_count\": 0,\r\n        \"unexpected_percent\": 0.0,\r\n        ...\r\n      },\r\n      ...\r\n    },\r\n    {\r\n      \"success\": false,\r\n      \"expectation_config\": {\r\n        \"type\": \"expect_column_values_to_be_between\",\r\n        \"kwargs\": {\r\n          \"batch_id\": \"test_data_source-test_data\",\r\n          \"column\": \"Score\",\r\n          \"min_value\": 80.0\r\n        },\r\n        ...\r\n      },\r\n      \"result\": {\r\n        \"element_count\": 5,\r\n        \"unexpected_count\": 1,\r\n        \"unexpected_percent\": 20.0,\r\n        \"partial_unexpected_list\": [\r\n          78.9000015258789\r\n        ],\r\n        ...\r\n      }\r\n      ...\r\n}<\/code><\/pre>\n<p>In this example, for <code>expect_column_values_to_be_unique<\/code>, all the data meets the expectation. We can confirm this by checking <code>success<\/code> flag(<code>true<\/code>), or by looking at the <code>unexpected_count<\/code> and <code>unexpected_percent<\/code>(both <code>0<\/code>).<\/p>\n<p>However, for <code>expect_column_values_to_be_between<\/code>, the result indicates a failure. We can see the failure in the <code>success<\/code> flag(<code>false<\/code>) and an <code>unexpected_count<\/code> of 1, which corresponds to an <code>unexpected_percent<\/code> of 20%. Additionally, we can identify the specific data that caused this failure by checking the <code>partial_unexpected_counts<\/code>.<\/p>\n<h2>Further notes<\/h2>\n<h3>Result format<\/h3>\n<p>It is also possible to get the primary key of the failed data by setting <code>result_format<\/code> option that GX provides. With the primary key, you can retrieve the entire rows if needed. For more details, please refer to the <a href=\"https:\/\/docs.greatexpectations.io\/docs\/core\/trigger_actions_based_on_results\/choose_a_result_format\/\">GX Result Format<\/a>.<\/p>\n<h3>Custom expectations<\/h3>\n<p>GX also provides options to <a href=\"https:\/\/docs.greatexpectations.io\/docs\/core\/customize_expectations\/\">create a custom expectations<\/a>. We can apply conditions to rows in a Batch, or customise an expectation class. This is particularly useful when we need to create tailored expectations for our dataset and reuse them across different validation scenarios.<\/p>\n<h3>Actions and Checkpoints<\/h3>\n<p>We can define <strong>actions<\/strong> to be executed based on the validation results. GX offers pre-defined actions we can choose from. For instance, we can update Data Doc with the validation results or send a notification to Slack.<\/p>\n<p>To enable <strong>actions<\/strong>, we need to create a <strong>checkpoint<\/strong>. With checkpoints, we can also run multiple validation definitions. For additional information, check out <a href=\"https:\/\/docs.greatexpectations.io\/docs\/core\/trigger_actions_based_on_results\/create_a_checkpoint_with_actions\">Create a checkpoint with actions<\/a>.<\/p>\n<h2>Conclusion<\/h2>\n<p>The integration of Great Expectations(GX) within Microsoft Fabric is a powerful solution to validate data programmatically, and offers a seamless way to ensure data quality and consistency across your workflow. GX allows us to easily define, apply, and track expectations, while the validation results provide valuable insights. We can analyze failures in depth, create custom reports for your needs, and take corrective actions as required. This capability enables you to maintain accurate and reliable data, significantly enhancing the efficiency of our data management process within Fabric.<\/p>\n<h2>References<\/h2>\n<ul>\n<li><a href=\"https:\/\/greatexpectations.io\/\">Great Expectations<\/a><\/li>\n<li><a href=\"https:\/\/docs.greatexpectations.io\/docs\/core\/introduction\/\">GX Core Docs<\/a><\/li>\n<\/ul>\n<p><em>The feature image was generated using Copilot Visual Creator. <a href=\"https:\/\/www.bing.com\/new\/termsofuse?FORM=GENTOS\">Terms<\/a> can be found here.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this blog post, we will explore how to integrate GX within the Microsoft Fabric environment and use it to validate data programmatically.<\/p>\n","protected":false},"author":180125,"featured_media":16158,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,3451],"tags":[3590,3591,3592],"class_list":["post-16155","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","category-ise","tag-data-engineering","tag-data-validation","tag-fabric"],"acf":[],"blog_post_summary":"<p>In this blog post, we will explore how to integrate GX within the Microsoft Fabric environment and use it to validate data programmatically.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16155","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/180125"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=16155"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/16155\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/16158"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=16155"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=16155"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=16155"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}