Data Validations with Great Expectations in MS Fabric

Introduction

Data quality is a critical challenge for companies. Inaccurate, incomplete, or inconsistent data can lead to flawed decision making, operational inefficiency, and loss of trust. Poor data quality affects all aspects of a business.

Great Expectations (GX) is a robust framework to check data quality and provide concrete, actionable results. Microsoft Fabric is an end-to-end data platform that integrates data engineering, data science, real-time analytics, and business intelligence into a unified environment. GX can be effectively utilized in the Microsoft Fabric environment, providing seamless solution for data validation and management. By validating datasets against defined expectations, companies can proactively identify and resolve issues, ensuring that their data is accurate, reliable, and fit for purpose.

GX allows us to define test criteria programmatically using its open-source Python library, GX Core, and execute tests to ensure data quality. Also, GX offers GX Cloud, which is a fully managed cloud platform for data quality management.

In this blog post, we will explore how to integrate GX Core within the Microsoft Fabric environment and use it to validate data programmatically.

Sample data and quality requirements

To illustrate the problem and solution, let’s define the scenario for this post. We are working with the following test dataset:

ID	Name	Age	City	Score
1	Alice	25	New York	85.5
2	Bob	30	Los Angeles	90.2
3	Charlie	35	Chicago	78.9
4	David	40	Houston	88.3
5	Eva	45	Phoenix	92.5

To validate the quality of this dataset, we define the following requirements:

The ID column should have unique values.
The score column should contain values greater than 80 (This expectation is intentionally set to fail to demonstrate how issues can be identified).

We will validate these requirements programmatically in a Microsoft Fabric Notebook using Great Expectations (GX). By integrating GX within this environment, we will demonstrate how to define expectations, apply validations, and analyze the results effectively.

Setting up GX in MS Fabric

Create an Environment

To use GX in MS Fabric, we need to add it to an Environment. We can include the great_expectations library with the desired version under the Public libraries > Public library tab. Once added, we must publish the updated environment to ensure the changes take effect.

Set library in environment

Create a Notebook

We are using a Notebook to programmatically validate the data using GX. We create a Notebook, and select the Environment you set up for GX. Now, the Notebook is running on the environment that has GX imported.

Set environment on notebook

Load GX components

Load data

Let’s begin by loading the data we want to validate into Microsoft Fabric. In this example, we upload a test dataset to the Lakehouse and then read it into the Notebook for processing using PySpark.

df = spark.read.table("test_dataset")

Create GX Context

Next, we need to define a Data source in the GX context. The GX context contains all the necessary information to run GX within our workflow.

import great_expectations as gx

context = gx.get_context()

Create an expectation suite

In GX, the validation criteria we want to test are referred to as expectations. The collection of these expectations is called expectation suite. The expectation suite contains all the expectations we intend to apply to the data. Also, it can be stored in GX context for future reuse and easier maintenance.

expectation_suite_name = "test_data_validation_suite"
context.suites.add(gx.ExpectationSuite(name=expectation_suite_name))
suite = context.suites.get(name=expectation_suite_name)

Create Data source and Batch

We need to create a data source, a data asset and a batch definition. These elements reference the target dataset in GX context. In the Validation definition, all the references required to run the validation are included, such as the target dataset and the expectation suite.

data_source = context.data_sources.add_spark(name="test_data_source")
data_asset = data_source.add_dataframe_asset(name="test_data")

batch_definition = data_asset.add_batch_definition_whole_dataframe("test_batch_definition")

validation_definition = gx.ValidationDefinition(
    data = batch_definition,
    suite = suite,
    name = "test_validation_definition",
)

Run GX in Fabric

Add new expectations

Now we are ready to start adding expectations. GX provides the pre-defined expectations that we can easily leverage. We can explore the complete list of available expectations in the GX Expectations Gallery and find the ones which fit our requirements.

As mentioned in the requirements above, we will define two expectations : the ID column should have unique values, and the score column should be contain values greater than 80.

These expectations are defined and added in the expectations suite for validation.

unique_id = gx.expectations.ExpectColumnValuesToBeUnique(column="ID")
suite.add_expectation(unique_id)

score_min = gx.expectations.ExpectColumnValuesToBeBetween(column="Score", min_value=80)
suite.add_expectation(score_min)

Run the validation

Let’s validate the data(stored in the PySpark DataFrame) using the validation definition. This will apply the expectations from the expectation suite to the dataset, as defined in the validation definition, and generate a validation result.

results = validation_definition.run(batch_parameters = {"dataframe": df})

Validation results

Let’s take a closer look at the details of the validation results from the previous step. This will help us to understand how the validation works and effectively analyze the results.

Success

We can see that the validation failed from the success flag at the top of the result. This is expected as we intentionally included false expectations in the suite.

{
  "success": false,
  ...
}

Details for each expectation

The detailed results are provided for each expectation. It allows us to see which expectations succeeded or failed individually. This breakdown also helps in understanding how each expectation was evaluated.

{
    ...
      "results": [
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_values_to_be_unique",
        "kwargs": {
          "batch_id": "test_data_source-test_data",
          "column": "ID"
        },
        ...
      },
      "result": {
        "element_count": 5,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        ...
      },
      ...
    },
    {
      "success": false,
      "expectation_config": {
        "type": "expect_column_values_to_be_between",
        "kwargs": {
          "batch_id": "test_data_source-test_data",
          "column": "Score",
          "min_value": 80.0
        },
        ...
      },
      "result": {
        "element_count": 5,
        "unexpected_count": 1,
        "unexpected_percent": 20.0,
        "partial_unexpected_list": [
          78.9000015258789
        ],
        ...
      }
      ...
}

In this example, for expect_column_values_to_be_unique, all the data meets the expectation. We can confirm this by checking success flag(true), or by looking at the unexpected_count and unexpected_percent(both 0).

However, for expect_column_values_to_be_between, the result indicates a failure. We can see the failure in the success flag(false) and an unexpected_count of 1, which corresponds to an unexpected_percent of 20%. Additionally, we can identify the specific data that caused this failure by checking the partial_unexpected_counts.

Further notes

Result format

It is also possible to get the primary key of the failed data by setting result_format option that GX provides. With the primary key, you can retrieve the entire rows if needed. For more details, please refer to the GX Result Format.

Custom expectations

GX also provides options to create a custom expectations. We can apply conditions to rows in a Batch, or customise an expectation class. This is particularly useful when we need to create tailored expectations for our dataset and reuse them across different validation scenarios.

Actions and Checkpoints

We can define actions to be executed based on the validation results. GX offers pre-defined actions we can choose from. For instance, we can update Data Doc with the validation results or send a notification to Slack.

To enable actions, we need to create a checkpoint. With checkpoints, we can also run multiple validation definitions. For additional information, check out Create a checkpoint with actions.

Conclusion

The integration of Great Expectations(GX) within Microsoft Fabric is a powerful solution to validate data programmatically, and offers a seamless way to ensure data quality and consistency across your workflow. GX allows us to easily define, apply, and track expectations, while the validation results provide valuable insights. We can analyze failures in depth, create custom reports for your needs, and take corrective actions as required. This capability enables you to maintain accurate and reliable data, significantly enhancing the efficiency of our data management process within Fabric.

References

The feature image was generated using Copilot Visual Creator. Terms can be found here.

Data Validations with Great Expectations in MS Fabric

Introduction

Sample data and quality requirements

Setting up GX in MS Fabric

Create an Environment

Create a Notebook

Load GX components

Load data

Create GX Context

Create an expectation suite

Create Data source and Batch

Run GX in Fabric

Add new expectations

Run the validation

Validation results

Success

Details for each expectation

Further notes

Result format

Custom expectations

Actions and Checkpoints

Conclusion

References

Author

Read next

Running RAG with ONNX Runtime GenAI for On-Prem Windows

Integration testing with Dapr and Testcontainers

Introduction

Sample data and quality requirements

Setting up GX in MS Fabric

Create an Environment

Create a Notebook

Load GX components

Load data

Create GX Context

Create an expectation suite

Create Data source and Batch

Run GX in Fabric

Add new expectations

Run the validation

Validation results

Success

Details for each expectation

Further notes

Result format

Custom expectations

Actions and Checkpoints

Conclusion

References

Author

Read next

Running RAG with ONNX Runtime GenAI for On-Prem Windows

Integration testing with Dapr and Testcontainers

Stay informed