Introduction
Data quality is a critical challenge for companies. Inaccurate, incomplete, or inconsistent data can lead to flawed decision making, operational inefficiency, and loss of trust. Poor data quality affects all aspects of a business.
Great Expectations (GX) is a robust framework to check data quality and provide concrete, actionable results. Microsoft Fabric is an end-to-end data platform that integrates data engineering, data science, real-time analytics, and business intelligence into a unified environment. GX can be effectively utilized in the Microsoft Fabric environment, providing seamless solution for data validation and management. By validating datasets against defined expectations, companies can proactively identify and resolve issues, ensuring that their data is accurate, reliable, and fit for purpose.
GX allows us to define test criteria programmatically using its open-source Python library, GX Core, and execute tests to ensure data quality. Also, GX offers GX Cloud, which is a fully managed cloud platform for data quality management.
In this blog post, we will explore how to integrate GX Core within the Microsoft Fabric environment and use it to validate data programmatically.
Sample data and quality requirements
To illustrate the problem and solution, let’s define the scenario for this post. We are working with the following test dataset:
ID | Name | Age | City | Score |
---|---|---|---|---|
1 | Alice | 25 | New York | 85.5 |
2 | Bob | 30 | Los Angeles | 90.2 |
3 | Charlie | 35 | Chicago | 78.9 |
4 | David | 40 | Houston | 88.3 |
5 | Eva | 45 | Phoenix | 92.5 |
To validate the quality of this dataset, we define the following requirements:
- The
ID
column should have unique values. - The
score
column should contain values greater than 80 (This expectation is intentionally set to fail to demonstrate how issues can be identified).
We will validate these requirements programmatically in a Microsoft Fabric Notebook using Great Expectations (GX). By integrating GX within this environment, we will demonstrate how to define expectations, apply validations, and analyze the results effectively.
Setting up GX in MS Fabric
Create an Environment
To use GX in MS Fabric, we need to add it to an Environment. We can include the great_expectations
library with the desired version under the Public libraries > Public library tab. Once added, we must publish the updated environment to ensure the changes take effect.
Create a Notebook
We are using a Notebook to programmatically validate the data using GX. We create a Notebook, and select the Environment you set up for GX. Now, the Notebook is running on the environment that has GX imported.
Load GX components
Load data
Let’s begin by loading the data we want to validate into Microsoft Fabric. In this example, we upload a test dataset to the Lakehouse and then read it into the Notebook for processing using PySpark.
df = spark.read.table("test_dataset")
Create GX Context
Next, we need to define a Data source in the GX context. The GX context contains all the necessary information to run GX within our workflow.
import great_expectations as gx
context = gx.get_context()
Create an expectation suite
In GX, the validation criteria we want to test are referred to as expectations. The collection of these expectations is called expectation suite. The expectation suite contains all the expectations we intend to apply to the data. Also, it can be stored in GX context for future reuse and easier maintenance.
expectation_suite_name = "test_data_validation_suite"
context.suites.add(gx.ExpectationSuite(name=expectation_suite_name))
suite = context.suites.get(name=expectation_suite_name)
Create Data source and Batch
We need to create a data source, a data asset and a batch definition. These elements reference the target dataset in GX context. In the Validation definition, all the references required to run the validation are included, such as the target dataset and the expectation suite.
data_source = context.data_sources.add_spark(name="test_data_source")
data_asset = data_source.add_dataframe_asset(name="test_data")
batch_definition = data_asset.add_batch_definition_whole_dataframe("test_batch_definition")
validation_definition = gx.ValidationDefinition(
data = batch_definition,
suite = suite,
name = "test_validation_definition",
)
Run GX in Fabric
Add new expectations
Now we are ready to start adding expectations. GX provides the pre-defined expectations that we can easily leverage. We can explore the complete list of available expectations in the GX Expectations Gallery and find the ones which fit our requirements.
As mentioned in the requirements above, we will define two expectations : the ID
column should have unique values, and the score
column should be contain values greater than 80.
These expectations are defined and added in the expectations suite for validation.
unique_id = gx.expectations.ExpectColumnValuesToBeUnique(column="ID")
suite.add_expectation(unique_id)
score_min = gx.expectations.ExpectColumnValuesToBeBetween(column="Score", min_value=80)
suite.add_expectation(score_min)
Run the validation
Let’s validate the data(stored in the PySpark DataFrame) using the validation definition. This will apply the expectations from the expectation suite to the dataset, as defined in the validation definition, and generate a validation result.
results = validation_definition.run(batch_parameters = {"dataframe": df})
Validation results
Let’s take a closer look at the details of the validation results from the previous step. This will help us to understand how the validation works and effectively analyze the results.
Success
We can see that the validation failed from the success
flag at the top of the result. This is expected as we intentionally included false expectations in the suite.
{
"success": false,
...
}
Details for each expectation
The detailed results are provided for each expectation. It allows us to see which expectations succeeded or failed individually. This breakdown also helps in understanding how each expectation was evaluated.
{
...
"results": [
{
"success": true,
"expectation_config": {
"type": "expect_column_values_to_be_unique",
"kwargs": {
"batch_id": "test_data_source-test_data",
"column": "ID"
},
...
},
"result": {
"element_count": 5,
"unexpected_count": 0,
"unexpected_percent": 0.0,
...
},
...
},
{
"success": false,
"expectation_config": {
"type": "expect_column_values_to_be_between",
"kwargs": {
"batch_id": "test_data_source-test_data",
"column": "Score",
"min_value": 80.0
},
...
},
"result": {
"element_count": 5,
"unexpected_count": 1,
"unexpected_percent": 20.0,
"partial_unexpected_list": [
78.9000015258789
],
...
}
...
}
In this example, for expect_column_values_to_be_unique
, all the data meets the expectation. We can confirm this by checking success
flag(true
), or by looking at the unexpected_count
and unexpected_percent
(both 0
).
However, for expect_column_values_to_be_between
, the result indicates a failure. We can see the failure in the success
flag(false
) and an unexpected_count
of 1, which corresponds to an unexpected_percent
of 20%. Additionally, we can identify the specific data that caused this failure by checking the partial_unexpected_counts
.
Further notes
Result format
It is also possible to get the primary key of the failed data by setting result_format
option that GX provides. With the primary key, you can retrieve the entire rows if needed. For more details, please refer to the GX Result Format.
Custom expectations
GX also provides options to create a custom expectations. We can apply conditions to rows in a Batch, or customise an expectation class. This is particularly useful when we need to create tailored expectations for our dataset and reuse them across different validation scenarios.
Actions and Checkpoints
We can define actions to be executed based on the validation results. GX offers pre-defined actions we can choose from. For instance, we can update Data Doc with the validation results or send a notification to Slack.
To enable actions, we need to create a checkpoint. With checkpoints, we can also run multiple validation definitions. For additional information, check out Create a checkpoint with actions.
Conclusion
The integration of Great Expectations(GX) within Microsoft Fabric is a powerful solution to validate data programmatically, and offers a seamless way to ensure data quality and consistency across your workflow. GX allows us to easily define, apply, and track expectations, while the validation results provide valuable insights. We can analyze failures in depth, create custom reports for your needs, and take corrective actions as required. This capability enables you to maintain accurate and reliable data, significantly enhancing the efficiency of our data management process within Fabric.
References
The feature image was generated using Copilot Visual Creator. Terms can be found here.