{"id":13245,"date":"2020-10-29T11:59:23","date_gmt":"2020-10-29T18:59:23","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cse\/?p=13245"},"modified":"2021-01-12T18:22:43","modified_gmt":"2021-01-13T02:22:43","slug":"building-a-clinical-data-drift-monitoring-system-with-azure-devops-azure-databricks-and-mlflow","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/building-a-clinical-data-drift-monitoring-system-with-azure-devops-azure-databricks-and-mlflow\/","title":{"rendered":"Building A Clinical Data Drift Monitoring System With Azure DevOps, Azure Databricks, And MLflow"},"content":{"rendered":"<p>Hospitals around the world regularly work towards improving the health of their patients as well as ensuring there are enough resources available for patients awaiting care. During these unprecedented times with the COVID-19 pandemic, Intensive Care Units are having to make difficult decisions at a greater frequency to optimize patient health outcomes.<\/p>\n<p>The continuous collection of biometric and clinical data throughout a patient\u2019s stay enables medical professionals to take a data-informed, holistic approach to clinical decision making.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/splash_smaller.png\"><img decoding=\"async\" class=\"wp-image-13265 size-full\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/splash_smaller.png\" alt=\"Image splash smaller\" width=\"1794\" height=\"756\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/splash_smaller.png 1794w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/splash_smaller-300x126.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/splash_smaller-1024x432.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/splash_smaller-768x324.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/splash_smaller-1536x647.png 1536w\" sizes=\"(max-width: 1794px) 100vw, 1794px\" \/><\/a><\/p>\n<p>In some cases, a Machine Learning model may be used to provide insight given the copious amount of data coming in from various monitors and clinical tests per patient. The Philips Healthcare Informatics (HI) \u00a0team uses such data to build models predicting outcomes such as likelihood of patient mortality, necessary length of ventilation, and necessary length of stay. In the case of the recent collaboration between us in Microsoft Commercial Software Engineering (CSE) and the Philips HI team, we focused on developing an <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learning\/concept-model-management-and-deployment#what-is-mlops\">MLOps<\/a>\u00a0solution to bring the Philips benchmark mortality model to production. This benchmark mortality model predicts the risk of patient mortality on a quarterly basis as an evaluation metric for individual ICU performance.<\/p>\n<p>Since the model uses such a large amount of information from various sources, it is imperative that the quality of the incoming data be monitored to catch any changes that may affect model performance. Manually investigating unexpected changes and tracking down the cause of the data problems takes valuable time away from the data science team at Philips working on the mortality and other models.<\/p>\n<p>In this blog post, we cover our approach to establishing a <span style=\"color: #ff0000;\"><a href=\"#what-is-data-drift\">data drift<\/a><\/span><span style=\"color: #000000;\">\u00a0<\/span>monitoring process for multifaceted clinical data in the Philips eICU network, including example code.<\/p>\n<p>&nbsp;<\/p>\n<h2>Challenges and Objectives<\/h2>\n<p>The aim of this collaboration was to integrate MLOps into the Philips team\u2019s workflow to improve their experience with moving code from development to production, as well as to enable scalability and to increase overall efficiency of their system. MLOps, also known as DevOps for Machine Learning, is a set of practices that enable automation of aspects of the Machine Learning lifecycle and help ensure quality in production (see the <a href=\"#resources\">Resources section<\/a> at the end of this post). Various workstreams of the solution focused on components of MLOps integration, including monitoring model performance and fairness.<\/p>\n<p>One of our key objectives was to develop a data drift monitoring process and integrate it into production such that potential changes in model performance could be caught before re-running the time-intensive and computationally expensive model training pipeline, which is run quarterly to generate a performance report for each ICU or acute unit monitored by an enterprise eICU program.<\/p>\n<p>Regarding data drift monitoring, we aimed to:<\/p>\n<ul>\n<li>Create distinct pipelines for input data monitoring and model training such that data drift monitoring could be performed more frequently and separately from model training.<\/li>\n<li>Perform both schema validation and distribution drift monitoring for numerical and categorical features to bring attention to noteworthy changes in data.<\/li>\n<li>Ensure data drift monitoring results are easily interpretable and provide useful insight on changes in the data.<\/li>\n<li>Structure a scalable and secure solution such that the framework established can accommodate additional models and datasets in the near future.<\/li>\n<\/ul>\n<p><figure id=\"attachment_13249\" aria-labelledby=\"figcaption_attachment_13249\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_01.png\"><img decoding=\"async\" class=\"wp-image-13249 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_01-1024x586.png\" alt=\"Diagram showing flow from data to model prediction\" width=\"640\" height=\"366\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_01-1024x586.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_01-300x172.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_01-768x440.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_01-1536x880.png 1536w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_01.png 1900w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/a><figcaption id=\"figcaption_attachment_13249\" class=\"wp-caption-text\"><strong>Figure 1<\/strong>. Data drift monitoring is performed on the data fed into the mortality prediction model.<\/figcaption><\/figure><\/p>\n<p>&nbsp;<\/p>\n<h2>What is Data Drift?<\/h2>\n<p>Drift, in the context of this project, involves shifts or changes in the format and values of data being fed as input into the mortality model. In general, data drift detection can be used to alert data scientists and engineers to changes in the data and can also be used to automatically trigger model retraining. In this project, we perform data drift monitoring to catch potential issues before running the time intensive model retraining. We separate data drift into two streams, schema validation and distribution drift monitoring.<\/p>\n<p><figure id=\"attachment_13250\" aria-labelledby=\"figcaption_attachment_13250\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_02.png\"><img decoding=\"async\" class=\"wp-image-13250 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_02-1024x610.png\" alt=\"Diagram showing how monitoring relates to the data\" width=\"640\" height=\"381\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_02-1024x610.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_02-300x179.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_02-768x458.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_02-1536x915.png 1536w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_02.png 2034w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/a><figcaption id=\"figcaption_attachment_13250\" class=\"wp-caption-text\"><strong>Figure 2<\/strong>. Various feature values (e.g., average heart rate, average pH, <a href=\"https:\/\/www.glasgowcomascale.org\/what-is-gcs\/\">Glasgow Coma Score<\/a>) are stored in the multi-health system database per patient. Each of these features undergo schema validation and distribution drift monitoring as part of the data drift monitoring process. Results are written back into tables designed to store data drift monitoring results in the database.<\/figcaption><\/figure><\/p>\n<p>Schema drift involves changes in the format or schema of the incoming data. For example, let\u2019s consider the case of an ICU using a new machine for recording blood pressure. This new machine outputs diastolic and systolic blood pressure as two strings (e.g., [\u201c120 S\u201d, \u201c80 D\u201d]) instead of as two integers like with the previous machine (e.g., [120, 80]). This unexpected change in format could lead to an error when attempting to retrain the model. Implementing automatic schema validation allows the team to quickly catch when breaking changes are introduced into the model training dataset.<\/p>\n<p><figure id=\"attachment_13251\" aria-labelledby=\"figcaption_attachment_13251\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_03.png\"><img decoding=\"async\" class=\"wp-image-13251 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_03-1024x704.png\" alt=\"Demonstrating schema drift with two blood pressure monitors\" width=\"640\" height=\"440\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_03-1024x704.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_03-300x206.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_03-768x528.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_03.png 1464w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/a><figcaption id=\"figcaption_attachment_13251\" class=\"wp-caption-text\"><strong>Figure 3<\/strong>. Medical devices measuring the same type of data (e.g., blood pressure) may output the data in different formats.<\/figcaption><\/figure><\/p>\n<p>Distribution drift, also known as virtual concept drift or covariate shift, involves change in the overall distribution of data within each feature over time. For example let\u2019s consider the case of one ICU\u2019s blood pressure monitors malfunctioning or recalibrated, leading to reporting diastolic blood pressure as 25 points higher and systolic blood pressure as 18 points lower consistently. This would be important to catch as the change may not cause the model to fail, but it may negatively impact the performance of the model for patients in this ICU.<\/p>\n<p>Distribution drift is calculated as the difference between a baseline and target distribution. For any given feature (e.g., diastolic blood pressure), the baseline distribution is a set of values for that feature from a historical time window (e.g., from January 1, 2018 to December 31, 2018) which the target distribution will be compared against. Likewise, the target distribution is a set of values for the given feature from a more recent time window (e.g., from January 1, 2019 to December 31, 2019). Figure 4 illustrates example baseline and target distributions for diastolic and systolic blood pressure.<\/p>\n<p>The difference in distributions for each feature is measured and evaluated as statistically significant using <a href=\"https:\/\/www.itl.nist.gov\/div898\/handbook\/eda\/section3\/eda35g.htm\">Kolmogorov-Smirnov tests<\/a>, as available through the Python <a href=\"https:\/\/docs.seldon.io\/projects\/alibi-detect\/en\/latest\/methods\/ksdrift.html\">alibi-detect library<\/a>. If the distributions are considered statistically significant, the feature is marked as having statistically significant drift.<\/p>\n<p><figure id=\"attachment_13252\" aria-labelledby=\"figcaption_attachment_13252\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_04.png\"><img decoding=\"async\" class=\"wp-image-13252 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_04-1024x454.png\" alt=\"Demonstrating distribution drift in blood pressure\" width=\"640\" height=\"284\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_04-1024x454.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_04-300x133.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_04-768x340.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_04-1536x681.png 1536w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_04.png 1932w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/a><figcaption id=\"figcaption_attachment_13252\" class=\"wp-caption-text\"><strong>Figure 4<\/strong>. Histograms showing example distributions of systolic and diastolic blood pressure over a baseline and target period. Here, we see the distribution of systolic blood pressure is shifted over to the right in the target period, exhibiting greater values more frequently than in the baseline period. Similarly, we see the distribution for diastolic blood pressure as being shifted left in the target period relative to the baseline.<\/figcaption><\/figure><\/p>\n<p>It is important to note that while some feature drift is seasonal or expected, data scientists should be aware of changes in the data that could affect model performance. Staying abreast of these changes, data scientists can communicate with hospital staff early on how to address any issues before the ICU performance reports are created.<\/p>\n<p>While the above examples do not appear as a major issue if blood pressure were the only feature used in the model, the model takes in data for many features across many hospitals and health systems. It would be impractical to manually inspect all the data to understand why the model is failing to retrain or why the model is performing worse after retraining.<\/p>\n<p>&nbsp;<\/p>\n<h2>Solution<\/h2>\n<p>Our solution needed to be scalable, repeatable, and secure. As a result, we built our solution on Azure Databricks using the open source library MLflow, and Azure DevOps.<\/p>\n<p>For the data drift monitoring component of the project solution, we developed Python scripts which were submitted as Azure Databricks jobs through the MLflow experiment framework, using an Azure DevOps pipeline. Example code for the data drift monitoring portion of the solution is available in the <a href=\"https:\/\/github.com\/niwilso\/data-drift-monitor\">Clinical Data Drift Monitoring GitHub repository<\/a>.<\/p>\n<p>Table 1 details the tools used for building the data drift monitor portion of the solution.<\/p>\n<table style=\"border-collapse: collapse; width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 10.9649%;\" width=\"55\"><strong>Tool Used<\/strong><\/td>\n<td style=\"width: 54.8246%;\" width=\"392\"><strong>Reason<\/strong><\/td>\n<td style=\"width: 34.1228%;\" width=\"177\"><strong>Resources<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 10.9649%;\" width=\"55\">Azure Databricks<\/td>\n<td style=\"width: 54.8246%;\" width=\"392\">Great computational power for model training and allows for scalability.<\/td>\n<td style=\"width: 34.1228%;\" width=\"177\"><a href=\"https:\/\/azure.microsoft.com\/en-us\/free\/databricks\/\">Azure Databricks<\/a>, <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/databricks\/\">Azure Databricks documentation<\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 10.9649%;\" width=\"55\">SQL Server<\/td>\n<td style=\"width: 54.8246%;\" width=\"392\">The healthcare data was already being stored in a SQL server database. No need to move the data.<\/td>\n<td style=\"width: 34.1228%;\" width=\"177\"><a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/databricks\/data\/data-sources\/sql-databases\">Accessing SQL databases on Databricks using JDBC<\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 10.9649%;\" width=\"55\">Alibi-detect<\/td>\n<td style=\"width: 54.8246%;\" width=\"392\">Established Python package with data drift detection calculation capabilities.<\/td>\n<td style=\"width: 34.1228%;\" width=\"177\"><a href=\"https:\/\/github.com\/SeldonIO\/alibi-detect\">Alibi-detect GitHub repository<\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 10.9649%;\" width=\"55\">MLflow<\/td>\n<td style=\"width: 54.8246%;\" width=\"392\">Established open source framework for tracking model parameters and artifacts.<\/td>\n<td style=\"width: 34.1228%;\" width=\"177\"><a href=\"https:\/\/mlflow.org\/\">MLflow overview<\/a><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 10.9649%;\" width=\"55\">Azure DevOps<\/td>\n<td style=\"width: 54.8246%;\" width=\"392\">All-inclusive service for managing code and pipelines for the full DevOps lifecycle.<\/td>\n<td style=\"width: 34.1228%;\" width=\"177\"><a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/devops\/user-guide\/what-is-azure-devops?view=azure-devops\">What is Azure DevOps?<\/a>, <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/devops\/?view=azure-devops\">Azure DevOps documentation<\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong><em>Table 1<\/em><\/strong><em>. Tools used were selected to accommodate the whole solution, including data drift monitoring. An important criterion in our tool selection was integration to the solution as a whole.<\/em><\/p>\n<p>With our MLOps approach, the data drift monitor code is continuously integrated into the solution and does not exist as isolated code. In this post, we will first cover the general structure of the MLOps code and then move into the strictly drift monitoring code.<\/p>\n<p>Note, for this blog post and in the example code, we use <a href=\"https:\/\/github.com\/nytimes\/covid-19-data\"><em>The New York Times<\/em> open source COVID-19 cases by county dataset<\/a>\u00a0to demonstrate data drift monitoring instead of sensitive clinical data.<\/p>\n<h2>General Workflow<\/h2>\n<p>Before running the data drift monitoring code, we needed to set up the Azure Databricks workspace connection to where all computation would take place (Figure 5). For guidance on how to create a shared resource group connected to an Azure Databricks workspace, see this <a href=\"https:\/\/github.com\/niwilso\/data-drift-monitor\/blob\/master\/docs\/README.md\">getting started README on this blog post repository<\/a>. For guidance on creating an Azure Databricks workspace, see the <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/databricks\/\">Azure Databricks documentation<\/a>.<\/p>\n<p><figure id=\"attachment_13253\" aria-labelledby=\"figcaption_attachment_13253\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_05.png\"><img decoding=\"async\" class=\"wp-image-13253 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_05-1024x778.png\" alt=\"Setting up a variable group\" width=\"640\" height=\"486\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_05-1024x778.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_05-300x228.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_05-768x584.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_05.png 1316w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/a><figcaption id=\"figcaption_attachment_13253\" class=\"wp-caption-text\"><strong>Figure 5<\/strong>. The Azure Databricks workspace can be connected to a variable group to allow access to all pipelines in the Azure DevOps instance. More detailed instructions in the following <a href=\"https:\/\/github.com\/niwilso\/data-drift-monitor\/blob\/master\/docs\/README.md\">README<\/a>.<\/figcaption><\/figure><\/p>\n<p>After creating the shared resource group connected to our Azure Databricks workspace, we needed to create a new pipeline in Azure DevOps that references the data drift monitoring code. In our <a href=\"https:\/\/github.com\/niwilso\/data-drift-monitor\/blob\/master\/.azure_pipelines\/data_drift.yml\">data_drift.yml pipeline file<\/a>, we specify where the code is located for schema validation and for distribution drift as two separate tasks.<\/p>\n<pre class=\"prettyprint\">  - task: Bash@3\r\n    displayName: Execute Data Drift Project (schema validation)\r\n    inputs:\r\n      targetType: \"inline\"\r\n      script: |\r\n        python scripts\/submit_job.py \\\r\n          --projectEntryPoint validation \\\r\n          --projectPath projects\/$(PROJECT_NAME)\/ \\\r\n          --projectExperimentFolder $(MODEL_WORKSPACE_DIR)\/data_drift \\\r\n    env:\r\n      MLFLOW_TRACKING_URI: databricks\r\n      MODEL_NAME: \"$(PROJECT_NAME)datadrift\"\r\n      MODEL_ID: $(MODEL_ID)\r\n      DATA_PATH: $(DATA_PATH)\r\n      FEATURES: $(FEATURES)\r\n      DATETIME_COL: $(DATETIME_COL)\r\n      GROUP_COL: $(GROUP_COL)\r\n      BASELINE_START: $(BASELINE_START)\r\n      BASELINE_END: $(BASELINE_END)\r\n      TARGET_START: $(TARGET_START)\r\n      TARGET_END: $(TARGET_END)\r\n      P_VAL: $(P_VAL)\r\n      OUT_FILE_NAME: $(OUT_FILE_NAME)<\/pre>\n<p>During pipeline creation, we specify pipeline variables that serve as parameters for the various drift-related Python scripts (Table 2) that can also be seen in the code snippet above. The default values in the table coincide with the open source <em>The New York Times<\/em> COVID-19 cases by county dataset we use in the example code.<\/p>\n<table>\n<thead>\n<tr>\n<td><strong>Variable Name<\/strong><\/td>\n<td><strong>Default Value<\/strong><\/td>\n<td><strong>Description<\/strong><\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>BASELINE_END<\/td>\n<td>2020-05-31<\/td>\n<td>End date of the baseline period in YYYY-MM-DD format.<\/td>\n<\/tr>\n<tr>\n<td>BASELINE_START<\/td>\n<td>2020-01-21<\/td>\n<td>Start date of the baseline period in YYYY-MM-DD format.<\/td>\n<\/tr>\n<tr>\n<td>DATA_PATH<\/td>\n<td>https:\/\/raw.githubusercontent.com\/nytimes\/covid-19-data\/master\/us-counties.csv<\/td>\n<td>Location of data (either local path or URL).<\/td>\n<\/tr>\n<tr>\n<td>DATETIME_COL<\/td>\n<td>date<\/td>\n<td>Name of column containing datetime information.<\/td>\n<\/tr>\n<tr>\n<td>FEATURES<\/td>\n<td>fips,cases,deaths<\/td>\n<td>List of features to perform schema validation for, separated by commas with no spaces.<\/td>\n<\/tr>\n<tr>\n<td>GROUP_COL<\/td>\n<td>state<\/td>\n<td>Name of column to group results by.<\/td>\n<\/tr>\n<tr>\n<td>MODEL_ID<\/td>\n<td>1<\/td>\n<td>Appropriate model ID number associated with the data we are performing drift monitoring for (see mon.vrefModel).<\/td>\n<\/tr>\n<tr>\n<td>OUT_FILE_NAME<\/td>\n<td>results.json<\/td>\n<td>Name of .json file storing results.<\/td>\n<\/tr>\n<tr>\n<td>P_VAL<\/td>\n<td>0.05<\/td>\n<td>Threshold value for p-values in distribution drift monitoring. Values below the threshold will be labelled as significant.<\/td>\n<\/tr>\n<tr>\n<td>TARGET_END<\/td>\n<td>2019-08-27<\/td>\n<td>End date of the target period in YYYY-MM-DD format.<\/td>\n<\/tr>\n<tr>\n<td>TARGET_START<\/td>\n<td>2019-08-01<\/td>\n<td>Start date of the target period in YYYY-MM-DD format.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong><em>Table 2<\/em><\/strong><em>. The data drift monitoring pipeline allows the user to set parameters (variable values) that are appropriate for any particular given run. These values are used by the data drift monitoring Python scripts.<\/em><\/p>\n<p>Each variable was set such that whoever triggers the pipeline can override the default values with values more appropriate for the specific run instance (e.g., changing the target start and end dates) (Figure 6). For further guidance on creating this pipeline, see <a href=\"https:\/\/github.com\/niwilso\/data-drift-monitor\/blob\/master\/docs\/mlops_example_data_drift_project.md\">mlops_example_data_drift_project.md<\/a>\u00a0on this blog post repository.<\/p>\n<p><figure id=\"attachment_13254\" aria-labelledby=\"figcaption_attachment_13254\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_06.png\"><img decoding=\"async\" class=\"wp-image-13254 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_06-1024x475.png\" alt=\"Updating pipeline variable value\" width=\"640\" height=\"297\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_06-1024x475.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_06-300x139.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_06-768x356.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_06-1536x712.png 1536w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_06-2048x949.png 2048w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/a><figcaption id=\"figcaption_attachment_13254\" class=\"wp-caption-text\"><strong>Figure 6<\/strong>. Default values in the pipeline can and should be overwritten as needed for individual runs.<\/figcaption><\/figure><\/p>\n<p>With the Azure Databricks workspace and the pipeline set up, let\u2019s look at the code our pipeline references.<\/p>\n<pre class=\"prettyprint\">DriftCode\r\n-  common (dir)\r\n-  distribution (dir)\r\n     - parameters.json.j2\r\n     - [distribution drift monitoring script]\r\n-  validation (dir)\r\n     - parameters.json.j2\r\n     - [schema validation scripts]\r\n-  cluster.json.j2\r\n-  MLProject\r\n-  project_env.yaml<\/pre>\n<p>With the MLflow framework, the environment, parameters, and script calls are all referenced in the MLProject file. We organize code specific to schema validation into the \u201cvalidation\u201d folder and code specific to distribution drift to the \u201cdistribution\u201d folder, which we will discuss later in this post.<\/p>\n<p>In order to use Databricks for computation, we define our cluster, which our MLflow project will be submitted to as a Databricks job.<\/p>\n<pre class=\"prettyprint\">{\r\n  \"spark_version\": \"7.0.x-scala2.12\",\r\n  \"num_workers\": 1,\r\n  \"node_type_id\": \"Standard_DS3_v2\",\r\n  \"spark_env_vars\": {\r\n    \"MODEL_NAME\": \"{{MODEL_NAME}}\"\r\n    {% if AZURE_STORAGE_ACCESS_KEY is defined and AZURE_STORAGE_ACCESS_KEY|length %}\r\n      ,\r\n      \"AZURE_STORAGE_ACCESS_KEY\": \"{{AZURE_STORAGE_ACCESS_KEY}}\"\r\n    {% endif %}\r\n  }\r\n}<\/pre>\n<p>Because the data drift monitoring code requires specific dependencies that other workstreams in the overall solution may not need, we specify an Anaconda environment for all the Python code to run on.<\/p>\n<pre class=\"prettyprint\">---\r\nname: drift\r\nchannels:\r\n  - defaults\r\n  - anaconda\r\n  - conda-forge\r\ndependencies:\r\n  - python=3.7\r\n  - pip:\r\n      - environs==8.0.0\r\n      - alibi-detect==0.4.1\r\n      - mlflow==1.7.0\r\n      - tensorflow==2.3.0\r\n      - cloudpickle==1.3.0\r\n<\/pre>\n<h2>The Data Drift Monitoring Code<\/h2>\n<p>The first step to detecting either changes in schema or distribution is loading the data. In the project with Philips, we connected to a SQL server to access the data using a combination of <a href=\"http:\/\/spark.apache.org\/docs\/latest\/api\/python\/pyspark.sql.html#pyspark.sql.SparkSession\">PySpark<\/a> and <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/databricks\/data\/data-sources\/sql-databases\">JDBC<\/a>.<\/p>\n<pre class=\"prettyprint\">import os\r\nimport numpy as np\r\nfrom pyspark.sql import SparkSession\r\nfrom environs import Env\r\n\r\nspark: SparkSession = SparkSession.builder.getOrCreate()\r\n<\/pre>\n<pre class=\"prettyprint\">def get_sql_connection_string(port=1433, database=\"\", username=\"\"):\r\n    \"\"\" Form the SQL Server Connection String\r\n\r\n    Returns:\r\n        connection_url (str): connection to sql server using jdbc.\r\n    \"\"\"\r\n    env = Env()\r\n    env.read_env()\r\n    server = os.environ[\"SQL_SERVER_VM\"]\r\n    password = os.environ[\"SERVICE_ACCOUNT_PASSWORD\"]\r\n\r\n    connection_url = \"jdbc:sqlserver:\/\/{0}:{1};database={2};user={3};password={4}\".format(\r\n        server, port, database, username, password\r\n    )\r\n\r\n    return connection_url\r\n\r\n\r\ndef submit_sql_query(query):\r\n    \"\"\" Push down a SQL Query to SQL Server for computation, returning a table\r\n\r\n    Inputs:\r\n        query (str): Either a SQL query string, with table alias, or table name as a string.\r\n\r\n    Returns:\r\n        Spark DataFrame of the requested data\r\n    \"\"\"\r\n    connection_url = get_sql_connection_string()\r\n    return spark.read.jdbc(url=connection_url, table=query)<\/pre>\n<p>For simplicity, in this example we do not connect to a SQL server but instead load our data from a local file or URL into a Pandas data frame. Here, we explore the open source <a href=\"https:\/\/github.com\/nytimes\/covid-19-data\"><em>The New York Times<\/em> COVID-19 dataset<\/a>\u00a0which includes <a href=\"https:\/\/www.census.gov\/quickfacts\/fact\/note\/US\/fips\">FIPS codes (fips)<\/a>, cases, and deaths by county in the United States of America over time.<\/p>\n<p><figure id=\"attachment_13255\" aria-labelledby=\"figcaption_attachment_13255\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_07.png\"><img decoding=\"async\" class=\"wp-image-13255 size-full\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_07.png\" alt=\"Table showing sample data from the New York Times COVID-19 dataset\" width=\"794\" height=\"602\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_07.png 794w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_07-300x227.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_07-768x582.png 768w\" sizes=\"(max-width: 794px) 100vw, 794px\" \/><\/a><figcaption id=\"figcaption_attachment_13255\" class=\"wp-caption-text\"><strong>Figure 7<\/strong>. A small sample of the data available in <em>The New York Times<\/em> COVID-19 dataset. FIPS codes, cases, and deaths are reported daily per county.<\/figcaption><\/figure><\/p>\n<p>Although not as complex as the data used to estimate optimal ICU stay length or mortality for the Philips model, this simple dataset allows us to explore data drift monitoring with minimal data wrangling and processing.<\/p>\n<h3>Schema Validation<\/h3>\n<p>Part of the drift monitoring pipeline involves checking if the schema of the data is as expected. In our solution, this schema validation is written as a separate script for each feature (i.e., fips, cases, deaths) followed by an assertion script which causes the pipeline to fail if any of the features present invalid schema.<\/p>\n<p>The following is an example of the nature of custom schema validation for a particular feature.<\/p>\n<pre class=\"prettyprint\">\"\"\"\r\nAssumptions for cases:\r\n    - Values are integers\r\n    - Values are non-negative\r\n\"\"\"\r\nfor group_value in group_values:\r\n    # ---------------------------------------------------\r\n    # Get unique values in the column (feature) of interest\r\n    # ---------------------------------------------------\r\n    feature_values = get_unique_vals(df, feature, group_col, group_value)\r\n\r\n    # Initialize variable to keep track of schema validity\r\n    status = \"valid\"\r\n\r\n    # Validate feature schema\r\n    for val in feature_values:\r\n\r\n        # Continue this loop until you hit an invalid\r\n        # This prevents from only saving the last value's status\r\n        if status == \"valid\":\r\n\r\n            # Check if value is a float\r\n            if type(val) in [int, np.int64]:\r\n                if val &lt; 0:\r\n                    status = \"invalid: value must be non-negative\"\r\n                else:\r\n                    status = \"valid\"\r\n            else:\r\n                status = \"invalid: value not an int\"\r\n\r\n        # Update dictionary\r\n        output_dict[\"schema_validation\"][group_col][group_value].update(\r\n            {feature: {\"status\": status, \"n_vals\": len(feature_values)}}\r\n        )<\/pre>\n<p>Note that the specific logic will vary depending on the particular assumptions for the given feature.<\/p>\n<p>With each feature schema check, validation results are written to a .json file which is read in as a dictionary in the assertion script.<\/p>\n<pre class=\"prettyprint\">def search_dict_for_invalid(group_col, group_values, features, results, invalids):\r\n    \"\"\" Search dictionary for features with invalid schema\r\n\r\n    Inputs:\r\n        group_col (str): Name of column to group results by.\r\n        group_values (str): Names of specific groups in group_col.\r\n        features (list of str): List of features that will be monitored.\r\n        results (dict): Dictionary containing schema validation results and metadata.\r\n        invalids (list): List of strings containing information about which features are invalid.\r\n\r\n    Return:\r\n        invalids (list): List of strings containing information about which features are invalid.\r\n    \"\"\"\r\n    for group_value in group_values:\r\n        for feature in features:\r\n            status = results[\"schema_validation\"][group_col][group_value][feature][\r\n                \"status\"\r\n            ]\r\n            if status.lower() != \"valid\":\r\n                invalids.append(\r\n                    \"{0}: {1}, {2} invalid\".format(group_col, group_value, feature)\r\n                )\r\n\r\n    return invalids<\/pre>\n<p>If the schema for any of the features is determined \u201cinvalid\u201d, the assertion call in the assertion script will throw an error.<\/p>\n<h3>Distribution Drift<\/h3>\n<p>The other half of the drift monitoring pipeline calculates distribution drift within each of the features over the user-specified baseline and target periods. This drift is calculated using Kolmogorov-Smirnov (K-S) tests implemented through the alibi-detect Python library. We decided to use K-S tests as they account for multiple comparisons, which is relevant for when running statistical tests on multiple features of interest (i.e., fips, cases, deaths). In addition, drift detection through alibi-detect allows for seamless handling of categorical data and has the potential to <a href=\"https:\/\/docs.seldon.io\/projects\/alibi-detect\/en\/latest\/examples\/cd_ks_cifar10.html\">predict malicious drift through adversarial drift detection<\/a>.<\/p>\n<pre class=\"prettyprint\">from alibi_detect.cd import KSDrift\r\nimport pandas as pd\r\nimport numpy as np\r\nimport datetime\r\nimport argparse\r\nimport decimal\r\nimport mlflow\r\nimport os\r\nimport sys\r\nfrom environs import Env\r\n...\r\n# ---------------------------------------------------\r\n# Drift detection\r\n# ---------------------------------------------------\r\nX_baseline = df_baseline[features].dropna().to_numpy()\r\nX_target = df_target[features].dropna().to_numpy()\r\n\r\nif X_target.size == 0:\r\n    return output_df\r\n\r\n# Initialize drift monitor using Kolmogorov-Smirnov test\r\n# https:\/\/docs.seldon.io\/projects\/alibi-detect\/en\/latest\/methods\/ksdrift.html\r\ncd = KSDrift(p_val=p_val, X_ref=X_baseline, alternative=\"two-sided\")\r\n\r\n# Get ranked list of feature by drift (ranked by p-value)\r\npreds_h0 = cd.predict(X_target, return_p_val=True)\r\ndrift_by_feature = rank_feature_drift(preds_h0, features)<\/pre>\n<p>With this, we set the p-value threshold to the threshold value set in the pipeline run (0.05 by default). This allows us to automatically label distribution drift for specific features as statistically significant or not in our results.<\/p>\n<p>Note, in this example with <em>The New York Times<\/em> COVID-19 by county dataset, we do not have any categorical variables but the alibi-detect implementation of K-S tests allows us to run drift detection as if they were numerical without having to apply any processing.<\/p>\n<h2>Accessing Monitoring Results<\/h2>\n<p>To view the results of the schema validation and distribution drift monitoring, we view the resulting files written from the schema validation and distribution drift tasks.<\/p>\n<p>Status of the schema validation and distribution drift MLflow experiments (submitted as Databricks jobs) may be viewed in links provided for the respective tasks in the Azure DevOps pipeline run (Figure 8).<\/p>\n<p><figure id=\"attachment_13256\" aria-labelledby=\"figcaption_attachment_13256\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_08.png\"><img decoding=\"async\" class=\"wp-image-13256 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_08-1024x510.png\" alt=\"Accessing the Databricks job from Azure pipeline run log\" width=\"640\" height=\"319\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_08-1024x510.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_08-300x149.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_08-768x382.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_08-1536x765.png 1536w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_08.png 1948w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/a><figcaption id=\"figcaption_attachment_13256\" class=\"wp-caption-text\"><strong>Figure 8<\/strong>. A link to the Azure Databricks run job status is provided in the output of the data drift monitoring steps defined by the data drift pipeline file.<\/figcaption><\/figure><\/p>\n<p>We can set the artifacts to be written either to Azure blob storage or directly to the Databricks file system (dbfs). In this example, we write directly to dbfs for easy access through the job summary in the Databricks workspace.<\/p>\n<p><figure id=\"attachment_13257\" aria-labelledby=\"figcaption_attachment_13257\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_09.png\"><img decoding=\"async\" class=\"wp-image-13257 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_09-1024x719.png\" alt=\"Previewing pipeline artifacts in Databricks job\" width=\"640\" height=\"449\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_09-1024x719.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_09-300x211.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_09-768x539.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_09.png 1472w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/a><figcaption id=\"figcaption_attachment_13257\" class=\"wp-caption-text\"><strong>Figure 9<\/strong>. By default, data drift monitoring results are stored as artifacts in each of the Databricks jobs when using the example code. Schema validation results and distribution drift results are stored separately in their respective jobs since they are designated as two separate entry points in the MLflow experiment and as two tasks in the pipeline.<\/figcaption><\/figure><\/p>\n<p>However, if you would like to instead write the files to Azure blob storage, you can uncomment\/comment the appropriate lines in <a href=\"https:\/\/github.com\/niwilso\/data-drift-monitor\/blob\/master\/.azure_pipelines\/data_drift.yml\">data_drift.yml<\/a>\u00a0to automatically route artifact uploading to blob.<\/p>\n<pre class=\"prettyprint\"># Keep the below line commented out if using dbfs, otherwise uncomment if using blob storage instead\r\n# - group: mlops-vg-storage\r\n...\r\n# Optional to write artifacts to blob storage\r\n# Comment out if using dbfs instead of blob storage\r\n# AZURE_STORAGE_ACCESS_KEY: $(AZURE_STORAGE_ACCESS_KEY)\r\n# AZURE_STORAGE_ACCOUNT_NAME: $(AZURE_STORAGE_ACCOUNT_NAME)\r\n# AZURE_STORAGE_CONTAINER_NAME: $(AZURE_STORAGE_CONTAINER_NAME)<\/pre>\n<p>Now looking into the artifacts, we see a .json and .csv file for the schema validation and a single .csv file for the distribution drift results.<\/p>\n<p>The .json and .csv for schema validation contain the same information but are formatted slightly differently. We decided to output results as a .csv for both schema validation and for distribution drift to easily write to SQL tables for the Philips project due to their tabular structure. The .json file is a bit easier to visually parse for anyone interested in looking directly at the results without writing specific queries.<\/p>\n<p><figure id=\"attachment_13258\" aria-labelledby=\"figcaption_attachment_13258\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_10.png\"><img decoding=\"async\" class=\"wp-image-13258 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_10-546x1024.png\" alt=\"Example JSON output of schema validation\" width=\"546\" height=\"1024\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_10-546x1024.png 546w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_10-160x300.png 160w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_10.png 692w\" sizes=\"(max-width: 546px) 100vw, 546px\" \/><\/a><figcaption id=\"figcaption_attachment_13258\" class=\"wp-caption-text\"><strong>Figure 10<\/strong>. Example output of the schema validation portion of the data drift monitoring pipeline. In the JSON file, results are organized by state, where n_vals represents the total number of unique data points evaluated for each feature.<\/figcaption><\/figure><\/p>\n<p>For distribution drift, results are organized into multiple columns which also allow for insight into categorical variable distribution changes (Table 3).<\/p>\n<table width=\"715\">\n<thead>\n<tr>\n<td><strong>Column Name<\/strong><\/td>\n<td><strong>Type<\/strong><\/td>\n<td><strong>Description<\/strong><\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>group_col<\/td>\n<td>string<\/td>\n<td>Name of column used to group results (e.g., &#8220;state&#8221;).<\/td>\n<\/tr>\n<tr>\n<td>group_value<\/td>\n<td>string<\/td>\n<td>Value in group_col (e.g., &#8220;Nebraska&#8221;).<\/td>\n<\/tr>\n<tr>\n<td>feature<\/td>\n<td>string<\/td>\n<td>Name of feature that drift detection is being run on (e.g., &#8220;cases&#8221;).<\/td>\n<\/tr>\n<tr>\n<td>pValue<\/td>\n<td>float<\/td>\n<td>Threshold set for determining significance for Kolmogorov-Smirnov test on a given feature.<\/td>\n<\/tr>\n<tr>\n<td>isSignificantDrift<\/td>\n<td>boolean<\/td>\n<td>True or False on whether drift detection on a feature results in a p-value below the pValue threshold.<\/td>\n<\/tr>\n<tr>\n<td>baselineSamples<\/td>\n<td>integer<\/td>\n<td>The number of samples present in the baseline.<\/td>\n<\/tr>\n<tr>\n<td>baselineNullValues<\/td>\n<td>integer<\/td>\n<td>The number of null values in the baseline for this specific feature.<\/td>\n<\/tr>\n<tr>\n<td>baselineRemoved<\/td>\n<td>integer<\/td>\n<td>The number of rows removed in the baseline, based on presence of null in all features.<\/td>\n<\/tr>\n<tr>\n<td>baselineValues<\/td>\n<td>string<\/td>\n<td>If the feature is categorical, a list of all values present in the baseline (e.g., [yes, no, maybe])<\/td>\n<\/tr>\n<tr>\n<td>baselineValueCounts<\/td>\n<td>string<\/td>\n<td>If the feature is categorical, a list of counts for all values present in the baseline (e.g., [60, 30, 10])<\/td>\n<\/tr>\n<tr>\n<td>baselineValuePercentages<\/td>\n<td>string<\/td>\n<td>If the feature is categorical, a list of proportions for all values present in the baseline (e.g., [0.6, 0.3, 0.1])<\/td>\n<\/tr>\n<tr>\n<td>targetNullValues<\/td>\n<td>integer<\/td>\n<td>The number of null values in the target for this specific feature.<\/td>\n<\/tr>\n<tr>\n<td>targetRemoved<\/td>\n<td>integer<\/td>\n<td>The number of rows removed in the target, based on presence of null in all features.<\/td>\n<\/tr>\n<tr>\n<td>targetValues<\/td>\n<td>string<\/td>\n<td>If the feature is categorical, a list of all values present in the target (e.g., [yes, no, maybe])<\/td>\n<\/tr>\n<tr>\n<td>targetValueCounts<\/td>\n<td>string<\/td>\n<td>If the feature is categorical, a list of counts for all values present in the target (e.g., [60, 30, 10])<\/td>\n<\/tr>\n<tr>\n<td>targetValuePercentages<\/td>\n<td>string<\/td>\n<td>If the feature is categorical, a list of proportions for all values present in the target (e.g., [0.6, 0.3, 0.1])<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong><em>Table 3<\/em><\/strong><em>. Distribution drift monitoring results are stored in a table where each row contains the results for a particular group\u2019s feature. In the case of The <\/em><em>New York Times COVID-19 dataset, a state or county can be set as the \u201cgroup\u201d and fips, cases, or deaths are the possible features.<\/em><\/p>\n<p>The columns baselineValues, baselineValueCounts, baselineValuePercentages, targetValues, targetValueCounts, and targetValuePercentages are all empty in this example as they are meant to contain data for categorical variables (Figure 11).<\/p>\n<p><figure id=\"attachment_13259\" aria-labelledby=\"figcaption_attachment_13259\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_11.png\"><img decoding=\"async\" class=\"wp-image-13259 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_11-1024x449.png\" alt=\"Example results of distribution drift monitoring\" width=\"640\" height=\"281\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_11-1024x449.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_11-300x132.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_11-768x337.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_11-1536x674.png 1536w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/10\/Figure_11.png 1902w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/a><figcaption id=\"figcaption_attachment_13259\" class=\"wp-caption-text\"><strong>Figure 11<\/strong>. Example results for distribution drift monitoring of <em>The New York Times<\/em> COVID-19 by county dataset when comparing a baseline period of January 21, 2020 \u2013 May 31, 2020 to a target period of August 1, 2020 \u2013 August 27, 2020.<\/figcaption><\/figure><\/p>\n<h2>Conclusion<\/h2>\n<p>Data drift monitoring is a key part of model maintenance that allows for data scientists to identify changes in the source data that may be detrimental to model performance before retraining the model.<\/p>\n<p>In the context of the Philips Healthcare Informatics (HI) \/ Microsoft collaboration, the implementation of data drift monitoring into their MLOps allows for the team to discover potential issues and contact the data source (e.g., a specific ICU) to address the issue before retraining the mortality model for the quarterly benchmark report. This allows the Philips team to save time by avoiding running the computationally and time-intensive model retraining with problematic data. This also helps maintain model performance quality.<\/p>\n<p>By using the MLflow experiment framework to submit our Python scripts as jobs to Azure Databricks through Azure DevOps pipelines, we were able to integrate our data drift monitoring code as a part of the broader MLOps solution.<\/p>\n<p>We hope this implementation and the provided example code empower others to begin integrating data drift monitoring into their MLOps solutions.<\/p>\n<h2>Acknowledgements<\/h2>\n<p>This work was a team effort and I would like to thank both the Microsoft team (MSFT) and the Philips Healthcare Informatics team for a great collaborative experience.<\/p>\n<p>I personally want to thank all of the following individuals for being great teammates and for their amazing work (listed in alphabetical order by last name): Omar Badawi (Philips), Denis Cepun (MSFT), Donna Decker (Philips), Jit Ghosh (MSFT), Brian Gottfried (Philips), Xingang Liu (Philips), Colin McKenna (Philips), Margaret Meehan (MSFT), Maysam Mokarian (MSFT), Federica Nocera (MSFT), Russell Rayner (Philips), Brian Reed (Philips), Samantha Rouphael (MSFT), Patty Ryan (MSFT), Tempest van Schaik (MSFT), Ashley Vernon (Philips), Galiya Warrier (MSFT), Nile Wilson (MSFT), Clemens Wolff (MSFT), and Yutong Yang (MSFT).<\/p>\n<h2>Resources<\/h2>\n<ul>\n<li><a href=\"https:\/\/github.com\/niwilso\/data-drift-monitor\">Clinical Data Drift Monitoring example code repository<\/a><\/li>\n<li><a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learning\/concept-model-management-and-deployment#what-is-mlops\">What is MLOps?<\/a><\/li>\n<li><a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/databricks\/scenarios\/what-is-azure-databricks\">What is Azure Databricks?<\/a><\/li>\n<li><a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learning\/how-to-monitor-datasets\">Azure Machine Learning Data Drift Monitor<\/a> (Note, this tool was still in development as we were creating our solution)<\/li>\n<li><a href=\"https:\/\/medium.com\/data-from-the-trenches\/a-primer-on-data-drift-18789ef252a6\">A Primer on Data Drift<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Hospitals around the world regularly work towards improving the health of their patients as well as ensuring there are enough resources available for patients awaiting care. During these unprecedented times with the COVID-19 pandemic, Intensive Care Units are having to make difficult decisions at a greater frequency to optimize patient health outcomes. The continuous collection [&hellip;]<\/p>\n","protected":false},"author":43597,"featured_media":13265,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,16,19],"tags":[60,3294,144,151,3295,239,3293],"class_list":["post-13245","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","category-devops","category-machine-learning","tag-azure","tag-data-drift","tag-databricks","tag-devops","tag-drift-monitoring","tag-machine-learning-ml","tag-mlops"],"acf":[],"blog_post_summary":"<p>Hospitals around the world regularly work towards improving the health of their patients as well as ensuring there are enough resources available for patients awaiting care. During these unprecedented times with the COVID-19 pandemic, Intensive Care Units are having to make difficult decisions at a greater frequency to optimize patient health outcomes. The continuous collection [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/13245","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/43597"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=13245"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/13245\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/13265"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=13245"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=13245"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=13245"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}