{"id":2179,"date":"2015-07-21T16:34:28","date_gmt":"2015-07-21T23:34:28","guid":{"rendered":"https:\/\/www.microsoft.com\/reallifecode\/index.php\/2015\/07\/21\/prediction-of-diabetes-hypoglycemic-events\/"},"modified":"2020-03-19T09:40:08","modified_gmt":"2020-03-19T16:40:08","slug":"prediction-of-diabetes-hypoglycemic-events","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/prediction-of-diabetes-hypoglycemic-events\/","title":{"rendered":"Prediction of diabetes hypoglycemic events"},"content":{"rendered":"<p>Our customer develops connected blood glucose meters to provide innovative diabetes solutions to its patients. Using this meter, they\u2019re able to store and archive patients\u2019 data for further analysis. Furthermore, their connected device allows them to provide real-time analysis of the measured glucose values and make predictions about an upcoming hypoglycemic (hypo) event. Considering the traumatic experience of a diabetes hypo and its associated cost, being able to alert patients about a possible upcoming hypo can be of a tremendous value to them.<\/p>\n<p>This case study describes the approach we took to create a Microsoft Azure Machine Learning (MAML) model which predicts diabetes hypos based on blood glucose measurements only.<\/p>\n<h2 id=\"overview-of-the-solution\">Overview of the Solution<\/h2>\n<p>Figure 1 shows an overview of our approach:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/2015-07-21-Prediction-of-diabetes-hypoglycemic-events_images-image001.png\" alt=\"Architecture\" \/><\/p>\n<p>Figure 1: overview of the approach<\/p>\n<ul>\n<li>We extracted the relevant information from the operational system and stored it as a comma-separated values (csv) file.<\/li>\n<li>We used Python to transform the data and create the dataset containing the needed features and labels.<\/li>\n<li>After uploading the dataset to MAML, we started to build experiments and evaluate their results.<\/li>\n<\/ul>\n<p>Development was comprised of multiple iterations; in each iteration, we refined the Python script for feature and label creation, then rebuilt and re-evaluated the MAML model.<\/p>\n<p>Once the model was sufficiently accurate, we published it as a Web Service, which was then integrated into the real-time data pipeline.<\/p>\n<h2 id=\"implementation\">Implementation<\/h2>\n<p>The first step was to extract the historical measurements and store them in a csv file. This file contains the following four columns:<\/p>\n<div class=\"highlighter-rouge\">\n<pre class=\"highlight\"><code>anonymized patient id, diabetes type, timestamp, glucose value\r\n<\/code><\/pre>\n<\/div>\n<p>Our initial idea was to predict a hypo event based on previous measurements. To do so, we translated the data into a time series of measurements and used this to train our machine learning algorithm to predict the glucose value for a specific hour. Not surprisingly, this approach didn\u2019t yield useful results, mainly because we were lacking critical data: To predict the actual glucose value for a specific hour\/timeslot, we would require additional details only available from other data such as information about insulin injections and food\/drink intake.<\/p>\n<p>While analyzing the data, we also realized that it is better to separate the different diabetes types: While we have an average of 2-3 daily measurements for patients of diabetes type 2, many patients of diabetes type 1 measure their glucose value more frequently \u2013 which led to a dataset where type 1 measurements made of 90% of the data.<\/p>\n<p>We also decided to do a binary classification to predict whether a hypo event might occur within the next 24 hours (instead of the initial approach of using linear regression to predict the glucose value for a specific hour). Such a prediction is especially useful to patients of diabetes type 2, who measure their glucose value only a few times a day, so knowing they\u2019re at risk might help them to closer manage their blood glucose values over the following 24 hours.<\/p>\n<p>We translated the raw, extracted csv file into a new dataset, formatted to make predictions based on a time series of historical data. In our case, the final dataset contained the following information:<\/p>\n<ul>\n<li>patient id, diabetes type, measured glucose value<\/li>\n<li>sequence number of measurement\nstarting at 1 for each patient; this is used to split the dataset into a training and validation set without \u201cdestroying\u201d the time series (e.g. use 1-2000 for training and 2001-3000 for validation)<\/li>\n<li>the label: did a hypo occur within the next 24 hours? (a hypo is defined as glucose value &lt; 4.0)<\/li>\n<li>the high, low, and average glucose values across the last 3 measurements<\/li>\n<li>the time between the current and the 3rd last measurement in minutes<\/li>\n<li>the high, low, and average glucose value each day for the last 7 days<\/li>\n<li>the number of measurements within the last 7 days<\/li>\n<\/ul>\n<p>A reusable python script for creating such time series datasets has been published to <a href=\"http:\/\/github.com\">GitHub<\/a>:<\/p>\n<p><a href=\"https:\/\/github.com\/cloudbeatsch\/CreateTimeSeriesData\">https:\/\/github.com\/cloudbeatsch\/CreateTimeSeriesData<\/a><\/p>\n<p>The CreateTimeSeriesData.py script provides the following inputs to control the generation of the time series dataset:<\/p>\n<div class=\"highlighter-rouge\">\n<pre class=\"highlight\"><code>usage: CreateTimeSeriesData.py [-h] [-i ID] [-t TIMESTAMP] [-v VALUE]\r\n                               [-a [ADDITIONAL_COLS [ADDITIONAL_COLS ...]]]\r\n                               inputCSV outputCSV threshold\r\n                               trigger_window_size datapoints slots slot_size\r\n\r\npositional arguments:\r\n  inputCSV              path to input csv file\r\n  outputCSV             path to output csv file\r\n  threshold             threshold of positive event\r\n  trigger_window_size   trigger window size in seconds\r\n  datapoints            nr of latest datapoints\r\n  slots                 nr of slots in time series\r\n  slot_size             slot size in seconds\r\n<\/code><\/pre>\n<\/div>\n<p>Figure 2 visualizes the core concepts of the script: The trigger window, last data points and slots:<\/p>\n<ul>\n<li><strong>trigger_window_size<\/strong> defines how many seconds we search forward to find values which are below the <strong>threshold<\/strong> value. If we find a value that is below the threshold, we set the value of the output column <em>IsTriggered<\/em> to 2; in all other cases we set it to 1. Note: <em>IsTriggered<\/em> will become our label for training the machine learning algorithm.<\/li>\n<li><strong>datapoints<\/strong> defines how many of the latest data points will be added to the output dataset.<\/li>\n<li><strong>slots<\/strong> defines the number of slots we aggregate and add to the output dataset. The length of a slot is defined by the <strong>slot_size<\/strong> in seconds.<\/li>\n<\/ul>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/2015-07-21-Prediction-of-diabetes-hypoglycemic-events_images-image002.png\" alt=\"Architecture\" \/><\/p>\n<p>Figure 2: the concepts of trigger window, last data points and slots<\/p>\n<p>To create our required format, we run the script using the following arguments:<\/p>\n<ul>\n<li>input.csv and output.csv filenames<\/li>\n<li>a threshold of 4 with a trigger window size of 24 hours (86400 seconds)<\/li>\n<li>adding the 3 last data points<\/li>\n<li>adding 7 slots of 24 hours each (86400 seconds)<\/li>\n<li>using the column called ID as the entity key (patient id)<\/li>\n<li>use the column called ValueMmol as the measurement value<\/li>\n<li>add one additional column (DiabetesTypeValue) from the input dataset to the output dataset<\/li>\n<\/ul>\n<div class=\"highlighter-rouge\">\n<pre class=\"highlight\"><code>python CreateTimeSeriesData.py\r\n    input.csv output.csv 4 86400 3 7 86400 --id=ID --value=ValueMmol\r\n    -a DiabetesTypeValue\r\n<\/code><\/pre>\n<\/div>\n<p>We uploaded the created dataset to MAML and created the experiment shown in Figure 3.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/2015-07-21-Prediction-of-diabetes-hypoglycemic-events_images-image003.jpg\" alt=\"Architecture\" \/><\/p>\n<p>Figure 3: Experiment predicting hypo using MAML<\/p>\n<p>The actual experiment contains two models; one for diabetes type 1 and one for diabetes type 2. We split the patients into a set for training and another set for model evaluation. This will guarantee that we evaluate the model against data it hasn\u2019t seen before (in this case, patients which were not part of the training dataset). We\u2019re also using a parameter sweep to find the best model parameters. To do so, we split the training dataset into a sweep training and a sweep validation set, using the sequence numbers. This ensures that the algorithm can learn about sequence patterns in the data. While experimenting and evaluating different algorithms, it was the Two-Class Boosted Decision Tree which yielded the best results.<\/p>\n<p>We\u2019re able to correctly predict ~35% of all hypos with only ~3% of our hypo predictions being a \u201cfalse alarm\u201d (see Figure 4). This makes it a useful tool to help the patients avoid more than 1\/3 of hypos.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/2015-07-21-Prediction-of-diabetes-hypoglycemic-events_images-image004.jpg\" alt=\"Architecture\" \/><\/p>\n<p>Figure 4: Model evaluation using MAML<\/p>\n<p>While 35% isn\u2019t an amazing result yet, it\u2019s a great start to helping people with diabetes manage their care.<\/p>\n<p>Because the model is not tied to a specific patient, its benefit can be made available to existing and new patients, without going through a lengthy learning phase.<\/p>\n<h2 id=\"challenges\">Challenges<\/h2>\n<p>Given the data at hand, it took some time to understand the type of questions we are able to answer:<\/p>\n<p><em>\u201cWhat is the chance that a hypo occurs within the next 24 hours\u201d<\/em>\nversus\n<em>\u201cWhat will the glucose value be in 3 hours\u201d<\/em><\/p>\n<p>While creating the time series data, it was crucial not to leak any information about the label into its features. For instance, we had one dataset that yielded great results; these were a bit too good to be true. Unfortunately, the creation of the dataset had an error, and we leaked the hypo information into one of the features.<\/p>\n<p>We required many cycles to land at a dataset which had the right features and yielded the results we wanted. The described Python script made it quite straightforward to experiment with different datasets and algorithms.<\/p>\n<h2 id=\"opportunities-for-reuse\">Opportunities for Reuse<\/h2>\n<p>The described approach of transforming data into time series for machine learning is widely applicable. The published Python script can be used and adapted to translate csv files containing series of events into a dataset that can be effectively used within MAML. Depending on the quality of data and the available time windows, such data can be used for regression and\/or for classifications tasks.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Creating a Microsoft Azure Machine Learning (MAML) model which predicts diabetics&#8217; hypoglycemic events based on blood glucose measurements alone.<\/p>\n","protected":false},"author":21354,"featured_media":12709,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[11],"tags":[83,152,205],"class_list":["post-2179","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data","tag-azure-maml","tag-diabetes","tag-hypoglycemia"],"acf":[],"blog_post_summary":"<p>Creating a Microsoft Azure Machine Learning (MAML) model which predicts diabetics&#8217; hypoglycemic events based on blood glucose measurements alone.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2179","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21354"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=2179"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2179\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/12709"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=2179"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=2179"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=2179"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}