{"id":2123,"date":"2016-12-14T16:00:00","date_gmt":"2016-12-14T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/reallifecode\/index.php\/2016\/12\/14\/regulating-sensor-error-in-wastewater-management-systems-with-machine-learning\/"},"modified":"2020-03-15T06:23:46","modified_gmt":"2020-03-15T13:23:46","slug":"regulating-sensor-error-in-wastewater-management-systems-with-machine-learning","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/regulating-sensor-error-in-wastewater-management-systems-with-machine-learning\/","title":{"rendered":"Regulating Sensor Error in Wastewater Management Systems with Machine Learning"},"content":{"rendered":"<h2 id=\"overview\">Overview<\/h2>\n<p>The following code story outlines a novel method for differentiating between anomalies and expected outliers using the Microsoft <a href=\"https:\/\/docs.microsoft.com\/en-gb\/azure\/machine-learning\/machine-learning-apps-anomaly-detection\">Anomaly Detection API<\/a> and Binary Classification to assist with Time Series Filtering.<\/p>\n<h2 id=\"background\">Background<\/h2>\n<p><a href=\"http:\/\/www.carlsolutions.com\/\">Carl Data Solutions<\/a> provides a suite of software tools called <a href=\"http:\/\/www.flowworks.com\/\">Flow Works<\/a>, that are used by municipalities to help manage their wastewater infrastructure. Their tools pull data from various sensor channels that measure variables such as water flow, velocity, and depth.<\/p>\n<p>These sensors sometimes malfunction or behave unexpectedly, causing skewed readings. Since forecasting models are built on top of these sensors, skewed data can negatively influence accuracy. Currently, to account for irregularities, municipalities that use Carl Data\u2019s flow works solutions hire consultants to manually sift through all the sensor data and modify values believed to be caused by sensor error.<\/p>\n<p><strong>Sample Daily Sensor Readings with Tagged Anomalies.<\/strong>\n <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2016\/12\/dailypattern.png\" alt=\"Image dailypattern\" width=\"1144\" height=\"511\" class=\"aligncenter size-full wp-image-11068\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2016\/12\/dailypattern.png 1144w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2016\/12\/dailypattern-300x134.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2016\/12\/dailypattern-1024x457.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2016\/12\/dailypattern-768x343.png 768w\" sizes=\"(max-width: 1144px) 100vw, 1144px\" \/><\/p>\n<p>Due to the overhead in time and cost, Carl Data was interested in building an anomaly detection model to automate the identification of these errors.<\/p>\n<h2 id=\"the-problem\">The Problem<\/h2>\n<p>Since sensor errors occur infrequently, the training set compiled from Carl Data\u2019s curated logs contained many more \u201cclean\u201d values than \u201cdirty\u201d ones. When there is a highly uneven distribution of training samples, traditional binary classifiers sometimes struggle to identify sensor errors since they get overwhelmed by positive examples. Often unsupervised time series outlier detection algorithms are applied to locate irregular flow.<\/p>\n<p>However, unsupervised Outlier Detection frameworks such as Twitter\u2019s <a href=\"https:\/\/github.com\/twitter\/AnomalyDetection\">Anomaly Detection Package<\/a> or Microsoft\u2019s <a href=\"https:\/\/docs.microsoft.com\/en-gb\/azure\/machine-learning\/machine-learning-apps-anomaly-detection\">Anomaly Detection API<\/a>, while great for detecting irregularities in flow cannot differentiate between expected irregular behavior such as a peak caused by a flood and sensor error. Additionally, these APIs only make batch classifications which are slow for real-time detection purposes.<\/p>\n<h2 id=\"the-engagement\">The Engagement<\/h2>\n<p>Microsoft partnered with Carl Data to help investigate how to build an anomaly detection model that could differentiate between irregularities and put the model into production using Event Hubs and PowerBI.<\/p>\n<h3 id=\"anomaly-detection-ml-methodology\">Anomaly Detection ML Methodology<\/h3>\n<p><strong>Model #1 : Outlier Detection (Unsupervised)<\/strong><\/p>\n<ol>\n<li>Read in raw historical data from the velocity sensor channel.<\/li>\n<li>Read in the tagged anomalies from the curated velocity sensor channel data.<\/li>\n<li>Send the raw data to the Microsoft Anomaly Detection API to tag outliers.<\/li>\n<li>Score outlier model using Anomaly Detection API results against the \u2018manually tagged anomalies.\u2019<\/li>\n<\/ol>\n<p><strong>Model #2: Binary Classifier (Supervised)<\/strong><\/p>\n<ol>\n<li>Read in raw historical data from the velocity sensor channel.<\/li>\n<li>Read in and merge the tagged anomalies from the curated velocity sensor channel data.<\/li>\n<li>Create a historical window of the previous four velocity channel readings values at each time.<\/li>\n<li>Create a train and test set from a random split on the historical windows.<\/li>\n<li>Train a random forest classifier on the train data.<\/li>\n<li>Benchmark the random forest on the test data.<\/li>\n<\/ol>\n<p><strong>Model #3: Hybrid Classifier (Differentiate Between Anomalies and Outliers)<\/strong><\/p>\n<ol>\n<li>Read in raw historical data from the velocity sensor channel.<\/li>\n<li>Read in the tagged anomalies from the curated velocity sensor channel data.<\/li>\n<li>Send the raw data to the Microsoft Anomaly Detection API to tag outliers.<\/li>\n<li>Create a historical window of the previous four velocity channel readings values at each time using only the values marked as outliers.<\/li>\n<li>Create a train and test set from a random split on the historical windows.<\/li>\n<li>Train a random forest classifier on the train data.<\/li>\n<li>Benchmark the random forest on the test data.<\/li>\n<li>Benchmark the random forest on the entire velocity time series excluding the training set.<\/li>\n<\/ol>\n<h3 id=\"integration-methodology\">Integration Methodology<\/h3>\n<p> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2016\/12\/integrationarch.jpg\" alt=\"Image integrationarch\" width=\"887\" height=\"151\" class=\"aligncenter size-full wp-image-11070\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2016\/12\/integrationarch.jpg 887w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2016\/12\/integrationarch-300x51.jpg 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2016\/12\/integrationarch-768x131.jpg 768w\" sizes=\"(max-width: 887px) 100vw, 887px\" \/><\/p>\n<ol>\n<li>Push the channel data to Anomaly Detection <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/event-hubs\/\">Event Hub<\/a> with window size n.<\/li>\n<li>On new events, tag whether they are anomalies or not using the model built in the last section.<\/li>\n<li>Push tagged channel data to the visualization <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/event-hubs\/\">Event Hub<\/a>.<\/li>\n<li>Use <a href=\"https:\/\/blogs.msdn.microsoft.com\/kaevans\/2015\/02\/26\/using-stream-analytics-with-event-hubs\/\">Stream Analytics<\/a> to ingest the visualization Event Hub.<\/li>\n<li>Import <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/stream-analytics\/stream-analytics-power-bi-dashboard\">Stream Analytics to PowerBI<\/a> to visualize tagged anomalies.<\/li>\n<\/ol>\n<h3 id=\"results\">Results<\/h3>\n<p>During this engagement, we successfully built three models using a combination of the Microsoft Anomaly Detection API and an ensemble of random forests and logistic regression to identify sensor error.<\/p>\n<p>Though the Anomaly Detection API helped differentiate identity outliers for anomaly classification, in Carl Data\u2019s dataset the difference between anomalies and regular flow was linearly differentiable enough that a random forest binary classifier provided just as good results as the approach combined with the Anomaly Detection API.<\/p>\n<p>Sometimes, analysts might tag a couple of values around an anomalous sensor spike as anomalies, too. The Anomaly Detection API has trouble tagging these values. However, in cases where sensor error is better represented, and anomalies are not as linearly differentiable, the hybrid method can be used to yield more generalizable results than a binary classifier alone.<\/p>\n<p>The model we chose performed as follows, with high precision (99%) and recall (100%):<\/p>\n<p><strong>Benchmarks on model from the SciKit learn benchmarking module<\/strong> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2016\/12\/benchmarks.jpg\" alt=\"Image benchmarks\" width=\"505\" height=\"127\" class=\"aligncenter size-full wp-image-11067\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2016\/12\/benchmarks.jpg 505w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2016\/12\/benchmarks-300x75.jpg 300w\" sizes=\"(max-width: 505px) 100vw, 505px\" \/><\/p>\n<h2 id=\"code\">Code<\/h2>\n<p>You can find the notebook and code for implementing this methodology <a href=\"https:\/\/github.com\/CatalystCode\/Channel-Sensor-Error-Detection\">on GitHub<\/a>.<\/p>\n<h2 id=\"opportunities-for-reuse\">Opportunities for Reuse<\/h2>\n<p>The methodology exhibited in this code story is important for highlighting time series machine learning applications in an underrepresented domain such as wastewater management.<\/p>\n<p>Additionally, as the field of IoT matures from data aggregation to predictive intelligence, it becomes increasingly critical to be able to differentiate between anomalies that are caused by sensor error and those that are expected outliers. As a result, the approach outlined in this code story will be helpful for such use cases.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A tutorial for building a real-time sensor anomaly detector for use in municipal wastewater treatment systems.<\/p>\n","protected":false},"author":21353,"featured_media":11069,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[19],"tags":[43,101,114,180,216,248,321,354],"class_list":["post-2123","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-anomaly-detection","tag-binary-classification","tag-carl-data-solutions","tag-flow-works","tag-iot","tag-microsoft-anomaly-detection-api","tag-sensors","tag-time-series-filtering"],"acf":[],"blog_post_summary":"<p>A tutorial for building a real-time sensor anomaly detector for use in municipal wastewater treatment systems.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2123","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21353"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=2123"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2123\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/11069"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=2123"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=2123"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=2123"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}