{"id":8626,"date":"2018-07-05T11:59:30","date_gmt":"2018-07-05T18:59:30","guid":{"rendered":"https:\/\/www.microsoft.com\/developerblog\/?p=8626"},"modified":"2020-03-20T07:29:07","modified_gmt":"2020-03-20T14:29:07","slug":"satellite-images-segmentation-sustainable-farming","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/satellite-images-segmentation-sustainable-farming\/","title":{"rendered":"Satellite Images Segmentation and Sustainable Farming"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p>Sustainability in agriculture is crucial to safeguard natural resources and ensure a healthy planet for future generations. To assist farmers, ranchers, and forest landowners in the adoption and implementation of sustainable farming practices, organizations like the\u00a0<a href=\"https:\/\/www.nrcs.usda.gov\/wps\/portal\/nrcs\/site\/national\/home\">NRCS<\/a> (Natural Resources Conservation Services) provide\u00a0technical and financial assistance, as well as conservation planning\u00a0for landowners making conservation improvements to their land.<\/p>\n<p>Central to efforts in sustainable farming is the process of map labeling. This process entails reviewing satellite images to determine how farmers are implementing sustainable practices. The task involves recognizing and marking visible evidence of practices such as the presence of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Filter_strip\">filter strips<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Riparian_buffer\">riparian buffers<\/a> \u2013 i.e., vegetated tracts and strips of land utilized to protect water sources. Creating and maintaining a comprehensive map of sustainability practices enables experts to monitor conservation efforts over time, while also helping to identify areas that need special attention and follow-up.<\/p>\n<p>However, such\u00a0map labeling today still requires a manual and tedious task that teams in <a href=\"https:\/\/www.nrcs.usda.gov\/wps\/portal\/nrcs\/site\/national\/home\">NRCS<\/a>\u00a0have to tackle daily analyzing\u00a0a complex array of geospatial data sources. To perform appropriate labeling of filter strips and riparian buffers, for example, conservation specialists must closely examine images of fields and water sources, determining whether the width and border appear\u00a0consistent with markings of such vegetated land conservation techniques.<\/p>\n<p>To help in this effort, Microsoft partnered with Land O\u2019Lakes SUSTAIN, which collaborates with farmers to help them improve sustainability outcomes using the latest best practices, including those recommended by NRCS. \u00a0Together, we explored ways of automating these map labeling tasks.\u00a0In particular we focused our efforts on labeling\u00a0<a href=\"http:\/\/www.mda.state.mn.us\/protecting\/conservation\/practices\/waterway.aspx\">waterways<\/a>,\u00a0<a href=\"http:\/\/www.mda.state.mn.us\/protecting\/conservation\/practices\/terrace.aspx\">terraces<\/a>,\u00a0<a href=\"http:\/\/www.mda.state.mn.us\/protecting\/conservation\/practices\/wscob.aspx\">water and sediment control basins<\/a>, and <a href=\"http:\/\/www.mda.state.mn.us\/protecting\/conservation\/practices\/fieldborder.aspx\">field borders<\/a>.<\/p>\n<p>Our aim was to use a\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1505.04597\">Unet<\/a>-based segmentation model and a\u00a0<a href=\"\/\/arxiv.org\/abs\/1703.06870\">Mask RCNN<\/a>-based instance segmentation model machine learning approaches to find a solution.\u00a0 In this blog post we&#8217;ll provide details on how we prepared data, trained these models and compared their performance.\u00a0 Our findings, we hope, will improve efficiency for all conservation specialists engaged in map-labeling techniques using satellite imagery analysis. The corresponding code can be found in this <a href=\"https:\/\/github.com\/olgaliak\/segmentation-unet-maskrcnn\">GitHub repo<\/a>.<\/p>\n<h2>Data<\/h2>\n<p>We used GeoSys satellite imagery for the following 4 Iowa counties: Tama, Benton, Iowa, and Poweshiek.\n<img decoding=\"async\" class=\"alignnone wp-image-8631\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/counties-1024x601.jpg\" alt=\"\" width=\"800\" height=\"469\" \/><\/p>\n<p>To get the training dataset, the aerial imagery was labeled manually using a desktop ArcGIS tool. The images then were split into tiles of 224&#215;224 pixel size. Our goal was for each class to have at least 1000 corresponding tiles.\nThe data for 7 suitable practices were prepared (see the description below). For the development of the proof of concept (POC) machine learning model we focused on the 4 classes that have the most labeled data:<\/p>\n<ul style=\"margin-left: .375in; direction: ltr; margin-top: 0in; margin-bottom: 0in;\" type=\"disc\">\n<li style=\"margin-top: 0; margin-bottom: 0; vertical-align: middle;\"><span style=\"font-family: Calibri; font-size: 11.0pt;\">Grassed waterways (5.7K manually labeled tiles)<\/span><\/li>\n<li style=\"margin-top: 0; margin-bottom: 0; vertical-align: middle;\"><span style=\"font-family: Calibri; font-size: 11.0pt;\">Terraces (2.7K tiles)<\/span><\/li>\n<li style=\"margin-top: 0; margin-bottom: 0; vertical-align: middle;\"><span style=\"font-family: Calibri; font-size: 11.0pt;\">Water and Sediment Control Basins or WSBs (1K tiles) <\/span><\/li>\n<li style=\"margin-top: 0; margin-bottom: 0; vertical-align: middle;\"><span style=\"font-family: Calibri; font-size: 11.0pt;\">Field Borders (1K tiles).<\/span><\/li>\n<\/ul>\n<p><img decoding=\"async\" class=\"wp-image-8633 alignnone\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/7_practices.png\" alt=\"\" width=\"800\" height=\"461\" \/>\n<span class=\"annotation\" style=\"background-color: #f0e465;\" data-annotation=\"Fix wording here\" data-author=\"Peter Cornell Andringa\">Here is what the training data looked like:<\/span><\/p>\n<ul>\n<li><span class=\"annotation\" style=\"background-color: #f0e465;\" data-annotation=\"Fix wording here\" data-author=\"Peter Cornell Andringa\">image tiles of 224&#215;224 pixels size <\/span><\/li>\n<li><span class=\"annotation\" style=\"background-color: #f0e465;\" data-annotation=\"Fix wording here\" data-author=\"Peter Cornell Andringa\">corresponding labels (masks) providing an\u00a0outline of the region of interest.<\/span><\/li>\n<\/ul>\n<p><span class=\"annotation\" style=\"background-color: #f0e465;\" data-annotation=\"Fix wording here\" data-author=\"Peter Cornell Andringa\">The goal was to train a model able to detect the outlines of the farming land use and correctly classify those practices.\u00a0 For example, in the image below we wanted to detect waterways and counter buffer strips:<\/span><\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8634\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/sample_tile.png\" alt=\"\" width=\"500\" height=\"163\" \/>\nHere is a sample <a href=\"https:\/\/github.com\/olgaliak\/segmentation-unet-maskrcnn\/tree\/master\/data\">small dataset<\/a>: it has 10 labeled images per class and gives a sense of the data we were using.<\/p>\n<h2 id=\"dataprep\">Data preparation<\/h2>\n<p>We augmented the dataset by flipping the image and rotating by 90 degrees. For training we used 70% of the data and 30% was saved for model evaluation.<\/p>\n<p>The\u00a0diagram\u00a0below shows the overall data distribution across classes:<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-8663\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/5af37553ae2cf_unballanced.png\" alt=\"\" width=\"387\" height=\"248\" \/><\/p>\n<p>After manual examination of the tiles we noticed that even for a human it&#8217;s not always possible to distinguish the land use types just by looking at an aerial image.<\/p>\n<p>The example below highlights the challenge.\u00a0 On the left we have input tiles,\u00a0 followed by masks for sustainability classes that a human expert has identified on this input image. Each sustainability class has different color coding (shades of green for class 1 and 2 versus blue masks for class 3 and 4).\u00a0 If we look at the first example, it appears that the top half of the image depicts a single sustainability practice. However, in reality those are different. There is a similar challenge with the second example: The central green arch belongs to class 1, but is easily confused with the lower arch which belongs to class 4.<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8640\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/confusuing-classes.png\" alt=\"\" width=\"800\" height=\"364\" \/><\/p>\n<p><a href=\"https:\/\/www.nrcs.usda.gov\/wps\/portal\/nrcs\/site\/national\/home\">NRCS <\/a>Experts heavily use <a href=\"https:\/\/en.wikipedia.org\/wiki\/Digital_elevation_model\">DEM (<\/a>hill shade data) when analyzing sustainability practices. For example, a terrace is a combination of a ridge and a channel. Contour buffer strips go around the hill slope.<\/p>\n<p>So, we added hill shade data to the dataset and applied the same data augmentation techniques to it as well.<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8641\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/with_hill.png\" alt=\"\" width=\"400\" height=\"240\" \/><\/p>\n<h2>Model training infrastructure<\/h2>\n<p>We used\u00a0<a href=\"https:\/\/keras.io\/\">Keras<\/a>\u00a0with a\u00a0<a href=\"https:\/\/www.tensorflow.org\/\">Tensorflow<\/a> backend to train and evaluate models.<\/p>\n<p>When training Deep Learning models it&#8217;s convenient to use hardware with GPUs.\u00a0 Provisioning on demand <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learning\/data-science-virtual-machine\/deep-learning-dsvm-overview\">Azure Deep Learning Virtual Machine<\/a>\u00a0or <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/virtual-machines\/windows\/sizes-gpu\">Azure N-series Virtual Machines<\/a>\u00a0 proved to be very useful.<\/p>\n<h2>Method<\/h2>\n<p>To evaluate the\u00a0feasibility of identifying and classifying sustainable farming practices we took 2 approaches (as the most promising):<\/p>\n<ul style=\"margin-left: .375in; direction: ltr; margin-top: 0in; margin-bottom: 0in;\" type=\"disc\">\n<li style=\"margin-top: 0; margin-bottom: 0; vertical-align: middle;\"><a href=\"https:\/\/arxiv.org\/abs\/1505.04597\"><span style=\"font-family: Calibri; font-size: 11.0pt;\">Unet<\/span><\/a><span style=\"font-family: Calibri; font-size: 11.0pt;\"> based segmentation model <\/span><\/li>\n<li style=\"margin-top: 0; margin-bottom: 0; vertical-align: middle;\"><a href=\"\/\/arxiv.org\/abs\/1703.06870\"><span style=\"font-family: Calibri; font-size: 11.0pt;\">Mask RCNN<\/span><\/a><span style=\"font-family: Calibri; font-size: 11.0pt;\"> based instance segmentation model<\/span><\/li>\n<\/ul>\n<h3>Introduction to Unet<\/h3>\n<p><a href=\"https:\/\/arxiv.org\/abs\/1505.04597\">U-Net<\/a> is designed like an\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Autoencoder\">auto-encoder<\/a>. It has an encoding path (\u201ccontracting\u201d) paired with a decoding path (\u201cexpanding\u201d) which gives it the \u201cU\u201d shape.<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8646\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/Unet.png\" alt=\"\" width=\"600\" height=\"415\" \/><\/p>\n<p>However, in contrast to the autoencoder, U-Net predicts a pixelwise segmentation map of the input image rather than classifying the input image as a whole. For each pixel in the original image, it asks the question: \u201cTo which class does this pixel belong?\u201d. U-Net passes the feature maps from each level of the contracting path over to the analogous level in the expanding path.\u00a0 These are similar to residual connections in a <a href=\"https:\/\/arxiv.org\/abs\/1512.03385\">ResNet<\/a> type model, and allow the classifier to consider features at various scales and complexities to make its decision.<\/p>\n<h3>Introduction to Mask RCNN<\/h3>\n<p><a href=\"https:\/\/arxiv.org\/abs\/1703.06870\">Mask RCNN<\/a> (Mask Region-based CNN) is an extension to <a href=\"https:\/\/arxiv.org\/abs\/1506.01497\">Faster R-CNN<\/a> that adds a branch for predicting an object mask in parallel with the existing branch for object detection. This <a href=\"https:\/\/blog.athelas.com\/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4\">blog post<\/a>\u00a0by Dhruv Parthasarathy contains a nice overview of the evolution of image segmentation\u00a0approaches, while\u00a0<a href=\"https:\/\/engineering.matterport.com\/splash-of-color-instance-segmentation-with-mask-r-cnn-and-tensorflow-7c761e238b46\">this blog<\/a>\u00a0by Waleed Abdulla explains Mask RCNN\u00a0well.<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8647\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/MaskRcnn-e1525901793502.png\" alt=\"\" width=\"500\" height=\"225\" \/><\/p>\n<h3>Metrics and loss functions<\/h3>\n<p>Our primary metric for model evaluation was <a href=\"https:\/\/en.wikipedia.org\/wiki\/Jaccard_index\">Jaccard Index<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/S%C3%B8rensen%E2%80%93Dice_coefficient\">Dice Similarity Coefficient<\/a>. These both measure how close the predicted mask is to the manually marked masks, ranging from 0 (no overlap) to 1 (complete congruence).<\/p>\n<p>Jaccard Similarity Index is the most intuitive ratio between the intersection and union:<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8649\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/5af36b9f205da_Jaccard.png\" alt=\"\" width=\"400\" height=\"75\" \/><\/p>\n<p>Dice Coefficient is a popular metric and it&#8217;s numerically less sensitive to mismatch when there is a reasonably strong overlap:<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8650\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/5af36ba0642d4_Dice.png\" alt=\"\" width=\"202\" height=\"75\" \/><\/p>\n<p>Regarding loss functions, we started out with using classical <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cross_entropy#Cross-entropy_error_function_and_logistic_regression\">Binary Cross Entropy<\/a> (BCE), which is available as a prebuilt loss function in <a href=\"https:\/\/keras.io\/losses\/\">Keras<\/a>.<\/p>\n<p>Inspired by this <a href=\"https:\/\/github.com\/killthekitten\/kaggle-carvana-2017\">repo<\/a> related to <a href=\"https:\/\/www.kaggle.com\/c\/carvana-image-masking-challenge\">Kaggle&#8217;s Carvana challenge<\/a>, we explored incorporating the Dice Similarity Coefficient into a loss function:\n<img decoding=\"async\" class=\"alignnone wp-image-8651\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/5af36ba1a95c7_BCE_loss.png\" alt=\"\" width=\"314\" height=\"75\" \/><\/p>\n<h2>Training Segmentation Model<\/h2>\n<p>Below we describe the training routine using Mask RCNN and Unet and discuss our learnings.<\/p>\n<h3>Training with Unet<\/h3>\n<p>When dealing with segmentation-related problems, Unet-based approaches are applied quite often \u2013 good examples include segmentation-themed Kaggle competitions (e.g.,\u00a0<a href=\"https:\/\/www.kaggle.com\/c\/dstl-satellite-imagery-feature-detection\">DSTL<\/a> satellite imagery feature detection, <a href=\"https:\/\/www.kaggle.com\/c\/carvana-image-masking-challenge\">Carvana<\/a> car segmentation), as well as various medical-related segmentation tasks (e.g., segmenting\u00a0<a href=\"https:\/\/github.com\/jocicmarko\/ultrasound-nerve-segmentation\">nerves<\/a>\u00a0in ultrasound images,\u00a0<a href=\"https:\/\/www.kaggle.com\/c\/data-science-bowl-2017#tutorial\">lungs<\/a>\u00a0in CT scans, and\u00a0<a href=\"https:\/\/github.com\/ternaus\/robot-surgery-segmentation\">robotics instrument<\/a> segmentation for endoscopy).<\/p>\n<p>Two of the benefits of Unet are that:<\/p>\n<ul>\n<li>It can be trained on a modestly-sized dataset<\/li>\n<li>Its input can easily be converted from 3 <a href=\"https:\/\/en.wikipedia.org\/wiki\/Channel_(digital_image)\">channels<\/a> (standard RGB image) to more interesting inputs as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Multispectral_image\">multispectral images<\/a><\/li>\n<\/ul>\n<p>Training a\u00a0Unet model on 2 classes (waterways and field borders) with 600 training images for 25 epocs produced promising results right away.<\/p>\n<p>Note:\u00a0In machine-learning parlance, an epoch is a complete pass through a given dataset.<\/p>\n<p>Input image:<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-8656\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/5af372bf35767_input_poc.jpg\" alt=\"\" width=\"224\" height=\"224\" \/><\/p>\n<p>Below are the prediction results for a simple 2-class model trained from scratch on just a few hundred tiles:<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8657\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/5af372c083674_poc__lol_res.png\" alt=\"\" width=\"500\" height=\"451\" \/><\/p>\n<p>Here we can see that the label is not perfect in the image: there is a field border on the right and something that looks very similar to a field border at the bottom (though the bottom instance is not labeled). The model detects both right and bottom borders.<\/p>\n<h3>Using hill shade data<\/h3>\n<p>As mentioned earlier, DEM info is very handy when detecting sustainable images. We converted hill shade data to a grayscale image and added this info as an additional 4th input channel.<\/p>\n<h3 id=\"unetpreinit\">Pre-initializing weights for Unet<\/h3>\n<p>In the above example we&#8217;re training Unet &#8220;from scratch&#8221; on our data.\u00a0 However smart weights initialization usually saves training time and positively affects the results (see <a href=\"https:\/\/arxiv.org\/abs\/1801.05746\">TernausNet<\/a> for more details on how\u00a0U-Net type architecture can be improved by the use of the pre-trained encoder).<\/p>\n<p>We trained Unet on 4 channel input: 3 channels were used for RGB input and the 4th channel was used for hill shade data.\u00a0 Weights for the first 3 channels are initialized from VGG 16 model pre-trained on the\u00a0<a href=\"http:\/\/www.image-net.org\/\">ImageNet<\/a> dataset. We were leaving the 4th channel initialized with zeroes &#8212; further improvements might include experimentation with various <a href=\"https:\/\/arxiv.org\/abs\/1704.08863\">weights initialization techniques<\/a>. See <a href=\"https:\/\/github.com\/olgaliak\/segmentation-unet-maskrcnn\/blob\/master\/unet\/model.py\">model.py<\/a> for more details.<\/p>\n<p>Below are visual comparisons of the results: manual label (left), training Unet from scratch (middle), training Unet with leveraging VGG16 weights pre-trained on <a href=\"http:\/\/www.image-net.org\">ImageNet<\/a>\u00a0(right).<\/p>\n<p>Input raw image:<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8659\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/lol_raw_ww_terr.png\" alt=\"\" width=\"224\" height=\"221\" \/><\/p>\n<p>Results: as expected, the Unet model that uses pre-trained VGG16 can learn much faster.<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8658\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/lol_pretrained.png\" alt=\"\" width=\"800\" height=\"419\" \/><\/p>\n<h3>Training with Mask-RCNN<\/h3>\n<p>In general, a significant number of labeled images are required to train a deep learning model from scratch.\u00a0 We experimented with training a MaskRCNN model from scratch and the results were not promising at all after 48 hours of training (1 Titan Xp GPU).<\/p>\n<p>To overcome the challenge and save training time we again used\u00a0transfer learning. We re-used the\u00a0<a href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/releases\">Mask RCNN model<\/a> pre-trained on the\u00a0<a href=\"http:\/\/cocodataset.org\/#home\">COCO dataset,<\/a>\u00a0then fine-tuned the model on the dataset with aerial images.<\/p>\n<p>As we discussed in the\u00a0<a href=\"#dataprep\">Data Preparation<\/a>\u00a0section, hill shade data is very useful for detecting some of the classes (terraces for example). So, ideally, we wanted to have a means to provide training input in at least 4 channels: 3 channels for RGB aerial photos and 1 more channel for hill shade data. The issue we encountered was that current <a href=\"https:\/\/github.com\/matterport\/Mask_RCNN\/releases\">pre-trained models<\/a> work well only with 3 channel input. Thus we merged the RGB and hill shade tiles into a combined 3 channel tile and used the later for training. We noticed that using\u00a0<a href=\"https:\/\/www.pyimagesearch.com\/2015\/10\/05\/opencv-gamma-correction\/\">gamma correction<\/a>\u00a0on the merged images improves the results: the image with gamma correction is less &#8220;bleached out,&#8221; while vegetation and topographical features are more prominent.<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8668\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/merging_hillshade.png\" alt=\"\" width=\"800\" height=\"249\" \/><\/p>\n<p>Below we demonstrate the MASK RCNN model prediction results and how they vary depending on whether or not the model had access to hill shade data and the loss function: the original input is most left, then there is the prediction result for the model trained only on aerial data. Following are prediction results for the model that was trained on a combination of aerial and hill shade data. Finally we show results for a model trained on a combination of aerial and hill shade data, using the enhanced loss function (takes Dice Coefficient into consideration).<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8670\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/maskRcnn_hilshade_loss.png\" alt=\"\" width=\"900\" height=\"345\" \/>\nAs identification of terraces relies mainly on hill shade information, the mask prediction tended to work better if the model was trained on data that also had information about the area&#8217;s topography. Incorporating Dice Coefficient seemed to add positive improvements as well.<\/p>\n<p>In the next diagram we show how the Mask RCNN models prediction evolved as the model trained for a longer time (more epochs). For demonstration we&#8217;re using the same cherry-picked example that we used in Unet&#8217;s section of this blog (see <a href=\"unetpreinit\">Pre-initializing weights for Unet<\/a>) . In this example we&#8217;re using a model trained on aerial and hill shade data and a loss function that uses Dice Coefficient:<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8672\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/maskrcnn_epochs.png\" alt=\"\" width=\"900\" height=\"377\" \/><\/p>\n<p>It is worth noting how distinct &#8220;instances&#8221; of waterways merge as training progresses: at the beginning (epoch 50) most of the w<span class=\"annotation\" style=\"background-color: #f0e465;\" data-annotation=\"Again, I don't think these should be capitalized unless you're referring to a technical term\" data-author=\"Peter Cornell Andringa\">aterway<\/span> predictions are separate pieces and the center part of the waterway is absent from the prediction. However, beginning at Epoch 200, the predicted mask covers more and more surface and gets much closer to the manual label.<\/p>\n<p>As a side note, the Mask RCNN was not able to detect terraces in this example, while the\u00a0Unet model did find terraces.<\/p>\n<h2>Results<\/h2>\n<p>Both the Mask RCNN and the Unet models did a fairly good job of learning how to detect waterways \u2013 this was no surprise, as this class has the biggest amount of labeled data. The average Dice Coefficient (on test set, around 3000 examples) for the Mask RCNN and the Unet models for waterways was 0.6515 and 0.5676, respectively.<\/p>\n<p>The histogram below shows the distribution of Dice Coefficient values for waterways across the test set for Mask RCNN and Unet:<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8753\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/result1.png\" alt=\"\" width=\"872\" height=\"339\" \/><\/p>\n<p><span style=\"float: none; background-color: transparent; color: #333333; cursor: text; font-family: Georgia,'Times New Roman','Bitstream Charter',Times,serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; text-align: left; text-decoration: none; text-indent: 0px;\">Our tests showed that the mean Dice Coefficient across all classes is a bit higher for the Mask RCNN model. Incorporating the Dice Coefficient had a positive impact on t<\/span>he Mask RCNN model performance. For Unet models, it improves performance in detecting waterways, but there is no significant difference for other classes.<\/p>\n<p>We used the mean Dice Coefficient to select the best Mask RCNN model. It was trained on a combination of aerial and hill shade data, using the enhanced loss function. The images below show a visual comparison of the Mask RCNN and Unet model predictions on a cherry-picked example. As we can see, both\u00a0the Mask RCNN and Unet models performed decently in detecting waterways. Although the dice value of waterways is not very large (0.42), the model is definitely on the right track to detect waterways. The Mask RCNN detection of field borders almost covers the manual-labeled mask, which is very impressive. However, the Unet model only starts picking up the field border.<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8757\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/LOL-result2.png\" alt=\"\" width=\"900\" height=\"335\" \/><\/p>\n<p>Below is another example demonstrating the results of the terrace&#8217;s detection. In this example, the Mask RCNN does not detect the terraces and the Unet does (hopefully making good use of hill shade data). Although the Dice Coefficient value of terraces is not very large (0.41), the prediction already captures the main shape of terraces and thus is useful for practical applications.<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-8758\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/LOL-result3.png\" alt=\"\" width=\"900\" height=\"334\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Incorporating additional channels of information such as hill shade data or multi band satellite imagery is definitely a promising approach. Doing so with Unet seems to be more straightforward than with Mask RCNN.<\/p>\n<h2>Conclusions and Discussion<\/h2>\n<p>We saw really promising results with getting AI to help with the detection of sustainable farming practices. Future work may include enhancing the dataset and making it more balanced, as well as getting additional channels of information (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Multispectral_image\">multispectral<\/a> and\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Hyperspectral_imaging\">hyperspectral<\/a> images) and exploring performance with specialized models that detect only 1-2 classes.<\/p>\n<p>In addition to this work&#8217;s potential applications for sustainable farming, similar work could be utilized in detecting solar panels or green-garden roofs in smart eco-friendly cities.<\/p>\n<h2>References<\/h2>\n<ol>\n<li>Github <a href=\"https:\/\/github.com\/olgaliak\/segmentation-unet-maskrcnn\">repo<\/a>.<\/li>\n<li>&#8220;Mask RCNN&#8221; <a href=\"https:\/\/arxiv.org\/abs\/1703.06870\">paper<\/a>.<\/li>\n<li>&#8220;U-Net: Convolutional Networks for Biomedical Image Segmentation&#8221; <a href=\"https:\/\/arxiv.org\/abs\/1505.04597\">paper<\/a>.<\/li>\n<li>Mask RCNN Keras implementation github <a href=\"https:\/\/github.com\/matterport\/Mask_RCNN\">repo<\/a>.<\/li>\n<li>&#8220;TernausNhttps:\/\/github.com\/matterport\/Mask_RCNNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation&#8221; <a href=\"https:\/\/arxiv.org\/abs\/1801.05746\">paper<\/a><\/li>\n<li><a href=\"https:\/\/www.kaggle.com\/c\/carvana-image-masking-challenge\/\">Carvana<\/a> Image Masking Kaggle challenge 4th place winners <a href=\"https:\/\/github.com\/killthekitten\/kaggle-carvana-2017\">repo<\/a>.<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p>Featured photo by\u00a0<a href=\"https:\/\/unsplash.com\/photos\/tI_Odb7ZU6M?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Sveta Fedarava<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Unsplash<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Can Machine Learning help with detecting sustainable farming practices? In this blog post inspired by our collaboration with Land  O&#8217;Lakes we share the lessons we learned in the image segmentation space.  <\/p>\n","protected":false},"author":21373,"featured_media":13041,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[19],"tags":[177],"class_list":["post-8626","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-featured"],"acf":[],"blog_post_summary":"<p>Can Machine Learning help with detecting sustainable farming practices? In this blog post inspired by our collaboration with Land  O&#8217;Lakes we share the lessons we learned in the image segmentation space.  <\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/8626","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21373"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=8626"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/8626\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/13041"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=8626"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=8626"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=8626"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}