{"id":2775,"date":"2017-04-10T18:22:40","date_gmt":"2017-04-11T01:22:40","guid":{"rendered":"https:\/\/www.microsoft.com\/reallifecode\/?p=2775"},"modified":"2020-03-15T05:44:35","modified_gmt":"2020-03-15T12:44:35","slug":"object-detection-using-cntk","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/object-detection-using-cntk\/","title":{"rendered":"Object Detection Using Microsoft CNTK"},"content":{"rendered":"<p>We recently collaborated with <a href=\"http:\/\/www.insoundz.com\/\">InSoundz<\/a>, an audio-tracking startup, to build an object detection system using Microsoft\u2019s open source deep learning framework, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/product\/cognitive-toolkit\/\">Computational Network Toolkit (CNTK)<\/a>.<\/p>\n<h2 id=\"the-problem\">The Problem<\/h2>\n<p>InSoundz captures and models 3D audio of live sports events to enhance live video feeds of these events for fans.\u00a0In order to enable automatic discovery of interesting scenarios that would be relevant to their solution, InSoundz wanted to integrate\u00a0object detection capabilities into their system.<\/p>\n<p>Any solution needed to be as flexible as possible, and also had to support adding new object types and creating detectors for various data types with ease. Since the object detection component evaluates images from a live camera feed, the detection also had to be fast, with near real-time performance.<\/p>\n<h3 id=\"object-detection-vs-object-recognition\">Object Detection vs. Object Recognition<\/h3>\n<p>Often when people talk about \u201cobject detection,\u201d they actually mean a combination of <strong>object detection<\/strong> (e.g. <em>where is the cat\/dog in this image?<\/em>) and <strong>object recognition<\/strong> (e.g. <em>is this a cat or a dog?<\/em>). That is, they mean that the algorithm should solve a combination of two problems: detecting both <em>where<\/em> there is an object in a given image, then recognizing <em>what<\/em> that object is. In this post, we will use this more common definition.<\/p>\n<p>In practice, the task of finding where an object is translates to finding a small bounding box that surrounds the object. While the tasks of recognition and object detection are both well-studied in the domain of computer vision, up until recently they were mainly solved using \u201cclassic\u201d approaches. These methods utilized local image features like <a href=\"https:\/\/en.wikipedia.org\/wiki\/Scale-invariant_feature_transform\">Scale Invariant Feature Transform (SIFT)<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Histogram_of_oriented_gradients\">Histogram of Oriented Gradients (HOG)<\/a>.<\/p>\n<p>Classic algorithms for <a href=\"https:\/\/en.wikipedia.org\/wiki\/Pedestrian_detection\">pedestrian detection<\/a>, for example, scan different regions in an image using a grid-like approach. They use the HOG features of the region and a pre-trained linear classifier, like a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Support_vector_machine\">support vector machine (SVM)<\/a>, and decide which region contains an image of a pedestrian.<\/p>\n<p>In recent years, however, deep learning methods like <a href=\"https:\/\/en.wikipedia.org\/wiki\/Convolutional_neural_network\">convolutional neural networks (CNNs)<\/a>\nhave become the prominent tool for object recognition tasks. Due to their outstanding performance in comparison to the classical methods, deep learning methods also became a popular tool for object detection based tasks.<\/p>\n<h2 id=\"the-solution\">The Solution<\/h2>\n<p>One such deep learning-based method for object detection is the <a href=\"https:\/\/arxiv.org\/abs\/1504.08083\">Fast-RCNN algorithm<\/a>. <a href=\"http:\/\/www.rossgirshick.info\/\">Ross Girschik<\/a>\u00a0developed the method during his time in <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/\">MSR<\/a> and it is considered to be one of the top methods in the field of object detection and recognition. In our solution, we utilized the <a href=\"https:\/\/github.com\/Microsoft\/CNTK\/wiki\/Object-Detection-using-Fast-R-CNN\">CNTK implementation of the Fast-RCNN algorithm<\/a>.<\/p>\n<p>The Fast-RCNN method for object detection provides the following possible capabilities:<\/p>\n<ul>\n<li>Train a model on arbitrary classes of objects, utilizing a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Inductive_transfer\">transfer learning<\/a> approach. This method takes advantage of existing object recognition solutions to train the neural network.<\/li>\n<li>Evaluate large numbers of proposed regions simultaneously and detect objects in those regions.<\/li>\n<\/ul>\n<p>A schema describing the Fast-RCNN algorithm is illustrated in the image below:<\/p>\n<p> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2017\/04\/fast_rcnn_architecture.jpg\" alt=\"Image fast rcnn architecture\" width=\"1200\" height=\"500\" class=\"aligncenter size-full wp-image-11031\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2017\/04\/fast_rcnn_architecture.jpg 1200w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2017\/04\/fast_rcnn_architecture-300x125.jpg 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2017\/04\/fast_rcnn_architecture-1024x427.jpg 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2017\/04\/fast_rcnn_architecture-768x320.jpg 768w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<p>Next, we describe the two steps in building our solution: how we <strong>train the neural network model<\/strong>, and how we <strong>use the model for object detection<\/strong>.<\/p>\n<h3 id=\"training-the-model\">Training the model<\/h3>\n<p>For the technical part of training a Fast-RCNN model with CNTK, please refer to <a href=\"https:\/\/github.com\/Microsoft\/CNTK\/wiki\/Object-Detection-using-Fast-R-CNN\">this tutorial<\/a>, which will walk you through setting up your model.<\/p>\n<p>The input of the training procedure is a dataset of images, where each image has a list of bounding boxes associated with it, and each bounding box has an associated class.<\/p>\n<p>For InSoundz, the input data was composed of several video files with no bounding box labels.<\/p>\n<p>In order to allow InSoundz to prepare the training data with ease, we developed a <strong><a href=\"https:\/\/github.com\/CatalystCode\/CNTK-Object-Detection-Video-Tagging-Tool\">video tagging tool<\/a><\/strong>. Our tool supports the ability to export the tagged images to the CNTK training format, as well as the ability to assess the performance of existing models. For more information about the tool, please refer to <a href=\"\/developerblog\/2017\/04\/10\/end-end-object-detection-box\/\">the related Real Life Code Story<\/a>.<\/p>\n<p>While the training process is rather straightforward, there are still a few things to consider when tuning the parameters of the model:<\/p>\n<h4 id=\"number-of-region-of-interest-rois\">Number of Region of Interest (ROIs)<\/h4>\n<p>This setting defines the number of regions that will be considered as the potential location of objects in the input image.<\/p>\n<p>While the default setting in the CNTK implementation defines 2000 ROIs for an image, this number can be tuned. A higher number can result in the model considering more regions, which increases the accuracy of detection. A larger number of regions, however, will result in longer training (and testing) times for the model. Decreasing the number of ROIs can help achieve shorter training and testing times, but can cause an overall decrease in the accuracy of the model.<\/p>\n<p>In our experiments, we found that setting this value to 1500 ROIs lead to reasonable detection results.<\/p>\n<h4 id=\"image-size\">Image Size<\/h4>\n<p>Since convolutional neural networks have a fixed input size, each image should be resized to a pre-defined size that the network expects. Similar to the number of ROIs, larger images can result in higher accuracy of detection but longer training and testing times. While using smaller images can result in shorter training and testing times, it can lower detection accuracy. In our tests, we used an image size of 800&#215;800.<\/p>\n<p>All of the parameters of the model can be easily set by changing the appropriate values in the file <a href=\"https:\/\/github.com\/Microsoft\/CNTK\/blob\/master\/Examples\/Image\/Detection\/FastRCNN\/PARAMETERS.py#L18-L20\">PARAMETERS.py<\/a>.<\/p>\n<h3 id=\"object-detection-with-the-cntk-model\">Object Detection with the CNTK Model<\/h3>\n<p>While the CNTK training procedure also contains a built-in evaluation procedure for a given test set, the user of the model will most likely want to use the model performance object detection on new images that aren\u2019t part of the training or test set.<\/p>\n<p>To use the model on a single image, the following operations should be performed for each:<\/p>\n<ol>\n<li>Resize the image to the expected input size. The resizing should be performed according to one of the image\u2019s axes to preserve the aspect ratio. Then, the resized image should be padded to fit the neural network input size.<\/li>\n<li>Calculate the candidate ROIs for detection. While a gridlike method can produce identical ROIs for different images, using other methods like <a href=\"http:\/\/koen.me\/research\/pub\/uijlings-ijcv2013-draft.pdf\">Selective Search<\/a> will most likely result in having different ROIs for different images.<\/li>\n<li>Run the resized image, together with the candidate ROIs, through the trained Neural Network. As a result, each ROI is assigned a predicted class (or object type), with a special \u201cbackground\u201d class in case no object was recognized.<\/li>\n<li>Finally, run the ROIs that were identified as containing an object through the <a href=\"http:\/\/www.pyimagesearch.com\/2014\/11\/17\/non-maximum-suppression-object-detection-python\/\">Non-Maximum-Suppression algorithm<\/a>, so overlapping regions are unified to produce the final bounding boxes for detected objects.<\/li>\n<\/ol>\n<p>A detailed walkthrough of the above pipeline is available in <a href=\"https:\/\/github.com\/nadavbar\/cntk-fastrcnn\/blob\/master\/CNTK_FastRCNN_Eval.ipynb\">a Python notebook<\/a>.<\/p>\n<p>In addition, <a href=\"https:\/\/github.com\/CatalystCode\/CNTK-FastRCNNDetector\">a full Python implementation<\/a> is available on GitHub. The code in this repository exposes an <strong>FRCNNDetector<\/strong> class that can be used to load a trained Fast-RCNN CNTK model and evaluate it on images.<\/p>\n<p>The following code sample demonstrates how the FRCNNDetector object encapsulates the steps described above into a single call of the <strong>detect<\/strong> method.<\/p>\n<pre class=\"lang:python decode:true\" title=\"CNTKFastRCNN Detector Sample\">import cv2\r\nfrom os import path\r\nfrom frcnn_detector import FRCNNDetector\r\n\r\ncntk_scripts_path = r'C:localcntkExamplesImageDetectionFastRCNN'\r\nmodel_file_path = path.join(cntk_scripts_path, r'proc\/grocery_2000\/cntkFiles\/Output\/Fast-RCNN.model')\r\n\r\n# initialize the detector and load the model\r\ndetector = FRCNNDetector(model_file_path, cntk_scripts_path=cntk_scripts_path)\r\n\r\nimg = cv2.imread(path.join(cntk_scripts_path,'r..\/..\/DataSets\/Grocery\/testImages\/WIN_20160803_11_28_42_Pro.jpg')\r\nrects, labels = detector.detect(img)\r\n\r\n# print detections\r\nfor rect, label in zip(rects, labels):\r\nprint(\"Bounding box: %s, label %s\"%(rect, label))<\/pre>\n<p>In the code sample shown above, an instance of an FRCNNModel class is created and the model is called for detection on a single image. The resulting bounding boxes and their corresponding labels are then printed to the screen.<\/p>\n<p>Note that the only parameters required to instantiate the FRCNNModel class are the location of the model file and the location of the CNTK Fast-RCNN Scripts.<\/p>\n<p>The ROI calculation step uses a caching mechanism when using a grid method to calculate ROIs, which allows for even shorter image evaluation times. If the user of the FRCNNDetector object chooses to disable the calculation of ROIs using selective search (and only uses grids), the evaluation times become much shorter since the grid is only calculated once and then the ROIs are re-used.<\/p>\n<p>In addition to the Python implementation mentioned above, we have also released a Node.js wrapper that exposes the Fast-RCNN detection capabilities for Node.js and Electron developers. For more info, please visit the <a href=\"https:\/\/github.com\/nadavbar\/node-cntk-fastrcnn\">node-cntk-fastrcnn code repository<\/a>.<\/p>\n<h2 id=\"opportunities-for-reuse\">Opportunities for Reuse<\/h2>\n<p>In this case study, we described how we built an object detection model using the CNTK implementation of the Fast-RCNN algorithm. As demonstrated above, the algorithm is generic and can be easily trained on different datasets and various classes of objects.<\/p>\n<p>We hope that this write-up, as well as the accompanying code, can benefit other developers looking to build their own object detection pipelines.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Creating an object detection model  using Microsoft&#8217;s open source deep learning framework CNTK and its implementation of Fast-RCNN. <\/p>\n","protected":false},"author":21372,"featured_media":11032,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[19],"tags":[123,127,147,175,239,279],"class_list":["post-2775","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-cntk","tag-computer-vision","tag-deep-learning","tag-fast-rcnn","tag-machine-learning-ml","tag-object-detection"],"acf":[],"blog_post_summary":"<p>Creating an object detection model  using Microsoft&#8217;s open source deep learning framework CNTK and its implementation of Fast-RCNN. <\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2775","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21372"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=2775"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2775\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/11032"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=2775"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=2775"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=2775"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}