{"id":2553,"date":"2017-04-10T18:17:26","date_gmt":"2017-04-10T18:17:26","guid":{"rendered":"https:\/\/www.microsoft.com\/reallifecode\/?p=2553"},"modified":"2020-03-25T17:23:15","modified_gmt":"2020-03-26T00:23:15","slug":"end-end-object-detection-box","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/end-end-object-detection-box\/","title":{"rendered":"End to End Object Detection in a Box"},"content":{"rendered":"<p class=\"posttitle\">This code story outlines a new end to end video tagging tool, built on top of the <a href=\"https:\/\/github.com\/Microsoft\/CNTK\"> Microsoft Cognitive Toolkit (CNTK)<\/a>, that enables developers to more easily create, review and iterate their own object detection models.<\/p>\n<div class=\"postbody\">\n<h2 id=\"background\">Background<\/h2>\n<p>We recently worked with\u00a0<a href=\"http:\/\/www.insoundz.com\/\">Insoundz<\/a>, an Israeli startup that captures sound at live sports events using audio-tracking technology. In order to map this captured audio, they needed a way to identify areas of interest in live video feeds. For this purpose, we worked with them<a href=\"\/developerblog\/2017\/04\/10\/object-detection-using-cntk\/\">\u00a0to build an object detection model with CNTK<\/a>.<\/p>\n<p>Object detection models require a large quantity of tagged image data to work in production. The training data for an object detection model consists of a set of images, where each image is associated with a group of bounding boxes surrounding the objects in the image, and each bounding box is assigned a label that describes the object.<\/p>\n<p><figure class=\"wp-caption alignnone\" ><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2017\/04\/boundingboxexplained.jpg\" alt=\"Image boundingboxexplained\" width=\"673\" height=\"463\" class=\"aligncenter size-full wp-image-11034\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2017\/04\/boundingboxexplained.jpg 673w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2017\/04\/boundingboxexplained-300x206.jpg 300w\" sizes=\"(max-width: 673px) 100vw, 673px\" \/><figcaption class=\"wp-caption-text\">Tagged Image Data<\/figcaption><\/figure><\/p>\n<div class=\"postbody\">\n<p>Having quality training data is one of the most important aspects of the model building process. However, in most existing object detection training pipelines, image datasets are compiled and expanded upon independently of training. As a result, models can only be optimized by fine tuning hyperparameters and by hoping that requests for additional training data encapsulate the edge cases of an existing model.<\/p>\n<\/div>\n<div class=\"postbody\">\n<p>Since Insoundz\u2019s capture solution supports a wide set of event scenarios (each with unique areas of interest) they needed a scalable way to aggregate data as they train, validate, and iterate new object detection models.<\/p>\n<h2 id=\"the-solution\">The Solution<\/h2>\n<p>To enable Insoundz\u00a0to generate large quantities of high quality, tagged image data quickly, we developed a semi-automated, cross-platform <a href=\"https:\/\/electron.atom.io\/\">Electron<\/a>\u00a0application. This video tagging application exports data directly to CNTK format and allows users to run and validate a trained model on new videos to generate stronger models.\u00a0The application supports Windows, OSX, and Linux.<\/p>\n<p><figure class=\"wp-caption alignnone\" ><img decoding=\"async\" class=\"size-full\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2017\/04\/detectioninabox.jpg\" alt=\"tag video, export tags to CNTK, train model, run model on new video, validate model suggestions &amp; fix errors\" width=\"1168\" height=\"430\" \/><figcaption class=\"wp-caption-text\"><strong>Object Detection in a Box Architecture<\/strong><\/figcaption><\/figure><\/p>\n<p>One major advantage of this object detection training architecture over other publicly available architectures, is the <em>validation feedback loop<\/em>. The feedback loop increases iterative model performance while decreasing the number of training samples needed for training.<\/p>\n<h3 id=\"how-the-tool-works\">How the tool works<\/h3>\n<p>The Visual Object Tagging Tool supports the following <strong>features<\/strong>:<\/p>\n<ul>\n<li><strong>Semi-automated video tagging<\/strong>: computer-assisted tagging and tracking of objects in videos using the <a href=\"http:\/\/opencv.jp\/opencv-1.0.0_org\/docs\/papers\/camshift.pdf\">CAMSHIFT tracking algorithm<\/a><\/li>\n<li><strong>Export to CNTK<\/strong>: export tags and assets to CNTK format for training a CNTK object detection model<\/li>\n<li><strong>Model validation<\/strong>: run and validate trained CNTK object detection model on new videos to generate stronger models<\/li>\n<\/ul>\n<p>The tagging control we used in our tool was built on the <a href=\"https:\/\/github.com\/CatalystCode\/video-tagging\">video tagging component<\/a> with some critical improvements, including the abilities to:<\/p>\n<ul>\n<li>Resize the control to maximize screen real estate<\/li>\n<li>Create rectangular bounding regions<\/li>\n<li>Resize regions<\/li>\n<li>Reposition regions<\/li>\n<li>Monitor visited frames<\/li>\n<li>Track regions between frames in a continuous scene<\/li>\n<\/ul>\n<div class=\"mceTemp\"><\/div>\n<p><img decoding=\"async\" class=\"size-full\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2017\/04\/5_Export.jpg\" width=\"509\" height=\"474\" \/><\/p>\n<h3>Tagging a Video<\/h3>\n<ul>\n<li>Click and drag a bounding box around the desired area<\/li>\n<li>Move or resize the region until it fits the object<\/li>\n<li>Click on a region and select the desired tag from the labeling toolbar at the bottom of the tagging control\n<ul>\n<li>Selected regions will appear in red\u00a0<img decoding=\"async\" src=\"https:\/\/placehold.it\/15\/f03c15\/000000?text=+\" alt=\"red\" \/> and unselected regions will appear in\u00a0blue <img decoding=\"async\" src=\"https:\/\/placehold.it\/15\/1589F0\/000000?text=+\" alt=\"#1589F0\" \/><\/li>\n<li>Click the <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2017\/04\/cleartags.png\" \/> button to clear all tags on a given frame<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Navigation<\/h3>\n<ul>\n<li>Users can navigate between video frames by using:\n<ul>\n<li>the <img decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2017\/04\/prev-next.png\" width=\"43\" height=\"21\" \/>\u00a0buttons<\/li>\n<li>the left\/right arrow keys<\/li>\n<li>the video skip bar<\/li>\n<\/ul>\n<\/li>\n<li>Tags are auto-saved each time a frame is changed<\/li>\n<\/ul>\n<h3>Tracking<\/h3>\n<p>Tracking support reduces the need to redraw and re-tag regions on every frame in a\u00a0scene. For example, in our\u00a0<a href=\"https:\/\/github.com\/CatalystCode\/CNTK-Video-Tagging-Tool\/tree\/master\/src\/videos\">test video<\/a>, we were able to track the location of a hat in over 600 consecutive frames with only one manual correction.<\/p>\n<p>New regions are tracked by default until the scene changes. Since the <a href=\"http:\/\/opencv.jp\/opencv-1.0.0_org\/docs\/papers\/camshift.pdf\">CAMSHIFT algorithm<\/a> has some known limitations, users can disable tracking for certain sets of frames. To toggle tracking <em>on<\/em> and <em>off<\/em>, use the file menu setting or the keyboard shortcut Ctrl\/Cmd + T.<\/p>\n<h3>CNTK Integration<\/h3>\n<ul>\n<li>Export tags for a video tagging job to CNTK format using Ctrl\/Cmd + E<\/li>\n<li>Apply the model to new video for Review using Ctrl\/Cmd + R<\/li>\n<\/ul>\n<h2 id=\"code\">Code<\/h2>\n<p>You can download the tool or evaluate the code for your own use <a href=\"https:\/\/github.com\/CatalystCode\/CNTK-Video-Tagging-Tool\/releases\">on GitHub<\/a>.<\/p>\n<h2 id=\"opportunities-for-reuse\">Opportunities for Reuse<\/h2>\n<p>The tool outlined in this code story is adaptable to any object detection\/recognition scenario.<\/p>\n<p>Since we wrote the CNTK video tagging tool in JavaScript and used independent components for tracking and CNTK integration, the work here is cross-platform and can be abstracted and embedded into a web application or an existing workflow.<\/p>\n<h2 id=\"future-plans\">Future Plans<\/h2>\n<p>In the future, we plan to provide support for tagging image directories, as well as additional project management support for handling multiple tagging jobs in parallel. Furthermore, we hope to investigate additional tracking algorithms to automate the tagging process even further. In the next version we will include some configurable interfaces so that interested contributors can integrate their own object detection frameworks and tracking algorithms into the tool.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Building a video tagging tool on top of CNTK to enable developers to create, review and iterate object detection models. <\/p>\n","protected":false},"author":21353,"featured_media":11035,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[19],"tags":[123,127,279,358,375],"class_list":["post-2553","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-cntk","tag-computer-vision","tag-object-detection","tag-tracking","tag-video-tagging"],"acf":[],"blog_post_summary":"<p>Building a video tagging tool on top of CNTK to enable developers to create, review and iterate object detection models. <\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2553","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21353"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=2553"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2553\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/11035"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=2553"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=2553"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=2553"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}