{"id":2184,"date":"2015-07-21T16:34:28","date_gmt":"2015-07-21T16:34:28","guid":{"rendered":"https:\/\/www.microsoft.com\/reallifecode\/index.php\/2015\/07\/21\/communicating-with-mans-best-friend-part-i-dog-tracking\/"},"modified":"2020-03-15T12:46:49","modified_gmt":"2020-03-15T19:46:49","slug":"communicating-with-mans-best-friend-part-i-dog-tracking","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/communicating-with-mans-best-friend-part-i-dog-tracking\/","title":{"rendered":"Communicating with Man&#8217;s Best Friend, Part I &#8211; Dog Tracking"},"content":{"rendered":"<p>With the arrival of commodity depth-capable cameras, specifically, the Microsoft Kinect, as well as high-performance machine learning algorithms, entirely new capabilities are made possible. Working with academic experts and others, we are attempting to track the movements, body language, and vocalizations of dogs.<\/p>\n<p>The overall goal of this project is to decode the communications of dogs. This TED Case Study covers one aspect of this project, specifically, progress in visually tracking dogs. Future papers will cover feature detection and analysis (e.g., ear position, mouth expressions, tail decoding); audio analysis (barks); and development and deployment an application. Eventually, the objective is to analyze all of these factors <em>in toto<\/em> and be able to infer (for example) if ears are up, mouth is open, tail is up, dog is silent: <em>I\u2019m alert, I\u2019m paying attention, I\u2019m not feeling aggressive or threatened.<\/em><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2015\/07\/image001-1.png\" alt=\"Image image001\" width=\"400\" height=\"254\" class=\"aligncenter size-full wp-image-11216\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image001-1.png 400w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image001-1-300x191.png 300w\" sizes=\"(max-width: 400px) 100vw, 400px\" \/><\/p>\n<p>Figure 1. Conceptual Goal of Project<\/p>\n<h2 id=\"overview-of-the-solution\">Overview of the Solution<\/h2>\n<p>Project \u201cDolittle\u201d initially leveraged the open source computer vision library OpenCV (<a href=\"http:\/\/www.opencv.org\">http:\/\/www.opencv.org<\/a>). OpenCV supports image detection, object recognition, various machine learning algorithms, classifiers, and video analysis, among other capabilities. Initial work focused on training OpenCV\u2019s Haar cascades to recognize one dog, in this case, a Smooth-Haired Collie named \u201cMici\u201d.<\/p>\n<p> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2015\/07\/image002-1.jpg\" alt=\"Image image002\" width=\"260\" height=\"400\" class=\"aligncenter size-full wp-image-11217\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image002-1.jpg 260w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image002-1-195x300.jpg 195w\" sizes=\"(max-width: 260px) 100vw, 260px\" \/><\/p>\n<p>Figure 2. Mici<\/p>\n<p>OpenCV uses a relatively typical training methodology. It receives a large number of annotated pictures as input. Here the developer took roughly 400 pictures of Mici and hand-annotated them; the training process required approximately a week of processing time. This approach was deemed inappropriate, partly because of the training time and because such training would be required on a per dog basis! While it is likely that the training time could have been accelerated by the use of Azure scale, we believed this approach would ultimately not be flexible enough to support the infinite variation in dog shape and dog movement.<\/p>\n<p>Nevertheless, object detection \u2013 especially moving objects such as animals \u2013 poses a formidable problem. The Kinect receives (through its IR camera) a depth stream in addition to the 30fps HD RGB color data.<a href=\"#_ftn1\">1<\/a> However, these constitute simply a point cloud that requires sophisticated analysis in order recognize a shape in real time. The current Kinect retail product presently leverages a massive amount of machine learning to perform human skeletal tracking.<\/p>\n<p>Various other approaches were considered including Berkeley\u2019s image processing machine learning library Caffe (<a href=\"http:\/\/caffe.berkeleyvision.org\/\">http:\/\/caffe.berkeleyvision.org\/<\/a>), internal Kinect code, and others.<\/p>\n<p>Eventually, the team collaborated with a group in Microsoft Research, which had built software for very high-resolution, real-time hand tracking. Unlike other approaches, MSR used <em>both<\/em> machine learning <em>and<\/em> model fitting in real time to identify and track hands with extremely high fidelity. Supplementing machine learning with the ability to match a 3D mesh against the Kinect video stream enabled considerably more accuracy, as shown in the video located <a href=\"http:\/\/www.youtube.com\/watch?v=A-xXrMpOHyc\">here<\/a>. See as well the hand tracking paper presented at SIGCHI 2015 <a href=\"http:\/\/research.microsoft.com\/pubs\/238453\/pn362-sharp.pdf\">here<\/a>.<\/p>\n<p> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2015\/07\/image003-1.jpg\" alt=\"Image image003\" width=\"553\" height=\"302\" class=\"aligncenter size-full wp-image-11218\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image003-1.jpg 553w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image003-1-300x164.jpg 300w\" sizes=\"(max-width: 553px) 100vw, 553px\" \/><\/p>\n<p>Figure 3. Hand Tracking Video<\/p>\n<p>However, the problem of dog tracking (and by extension similar large objects that move) has some differences with hand tracking:<\/p>\n<ul>\n<li>For our purposes, extreme real-time tracking (that is, at 30 frames per second) is not really required as the goal is to detect the dog\u2019s expressions, which do not change as fast<\/li>\n<li>However, background removal is a significant issue that the hand tracking demo did not have to address<\/li>\n<li>There is substantial variation in dog shapes (because of breeds): small, large, with large snouts, small, with tails, without tails, colors, and so on.<\/li>\n<\/ul>\n<p>A multi-stage pipeline is used to recognize an object. The process matches predefined depth-aware \u201cposes\u201d (approximately 100,000 of them)<a href=\"#_ftn2\">2<\/a> to what the Kinect sees. To do this matching, a jungle ML algorithm<a href=\"#_ftn3\">3<\/a> is used to detect a set of candidate poses and then a particle swarm optimization algorithm performs \u201cmodel fitting\u201d. In model fitting, the observed image is matched against prebuilt poses and the best choice is selected.<\/p>\n<p><!-- TODO - missing image\n\n\n<p><img decoding=\"async\" src=\"\/developerblog\/wp-content\/uploads\/2015-07-21-Dog-Tracking_images\/image004.jpg\" alt=\"Photo\" \/><\/p>\n\n\n--><\/p>\n<p>Figure 4. Model Fitting Visualization<\/p>\n<p>To create the poses, a \u201crigged\u201d (articulated) Blender 3D model, such as the hand model below, is used:<\/p>\n<p> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2015\/07\/image005-2.jpg\" alt=\"Image image005\" width=\"707\" height=\"505\" class=\"aligncenter size-full wp-image-11219\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image005-2.jpg 707w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image005-2-300x214.jpg 300w\" sizes=\"(max-width: 707px) 100vw, 707px\" \/><\/p>\n<p>Figure 5. Rigged Hand Model in Blender<\/p>\n<p>(\u201cRigged\u201d meaning that the 3D mesh includes \u201cbones\u201d and the model can be articulated.) Poses of the model rotated in 3d space can be used to test the recognition. These models are then rotated and articulated to get the thousands of poses that will be used to \u201cmodel fit\u201d against the observed Kinect data.<\/p>\n<p>An example of a dog model (a border collie) in Blender is shown below:<\/p>\n<p> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2015\/07\/image006-1.jpg\" alt=\"Image image006\" width=\"499\" height=\"374\" class=\"aligncenter size-full wp-image-11220\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image006-1.jpg 499w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2015\/07\/image006-1-300x225.jpg 300w\" sizes=\"(max-width: 499px) 100vw, 499px\" \/><\/p>\n<p>Figure 6. Border Collie in Blender<\/p>\n<p>These models can be used as the basis for building poses which the ML algorithms will use to track the dog in question.<\/p>\n<p>Initially, the goal of this project was modest: to track one or two specific dogs (Ilkka\u2019s dog Mici and Barry\u2019s dog Joe). Later the project will be able to recognize and tune for different sizes and breeds, using machine learning and likely leveraging training data sets such as the Stanford Dogs Dataset (an annotated library of some 22,000 dog images \u2013 <a href=\"http:\/\/vision.stanford.edu\/aditya86\/ImageNetDogs\/\">http:\/\/vision.stanford.edu\/aditya86\/ImageNetDogs\/<\/a> ) and potentially using such techniques as model deformation to match different breeds.<\/p>\n<p>We have been able to track Ilkka\u2019s dog Mici in real time, with some limitations (see the TED Case Study entitled \u201cBackground and Floor Removal from Depth Camera Data\u201d for a discussion of one of the thorniest issues). The next step was to improve the tracking and then move on to identify features on the dog \u2013 ear position, tail wag rate, and so forth \u2013 in order to infer the dog\u2019s state of mind. In addition, work is also under way to use machine learning algorithms to decipher dog vocalizations; these topics and others will be discussed in future case studies.<\/p>\n<h2 id=\"code-artifacts\">Code Artifacts<\/h2>\n<p>OpenCV is at <a href=\"http:\/\/www.opencv.org\">http:\/\/www.opencv.org<\/a> .<\/p>\n<p>The Kinect SDK is at <a href=\"http:\/\/www.microsoft.com\/en-us\/kinectforwindows\/\">http:\/\/www.microsoft.com\/en-us\/kinectforwindows\/<\/a><\/p>\n<h2 id=\"opportunities-for-reuse\">Opportunities for Reuse<\/h2>\n<p>We believe this project has numerous possibilities for reuse. The most significant accomplishment of the project to date is to demonstrate that real time, very high fidelity reconstruction (far more significant than 3D cameras out of the box) is possible. Further, since the model fitting uses 3D models that can be tagged, feature extraction (e.g., ear position) is made possible, and this capability leads to a number of important scenarios.<\/p>\n<p>It should be possible, using the code base, to extend the recognition to other animals such as cats (actually, a much easier problem given that physical variation among cat breeds is substantially less than for dogs) and possibly horses.<\/p>\n<p>The resolution of the tracking enables scenarios previously difficult or impossible for Kinect. For example, 24-hour scanning of premature human babies is, with work, feasible (as the out of the box Kinect has certain minimum size limitations); in addition, similar scanning for normal babies (for terrified new parents) could also be done and there are a number of similar scenarios. Finally, such technology could be used for the benefit of Alzheimer\u2019s and other physically challenged individuals.<\/p>\n<hr \/>\n<p><a id=\"_ftn1\"><\/a> For more specifics on the Kinect, see <a href=\"http:\/\/channel9.msdn.com\/coding4fun\/kinect\/Kinect-1-vs-Kinect-2-a-side-by-side-reference\">here<\/a>; for the Kinect SDK, see <a href=\"http:\/\/www.microsoft.com\/en-us\/kinectforwindows\/\">here<\/a>.<\/p>\n<p><a id=\"_ftn2\"><\/a> Initial set. One of the goals of the project is to see if we can match with substantially fewer prebuilt poses.<\/p>\n<p><a id=\"_ftn3\"><\/a> For more on decision jungles, see <a href=\"http:\/\/research.microsoft.com\/pubs\/205439\/DecisionJunglesNIPS2013.pdf\">here<\/a>. To quote: \u201cUnlike conventional decision trees that only allow one path to every node, a DAG [Directed Acyclic Graph] in a decision jungle allows multiple paths from the root to each leaf.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"<p>With the arrival of commodity depth-capable cameras, specifically, the Microsoft Kinect, as well as high-performance machine learning algorithms, entirely new capabilities are made possible. Working with academic experts and others, we are attempting to track the movements, body language, and vocalizations of dogs.<\/p>\n","protected":false},"author":21345,"featured_media":11215,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[19],"tags":[127,226,239,253,282],"class_list":["post-2184","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-computer-vision","tag-kinect-sdk","tag-machine-learning-ml","tag-microsoft-kinect","tag-opencv"],"acf":[],"blog_post_summary":"<p>With the arrival of commodity depth-capable cameras, specifically, the Microsoft Kinect, as well as high-performance machine learning algorithms, entirely new capabilities are made possible. Working with academic experts and others, we are attempting to track the movements, body language, and vocalizations of dogs.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2184","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21345"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=2184"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2184\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/11215"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=2184"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=2184"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=2184"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}