{"id":13826,"date":"2021-09-20T08:59:12","date_gmt":"2021-09-20T15:59:12","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cse\/?p=13826"},"modified":"2023-06-19T10:50:47","modified_gmt":"2023-06-19T17:50:47","slug":"building-an-action-detection-scoring-pipeline-for-digital-dailies","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/building-an-action-detection-scoring-pipeline-for-digital-dailies\/","title":{"rendered":"Building an Action Detection Scoring Pipeline for Digital Dailies"},"content":{"rendered":"<h2><span style=\"font-size: 24pt;\">Introduction<\/span><\/h2>\n<p>In media companies, like WarnerMedia, footage filmed for the entire day include \u2018takes\u2019 of various scenes or footage types as well as footage before and after each take. This footage filmed each day is known as \u2018digital dailies\u2019.<\/p>\n<p>Archival of digital dailies is a manual and time-consuming process. When digital dailies are produced all the data is permanently archived, however there is a portion of that content that should either be long-term archived or completely discarded. The problem here is two-pronged: 1. It\u2019s an expensive manual process to identify which portions of content can be archived and\/or discarded and 2. There is a cost associated with storing unnecessary content, especially when we are dealing with terabytes and petabytes of data.<\/p>\n<p>This project is a co-engineering collaboration between WarnerMedia and Microsoft\u2019s Commercial Software Engineering for identifying action and cut sequences within media for archival purposes.<\/p>\n<h2><span style=\"font-size: 24pt;\">Goals<\/span><\/h2>\n<p>Our primary goal was to use Machine Learning to identify archival content from WarnerMedia digital dailies, that could be either long-term archived or discarded, to reduce storage costs.<\/p>\n<h2><span style=\"font-size: 24pt;\">Exploring the Possibilities<\/span><\/h2>\n<p>We focused on identifying \u201caction\u201d and \u201ccut\u201d using visual and audio cues in the footage as a means of demarcating the content to identify portions for long-term storage and portions that can be discarded.<\/p>\n<p>When an \u201caction\u201d event has been identified, and up to a \u201ccut\u201d event, that\u2019s content that should be kept. When \u201ccut\u201d happens, and up to the next \u201caction\u201d that\u2019s content that could be archived or discarded.<\/p>\n<p><img decoding=\"async\" width=\"1456\" height=\"239\" class=\"wp-image-13845\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-9.png\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-9.png 1456w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-9-300x49.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-9-1024x168.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-9-768x126.png 768w\" sizes=\"(max-width: 1456px) 100vw, 1456px\" \/><\/p>\n<p>To accurately detect \u201caction\u201d and \u201ccut\u201d, the Data Science team iterated through many different services and experimentations.<\/p>\n<h4>Visual Exploration:<\/h4>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/09\/clapperboard_detection.png\" \/><\/p>\n<p>The goal with the visual exploration is to identify clapperboards within frames of video and extract text from those clapperboards. The exploration consisted of:<\/p>\n<ol>\n<li>Automatic metadata and insights extraction using <a href=\"https:\/\/azure.microsoft.com\/en-us\/products\/video-analyzer\/\">Azure Video Analyzer for Media<\/a> (formerly known as Video Indexer).<\/li>\n<li>Selecting and generating a representative dataset of frames for model training.<\/li>\n<li>Augmenting images within the dataset by utilizing crop, rotate, cutout, blur, flip techniques and more.<\/li>\n<li>A semi-supervised labeling tool to automatically group similar labels using density-based spatial clustering.<\/li>\n<li>Training and utilizing <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/custom-vision-service\/#overview\">Azure Cognitive Services Custom Vision<\/a> models to detect clapperboard images against various backdrops.<\/li>\n<li>Clapperboard detection with trained <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/custom-vision-service\/#overview\">Azure Cognitive Services Custom Vision<\/a> models<\/li>\n<li>Training a custom model in <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/form-recognizer\/\">Azure Form Recognizer<\/a> to detect and extract items such as \u201croll\u201d, \u201cscene\u201d, \u201ctake\u201d, etc. in detected clapperboard images.<\/li>\n<li>Using <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/form-recognizer\/\">Azure Cognitive Services Computer Vision Read API<\/a> (OCR) to choose the \u201cbest\u201d clapperboard frame (clapperboard image with the most detectable text) for each action event.<\/li>\n<li>Using a trained custom model in <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/form-recognizer\/\">Azure Form Recognizer<\/a> to extract key value pairs (such as \u201cscene\u201d, \u201croll\u201d, \u201ctake\u201d, etc.) from detected clapperboard images.<\/li>\n<\/ol>\n<h4>Audio Exploration:<\/h4>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/09\/cut_utterance2.png\" \/><\/p>\n<p>Here we had to accurately detect the spoken word \u201ccut\u201d. The exploration consisted of:<\/p>\n<ul>\n<li>Utilizing the Speech to Text Cognitive Service<\/li>\n<li>Experimenting with search techniques<\/li>\n<li>Semi-supervised, <a href=\"https:\/\/arxiv.org\/ftp\/arxiv\/papers\/2001\/2001.07685.pdf\">FixMatch<\/a>-based audio labeling<\/li>\n<li>Custom Audio Model with a pre-trained Convolutional Neural Network<\/li>\n<li>Using the <a href=\"https:\/\/www.audacityteam.org\/about\/\">OSS Audacity\u00ae<\/a> for creating audio labels<\/li>\n<li>Audio preprocessing using Dynamic range compression<\/li>\n<li>Utilizing the Custom Speech to Text service fine-tuned with custom data<\/li>\n<\/ul>\n<h2><span style=\"font-size: 24pt;\">Overall Design<\/span><\/h2>\n<p>After the initial exploration phase of the project the team decided the best way to go was a combination of utterance and clapperboard detection. We decided to proceed with the following services for the final solution:<\/p>\n<ul>\n<li><a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/custom-vision-service\/#features\">Custom Vision<\/a><\/li>\n<li><a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/computer-vision\/\">Computer Vision \/ OCR<\/a><\/li>\n<li><a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/speech-to-text\/Custom Vision\">Custom Speech \/ Speech to Text<\/a><\/li>\n<li><a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/form-recognizer\/\">Form Recognizer<\/a><\/li>\n<\/ul>\n<p>Here is a look at the final scoring pipeline process.<\/p>\n<p><img decoding=\"async\" width=\"1699\" height=\"1847\" class=\"wp-image-13848\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-10.png\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-10.png 1699w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-10-276x300.png 276w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-10-942x1024.png 942w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-10-768x835.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-10-1413x1536.png 1413w\" sizes=\"(max-width: 1699px) 100vw, 1699px\" \/><\/p>\n<p>In the scoring pipeline there are three workstreams: audio, vision, and metadata extraction. These three workstreams extract out the audio and visual cues from the raw video and compile those into a final report which identifies \u201caction\u201d and \u201ccut\u201d at the various timestamps within the video.<\/p>\n<p>And to tie the above services together we used <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/machine-learning\/\"><strong>Azure Machine Learning<\/strong><\/a> to run training and scoring pipelines for the various cognitive services.<\/p>\n<p>We created an architecture with training pipelines for MLOps (Machine Learning Operations) to efficiently maintain, train, and deploy models. In this architecture the pipelines would first run against simplified datasets for PRs, then after merging, they would run full training pipelines and train the various models. From this point, models could be promoted to production to be used as the primary scoring model. At a high level the infrastructure looks like this:<\/p>\n<p><img decoding=\"async\" width=\"862\" height=\"706\" class=\"wp-image-13849\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/09\/graphical-user-interface-application-description-3.png\" alt=\"Graphical user interface, application Description automatically generated\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/graphical-user-interface-application-description-3.png 862w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/graphical-user-interface-application-description-3-300x246.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/graphical-user-interface-application-description-3-768x629.png 768w\" sizes=\"(max-width: 862px) 100vw, 862px\" \/><\/p>\n<h2><span style=\"font-size: 24pt;\">Results<\/span><\/h2>\n<p>The results turned out to be excellent with the <strong>precision <\/strong>and <strong>recall <\/strong>both being 96%. The <strong>precision<\/strong> meaning how much discardable footage was labeled archive and <strong>recall<\/strong> meaning did we archive all the footage of interest.<\/p>\n<p>For content with audio, we were able to discard around 45% of the content, that means eliminating storage costs completely for the discarded content, and we marked around 55% of content for long-term archival, that means lower costs for cheaper storage tiers. When referring to long-term archival, Azure provides both \u2018cool\u2019 and \u2018archive\u2019 storage for reduced costs instead of storing everything in the \u2018hot\u2019 tier.<\/p>\n<p><img decoding=\"async\" width=\"1525\" height=\"706\" class=\"wp-image-13850\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-11.png\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-11.png 1525w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-11-300x139.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-11-1024x474.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/09\/word-image-11-768x356.png 768w\" sizes=\"(max-width: 1525px) 100vw, 1525px\" \/><\/p>\n<p>Note that for video with no audio, 100% of the content is marked archive. The reason being that audio is required to detect \u201ccut\u201d utterances denoting the end of takes, which is necessary for identifying segments of footage that can be discarded. Without this additional audio component, the confidence to completely discard a portion of the video content would decrease.<\/p>\n<h2><span style=\"font-size: 24pt;\">Summary<\/span><\/h2>\n<p>This project demonstrated an effective iterative process in exploring Machine Learning techniques where Data Scientists were able to utilize various techniques to progressively increase the precision and recall through a combination of visual and audio workstreams.\u00a0 Along with the accomplishment of reaching a <strong>precision <\/strong>and <strong>recall <\/strong>of 96% the software engineers on the team created infrastructure as code and DevOps\/MLOps code and configuration that allowed for both quick deployment and exploration on the ML side and also effective promotion of the trained models to production.<\/p>\n<p>For a deeper dive into the work done by the Data Scientists please see <a href=\"https:\/\/devblogs.microsoft.com\/cse\/2021\/09\/27\/archiving-footage-deep-dive\/\">Detecting \u201cAction\u201d and \u201cCut\u201d in Archival Footage Using a Multi-model Computer Vision and Audio Approach with Azure Cognitive Services<\/a>.<\/p>\n<h2><span style=\"font-size: 24pt;\">Acknowledgements<\/span><\/h2>\n<p>Special thanks to the engineers and contributors from Deltatre and WarnerMedia and especially Michael Green and David Sugg.<\/p>\n<p>Also, a big thank you to all contributors to our solution from the Microsoft Team (last name alphabetically listed): Sergii Baidachnyi, Andy Beach, Kristen DeVore, Daniel Fatade, Geisa Faustino, Moumita Ghosh, Vlad Kolesnikov, Bryan Leighton, Vito Flavio Lorusso, Samuel Mendenhall, Simon Powell, Patty Ryan, Sean Takafuji, Yana Valieva, Nile Wilson<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Media companies capture footage filmed for the entire day in what&#8217;s known as \u2018digital dailies\u2019.   When talking about terabytes and petabytes of content, storage costs can be a factor.  Lets explore Machine Learning approaches to identify which content can be archived or discarded which will save on those storage costs.<\/p>\n","protected":false},"author":70702,"featured_media":13883,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[14,1,19],"tags":[60,127,3315,238,239,3314,250,3313],"class_list":["post-13826","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cognitive-services","category-cse","category-machine-learning","tag-azure","tag-computer-vision","tag-digital-dailies","tag-machine-learning","tag-machine-learning-ml","tag-media-communications","tag-microsoft-cognitive-services","tag-warnermedia"],"acf":[],"blog_post_summary":"<p>Media companies capture footage filmed for the entire day in what&#8217;s known as \u2018digital dailies\u2019.   When talking about terabytes and petabytes of content, storage costs can be a factor.  Lets explore Machine Learning approaches to identify which content can be archived or discarded which will save on those storage costs.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/13826","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/70702"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=13826"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/13826\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/13883"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=13826"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=13826"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=13826"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}