{"id":343,"date":"2022-03-29T09:10:26","date_gmt":"2022-03-29T16:10:26","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/?p=343"},"modified":"2022-03-29T09:10:26","modified_gmt":"2022-03-29T16:10:26","slug":"the-pursuit-of-an-autonomic-scale-and-efficiency-system-for-microsoft-365-making-it-as-easy-as-breathing","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/the-pursuit-of-an-autonomic-scale-and-efficiency-system-for-microsoft-365-making-it-as-easy-as-breathing\/","title":{"rendered":"The pursuit of an autonomic scale and efficiency system for Microsoft 365: Making it as easy as breathing"},"content":{"rendered":"<p>Engineers face a daunting amount of complexity &#8212; customer requirements, architecture, monitoring, compliance, security, tools, scale, testing, cost, processes, bugs, incidents, and much more. In this post, I dip into the scale and efficiency investments we&#8217;ve made to empower innovation for the 1000+ services across the Microsoft 365 Cloud.<\/p>\n<p>The North Star of the experiences for scale and efficiency takes a page from the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Autonomic_nervous_system\">autonomic nervous system<\/a> as inspiration. This part of the nervous system is responsible for control of the bodily functions not consciously directed, such as breathing, the heartbeat, and digestive processes. Could you imagine having to constantly think about making your heartbeat?<\/p>\n<p>This model exemplifies the overall pursuit of many of the investments in the scale and efficiency fabric developed to support Microsoft 365 cloud services. The aspirational desire is for engineers to be at the center of experiences that naturally integrate into the way <em>they<\/em> do work. While these experiences have not yet achieved the ease of breathing without thinking, they are continuously evolving to make scale and efficiency more natural, in pursuit of a self-optimizing cloud.<\/p>\n<h3><strong>Writing efficient code<\/strong><\/h3>\n<p>Traditionally engineers writing code are disconnected from the runtime context and implications on the production environments to which their code will deploy. The process of understanding a production environment (especially from the perspective of performance) is something the engineer must remember to do manually.<\/p>\n<p>To bridge this gap, M365 Core developed the Cloud Profiling and Reporting Pipeline and a set of experiences that connect the development environment with the performance context of the code base\u2019s associated production environment. Through automated profiling and data collection of performance behavior we can now derive the context with which to inform the engineer about the impact of their code, as they write it. These experiences integrate the identification of bad patterns directly in line with the engineer\u2019s code development environment and at the same time provide relative cost that is relatable and meaningful.<\/p>\n<p>The Cloud Profiling and Reporting Pipeline enables the capture of a vast spectrum of data ranging from CPU, memory allocations, rooted memory, redundant <em>duplicated<\/em> instances of memory, exceptions, file and path IO, garbage collection stack pause times by generation and reason, block time latency stacks and many others from the cloud. Integrated experiences then surface relevant data within an engineer\u2019s IDE like so:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image1.gif\"><img decoding=\"async\" class=\"alignnone wp-image-345 size-full\" src=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image1.gif\" alt=\"A short video clip showing the M365 Visual Studio code lens experience for understanding production cost trend of a method.\" width=\"1997\" height=\"1270\" \/><\/a><\/p>\n<p>The integrated and interactive experience takes this even further by allowing the engineer to explore the current runtime cloud costs leading up to the current method of interest and the calls it makes. What is striking here is that the experience is bundling trillions of samples collected from the M365 cloud through the Cloud Profiling and Reporting Pipeline and painting it into a highly accessible canvas for the engineer.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-2.gif\"><img decoding=\"async\" class=\"alignnone wp-image-346 size-full\" src=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-2.gif\" alt=\"A short video clip showing the M365 Visual Studio code peek experience for understanding where expensive calls originate from to that method.\" width=\"1605\" height=\"1100\" \/><\/a><\/p>\n<p>This is complimented with a suite of Roslyn Analyzers to integrate the best-practice coding patterns within the IDE. An extensive library of static analysis rules tirelessly works to help engineers generate optimal code. These have been crafted to codify the knowledge of performance experts, democratizing efficient patterns gleaned from extensive cloud analysis.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-3-1.png\"><img decoding=\"async\" class=\"alignnone wp-image-348 size-full\" src=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-3-1.png\" alt=\"A screenshot of the Roslyn rule and associate code fix within a hot-path method.\" width=\"1430\" height=\"369\" srcset=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-3-1.png 1430w, https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-3-1-300x77.png 300w, https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-3-1-1024x264.png 1024w, https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-3-1-768x198.png 768w\" sizes=\"(max-width: 1430px) 100vw, 1430px\" \/><\/a><\/p>\n<p>The experiences carry forward into the code review process for engineers. Helpful efficiency bots automatically comment on code reviews when potential efficiency opportunities are detected.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-4.png\"><img decoding=\"async\" class=\"alignnone wp-image-349 size-full\" src=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-4.png\" alt=\"A screenshot of a DevOps pull request comment left by the M365 CPR bot, explaining a performance optimization opportunity.\" width=\"1248\" height=\"1075\" srcset=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-4.png 1248w, https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-4-300x258.png 300w, https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-4-1024x882.png 1024w, https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-4-768x662.png 768w\" sizes=\"(max-width: 1248px) 100vw, 1248px\" \/><\/a><\/p>\n<p>We are working on leveraging process state to be able to make even more exact suggestions on code improvements. For the example above, the average list size can be learned from production data, enabling specific (and optimal) recommendations.<\/p>\n<h3><strong>Zoom and enhance: Code-level production anomaly detection of resource changes<\/strong><\/h3>\n<p>With thousands of code changes committed every week, another key innovation area is seamless detection, isolation, and root causing of efficiency changes as they occur&#8211;anywhere across 1000+ services in Microsoft 365. Given the immense scale of Microsoft 365 cloud services, the impact of any one change can be quite large. Autonomous detection of efficiency changes is delivered through the application of robust anomaly detection algorithms applied to the extensive Cloud Profiling and Reporting Pipeline dataset. The result is continuous orchestrated engagement and collaboration across hundreds of teams to resolve issues, generating very material impact to Microsoft\u2019s ability to continuously innovate while delivering fiscally responsible and sustainable cloud services. Heatmaps of the Cloud Profiling and Reporting Pipeline data pinpoint efficiency changes all the way down to the code level.<\/p>\n<p><em>CPU frames level heatmap of Cloud Profiling and Reporting Pipeline data from an analysis experience for a detected CPU anomaly:<\/em><\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-5.png\"><img decoding=\"async\" class=\"aligncenter wp-image-350 size-full\" src=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-5.png\" alt=\"A screenshot of a heatmap showing the flare ups in method performance that were automatically detected.\" width=\"639\" height=\"285\" srcset=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-5.png 639w, https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-5-300x134.png 300w\" sizes=\"(max-width: 639px) 100vw, 639px\" \/><\/a><\/p>\n<p>Alerting on the anomalies flows as a connected experience from an event-driven incident management system into customized data analysis experiences providing scoping, issue tracking, automated analysis insights, and frequency distributions. These are designed to further accelerate root causing efficiency changes. Below is a small example of the automated insights surfaced on these data analysis canvases:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-6.png\"><img decoding=\"async\" class=\"wp-image-351 size-full aligncenter\" src=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-6.png\" alt=\"A chart showing the number of times various processes have breached performance threshold.\" width=\"254\" height=\"184\" \/><\/a><a href=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/Analyzer-Table.png\"><img decoding=\"async\" class=\"wp-image-369 aligncenter\" src=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/Analyzer-Table.png\" alt=\"A table showing the pipeline's ability to catalog the top processes and their associated top call stacks.\" width=\"352\" height=\"182\" srcset=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/Analyzer-Table.png 518w, https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/Analyzer-Table-300x155.png 300w\" sizes=\"(max-width: 352px) 100vw, 352px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<h3><strong>Flexible and extensible<\/strong><\/h3>\n<p>With the flexible and extensible nature of the Cloud Profiling and Reporting Pipeline, there are many streams of data that are collected and available for analysis. Not all of them can be intuitively streamlined into the developer experiences mentioned above. To explore these streams in detail a dedicated data viewer allows the engineer (and the performance expert) to deep dive into all the data collected.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-8.png\"><img decoding=\"async\" class=\"aligncenter wp-image-354 size-full\" src=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-8.png\" alt=\"A screenshot of the data M365 profiling data Viewer displaying call stacks as a flame graph.\" width=\"468\" height=\"180\" srcset=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-8.png 468w, https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-8-300x115.png 300w\" sizes=\"(max-width: 468px) 100vw, 468px\" \/><\/a><\/p>\n<p>Not only are there a multitude of visualizations to help make sense of the significance, but the engineer can even perform distributed authoring of issues, directly from the viewer, which is then available and seen by all other engineers using the viewer. This is geared towards getting engineers to acknowledge and take ownership of performance issues affecting Microsoft 365 cloud services while simultaneously sharing known issues to the broader engineering community.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-9.png\"><img decoding=\"async\" class=\"aligncenter wp-image-355 size-full\" src=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-9.png\" alt=\"A screenshot showing the expensive nature of contention and how it is displayed in the Viewer.\" width=\"624\" height=\"77\" srcset=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-9.png 624w, https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-9-300x37.png 300w\" sizes=\"(max-width: 624px) 100vw, 624px\" \/><\/a><a href=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-10.png\"><img decoding=\"async\" class=\"aligncenter wp-image-356 \" src=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-10.png\" alt=\"A screenshot showcasing the ability to tag and associate work item with specific methods in the Viewer.\" width=\"592\" height=\"73\" srcset=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-10.png 468w, https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-10-300x37.png 300w\" sizes=\"(max-width: 592px) 100vw, 592px\" \/><\/a><\/p>\n<p>Trending data over time is also a very important scenario that helps provide much needed context to the significance of an issue.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-11.png\"><img decoding=\"async\" class=\"aligncenter wp-image-357 size-full\" src=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-11.png\" alt=\"A line chart showing the Viewer's capability for trending profiled method performance over time.\" width=\"620\" height=\"236\" srcset=\"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-11.png 620w, https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-content\/uploads\/sites\/72\/2022\/03\/M365-image-11-300x114.png 300w\" sizes=\"(max-width: 620px) 100vw, 620px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<h3><strong>To infinity and beyond: The self-optimizing cloud<\/strong><\/h3>\n<p>Forward looking investments continue the journey towards making the experiences unconsciously a part of bringing great new features efficiently to the cloud. This is an aspirational pursuit of a fully autonomous self-optimizing cloud that will one day be capable of crowd sourcing new emergent efficient patterns from all engineers while virtuously changing existing code to adopt those new efficient patterns without thinking about the process. As machine learning continues to make awestriking progress in Code AI with innovative solutions like <a href=\"https:\/\/copilot.github.com\/\">GitHub Copilot<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/perflens-a-data-driven-performance-bug-detection-and-fix-platform\/\">PerfLens: A Data-Driven Performance Bug Detection and Fix Platform &#8211; Microsoft Research<\/a>, this world of science fiction becomes increasingly more accessible. Some current tractable investment areas targeting progress towards this dream are already in progress for the next chapter of this journey.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Through automated profiling and data collection of performance behavior, Microsoft\u2019s M365 Core team can derive the context with which to inform the engineer about the impact of their code, as they write it. Randy Lehner likens it to the autonomic nervous system in this post on their Cloud Profiling and Reporting Pipeline.<\/p>\n","protected":false},"author":85501,"featured_media":374,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[24,25,23],"class_list":["post-343","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-engineering-at-microsoft","tag-efficiency","tag-m365","tag-scale"],"acf":[],"blog_post_summary":"<p>Through automated profiling and data collection of performance behavior, Microsoft\u2019s M365 Core team can derive the context with which to inform the engineer about the impact of their code, as they write it. Randy Lehner likens it to the autonomic nervous system in this post on their Cloud Profiling and Reporting Pipeline.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-json\/wp\/v2\/posts\/343","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-json\/wp\/v2\/users\/85501"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-json\/wp\/v2\/comments?post=343"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-json\/wp\/v2\/posts\/343\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-json\/wp\/v2\/media\/374"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-json\/wp\/v2\/media?parent=343"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-json\/wp\/v2\/categories?post=343"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/engineering-at-microsoft\/wp-json\/wp\/v2\/tags?post=343"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}