A Heavy Lift: Bringing Kestrel + YARP to Azure App Services

Byron Tardif

NOTE: This blog was originally posted to the App Service Team blog

In this post, we get a behind-the-scenes look at the engineering work required to change a critical platform component with code paths that are exercised billions of times a day while minimizing service disruptions and maintaining SLA for our customers. We provide a brief introduction to help cover the basics, go over motivations for doing this work, explain some of the more interesting challenges, issues, and bugs encountered along the way, and close with the results and the new customers scenarios enabled.

The challenge was huge, but we’re excited about the benefits this brings to Azure App Services and our customers:

  • Almost 80% improvement in throughput in performance tests designed to isolate the benefits.
  • Greener Azure data centers from significantly decreased per-request CPU usage.
  • Support for modern protocols like HTTP/3.
  • Support for new customer scenarios such as gRPC applications, per-host cipher suite configuration, custom error pages, and more.

Introduction

In 2021, a group of engineers across multiple teams, including .NET and Azure, got together to transition the App Service Frontend fleet to Kestrel + YARP. As we celebrate the completion of this major lift and collaboration, we decided to write down the journey and describe some of the challenges of completing such a change to a live service, the wins we achieved, and the future work enabled by this transition. We hope you enjoy it.

Azure App Service in a nutshell

Azure App Service recently celebrated its 10 year anniversary (we launched it on June 7th, 2012). We are grateful and humbled by our customers who have helped us grow into a big service (affectionately called an XXL service in Azure internally, a designation only shared with 3 other services). Here are some numbers that provide a glimpse into our scale:

  • 160B+ daily HTTP requests served by applications
  • 14M+ host names
  • 1.5K+ multi-tenant scale units and an additional 10K+ dedicated scale units (App Service Environments aka App Service Isolated SKU)

One of the key architectural pieces of this system is our FrontEndRole. The FrontEndRole main purposes are:

  • Receiving traffic on HTTP/HTTPS from public virtual IP addresses associated with a scale unit
  • Terminating SSL if required
  • Determining which set of VMs are the origin-servers for the application (called Workers) and then routing to them

FrontEndRole diagram

App Service was originally built as a Cloud Service and this role is just called FrontEndRole. With our transition to VM Scale Sets, the FrontEndRole is a separate scale set which is part of each scale unit.

The original App Service FrontEndRole, which runs on Windows Server, consisted of:

Kestrel and YARP in a nutshell

The first release of .NET Core introduced the Kestrel webserver: an open-source, cross-platform, and fast webserver implementation built using modern .NET. Performance is a key focus for the .NET team, and with each .NET release, Kestrel has gotten ever faster and more full-featured. As an example, recent changes made to Kestrel include:

  • Significant scalability improvements on many-core machines
  • Significant HTTP/2 performance enhancements when running with many concurrent streams
  • Support for new standards like HTTP/3

YARP (“Yet Another Reverse Proxy”) is a reverse proxy toolkit that enables building fast proxy servers using infrastructure from ASP.NET and .NET, focusing on easy customization. It is developed in the open at https://github.com/microsoft/reverse-proxy. YARP’s toolkit/extensibility model made it easy for us to incorporate our routing and TLS handling with its request forwarding capabilities. YARP includes support for modern protocols like HTTP/2 & HTTP/3, which App Service customers can now expose. In addition, being based on the fast-evolving .NET platform means that every release, Kestrel and YARP benefit from improvements up and down the .NET stack, including everything from networking libraries all the way down to JIT compiler improvements that improve the quality of generated code. For a sampling of the types of improvements that went into just the .NET 6 release in 2021, see Performance Improvements in .NET 6.

Betting on Kestrel + YARP for App Service: Why?

The previous FrontEndRole architecture of App Service built on IIS/HTTP.SYS has served us well, but the promise of a modern HTTP stack in Kestrel + YARP could deliver new benefits to all App Service customers. Specifically:

  • Performance improvements, including significantly decreased per-request CPU cost and per-connection memory cost.
  • More flexible extensibility points into SSL termination path, allowing for easier dynamic SNI host selection.
  • Enable new customer scenarios like support for gRPC, per-host cipher suite configuration, custom error pages, and more.

With all that context and motivation, the goal of the V-Team was clear:

“Transition the 200K+ dedicated cores running FrontEndRole to use Kestrel + YARP (and thus move away from IIS/HTTP.SYS/ARR)”

Challenge: Server Framework Diversity

App Service is not the first Microsoft service to transition to Kestrel and YARP. Microsoft has already documented the journeys of Bing, Azure Active Directory (AAD), and Dynamics 365 to .NET; these efforts have proven out the stability and performance of .NET for critical service workloads.

The unique challenge that App Service adds to the mix is the diversity of server implementations. The previously mentioned Microsoft services are written by server engineers working for … Microsoft. This is definitely not the case for App Service, which enables customers to bring their own server frameworks and write their own applications with varying levels of standards compliance. Hosting customer applications brings a unique set of challenges described below.

Challenge: Platform versus Organic Health

Because App Service enables customers to write their own applications, the concept of “service health” is a nuanced discussion. App Service measures the health of the platform; we ensure that customers have a running VM which can connect to storage and can execute a simple canary request. But App Service cannot easily measure the organic health (HTTP request success rate) since we do not control the application. As a result, we primarily focused on platform health as our main metric.

For our transition to Kestrel + YARP, we needed to broaden our measurement to include organic health. Rather than looking for an absolute bar (say >99.99% success), we needed to compare “before Kestrel + YARP” and “after Kestrel + YARP” organic success and look for anomalies that would point out potential problems.

Challenge: Quick Rollback in Production

With a broadened approach to assessing organic health anomalies caused by a diverse set of applications/frameworks on our platform, we required fast mechanisms to undo our Kestrel + YARP transition on a per scale-unit basis; in other words, we needed to be able to “break glass” quickly when we encountered problems and return to using IIS/HTTP.SYS.

The Journey: 100% FrontEndRoles using Kestrel/YARP

With all the context and challenges described, here is how the journey looked like in a picture.

FrontEndRole Migration

As you can see this journey took a lot of time. 6 months passed between the first Kestrel/YARP deployment and 100%.

The Bugs Encountered

We encountered multiple bugs on our journey to Kestrel + YARP. Apart from bugs in our business logic, one of the interesting classes of issue we encountered was the treatment of edge-cases in the HTTP specification. A diversity of clients hit our FrontEndRole. We need to be generous in accepting behavior that may not be exactly the HTTP spec’s letter and intent.

A simple example of one of these cases is when a request has leading newline characters (CR and/or LF). Strictly speaking, this isn’t allowed, but it turns out that there are some clients that send requests that start like:

\rGET / HTTP/1.1\r\n
...

This is a case that IIS (and some other servers) allow, but because Kestrel historically has taken a fairly strict stance, we saw its parser rejecting requests like this with a BadHttpRequestException. Working closely with the ASP.NET Core team, we were able to make Kestrel a bit more generous in what it accepts (the example above now works in Kestrel in .NET 6.0.5 and newer releases).

Some other interesting issues uncovered can be found here, here and here.

As a result of investigating and addressing this class of issues, we’ve made Kestrel a more capable server without compromising the core principle of security.

The Payoff: Performance and New Features, Now and in the Future

Now that we have moved our FrontEndRoles to Kestrel + YARP we have realized multiple benefits in production.

Performance tests designed to isolate the benefits of our FrontEndRole change showed an almost 80% improvement in throughput (tested using a simple 1K helloworld response from a single dedicated worker in a test environment). App Service over-provisions FrontEndRole instances, so the realized benefit across our aggregate fleet is a large decrease in CPU% which provides more CPU headroom for the fleet. We are still in the early days of monitoring the fleet post-move; we may eventually be able to decrease our cores assigned to this role to reduce operating costs and data center energy usage. More investigation to follow.

With our move to Kestrel + YARP on our FrontEndRoles, we were also able to move our Linux worker VMs to use Kestrel+YARP. This change allows us to replace nginx, commonize the codebase, and light up gRPC for our App Service Linux SKUs. gRPC support has been a popular feature request from Azure App Services users and we’re excited to add this capability.

With this platform work complete we are now working on enabling two of the most frequently requested features in App Service; more news coming soon as we complete these improvements:

  • Ability to configure custom error pages for requests that terminate on the front end (Specifically: HTTP 503, HTTP 502 and HTTP 403).
  • Ability to specify TLS cipher suite allowed per given application. Today customers can only configure allowed cipher suites on our Isolated SKUs.

A Great Partnership

Once you have a live multi-tenant service running with millions of VMs globally, you learn to be very careful with how and when you advance it. That said, the innovations with Kestrel + YARP being developed by the .NET core team were just too valuable to pass up. At the same time, the .NET team would tell you the experience of supporting this migration was a whole new challenge for that team as they experienced the breadth and diversity of App Service scenarios. This was a great journey for both teams and we landed it. Now that we have this new platform in place, we look forward to continued innovation between our teams.

13 comments

Discussion is closed. Login to edit/delete existing comments.

  • The Bitland Prince 0

    Thanks Byron for the post. Quite interesting to see Microsoft still does some dogfooding 😏

    Yet you forgot to mention one thing: which is the host operating system for the new Frontend role built upon Kestrel + YARP? I hope that still is Windows… 🤔🤔

    Another very interesting news is that you replaced Nginx with Kestrel. Awesome.

    • David FowlerMicrosoft employee 0

      It is still running on windows yes.

      • The Bitland Prince 0

        Thank you, David. That’s what I hoped.

    • Byron TardifMicrosoft employee 0

      We are using Windows for the App Service Front Ends, but I think that is beside the point. The more important thing in my opinion is that if you are writing a .NET app you can rely on the same battle tested components that we use to run global scale service and the same code can run anywhere from a RaspberryPI to a Mac Laptop to a Windows Server or a planet scale PaaS offering.

      • The Bitland Prince 0

        Byron,

        I definitely agree. From my LinkedIn post about your post:

        The juicy news here is not only that Kestrel + YARP is being used at massive scale by MS itself, quite a proof, but also that Azure is replacing Nginx with Kestrel + YARP for Linux worker VMs. Which is indeed interesting.

        However, my question had a goal. If you at Microsoft are gradually but steadily replacing Windows Server with Linux, WS IS the technology we shouldn’t invest into anymore. MS is sending strong signals about that so I personally use any chance to better understand the trend. And I’m telling you because WS is our choice.

        Thanks for confirming the OS you are using. Great post. Thanks.

        • Evan Basalik 0

          There are places like WAF v2 (nginix under the covers) where we switched to Linux because it made more sense to build off something that is super common across the industry vs. building from scratch. On the other hand our deep experience with Windows virtualization means that all of our physical hosts underneath Azure VMs are Windows, as are the VMs underneath our various flavors of SQL Server. Then, we have solutions like Arc or AAD client libraries which are multi-platform by definition and SQL Server itself is fully supported when running on Linux or in a container.

          The reality is that these days we use whichever one makes more sense. We truly see them both as vital to our business. I was just in a meeting yesterday where much of the Azure Engineering leadership team was reviewing an outage that impacted purely a Linux-only extension.

          You should pick whichever OS makes the most sense for your business and we’ll be happy to meet you wherever you are.

          • The Bitland Prince 0

            Evan, I understand this explanation under the service provider point of view and as a service provider, we tell our customers the same thing: we provide to you anything, use what you want.

            However, when you’re building a solution or developing a solution it is not that you would use tens of Web servers or tens of other components that do the same thing. You usually pick a (very) short list of them, possibly one of a kind. So for example our infrastructure is based on Windows virtualization and it would be way better if we hadn’t to put anything else into because we need something that such platform is not offering. On top of that, and on top of your OS of choice, you develop your custom integrations, software, services, whatever. Because, as you said, it depends on the experience that you built during the years and it’s not that we want to use IIS and Apache and Nginx and LiteSpeed at the same time just because they are available. If possible, we would use IIS for anything for the very same reason why Byron said that using Kestrel + YARP instead of Nginx would help you guys use a single code base for everything. That’s the point.

            So what I’m trying to understand is if Windows (Server) is still a relevant project in Microsoft or not, if it’s actively evolved or it is just a side project that gets some new feature sometimes. When someone in Microsoft tells me that Windows Server is still relevant, I trust him but then I see that instead of evolving and adding new features to the OS, you usually replace it even at the core of your infrastructure. Which is the opposite of evolving the product. When you say that instead of adding capabilities that would make WS work for WAF services, you simply replaced it with Linux, that’s the opposite of evolving the product and it is worrisome.

            So I have to take strategic decisions about which platform to use in our company, there’s a difference if I feel (and note) that Windows is actively replaced in Microsoft or if Windows is actively deployed AND it gets the new features it needs. Because, as you know, a platform that doesn’t evolve is a dead duck and I wouldn’t build my solution on top of it just to get struck with an old, non-modernized platform at a later time. So if we note that MS is actively replacing WS with Linux and it switches components as soon as possible, there’s no point to start new projects with WS. That’s the thing I’m actively trying to understand and why I’m monitoring most of your decisions about it.

            So I’m positively impressed when you write that you’re using Kestrel + YARP and that you’re going to replace Nginx with that. I feel that we can build upon Kestrel. But if you say, instead of adding to our server OS the feature it needed to address a specific issue we decided to replace it, it is worrisome and it feels like Windows Server is a side-project that could be left behind at any moment and we should probably not building anything upon it. Ultimately, this is not a light choice for us or for anyone, I guess.

            I’m fascinated how Microsoft people sometimes talks about Windows Server as if it was someone else’s product. If I go to a Ferrari cars dealer, but even to an Audi one, the dealer won’t tell me to go to buy someone else’s car if their car is missing some nice feature, he will tell me to wait for the next model that will have such features. 😉

            Sorry for the long post: delete if it is not appropriate to have it here. I’m relieved to know that Azure is hosting VMs using Windows Virtualization, makes us confident that we can trust WS at least for that and that we can keep building our core infrastructure upon that.

            Do you maybe have any kind of roadmap for WS or at least a list of new features for the previews? We couldn’t find any documentation/changelog for previews. It would be great to know what the new features will be in order to be able to take decisions about them.

            Thanks.

  • Andrew W 0

    Wow, great work! That’s a daunting task, making sure all the existing sites still work as expected! You listed three interestingissues, how many issues in total did your organic health checks find? And did you see much improvement in the memory footprint vs IIS?

  • Onur Gümüş 0

    Does that mean we will have HTTP 3 support soon?

    • Byron TardifMicrosoft employee 0

      It’s unblocked, can’t comment on timeline at the moment but this was the first step towards unblocking that scenario.

  • Ngo Sinh 0

    What I need to do if the migration was done?
    For eg: is the web.config still work?

    • Byron TardifMicrosoft employee 0

      There is no action required. This change is not done on the webserver where your app is hosted, but rather on common infrastructure that acts as a load balancer and should be 100% transparent to you.

Feedback usabilla icon