Azure Active Directory’s gateway is on .NET 6.0!

Avanindra

Azure Active Directory’s gateway service is a reverse proxy that fronts hundreds of services that make up Azure Active Directory (Azure AD). If you’ve used services such as office.com, outlook.com, portal.azure.com or xbox.live.com, then you’ve used Azure AD’s gateway. The gateway provides features such as TLS termination, automatic failovers/retries, geo-proximity routing, throttling, and tarpitting to services in Azure AD. The gateway is present in 54 Azure datacenters worldwide and serves ~185 Billion requests each day. Up until recently, Azure AD’s gateway was running on .NET 5.0. As of September 2021, it’s running on .NET 6.0.

Efficiency gains by moving to .NET 6.0

The below image shows that application CPU utilization dropped by 33% for the same traffic volume after moving to .NET 6.0 on our production fleet.

efficiency

The above meant that our application efficiency went up by 50%. Application efficiency is one of the key metrics we use to measure performance and is defined as

Application efficiency = (Requests per second) / (CPU utilization of application)

Changes made in .NET 6.0 upgrade

Along with the .NET 6.0 upgrade, we made two major changes:

  1. Migrated from IIS to HTTP.sys server. This was made possible by new features in .NET 6.0.
  2. Enabled dynamic PGO (profile-guided optimization). This is a new feature of .NET 6.0.

The following sections will describe each of those changes in more detail.

Migrating from IIS to HTTP.sys server

There are 3 server options to pick from in ASP.NET Core:

  • Kestrel
  • HTTP.sys server
  • IIS

A previous blog post describes why Azure AD gateway chose IIS as the server to run on during our .NET Framework 4.6.2 to .NET Core 3.1 migration. During the .NET 6.0 upgrade, we migrated from IIS to HTTP.sys server. Kestrel was not chosen due to the lack of certain TLS features our service depends on (support is expected by June 2022 in Windows Server 2022).

By migrating from IIS to HTTP.sys server, Azure AD gateway saw the following benefits:

  • A 27% increase in application efficiency.
  • Deterministic queuing model: HTTP.sys server runs on a single-queue system, whereas IIS has an internal queue on top of the HTTP.sys queue. The double-queue system in IIS results in unique performance problems (especially in high concurrency situations, although issues in IIS can potentially be offset by tweaking Windows registry keys such as HKLM:SYSTEM\CurrentControlSet\Services\W3SVC\Performance\ReceiveRequestsPending). By removing IIS and moving to a single-queue system on HTTP.sys, queuing issues that arose due to rate mismatches in the double-queue system disappeared as we moved to a deterministic model.
  • Improved deployment and autoscale experience: The move away from IIS simplifies deployment since we no longer need to install/configure IIS and ANCM before starting the website. Additionally, TLS configuration is easier and more resilient as it needs to be specified at just one layer (HTTP.sys) instead of two as it had been with IIS.

The following showcase some of the changes that were made while moving from IIS to HTTP.sys server:

  • TLS renegotiation: Renegotiation provides the ability to do optional client certificate negotiation based on HTTP constructs such as request path.

Example: On IIS, during the initial TLS handshake with the client, the server can be configured to not request a client certificate. However, if the path of the request contains, say “foo”, IIS triggers a TLS renegotiation and requests a client certificate.

The following web.config configuration in IIS is how path based TLS renegotiation is enabled on IIS:

  <location path="foo">
      <system.webServer>
        <security>
          <access sslFlags="Ssl, SslNegotiateCert, SslRequireCert"/>
        </security>
      </system.webServer>
  </location>

In HTTP.sys server hosting (.NET 6.0 and up), the above configuration is expressed in code by calling GetClientCertificateAsync() as below.

  // default renegotiate timeout in http.sys is 120 seconds.
  const int RenegotiateTimeOutInMilliseconds = 120000;
  X509Certificate2 cert = null;
  if (httpContext.Request.Path.StartsWithSegments("foo"))
  {
    if (httpContext.Connection.ClientCertificate == null)
    {
      using (var ct = new CancellationTokenSource(RenegotiateTimeOutInMilliseconds))
      {
        cert = await context.Connection.GetClientCertificateAsync(ct.Token);
      }
    }
  }

In order for GetClientCertificateAsync() to trigger a renegotiation, the following setting should be set in HttpSysOptions

options.ClientCertificateMethod = ClientCertificateMethod.AllowRenegotation;
  • Mapping IIS Server variables:

    • On IIS, TLS information such as CRYPT_PROTOCOL, CRYPT_CIPHER_ALG_ID, CRYPT_KEYEXCHANGE_ALG_ID and CRYPT_HASH_ALG_ID is obtained by IIS Server variables and can be leveraged as shown here.
    • On HTTP.sys server, equivalent information is exposed via ITlsHandshakeFeature’s Protocol, CipherAlgorithm, KeyExchangeAlgorithm and HashAlgorithm respectively.
  • Ability to interpret non-ASCII headers:

The gateway receives millions of headers each day with non-ASCII characters in them and the ability to interpret non-ASCII headers is important. Kestrel and IIS already have this ability, and in .NET 6.0, Latin1 request header encoding was added for HTTP.sys as well. It can be enabled using HttpSysOptions as shown below.

options.UseLatin1RequestHeaders = true;
  • Observability:

In addition to .NET telemetry, the health of a service can be monitored by plugging into a wealth of telemetry exposed by HTTP.sys such as:

Http Service Request Queues\ArrivalRate
Http Service Request Queues\RejectedRequests
Http Service Request Queues\CurrentQueueSize
Http Service Request Queues\MaxQueueItemAge
Http Service Url Groups\ConnectionAttempts
Http Service Url Groups\CurrentConnections

Enabling Dynamic PGO (profile-guided optimization)

Dynamic PGO is one the most exciting features of .NET 6.0! PGO can benefit .NET 6.0 applications by maximizing steady-state performance.

Dynamic PGO is an opt-in feature in .NET 6.0. There are 3 environment variables you need to set to enable dynamic PGO:

  • set DOTNET_TieredPGO=1. This setting leverages the initial Tier0 compilation of methods to observe method behavior. When methods are rejitted at Tier1, the information gathered from the Tier0 executions is used to optimize the Tier1 code. Enabling this switch increased our application efficiency by 8.18% compared to plain .NET 6.0.
  • set DOTNET_TC_QuickJitForLoops=1. This setting enables tiering for methods that contain loops. Enabling this switch (in conjunction with above switch) increased our application efficiency by 10.2% compared to plain .NET 6.0.
  • set DOTNET_ReadyToRun=0. The core libraries that ship with .NET come with ReadyToRun enabled by default. ReadyToRun allows for faster startup because there is less to JIT compile, but this also means code in ReadyToRun images doesn’t go through the Tier0 profiling process which enables dynamic PGO. By disabling ReadyToRun, the .NET libraries also participate in the dynamic PGO process. Setting this switch (in conjunction with the two above) increased our application efficiency by 13.23% compared to plain .NET 6.0.

Learnings

  • There were a few SocketsHttpHandler changes in .NET 6.0 that surfaced as issues in our service. We worked with the .NET team to identify workarounds and improvements.

    • New connection attempts that fail can impact multiple HTTP requests in .NET 6.0, whereas a failed connection attempt would only impact a single HTTP request in .NET 5.0.

      • Workaround : Setting a ConnectTimeout slightly lower than HTTP request timeout ensures .NET 5.0 behavior is maintained. Alternatively, disposing the underlying handler on a failure also ensures only a single request is impacted due to a connect timeout (although this can be expensive depending on the size of the connection pool, please be sure to measure for your scenario).
    • Requests that fail due to RST packets are no longer automatically retried in .NET 6.0 and this results in an elevated rate of An existing connection was forcibly closed by the remote host exceptions bubbling up to the application from HttpClient.

      • Workaround : The application can add retries on top of HttpClient for idempotent requests. Additionally, if RST packets are due to idle timeouts, setting PooledConnectionIdleTimeout to lower than the idle timeout of the server will help eliminate RST packets due to idle connections.
  • HttpContext.RequestAborted.IsCancellationRequested had inconsistent behavior on HTTP.sys compared to other servers and has been fixed in .NET 6.0.
  • Client side disconnects were noisy on HTTP.sys server and there was a race condition that was triggered while trying to set StatusCode on a disconnected request. Both have been fixed in .NET 6.0.

Summary

Every new release of .NET has tremendous performance improvements and there is a huge upside to migrating to the latest version of .NET. For Azure AD gateway, we look forward to trying out newer APIs specific to .NET 6.0 for even bigger wins and further enhancements in .NET 7.0.

8 comments

Leave a comment

  • Sean Decker

    Can you point to documentation on HKLM:SYSTEM\CurrentControlSet\Services\W3SVC\Performance\ReceiveRequestPending. We’ve done much with trying to tweak IIS performance configuration and have never come across this setting before. We’ve struggled to find some mysterious queue that block some request under high load. In fact, just now in searching on both Bing and Google the only result for that key is this blog post.

    • Avanindra ParuchuriMicrosoft employee

      Sean, I don’t think there is public documentation for this key, this is an undocumented IIS reg. key. For reference, Azure App Service and certain other Microsoft services set this key to 1000. You will want to experiment with different values for your scenario.

  • Günther Foidl

    Thanks for sharing this insights. Really impressive, and very interesting to read about the rationale why some technologie got chosen.

  • John King

    how many azure services runing on .net ? how many azure services switched to java/go/rust ? and it seems azure AD is the only service running on .net 5/6 , and ton’s of service running on .net framework, and things like spring cloud spring boot are running on java, kebunates and docker services running on golang

    • Fabien Geraud

      ton’s of service also use python on azure for machine learning or notebook.

      spring cloud spring boot are running on java

      Well yes. Python run on python, Java run on java, Go run on Go… I mean one Azure service is more than just the front part. You have to manage a lot of physical ressource. Create a lot of network and administrate them… A lot of technologies are use. For instance they also use orleans. I don’t know how or for wich services. But the point is orlean can run Python script. So you’re service could be python and still use .net behind.

      Also Yarp (reverse-proxy on .Net) was made has i understand for the Azure teams. I don’t know if the reverse proxy for AD use Yarp since it’s recent i don’t thinks so. But this show new usage of .Net in MS.
      You should also watch this part 33:51 – 35:46 : https://youtu.be/oPyTZ-HGdn4?t=2031
      And also the beginning 12:00 – 15:06 : https://youtu.be/oPyTZ-HGdn4?t=720

      Bing use .Net also we use to be able to see the related build using (https://www.bing.com/version). Last year there were using .Net 5 and use to follow preview.

  • Olga Klimova

    Interesting, why the pattern of CPU utilization in the first image is so stable? User logins cannot be that simultaneous in time.

    • Wf F

      It looks like an hourly or X-minute aggregate, probably either an average or a percentile.
      In that case, looking smooth/stable shouldn’t come as a surprise.