Azure Active Directory’s gateway service is a reverse proxy that fronts hundreds of services that make up Azure Active Directory (Azure AD). If you’ve used services such as office.com, outlook.com, portal.azure.com or xbox.live.com, then you’ve used Azure AD’s gateway. The gateway provides features such as TLS termination, automatic failovers/retries, geo-proximity routing, throttling, and tarpitting to services in Azure AD. The gateway is present in 54 Azure datacenters worldwide and serves ~185 Billion requests each day. Up until recently, Azure AD’s gateway was running on .NET 5.0. As of September 2021, it’s running on .NET 6.0.
Efficiency gains by moving to .NET 6.0
The below image shows that application CPU utilization dropped by 33% for the same traffic volume after moving to .NET 6.0 on our production fleet.
The above meant that our application efficiency went up by 50%. Application efficiency is one of the key metrics we use to measure performance and is defined as
Application efficiency = (Requests per second) / (CPU utilization of application)
Changes made in .NET 6.0 upgrade
Along with the .NET 6.0 upgrade, we made two major changes:
- Migrated from IIS to HTTP.sys server. This was made possible by new features in .NET 6.0.
- Enabled dynamic PGO (profile-guided optimization). This is a new feature of .NET 6.0.
The following sections will describe each of those changes in more detail.
Migrating from IIS to HTTP.sys server
There are 3 server options to pick from in ASP.NET Core:
- Kestrel
- HTTP.sys server
- IIS
A previous blog post describes why Azure AD gateway chose IIS as the server to run on during our .NET Framework 4.6.2 to .NET Core 3.1 migration. During the .NET 6.0 upgrade, we migrated from IIS to HTTP.sys server. Kestrel was not chosen due to the lack of certain TLS features our service depends on (support is expected by June 2022 in Windows Server 2022).
By migrating from IIS to HTTP.sys server, Azure AD gateway saw the following benefits:
- A 27% increase in application efficiency.
- Deterministic queuing model: HTTP.sys server runs on a single-queue system,
whereas IIS has an internal queue on top of the HTTP.sys queue. The
double-queue system in IIS results in unique performance problems (especially
in high concurrency situations, although issues in IIS can potentially be
offset by tweaking Windows registry keys such as
HKLM:SYSTEM\CurrentControlSet\Services\W3SVC\Performance\ReceiveRequestsPending
). By removing IIS and moving to a single-queue system on HTTP.sys, queuing issues that arose due to rate mismatches in the double-queue system disappeared as we moved to a deterministic model. - Improved deployment and autoscale experience: The move away from IIS simplifies deployment since we no longer need to install/configure IIS and ANCM before starting the website. Additionally, TLS configuration is easier and more resilient as it needs to be specified at just one layer (HTTP.sys) instead of two as it had been with IIS.
The following showcase some of the changes that were made while moving from IIS to HTTP.sys server:
- TLS renegotiation: Renegotiation provides the ability to do optional client certificate negotiation based on HTTP constructs such as request path.
Example: On IIS, during the initial TLS handshake with the client, the server can be configured to not request a client certificate. However, if the path of the request contains, say “foo”, IIS triggers a TLS renegotiation and requests a client certificate.
The following web.config configuration in IIS is how path based TLS renegotiation is enabled on IIS:
<location path="foo">
<system.webServer>
<security>
<access sslFlags="Ssl, SslNegotiateCert, SslRequireCert"/>
</security>
</system.webServer>
</location>
In HTTP.sys server hosting (.NET 6.0 and up), the above configuration is expressed in code by calling GetClientCertificateAsync() as below.
// default renegotiate timeout in http.sys is 120 seconds.
const int RenegotiateTimeOutInMilliseconds = 120000;
X509Certificate2 cert = null;
if (httpContext.Request.Path.StartsWithSegments("foo"))
{
if (httpContext.Connection.ClientCertificate == null)
{
using (var ct = new CancellationTokenSource(RenegotiateTimeOutInMilliseconds))
{
cert = await context.Connection.GetClientCertificateAsync(ct.Token);
}
}
}
In order for GetClientCertificateAsync()
to trigger a renegotiation, the
following setting should be set in
HttpSysOptions
options.ClientCertificateMethod = ClientCertificateMethod.AllowRenegotation;
- Mapping IIS Server variables:
- On IIS, TLS information such as
CRYPT_PROTOCOL
,CRYPT_CIPHER_ALG_ID
,CRYPT_KEYEXCHANGE_ALG_ID
andCRYPT_HASH_ALG_ID
is obtained by IIS Server variables and can be leveraged as shown here. - On HTTP.sys server, equivalent information is exposed via
ITlsHandshakeFeature’s
Protocol
,CipherAlgorithm
,KeyExchangeAlgorithm
andHashAlgorithm
respectively. - Ability to interpret non-ASCII headers:
The gateway receives millions of headers each day with non-ASCII characters in them and the ability to interpret non-ASCII headers is important. Kestrel and IIS already have this ability, and in .NET 6.0, Latin1 request header encoding was added for HTTP.sys as well. It can be enabled using HttpSysOptions
as shown below.
options.UseLatin1RequestHeaders = true;
- Observability:
In addition to .NET telemetry, the health of a service can be monitored by plugging into a wealth of telemetry exposed by HTTP.sys such as:
Http Service Request Queues\ArrivalRate Http Service Request Queues\RejectedRequests Http Service Request Queues\CurrentQueueSize Http Service Request Queues\MaxQueueItemAge Http Service Url Groups\ConnectionAttempts Http Service Url Groups\CurrentConnections
Enabling Dynamic PGO (profile-guided optimization)
Dynamic PGO is one the most exciting features of .NET 6.0! PGO can benefit .NET 6.0 applications by maximizing steady-state performance.
Dynamic PGO is an opt-in feature in .NET 6.0. There are 3 environment variables you need to set to enable dynamic PGO:
set DOTNET_TieredPGO=1
. This setting leverages the initial Tier0 compilation of methods to observe method behavior. When methods are rejitted at Tier1, the information gathered from the Tier0 executions is used to optimize the Tier1 code. Enabling this switch increased our application efficiency by 8.18% compared to plain .NET 6.0.set DOTNET_TC_QuickJitForLoops=1
. This setting enables tiering for methods that contain loops. Enabling this switch (in conjunction with above switch) increased our application efficiency by 10.2% compared to plain .NET 6.0.set DOTNET_ReadyToRun=0
. The core libraries that ship with .NET come with ReadyToRun enabled by default. ReadyToRun allows for faster startup because there is less to JIT compile, but this also means code in ReadyToRun images doesn’t go through the Tier0 profiling process which enables dynamic PGO. By disabling ReadyToRun, the .NET libraries also participate in the dynamic PGO process. Setting this switch (in conjunction with the two above) increased our application efficiency by 13.23% compared to plain .NET 6.0.
Learnings
- There were a few
SocketsHttpHandler
changes in .NET 6.0 that surfaced as issues in our service. We worked with the .NET team to identify workarounds and improvements.- New connection attempts that fail can impact multiple
HTTP requests in .NET 6.0,
whereas a failed connection attempt would only impact a single HTTP
request in .NET 5.0.
- Workaround : Setting a ConnectTimeout slightly lower than HTTP request timeout ensures .NET 5.0 behavior is maintained. Alternatively, disposing the underlying handler on a failure also ensures only a single request is impacted due to a connect timeout (although this can be expensive depending on the size of the connection pool, please be sure to measure for your scenario).
- Requests that fail due to RST packets are no longer automatically
retried in .NET 6.0 and
this results in an elevated rate of
An existing connection was forcibly closed by the remote host
exceptions bubbling up to the application from HttpClient.- Workaround : The application can add retries on top of HttpClient for idempotent requests. Additionally, if RST packets are due to idle timeouts, setting PooledConnectionIdleTimeout to lower than the idle timeout of the server will help eliminate RST packets due to idle connections.
- New connection attempts that fail can impact multiple
HTTP requests in .NET 6.0,
whereas a failed connection attempt would only impact a single HTTP
request in .NET 5.0.
HttpContext.RequestAborted.IsCancellationRequested
had inconsistent behavior on HTTP.sys compared to other servers and has been fixed in .NET 6.0.- Client side disconnects were noisy on HTTP.sys server and there was a race condition that was triggered while trying to set StatusCode on a disconnected request. Both have been fixed in .NET 6.0.
Summary
Every new release of .NET has tremendous performance improvements and there is a huge upside to migrating to the latest version of .NET. For Azure AD gateway, we look forward to trying out newer APIs specific to .NET 6.0 for even bigger wins and further enhancements in .NET 7.0.
How is being run on backend? Do you have Windows VM? Windows VMSS? Windows containers?
Interesting, why the pattern of CPU utilization in the first image is so stable? User logins cannot be that simultaneous in time.
It looks like an hourly or X-minute aggregate, probably either an average or a percentile.
In that case, looking smooth/stable shouldn’t come as a surprise.
how many azure services runing on .net ? how many azure services switched to java/go/rust ? and it seems azure AD is the only service running on .net 5/6 , and ton’s of service running on .net framework, and things like spring cloud spring boot are running on java, kebunates and docker services running on golang
ton's of service also use python on azure for machine learning or notebook.
spring cloud spring boot are running on java
Well yes. Python run on python, Java run on java, Go run on Go... I mean one Azure service is more than just the front part. You have to manage a lot of physical ressource. Create a lot of network and administrate them... A lot of technologies are use. For instance they also use orleans. I...
Thanks for sharing this insights. Really impressive, and very interesting to read about the rationale why some technologie got chosen.
Can you point to documentation on HKLM:SYSTEM\CurrentControlSet\Services\W3SVC\Performance\ReceiveRequestPending. We’ve done much with trying to tweak IIS performance configuration and have never come across this setting before. We’ve struggled to find some mysterious queue that block some request under high load. In fact, just now in searching on both Bing and Google the only result for that key is this blog post.
Sean, I don’t think there is public documentation for this key, this is an undocumented IIS reg. key. For reference, Azure App Service and certain other Microsoft services set this key to 1000. You will want to experiment with different values for your scenario.