A Hypothesis-Driven Approach to Building and Testing Resilience in .NET Azure Functions

Cloud-native architectures in Azure often bring together many services and dependencies – applications can read from and write to data stores, improve performance via external caches, and rely on message and event services to process data, among many other potential configurations of cloud components. While we’d hope that every component of our tech stack will work perfectly over the lifetime of our product, this ignores the realities of software development, particularly in the realm of cloud computing: network connections flicker, infrastructure experiences a temporary blip, or load on shared public cloud resources leads to throttled requests. Even in the face of these challenges, we, as developers, still have a responsibility to our end users to make our application as available and reliable as we can.

It might be impossible to foresee every problem that will arise, but we can follow the spirit of scientific inquiry to develop and test hypotheses about how our system might behave when different components fail. By identifying potential points of failure in our system early on, we can take steps to mitigate problems that might arise and make our solutions more resilient overall.

In this post, we’ll present an example of this journey through code snippets from an Azure Functions-based data processing pipeline created as an artifact from our team’s work with customers. The Functions relied heavily on Azure Digital Twins (ADT) and Azure Blob Storage as part of their workflows, and encountering transient errors while interfacing with these dependencies, combined with our team’s prioritization of engineering fundamentals, inspired a deeper dive into strategies to build and test resilience in our .NET code. By adopting and incorporating this resilience-first mindset in our project, we were able to more carefully consider how we might want to vary error handling behavior across different failure scenarios and exception types. We also needed to consider how we could design a thorough defense-in-depth strategy that integrates resilience options, both those built-into the Azure SDKs and implemented via external libraries, and, ultimately, enable our customer to more effectively and reliably support and extend this application in production.

You’ll be able to follow along as we go over some of the main concepts of resilience engineering and how they can be implemented. In particular, we’ll discuss ways in which you can leverage the well-known Polly library in an Azure Function to implement patterns for resilience and transient fault-handling. We will also discuss using these in conjunction with resilience options offered in some of the native Azure SDKs. We’ll conclude by reviewing the use of Polly’s companion library, Simmy, to perform hypothesis-driven chaos testing against any external dependencies used in an application.

Implementing resilience with Polly

Polly is a .NET library for resilience and transient fault-handling. It implements several design patterns for this – including Retry, Circuit-breaker, Timeout, and Rate-limiting – through the use of composable policies. These policies are highly configurable based on your particular business and error-handling requirements. You can also combine multiple policies for a multi-layered approach to resilience. You can check out this page in their documentation for a more detailed discussion of when and why you should use each policy in a comprehensive resilience engineering strategy.

Built-in retry policies for Azure SDKs

Note that the services that your applications consume may have some resilience engineering built in by default. For instance, many Azure SDKs and services include configurable retry mechanisms, though it’s also worth checking the individual product documentation of each service in case this behavior isn’t listed in that linked document (e.g. Azure Digital Twins (ADT)). These retry policies can be tuned, via configuration variables, to ensure that the behavior is best suited to the unique requirements and best practices for each service.

This means that when this built-in Azure SDK client retry strategy is appropriate for your application scenario, you probably wouldn’t need to implement retries using Polly. It might still be useful to do so when making requests against some other SDK/API that doesn’t have that guarantee built in or when additional retry configuration complexity is required across multiple application layers.

As with most design decisions, you should think about this carefully based on your business requirements, the services you’re using, your particular cloud architecture, and the expected usage patterns of your system. The Azure Architecture Center has a great article on transient fault handling that dives a little deeper on these considerations, common pitfalls, and best practices – we highly encourage you to check it out for more guidance on this topic!

Using the Circuit Breaker pattern

Retries are a very useful tool in handling transient faults. However, in cases where a dependency is seriously struggling, and it’s evident that the issue is likely not just a minor blip, a large number of high-frequency requests or an over-enthusiastic retry policy (or both) can do more harm than good, potentially causing your users to wait even longer to get what they need from your application. The Circuit Breaker pattern, used in conjunction with a well-crafted retry policy, is well-suited for these scenarios to regulate the flow of traffic and avoid overwhelming downstream servers with excessive retries.

In the section below, we’ll walk through setting up a circuit breaker policy using the Polly implementation of this pattern. Circuit Breaker is the only Polly policy implemented and discussed in this sample, but you can read more about the other options for implementing resilience and transient fault-handling in their wiki. The patterns presented for registering policies via dependency injection and parameterization to enable easy configuration are applicable across these policy options.

Policy registration via dependency injection

For the rest of this blog post, we’ll assume some basic familiarity with .NET Azure Functions. However, if you’re less familiar with this topic, don’t worry – the Azure documentation is a good place to get basic knowledge and get started developing .NET Azure Functions. Dependency injection in Azure Functions is key to building loosely-coupled, easily configurable applications that follow the principle of inversion of control (IoC). We can inject a Polly PolicyRegistry centrally on Function startup so that later, when we want to apply one of its policies elsewhere in the project, it can be extracted from this registry; this pattern promotes the separation of policy definition and usage.

In our sample, we define the methods for injecting policies in PolicyExtensions.cs:

public static class PolicyExtensions
{
    public const string AdtPolicyName = "adtPolicy";

    /// <summary>
    /// Adds Polly Policies to service collection
    /// </summary>
    /// <param name="services">Service collection to add Polly policies to</param>
    /// <param name="circuitBreakerAllowedExceptionCount">Allowed exceptions in timeframe</param>
    /// <param name="circuitBreakerWaitTimeSeconds">Duration of Break</param>
    /// <returns>void</returns>
    public static void AddPollyPolicies(this IServiceCollection services, int circuitBreakerAllowedExceptionCount, int circuitBreakerWaitTimeSeconds)
    {
        if (services is null)
        {
            throw new ArgumentNullException(nameof(services));
        }

        services
            .AddPolicyRegistry()
            .AddAdtPolicies(circuitBreakerAllowedExceptionCount, circuitBreakerWaitTimeSeconds);
    }

    /// <summary>
    /// Adds a Circuit Breaker Policy for ADT requests to the chain of policy registries.
    /// </summary>
    /// <param name="policyRegistry">Policy Registry to add the Circuit Breaker Policy to</param>
    /// <param name="circuitBreakerAllowedExceptionCount">Allowed exceptions in timeframe</param>
    /// <param name="circuitBreakerWaitTimeSeconds">Duration of Break</param>
    /// <returns>void</returns>
    private static IPolicyRegistry<string> AddAdtPolicies(this IPolicyRegistry<string> policyRegistry, int circuitBreakerAllowedExceptionCount, int circuitBreakerWaitTimeSeconds)
    {
        if (policyRegistry is null)
        {
            throw new ArgumentNullException(nameof(policyRegistry));
        }

        // Open circuit based on specified exception types being thrown
        var adtPolicy = GetCircuitBreakerPolicy(circuitBreakerAllowedExceptionCount, circuitBreakerWaitTimeSeconds);

        policyRegistry
            .Add(AdtPolicyName, adtPolicy);

        return policyRegistry;
    }

    private static IAsyncPolicy GetCircuitBreakerPolicy(int allowedExceptionCount, int waitTimeSeconds)
    {
        // Open circuit based on potentially transient HTTP error exceptions being thrown
        var policy = Policy
            .Handle<RequestFailedException>(ex =>
                ex.Status >= 500 ||
                ex.Status == (int)HttpStatusCode.RequestTimeout ||
                ex.Status == (int)HttpStatusCode.TooManyRequests);

        var circuitBreakerPolicyAsync = policy.CircuitBreakerAsync(allowedExceptionCount, TimeSpan.FromSeconds(waitTimeSeconds));

        return circuitBreakerPolicy;
    }
}

There are a few things to call out here:

The methods generating policies are grouped by the service they are targeting (here, we showed the ADT-related policies). This was an intentional design choice, made not only as a logical grouping for readability, but also to make it easier to centralize all policy-related setup in this class, so that the consuming SDK clients will only need to extract a single policy. We will see later in this document how this can be extended with PolicyWrap to flexibly combine multiple policies into one custom policy when performing chaos testing with Simmy.
Note that our Circuit Breaker policy explicitly handles only RequestFailedExceptions, which is the exception type thrown by the ADT .NET SDK on service errors. We introspect further on the status code of this error to only trigger the policy on the ones that potentially indicate a transient HTTP error (e.g. >=500 status code for an Internal Service Error, request timeouts, or rate-limits exceeded). More examples of pattern matching in policy construction to enable the configuration best suited to your use case can be found in the Polly Circuit Breaker quickstart.
We pass the allowed exception count and cool-off period of the circuit breaker through to these methods as variables. This parameterization allows us to configure these via the external app settings for easy tuning.

We can call this extension method in our main Startup.cs class of the Functions project:

var config = this.GetConfiguration(builder);

// Add Polly
var settings = new StreamingDataFlowSettings(config);

builder.Services.AddPollyPolicies(settings.CircuitBreakerAllowedExceptionCount, settings.CircuitBreakerWaitTimeSec);

Here, StreamingDataFlowSettings is a POCO class meant to map to the local or deployed settings for the function app, through which we can set the configuration for all components of the application.

Implementation in SDK clients

As we discussed above, to use the policies registered in our startup class, we can extract what we need to use by name in the constructor of our consuming client classes (here, in AdtClient.cs):

private readonly IAsyncPolicy policy;

...

  // In the AdtClient() constructor
  this.policy = policyRegistry.Get<IAsyncPolicy>(PolicyExtensions.AdtPolicyName);

We can then use this policy to execute the client requests provided by the ADT .NET SDK. As an example:

public void UpdateTwinAsync<BasicDigitalTwin>(string twinId, BasicDigitalTwin twin)
{
    try
    {
        this.policy.ExecuteAsync<T>(() =>
            this.client.UpdateDigitalTwinAsync<T>(twinId, twin));
    }
    catch (RequestFailedException e)
    {
        this.logger.FailedToCreateTwin(...);
    }
}

All requests wrapped in this way will be subject to the circuit breaker policy we defined in the PolicyExtensions class, just like we wanted.

Testing resilience with Simmy

So far in this post, we’ve seen how to implement and configure Polly policies to improve resilience in our .NET Azure Functions applications. This is great! A well-considered resilience engineering strategy can give us tremendous peace of mind to know that transient errors are being handled gracefully. However, it’s also best to ensure that this fault handling is correctly implemented and configured; while transient dependency failures are an inevitable part of software, it isn’t easy to replicate the conditions that lead to them, like network failures or host throttling, in a repeatable and systematic way.

When using Polly, Simmy is a natural next step to achieve these goals. Simmy is a library for performing chaos engineering by using configurable fault injections in a policy-centric way. It is directly built off of and integrated with Polly. As a bonus, this also means that we can reuse much of our existing code structure for policy registration and request execution.

Building a hypothesis

Resilience engineering and structuring chaos tests (really, most testing in software development), first requires you to understand what the expected behavior of your application is at a steady state. From there, you need to understand where and how it can fail, and then craft your hypothesis about what will happen in each case, following an “if-then” format (i.e. if X occurs at a rate of Y%, then Z will happen.)

As a simple example, let’s look at the case of our application attempting to update twins in Azure Digital Twins. We know that if ADT encounters an error on receiving an update request, it will return a RequestFailedException, which will be logged by our application. We also know that we have configured our circuit breaker policy to specifically handle errors with status code 408 (Request Timeout), 429 (Too Many Requests), and anything >= 500 (Internal Status Error), so the circuit breaker will be triggered if we exceed a certain amount (allowedExceptionCount) of consecutive errors matching this criteria – for the sake of this example, lets say that allowedExceptionCount is 5. We can form two hypotheses based off of this information:

If the update request to ADT returns a RequestFailedException with status code 500 randomly 5% of the time (simulating some kind of temporary blip), then we expect to see these exceptions occur occasionally and be logged, but it would be rare to see the circuit breaker trigger, since exceptions need to be consecutive for the circuit to open.
If the update request to ADT returns a RequestFailedException with status code 500, 100% of the time (simulating an extended issue or outage), then we expect to see the exceptions logged 5 times and then the circuit breaker policy would trigger, blocking further requests via the .NET SDK.

From here, provided that you have a configurable framework for performing these tests, it’s easy enough to simply vary the parameters of the faults that you are injecting and run your series of experiments based off of this list. We will see how this can be implemented in the following section.

Putting into practice

Remember, from what we’ve implemented so far, we already have a framework for injecting a PolicyRegistry into our application and using that policy to execute client SDK requests. Per our established pattern, adding chaos policies to simulate the exception cases detailed in the hypotheses above requires minimal extension to add the new policies, as well as a few new variables for configuration:

private static IPolicyRegistry<string> AddAdtPolicies(
  this IPolicyRegistry<string> policyRegistry,
  bool usingCircuitBreaker,
  double simmyInjectionRate,
  string chaosDependencyTestingKey,
  int circuitBreakerAllowedExceptionCount,
  int circuitBreakerWaitTimeSeconds)
{
    if (policyRegistry is null)
    {
        throw new ArgumentNullException(nameof(policyRegistry));
    }

    // By default, use a no-op policy
    IAsyncPolicy adtPolicy = Policy.NoOpAsync();
    List<IAsyncPolicy> allPolicies = new List<IAsyncPolicy> { adtPolicy };

    // Add additional policies per config
    if (usingCircuitBreaker)
    {
        var circuitBreakerPolicy = GetCircuitBreakerPolicy(circuitBreakerAllowedExceptionCount, circuitBreakerWaitTimeSeconds);
        allPolicies.Add(circuitBreakerPolicy);
    }

    // Note that here, "Adt" is used as a key to toggle chaos testing for the ADT SDK client
    // This can be extended to other dependency types to isolate chaos testing for each
    // dependency via configurable app settings.
    if (string.Equals(chaosDependencyTesting, "Adt"))
    {
        var adtFaultPolicy = GetRequestFailedExceptionFaultPolicy(simmyInjectionRate, adtKey);
        allPolicies.Add(adtFaultPolicy);
    }

    // If we ended up adding more policies, combine them - otherwise, just return no-op
    if (allPolicies.Count > 1)
    {
        adtPolicy = Policy.Wrap(allPolicies.ToArray());
    }

    policyRegistry
        .Add(AdtPolicyName, adtPolicy);

    return policyRegistry;
}

private static IAsyncPolicy GetRequestFailedExceptionFaultPolicy(double injectionRate, string serviceName)
{
    // Causes the policy to throw a RequestFailedException with a probability of {injectionRate}% if enabled
    var fault = new RequestFailedException(500, $"Simmy: {serviceName} Internal Status Error");

    var chaosExceptionPolicy = MonkeyPolicy.InjectExceptionAsync(with =>
      with.Fault(fault)
        .InjectionRate(injectionRate)
        .Enabled());

    return chaosExceptionPolicy;
}

The main takeaways from this sample:

We’ve added parameters (that can be stored in the function app settings) for the injection rate of the Simmy exception policies as well as a key to toggle chaos testing for this dependency (in this case, "Adt") – if this key is unset, no chaos-related policies are injected. This is crucial for containing the blast radius of the chaos testing; it’s important that testing of this nature has minimal impact on your end users in a production environment. This approach can also be extended if you wish to run chaos testing on multiple dependencies – by varying the key, tests for each dependency can be isolated, leading to cleaner tests.
This version also includes a toggle for the circuit breaker policy, mostly to illustrate the use of no-op policies but also for maintaining expected functionality in cases where we would want to use the same ADT SDK client but with different behavior for error handling and resilience. No-op policies conform to the same expected format of a policy, but allow you to execute the underlying code without intervention, which is useful for maintaining data contracts across a project. They can also be useful in unit tests where you want to create stubs for policy behavior without affecting functionality.
Multiple policies can be wrapped into one with the Policy.Wrap() method. This can be used as it is here for Polly/Simmy in combination, as well as multiple Polly resilience strategies for a defense in depth approach.

Now that we’ve implemented this chaos testing framework in our .NET code, we can vary the configuration of the Simmy fault injection to match each of the hypotheses to test. Once that’s set, we can then just run our application normally with some simulated input (which can be done manually, via simple console app input generators, or automated load testing tools) and observe the resulting behavior via the application logs or monitoring tools. Note that Simmy exceptions are injected in place of the SDK call wrapped by the policy, meaning that they are meant to represent errors thrown by the SDK after it has gone through its internal retry policies. If we were able to inject Simmy exceptions at the layer that they would be able to be retried by the SDK, we could expect to see fewer exceptions logged.

The Azure ADT .NET SDK was just one of the external dependencies we used as part of our customer project, and we’ve highlighted its usage here as a representative sample of this approach to chaos testing. In our project, we implemented a version of this for each external SDK client we used, fine-tuning the resilience policies and fault injections based on the usage and error handling requirements of each one. For us, this testing revealed opportunities for improvement to add more fine-grained error handling in our circuit breaker logic to handle transient HTTP errors and temporary rate limit throttling differently from others that were related more to input validation for ADT requests (the final result of which is shown in this sample); it also informed a more thoughtful custom error handling policy for a cache client that had different built-in retry configurations for its initial connection and cache operations. Most importantly, this careful observation and documentation across the failure scenarios we identified enabled us to test and validate our assumptions about how our application worked in the face of unexpected failures. By using these results to identify bugs and make improvements where needed across our system, we could ensure that we were leaving our customer well-equipped to handle errors gracefully and guarantee system reliability to their end users.

While this is by no means an exhaustive example, we hope that you can extend it, in conjunction with the Polly and Simmy documentation and Azure resources, to develop a strategy for resilience and chaos testing that’s best suited for your cloud application.

Conclusion

Resilience is a core tenet of building reliable cloud platforms, and in addition to Polly and Simmy for .NET, there are many tools available for incorporating and testing resilience engineering into your cloud applications. It’s always important to keep engineering fundamentals in mind when designing and implementing systems; rigorous testing of our software is critical not only for ensuring that we meet the expected behavior of our system, but also to foster a mindset of mindfulness about understanding all potential inputs and failure points of the system. Being methodical and taking the time to systematically develop testing hypotheses helps to validate assumptions, promote code quality, and uncover the areas where your application can become more resilient overall.

To see the code presented in this sample in context of the full data processing workflow, check out the AAS Digital Factory repo in Azure-Samples. We hope that you’ve found this to be a useful overview of some of the key concepts of resilience engineering and chaos testing and how Polly and Simmy can be used to achieve these in a real-world application. This is obviously just one way of building and testing resilience in an Azure .NET application – we welcome any thoughts or feedback on ways you might have achieved this in your own cloud applications.

References and further reading

Polly documentation
Simmy documentation
Azure Architecture Center: guidance on transient fault-handling
Azure Architecture Center: Chaos engineering
Azure Architecture Center: the Retry pattern
Azure Architecture Center: the Circuit Breaker pattern
Pluralsight course – performing chaos in a serverless world
Azure-Samples: AAS Digital Factory shows how this approach to resilience and chaos testing presented can be put into practice, including all code snippets from this article.

A Hypothesis-Driven Approach to Building and Testing Resilience in .NET Azure Functions

Implementing resilience with Polly

Built-in retry policies for Azure SDKs

Using the Circuit Breaker pattern

Policy registration via dependency injection

Implementation in SDK clients

Testing resilience with Simmy

Building a hypothesis

Putting into practice

Conclusion

References and further reading

Author

Read next

Effortless Pair Programming with GitHub Codespaces and VSCode LiveShare

In-App User Experience with Flutter Embedded Web View and Azure Active Directory B2C User Flows

Implementing resilience with Polly

Built-in retry policies for Azure SDKs

Using the Circuit Breaker pattern

Policy registration via dependency injection

Implementation in SDK clients

Testing resilience with Simmy

Building a hypothesis

Putting into practice

Conclusion

References and further reading

Author

Read next

Effortless Pair Programming with GitHub Codespaces and VSCode LiveShare

In-App User Experience with Flutter Embedded Web View and Azure Active Directory B2C User Flows

Stay informed