{"id":50406,"date":"2024-02-09T10:05:00","date_gmt":"2024-02-09T18:05:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=50406"},"modified":"2024-12-13T14:02:35","modified_gmt":"2024-12-13T22:02:35","slug":"resilience-and-chaos-engineering","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/resilience-and-chaos-engineering\/","title":{"rendered":"Resilience and chaos engineering"},"content":{"rendered":"<p>This article introduces the concept of resilience and chaos engineering in .NET applications using the Polly library, highlighting new features that enable chaos engineering. It provides a practical guide on integrating chaos strategies within HTTP clients and showcases how to configure resilience pipelines for improved fault tolerance.<\/p>\n<h2>TL;DR<\/h2>\n<p>As of version 8.3.0, the <a href=\"https:\/\/www.pollydocs.org\">Polly library<\/a> now supports <a href=\"https:\/\/www.pollydocs.org\/chaos\">chaos engineering<\/a>. This update allows you to use the following chaos strategies:<\/p>\n<ul>\n<li><strong>Fault<\/strong>: Introduces faults (exceptions) into your system.<\/li>\n<li><strong>Outcome<\/strong>: Injects fake outcomes (results or exceptions) in your system.<\/li>\n<li><strong>Latency<\/strong>: Adds delay to executions before calls are executed.<\/li>\n<li><strong>Behavior<\/strong>: Enables the injection of <em>any<\/em> additional behavior before a call is made.<\/li>\n<\/ul>\n<blockquote>\n<p>Chaos engineering for .NET was initially introduced in the <a href=\"https:\/\/github.com\/Polly-Contrib\/Simmy\">Simmy library<\/a>. In version 8 of Polly, we collaborated with the creator of Simmy to integrate the Simmy library directly into Polly.<\/p>\n<\/blockquote>\n<p>To use the new chaos strategies in a HTTP client, add the following packages to your C# project:<\/p>\n<pre><code class=\"language-xml\">&lt;ItemGroup&gt;\n  &lt;PackageReference Include=\"Microsoft.Extensions.Http.Resilience\" \/&gt;\n  &lt;!-- This is required until Microsoft.Extensions.Http.Resilience updates the version of Polly it depends on. --&gt;\n  &lt;PackageReference Include=\"Polly.Extensions\" \/&gt;\n&lt;\/ItemGroup&gt;<\/code><\/pre>\n<p>You can now use the new chaos strategies when setting up the resilience pipeline:<\/p>\n<pre><code class=\"language-csharp\">services\n    .AddHttpClient(\"my-client\")\n    .AddResilienceHandler(\"my-pipeline\", (ResiliencePipelineBuilder&lt;HttpResponseMessage&gt; builder) =&gt; \n    {\n        \/\/ Start with configuring standard resilience strategies\n        builder\n            .AddConcurrencyLimiter(10, 100)\n            .AddRetry(new RetryStrategyOptions&lt;HttpResponseMessage&gt; { \/* configuration options *\/ })\n            .AddCircuitBreaker(new CircuitBreakerStrategyOptions&lt;HttpResponseMessage&gt; { \/* configuration options *\/ })\n            .AddTimeout(TimeSpan.FromSeconds(5));\n\n        \/\/ Next, configure chaos strategies to introduce controlled disruptions.\n        \/\/ Place these after the standard resilience strategies.\n\n        \/\/ Inject chaos into 2% of invocations\n        const double InjectionRate = 0.02;\n\n        builder\n            .AddChaosLatency(InjectionRate, TimeSpan.FromMinutes(1)) \/\/ Introduce a delay as chaos latency\n            .AddChaosFault(InjectionRate, () =&gt; new InvalidOperationException(\"Chaos strategy injection!\")) \/\/ Introduce a fault as chaos\n            .AddChaosOutcome(InjectionRate, () =&gt; new HttpResponseMessage(System.Net.HttpStatusCode.InternalServerError)) \/\/ Simulate an outcome as chaos\n            .AddChaosBehavior(0.001, cancellationToken =&gt; RestartRedisAsync(cancellationToken)); \/\/ Introduce a specific behavior as chaos            \n    });<\/code><\/pre>\n<p>In the example above:<\/p>\n<ul>\n<li>The <a href=\"https:\/\/learn.microsoft.com\/dotnet\/core\/extensions\/httpclient-factory\"><code>IHttpClientFactory<\/code><\/a> pattern is used to register the <code>my-client<\/code> HTTP client.<\/li>\n<li>The <code>AddResilienceHandler<\/code> extension method is used to set up Polly&#8217;s resilience pipeline. For more information on resilience with the Polly library, refer to <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/building-resilient-cloud-services-with-dotnet-8\/\">Building resilient cloud services with .NET 8<\/a>.<\/li>\n<li>Both resilience and chaos strategies are configured by calling extension methods on top of <code>builder<\/code> instance.<\/li>\n<\/ul>\n<h2>About chaos engineering<\/h2>\n<p>Chaos engineering is a practice that involves testing a system by introducing disturbances or unexpected conditions. The goal is to gain confidence in the system&#8217;s ability to remain stable and reliable under challenging circumstances in a live production environment.<\/p>\n<p>There are numerous resources available for those interested in exploring chaos engineering further:<\/p>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Chaos_engineering\">Chaos engineering on Wikipedia<\/a>: This link provides an overview of chaos engineering, including its basic concepts, history, and associated tools.<\/li>\n<li><a href=\"https:\/\/www.gremlin.com\/community\/tutorials\/chaos-engineering-the-history-principles-and-practice\/\">Chaos engineering, the history, principles, and practices<\/a>: An in-depth article by <a href=\"https:\/\/www.gremlin.com\/chaos-engineering\/\">Gremlin<\/a>, a platform specializing in chaos engineering, covering the topic&#8217;s history, key principles, and practical applications.<\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/azure\/chaos-studio\/chaos-studio-chaos-engineering-overview\">Understanding chaos engineering and resilience<\/a>: A beginner&#8217;s guide to chaos engineering within the framework of <a href=\"https:\/\/learn.microsoft.com\/azure\/chaos-studio\/chaos-studio-overview\">Azure Chaos Studio<\/a>, a service designed to evaluate and enhance the resilience of cloud applications and services through chaos engineering.<\/li>\n<\/ul>\n<p>However, this blog post will not delve into the intricacies of chaos engineering. Instead, it highlights how to use the Polly library to inject chaos into our systems practically. We will focus on <em>in-process<\/em> chaos injection, meaning we introduce chaos directly into your process. We won&#8217;t cover other external methods, such as restarting virtual machines, simulating high CPU usage, or creating low-memory conditions, in this article.<\/p>\n<h2>Scenario<\/h2>\n<p>We&#8217;re building a simple web service that provides a list of TODOs. This service gets the TODOs by talking to the <code>https:\/\/jsonplaceholder.typicode.com\/todos<\/code> endpoint using an HTTP client. To test how well our service can handle problems, we will introduce chaos into the HTTP communication. Then, we&#8217;ll use resilience strategies to mitigate these issues, making sure that the service remains reliable for users.<\/p>\n<p>You&#8217;ll learn how to use the Polly API to control the amount of chaos injected. This technique allows us to apply chaos selectively, such as only in certain environments (like Development or Production) or to specific users. This way, we can ensure stability while still testing our service&#8217;s robustness.<\/p>\n<h2>Concepts<\/h2>\n<p>The Polly library&#8217;s chaos strategies have several key properties in common:<\/p>\n<table>\n<thead>\n<tr>\n<th>Property<\/th>\n<th>Default Value<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>InjectionRate<\/code><\/td>\n<td>0.001<\/td>\n<td>This is a decimal value between 0 and 1. It represents the chance that chaos will be introduced. For example, a rate of 0.2 means there&#8217;s a 20% chance for chaos on each call; 0.01 means a 1% chance; and 1 means chaos will occur on every call.<\/td>\n<\/tr>\n<tr>\n<td><code>InjectionRateGenerator<\/code><\/td>\n<td><code>null<\/code><\/td>\n<td>This function decides the rate of chaos for each specific instance, with values ranging from 0 to 1.<\/td>\n<\/tr>\n<tr>\n<td><code>Enabled<\/code><\/td>\n<td><code>true<\/code><\/td>\n<td>This indicates whether chaos injection is currently active.<\/td>\n<\/tr>\n<tr>\n<td><code>EnabledGenerator<\/code><\/td>\n<td><code>null<\/code><\/td>\n<td>This function determines if the chaos strategy should be activated for each specific instance.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>By adjusting the <code>InjectionRate<\/code>, you can control the amount of chaos injected into the system. The <code>EnabledGenerator<\/code> lets you dynamically enable or disable the chaos injection. This means you can turn the chaos on or off under specific conditions, offering flexibility in testing and resilience planning.<\/p>\n<h2>Practical example<\/h2>\n<p>In the next sections, we&#8217;ll walk through the development of a simple web service. This process includes introducing chaos into the system, then exploring how you can dynamically manage the level of chaos introduced. Lastly, we&#8217;ll demonstrate how to use resilience strategies to mitigate the chaos.<\/p>\n<h3>Creating the Project<\/h3>\n<p>To start, we&#8217;ll create a new project using the console. Follow these steps:<\/p>\n<p><strong>Create a new project<\/strong>: Open a new console window and run the following command that creates a new web project named <em>Chaos<\/em>:<\/p>\n<pre><code class=\"language-bash\">dotnet new web -o Chaos<\/code><\/pre>\n<p>This command creates a new directory called <code>Chaos<\/code> with a basic web project setup.<\/p>\n<p><strong>Modify <code>Program.cs<\/code> File<\/strong>: Next, open the <code>Program.cs<\/code> file in your newly created project and replace its contents with the following code:<\/p>\n<pre><code class=\"language-csharp\">var builder = WebApplication.CreateBuilder(args);\nvar services = builder.Services;\n\nvar httpClientBuilder = builder.Services.AddHttpClient&lt;TodosClient&gt;(client =&gt; client.BaseAddress = new Uri(\"https:\/\/jsonplaceholder.typicode.com\"));\n\nvar app = builder.Build();\napp.MapGet(\"\/\", (TodosClient client, CancellationToken cancellationToken) =&gt; client.GetTodosAsync(cancellationToken));\napp.Run();<\/code><\/pre>\n<p>The <code>TodosClient<\/code> and <code>Todo<\/code> is defined as:<\/p>\n<pre><code class=\"language-csharp\">public class TodosClient(HttpClient client)\n{\n    public async Task&lt;IEnumerable&lt;TodoModel&gt;&gt; GetTodosAsync(CancellationToken cancellationToken)\n    {\n        return await client.GetFromJsonAsync&lt;IEnumerable&lt;TodoModel&gt;&gt;(\"\/todos\", cancellationToken) ?? [];\n    }\n}\n\npublic record TodoModel(\n    [property: JsonPropertyName(\"id\")] int Id, \n    [property: JsonPropertyName(\"title\")] string Title);<\/code><\/pre>\n<p>The code above does the following:<\/p>\n<ul>\n<li>Utilizes the <code>IHttpClientFactory<\/code> to configure a typed <code>TodosClient<\/code> that targets a specified endpoint.<\/li>\n<li>Injects the <code>TodosClient<\/code> into the request handler to fetch a list of todos from the remote endpoint.<\/li>\n<\/ul>\n<p><strong>Run the application<\/strong>: After setting up your project and the <code>TodosClient<\/code>, run the application. You should be able to retrieve and display a list of todos by accessing the root endpoint.<\/p>\n<h3>Injecting chaos<\/h3>\n<p>In this section, we&#8217;ll introduce chaos to our HTTP client to observe its impact on our web service.<\/p>\n<p><strong>Adding resilience libraries<\/strong>: First, update your project file to include necessary dependencies for resilience and chaos handling:<\/p>\n<pre><code class=\"language-xml\">&lt;ItemGroup&gt;\n    &lt;PackageReference Include=\"Microsoft.Extensions.Http.Resilience\" Version=\"8.0.0\" \/&gt;\n    &lt;PackageReference Include=\"Polly.Core\" Version=\"8.3.0\" \/&gt;\n&lt;\/ItemGroup&gt;<\/code><\/pre>\n<blockquote>\n<p><strong>Note<\/strong>: We&#8217;re including <code>Polly.Core<\/code> directly, even though <code>Microsoft.Extensions.Http.Resilience<\/code> already references it. This ensures we use the latest version of Polly that includes chaos strategies. Once <code>Microsoft.Extensions.Http.Resilience<\/code> is updated to incorporate the latest <code>Polly.Core<\/code>, this direct reference will no longer be necessary.<\/p>\n<\/blockquote>\n<p><strong>Injecting chaos into the HTTP Client<\/strong>: Next, enhance the HTTP client setup in your code to use the <code>AddResilienceHandler<\/code> for integrating chaos strategies:<\/p>\n<pre><code class=\"language-csharp\">var httpClientBuilder = builder.Services.AddHttpClient&lt;TodosClient&gt;(client =&gt; client.BaseAddress = new Uri(\"https:\/\/jsonplaceholder.typicode.com\"));\n\n\/\/ New code below\nhttpClientBuilder.AddResilienceHandler(\"chaos\", (ResiliencePipelineBuilder&lt;HttpResponseMessage&gt; builder) =&gt; \n{\n    \/\/ Set the chaos injection rate to 5%\n    const double InjectionRate = 0.05;\n\n    builder\n        .AddChaosLatency(InjectionRate, TimeSpan.FromSeconds(5)) \/\/ Add latency to simulate network delays\n        .AddChaosFault(InjectionRate, () =&gt; new InvalidOperationException(\"Chaos strategy injection!\")) \/\/ Inject faults to simulate system errors\n        .AddChaosOutcome(InjectionRate, () =&gt; new HttpResponseMessage(System.Net.HttpStatusCode.InternalServerError)); \/\/ Simulate server errors\n});<\/code><\/pre>\n<p>This change accomplishes the following:<\/p>\n<ul>\n<li>Applies the <code>AddResilienceHandler<\/code> extension method to introduce a <code>chaos<\/code> resilience pipeline to our HTTP client.<\/li>\n<li>Utilizes various Polly extension methods within the callback to integrate different types of chaos (latency, faults, and error outcomes) into our HTTP calls.<\/li>\n<\/ul>\n<p><strong>Running the application<\/strong>: With the chaos strategies in place, running the application and attempting to retrieve TODOs will now result in random issues. These issues, while artificially induced at a 5% rate, mimic real-world scenarios where dependencies may be unstable, leading to occasional disruptions.<\/p>\n<h3>Dynamically injecting chaos<\/h3>\n<p>In the previous section, we introduced chaos into our HTTP client, but with limited control over the chaos injection&#8217;s timing and intensity. The Polly library, however, offers powerful APIs that allow for precise control over when and how chaos is introduced. These capabilities enable several scenarios:<\/p>\n<ul>\n<li><strong>Environment-specific chaos<\/strong>: Inject chaos only in certain environments, such as Testing or Production, to assess resilience without affecting all users.<\/li>\n<li><strong>User or tenant chaos<\/strong>: Introduce chaos specifically for certain users or tenants, which can be useful for testing resilience in multi-tenant applications.<\/li>\n<li><strong>Dynamic chaos intensity<\/strong>: Adjust the amount of chaos injected based on the environment or specific users, allowing for more refined testing.<\/li>\n<li><strong>Selective request chaos<\/strong>: Choose specific requests or APIs for chaos injection, enabling focused testing on particular system parts.<\/li>\n<\/ul>\n<p>To implement these scenarios, we can use the <code>EnabledGenerator<\/code> and <code>InjectionRateGenerator<\/code> properties available across all Polly chaos strategies. Let&#8217;s explore how to apply these in practice.<\/p>\n<p><strong>Create <code>IChaosManager<\/code> abstraction<\/strong>: First, we&#8217;ll create an <code>IChaosManager<\/code> interface to encapsulate our chaos injection logic. This interface might look like this:<\/p>\n<pre><code class=\"language-csharp\">public interface IChaosManager\n{\n    ValueTask&lt;bool&gt; IsChaosEnabledAsync(ResilienceContext context);\n\n    ValueTask&lt;double&gt; GetInjectionRateAsync(ResilienceContext context);\n}<\/code><\/pre>\n<p>This interface allows us to dynamically determine if chaos should be enabled and at what rate, with the flexibility to make these decisions asynchronously, such as by fetching configuration settings from a remote source.<\/p>\n<p><strong>Incorporating the <code>IChaosManager<\/code> interface<\/strong>: We&#8217;re going to update our approach for defining chaos strategies by utilizing the <code>IChaosManager<\/code>. This will allow us to make dynamic decisions about when to inject chaos.<\/p>\n<pre><code class=\"language-csharp\">\/\/ Updated code below\nhttpClientBuilder.AddResilienceHandler(\"chaos\", (builder, context) =&gt; \n{\n    \/\/ Get IChaosManager from dependency injection\n    var chaosManager = context.ServiceProvider.GetRequiredService&lt;IChaosManager&gt;();\n\n    builder\n        .AddChaosLatency(new ChaosLatencyStrategyOptions\n        {\n            EnabledGenerator = args =&gt; chaosManager.IsChaosEnabledAsync(args.Context),\n            InjectionRateGenerator = args =&gt; chaosManager.GetInjectionRateAsync(args.Context),\n            Latency = TimeSpan.FromSeconds(5)\n        })\n        .AddChaosFault(new ChaosFaultStrategyOptions\n        {\n            EnabledGenerator = args =&gt; chaosManager.IsChaosEnabledAsync(args.Context),\n            InjectionRateGenerator = args =&gt; chaosManager.GetInjectionRateAsync(args.Context),\n            FaultGenerator = new FaultGenerator().AddException(() =&gt; new InvalidOperationException(\"Chaos strategy injection!\"))\n        })\n        .AddChaosOutcome(new ChaosOutcomeStrategyOptions&lt;HttpResponseMessage&gt;\n        {\n            EnabledGenerator = args =&gt; chaosManager.IsChaosEnabledAsync(args.Context),\n            InjectionRateGenerator = args =&gt; chaosManager.GetInjectionRateAsync(args.Context),\n            OutcomeGenerator = new OutcomeGenerator&lt;HttpResponseMessage&gt;().AddResult(() =&gt; new HttpResponseMessage(System.Net.HttpStatusCode.InternalServerError))\n        })           \n});<\/code><\/pre>\n<p>Let&#8217;s break down the changes:<\/p>\n<ul>\n<li>We&#8217;re using a different <code>AddResilienceHandler<\/code> method that takes a <code>context<\/code>. This <code>context<\/code> gives us access to <code>IServiceProvider<\/code>, allowing us to retrieve additional services needed for our chaos setup.<\/li>\n<li><code>IChaosManager<\/code> is obtained and utilized for setting up chaos strategies.<\/li>\n<li>Rather than using simple chaos methods, we&#8217;re opting for options-based extensions. This grants complete control over the chaos functionalities, including the ability to specify conditions for when chaos should be enabled and how frequently. This includes access to both <code>EnabledGenerator<\/code> and <code>InjectionRateGenerator<\/code> properties.<\/li>\n<li>The introduction of <code>FaultGenerator<\/code> and <code>OutcomeGenerator&lt;HttpResponseMessage&gt;<\/code> enables us to define specific faults (errors) and outcomes that chaos strategies will inject. These APIs also allow the creation of a variety of faults and the assignment of different probabilities to each, influencing how likely each fault is to occur. For further details, refer to <a href=\"https:\/\/www.pollydocs.org\/chaos\/outcome#generating-outcomes\">PollyDocs: generating outcomes<\/a> and <a href=\"https:\/\/www.pollydocs.org\/chaos\/fault#generating-faults\">PollyDocs: generating faults<\/a>.<\/li>\n<\/ul>\n<h3>Implementing the <code>IChaosManager<\/code><\/h3>\n<p>Before we dive into the implementation of the chaos manager, let&#8217;s outline how it behaves during chaos injection:<\/p>\n<ul>\n<li>In testing environments, chaos is always active, with the injection rate of <em>5%<\/em>.<\/li>\n<li>In production environments, chaos is enabled exclusively for test users, with an injection rate of <em>3%<\/em>.<\/li>\n<\/ul>\n<blockquote>\n<p><strong>Note:<\/strong> For this example, we&#8217;ll simplify user identification. We&#8217;ll identify users based on the presence of the <code>user=&lt;user-name&gt;<\/code> query string, without delving into the complexities of user identity.<\/p>\n<\/blockquote>\n<p><strong><code>ChaosManager<\/code> implementation<\/strong>: To meet the specified requirements, here&#8217;s how we implement the <code>ChaosManager<\/code>:<\/p>\n<pre><code class=\"language-csharp\">internal class ChaosManager(IWebHostEnvironment environment, IHttpContextAccessor contextAccessor) : IChaosManager\n{\n    private const string UserQueryParam = \"user\";\n\n    private const string TestUser = \"test\";\n\n    public ValueTask&lt;bool&gt; IsChaosEnabledAsync(ResilienceContext context)\n    {\n        if (environment.IsDevelopment())\n        {\n            return ValueTask.FromResult(true);\n        }\n\n        \/\/ This condition is demonstrative and not recommended to use in real apps.\n        if (environment.IsProduction() &amp;&amp;\n            contextAccessor.HttpContext is {} httpContext &amp;&amp; \n            httpContext.Request.Query.TryGetValue(UserQueryParam, out var values) &amp;&amp;\n            values == TestUser)\n        {\n            \/\/ Enable chaos for 'test' user even in production \n            return ValueTask.FromResult(true);\n        }\n\n        return ValueTask.FromResult(false);\n    }\n\n    public ValueTask&lt;double&gt; GetInjectionRateAsync(ResilienceContext context)\n    {\n        if (environment.IsDevelopment())\n        {\n            return ValueTask.FromResult(0.05);\n        }\n\n        if (environment.IsProduction())\n        {\n            return ValueTask.FromResult(0.03);\n        }\n\n        return ValueTask.FromResult(0.0);\n    }\n}<\/code><\/pre>\n<p><strong>Integrating <code>IChaosManager<\/code> with <code>IServiceCollection<\/code><\/strong>: Update <strong>Program.cs<\/strong> to include <code>IChaosManager<\/code> in the Dependency Injection (DI) container.<\/p>\n<pre><code class=\"language-csharp\">var builder = WebApplication.CreateBuilder(args);\nvar services = builder.Services;\n\nservices.TryAddSingleton&lt;IChaosManager, ChaosManager&gt;(); \/\/ &lt;-- Add this line\nservices.AddHttpContextAccessor(); \/\/ &lt;-- Add this line<\/code><\/pre>\n<p><strong>Run the application<\/strong>: Start the application to see how it behaves. Experiment by <a href=\"https:\/\/learn.microsoft.com\/aspnet\/core\/fundamentals\/environments\">changing the environment setting<\/a> to observe the differences in how the application responds to chaos in development versus production settings.<\/p>\n<h3>Use resilience strategies to fix the chaos<\/h3>\n<p>In this section we will use resilience strategies to mitigate the chaos. First, let&#8217;s recount what type of chaos is injected:<\/p>\n<ul>\n<li>Latency of 5 seconds.<\/li>\n<li>Injection of <code>InvalidOperationException<\/code> exception.<\/li>\n<li>Injection of <code>HttpResponseMessage<\/code> with <code>HttpStatusCode.InternalServerError<\/code>.<\/li>\n<\/ul>\n<p>We can use the following resilience strategies to mitigate the chaos.<\/p>\n<ul>\n<li><a href=\"https:\/\/www.pollydocs.org\/strategies\/timeout\">Timeout Strategy<\/a>: Cancels overly long requests.<\/li>\n<li><a href=\"https:\/\/www.pollydocs.org\/strategies\/retry\">Retry Strategy<\/a>: Retries exceptions and invalid responses.<\/li>\n<li><a href=\"https:\/\/www.pollydocs.org\/strategies\/circuit-breaker\">Circuit Breaker Strategy<\/a>: If the failure rate exceeds a certain threshold, it&#8217;s better to stop all communications temporarily. This gives the dependency a chance to recover while conserving resources of our service.<\/li>\n<\/ul>\n<blockquote>\n<p>For more information on various resilience strategies offered by the Polly library, please visit <a href=\"https:\/\/www.pollydocs.org\/strategies\">PollyDocs: Resilience strategies<\/a>.<\/p>\n<\/blockquote>\n<p>To add the resilience of the HTTP client, you have two options:<\/p>\n<ul>\n<li><a href=\"https:\/\/learn.microsoft.com\/dotnet\/core\/resilience\/http-resilience?tabs=dotnet-cli#add-custom-resilience-handlers\"><strong>Custom Resilience Handler<\/strong><\/a>: This option allows you complete control over the resilience strategies added to the client.<\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/dotnet\/core\/resilience\/http-resilience?tabs=dotnet-cli#add-standard-resilience-handler\"><strong>Standard Resilience Handler<\/strong><\/a>: A pre-defined standard resilience handler designed to meet the needs of most situations encountered in the real applications and used across many production applications.<\/li>\n<\/ul>\n<p>For our purposes, the chaos caused by the chaos strategies mentioned earlier can be effectively managed using the standard resilience handler. Generally, it&#8217;s advisable to use the standard handler unless you find it doesn&#8217;t meet your specific needs. Here&#8217;s how you configure the HTTP client with standard resilience:<\/p>\n<pre><code class=\"language-csharp\">var httpClientBuilder = builder.Services.AddHttpClient&lt;TodosClient&gt;(client =&gt; client.BaseAddress = new Uri(\"https:\/\/jsonplaceholder.typicode.com\"));\n\n\/\/ Add and configure the standard resilience above the chaos handler\nhttpClientBuilder\n    .AddStandardResilienceHandler()\n    .Configure(options =&gt; \n    {\n        \/\/ Update attempt timeout to 1 second\n        options.AttemptTimeout.Timeout = TimeSpan.FromSeconds(1);\n\n        \/\/ Update circuit breaker to handle transient errors and InvalidOperationException\n        options.CircuitBreaker.ShouldHandle = args =&gt; args.Outcome switch\n        {\n            {} outcome when HttpClientResiliencePredicates.IsTransient(outcome) =&gt; PredicateResult.True(),\n            { Exception: InvalidOperationException } =&gt; PredicateResult.True(),\n            _ =&gt; PredicateResult.False()\n        };\n\n        \/\/ Update retry strategy to handle transient errors and InvalidOperationException\n        options.Retry.ShouldHandle = args =&gt; args.Outcome switch\n        {\n            {} outcome when HttpClientResiliencePredicates.IsTransient(outcome) =&gt; PredicateResult.True(),\n            { Exception: InvalidOperationException } =&gt; PredicateResult.True(),\n            _ =&gt; PredicateResult.False()\n        };\n    });\n\nhttpClientBuilder.AddResilienceHandler(\"chaos\", (builder, context) =&gt; { \/* Chaos configuration omitted for clarity *\/ });<\/code><\/pre>\n<p>The previous example:<\/p>\n<ul>\n<li>Adds a standard resilience handler using <code>AddStandardResilienceHandler<\/code>. It&#8217;s crucial to place the standard handler before the chaos handler to effectively handle any introduced chaos.<\/li>\n<li>Uses the <code>Configure<\/code> extension method to modify the resilience strategy configuration.<\/li>\n<li>Sets the attempt timeout to 1 second. Timeout duration may differ for each dependency. In our cases, endpoint calls are quick. If they exceed 1 second, it suggests an issue, and it&#8217;s advisable to cancel and retry.<\/li>\n<li>Updates the <code>ShouldHandle<\/code> predicate for both retry and circuit breaker strategies. Here, we employ switch expressions for error handling. The <code>HttpClientResiliencePredicates.IsTransient<\/code> function is utilized to retry on typical transient errors, like HTTP status codes of 500 and above or <code>HttpRequestException<\/code>. We also need to handle <code>InvalidOperationException<\/code>, as it&#8217;s not covered by the <code>HttpClientResiliencePredicates.IsTransient<\/code> function.<\/li>\n<\/ul>\n<p><strong>Running the application<\/strong>: With resilience integrated into the HTTP pipeline, try to run the application and observe its behavior. Errors should no longer appear, thanks to the chaos being effectively countered by the standard resilience handler.<\/p>\n<h2>Telemetry<\/h2>\n<p>Polly offers <a href=\"https:\/\/www.pollydocs.org\/advanced\/telemetry\">comprehensive telemetry<\/a> by default, providing robust tools to track the use of resilience strategies. This telemetry includes both logs and metrics.<\/p>\n<h3>Logs<\/h3>\n<p>Start the application and pay attention to the log events generated by Polly. Look for &#8220;chaos events&#8221; such as <code>Chaos.OnLatency<\/code> or <code>Chaos.OnOutcome<\/code> marked with an <code>Information<\/code> severity level:<\/p>\n<pre><code class=\"language-bash\">info: Polly[0]\n      Resilience event occurred. EventName: 'Chaos.OnOutcome', Source: 'TodosClient-chaos\/\/Chaos.Outcome', Operation Key: '', Result: ''\ninfo: Polly[0]\n      Resilience event occurred. EventName: 'Chaos.OnLatency', Source: 'TodosClient-chaos\/\/Chaos.Latency', Operation Key: '', Result: ''      <\/code><\/pre>\n<p>Resilience events like <code>OnRetry<\/code> or <code>OnTimeout<\/code> triggered by resilience strategies are categorized with higher severity levels, such as <code>Warning<\/code> or <code>Error<\/code>. These indicate unusual activity in the system:<\/p>\n<pre><code class=\"language-bash\">fail: Polly[0]\n      Resilience event occurred. EventName: 'OnTimeout', Source: 'TodosClient-standard\/\/Standard-AttemptTimeout', Operation Key: '', Result: ''\nwarn: Polly[0]\n      Resilience event occurred. EventName: 'OnRetry', Source: 'TodosClient-standard\/\/Standard-Retry', Operation Key: '', Result: '500'      <\/code><\/pre>\n<h3>Metrics<\/h3>\n<p>Polly provides the following instruments that are emitted under <strong>Polly<\/strong> metric name. These help you monitor your application&#8217;s health:<\/p>\n<ul>\n<li><code>resilience.polly.strategy.events<\/code>: Triggered when a resilience event occurs.<\/li>\n<li><code>resilience.polly.strategy.attempt.duration<\/code>: Measures how long execution attempts take, relevant for <code>Retry<\/code> and <code>Hedging<\/code> strategies.<\/li>\n<li><code>resilience.polly.pipeline.duration<\/code>: Tracks the total time taken by resilience pipelines.<\/li>\n<\/ul>\n<p>For more information on Polly&#8217;s metrics, visit <a href=\"https:\/\/www.pollydocs.org\/advanced\/telemetry.html#metrics\">Polly Docs: Metrics<\/a>.<\/p>\n<p>To see these metrics in action, let&#8217;s use the <a href=\"https:\/\/learn.microsoft.com\/dotnet\/core\/diagnostics\/metrics-collection#view-metrics-with-dotnet-counters\"><code>dotnet counters<\/code><\/a> tool:<\/p>\n<ol>\n<li>Start the application using the <code>dotnet run<\/code> command.<\/li>\n<li>Open new terminal window and execute the following command to begin tracking the <code>Polly<\/code> metrics for our <code>Chaos<\/code> app:<\/li>\n<\/ol>\n<pre><code class=\"language-bash\">dotnet counters monitor -n Chaos Polly<\/code><\/pre>\n<p>Finally, access the application root API several times until resilience events start appearing in the console. In the terminal where <code>dotnet counters<\/code> is active you should see the following output:<\/p>\n<pre><code class=\"language-bash\">[Polly]\n    resilience.polly.pipeline.duration (ms)\n        error.type=200,event.name=PipelineExecuted,event.severit         130.25\n        error.type=200,event.name=PipelineExecuted,event.severit         130.25\n        error.type=200,event.name=PipelineExecuted,event.severit         130.25\n        error.type=200,event.name=PipelineExecuted,event.severit         133.75\n        error.type=200,event.name=PipelineExecuted,event.severit         133.75\n        error.type=200,event.name=PipelineExecuted,event.severit         133.75\n        error.type=500,event.name=PipelineExecuted,event.severit           2.363\n        error.type=500,event.name=PipelineExecuted,event.severit           2.363\n        error.type=500,event.name=PipelineExecuted,event.severit           2.363\n        event.name=PipelineExecuted,event.severity=Information,e         752\n        event.name=PipelineExecuted,event.severity=Information,e         752\n        event.name=PipelineExecuted,event.severity=Information,e         752\n    resilience.polly.strategy.attempt.duration (ms)\n        attempt.handled=False,attempt.number=0,error.type=200,ev         130.5\n        attempt.handled=False,attempt.number=0,error.type=200,ev         130.5\n        attempt.handled=False,attempt.number=0,error.type=200,ev         130.5\n        attempt.handled=False,attempt.number=1,error.type=200,ev          98.125\n        attempt.handled=False,attempt.number=1,error.type=200,ev          98.125\n        attempt.handled=False,attempt.number=1,error.type=200,ev          98.125\n        attempt.handled=True,attempt.number=0,error.type=500,eve           2.422\n        attempt.handled=True,attempt.number=0,error.type=500,eve           2.422\n        attempt.handled=True,attempt.number=0,error.type=500,eve           2.422\n    resilience.polly.strategy.events (Count \/ 1 sec)\n        error.type=500,event.name=OnRetry,event.severity=Warning           0\n        event.name=Chaos.OnOutcome,event.severity=Information,pi           0\n        event.name=dummy,event.severity=Error,pipeline.instance=           0\n        event.name=PipelineExecuting,event.severity=Debug,pipeli           0\n        event.name=PipelineExecuting,event.severity=Debug,pipeli           0<\/code><\/pre>\n<p>In practical applications, it&#8217;s beneficial to use Polly metrics for creating dashboards and monitoring tools that monitor your application&#8217;s resilience. The article on <a href=\"https:\/\/learn.microsoft.com\/dotnet\/core\/diagnostics\/observability-with-otel\">.NET observability with OpenTelemetry<\/a> provides a solid starting point for this process.<\/p>\n<h2>Summary<\/h2>\n<p>This article explores the chaos engineering features available in Polly version <code>8.3.0<\/code> and later. It walks you through developing a web application that communicates with a remote dependency. By utilizing Polly&#8217;s chaos engineering capabilities, we can introduce controlled chaos into HTTP client communications and then implement resilience strategies to counteract the chaos. Applying this approach to production applications enables you to proactively address issues affecting your application&#8217;s resilience. Instead of waiting for unforeseen problems, use chaos engineering to simulate and prepare for them.<\/p>\n<p>You can check out the full example in the <a href=\"https:\/\/github.com\/App-vNext\/Polly\/tree\/main\/samples\/Chaos\"><code>Polly\/Samples\/Chaos<\/code><\/a> folder on GitHub.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Chaos engineering with HTTP clients and Polly library<\/p>\n","protected":false},"author":133391,"featured_media":50407,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685,7699,7509,756,7591],"tags":[4,46,7794,7795,7539,7676,80,7770,7772,7769,7771],"class_list":["post-50406","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","category-dotnet-fundamentals","category-aspnetcore","category-csharp","category-networking","tag-net","tag-c","tag-chaos","tag-chaos-engineering","tag-cloud","tag-http","tag-httpclient","tag-ihttpclientfactory","tag-polly","tag-resilience","tag-scale"],"acf":[],"blog_post_summary":"<p>Chaos engineering with HTTP clients and Polly library<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/50406","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/133391"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=50406"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/50406\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/50407"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=50406"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=50406"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=50406"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}