Introducing Configurable Retry Logic in Microsoft.Data.SqlClient v3.0.0-Preview1






The first (false) assumption mentioned in the infamous Fallacies of distributed computing states that “the network is reliable”. Unfortunately, it turns out that distributed applications leveraging services running on multiple compute nodes can suffer from transient connectivity issues, and this is true for on-premises, hybrid and cloud-based deployments. Applications that are not implementing well-known Retry pattern on generic service-to-service communications will usually experience timeouts and exceptions when connectivity issues will manifest themselves.

In addition to network blips. services like Azure SQL can dynamically reconfigure themselves in response to situations like heavy workloads, planned events (like software upgrade or maintenance) and unplanned issues, like hardware or software failures. When events like these happen, client applications can lose connection to the database and receive specific error codes providing an indication of what’s going on in the system. To get a comprehensive list of error messages that can usually be considered as transient and retriable, we recommend you take a deeper look at this other article.

Over the last few years, we provided guidance on how customers could create their own retry logic or, as an alternative, reuse existing libraries aimed to simplify this task for them, More recently, we decided to provide a better experience by incorporating configurable retry logic capabilities directly in our client drivers portfolio starting with Microsoft.Data.SqlClient for .NET 5, .NET Core and full .NET Framework (although we’re planning to cover other major drivers like JDBC, ODBC and more over the next months).

Introducing Configurable Retry Logic

In essence, configurable retry logic lets developers and app managers specify, through code or app config files, how operations like opening a connection or executing a command should react when they encounter a connectivity issue manifesting itself with a certain exception number considered transient or “retriable”. In addition to error codes treated as transient (those published in this list), they can also add new error codes that are part of their business logic through the configuration file and specify regular expression patters of commands that should be considered or excluded from automatic retries, without writing a single line of code for that.

Digging down into details, this new feature is composed by three main components:

  1. A core API that developers can use when they create new data access applications, or refactor existing ones, to define their own retry logic provider with custom tailored behaviors and characteristics.
  2. A default, out of the box, retry logic provider with some pre-defined behaviors (e.g. fixed interval retry, exponential retry, etc.) that developer can reuse as-is without the need of re-implementing their owns.
  3. A configuration file schema that developers can use to enable default retry logic provider and configure what behaviors they want to associate to connection opening and command execution when a connectivity issue happens, without changing a line of code in their existing applications.

The Core API itself is designed around these three interfaces:

  1. Enumerator: generates a sequence of time intervals that depends on a given retry logic.
  2. Retry logic: retrieves the next time interval, considering max number of retries, if a transient condition happens.
  3. Retry provider: applies a given retry logic to connection and command operations.

Image Screenshot 2021 03 09 174247

As this is an extensible design, application developers with very peculiar requirements can create their own retry logic provider, containing a custom implementation of these concrete classes, where specify all the details of how operation will be retried.

For most developers though, they can just use the default base provider that offers out-of-the-box implementation of most common retry patterns like fixed, incremental and exponential retry intervals.

How to use Configurable Retry Logic?

By adding a reference to Microsoft.Data.SqlClient v3.0 Preview NuGet package in your existing project, and creating specific sections in your configuration file, you can enable Retry Logic by turning on the safety switch we have during Preview phase and define what retry method you want to use for SqlConnection and SqlCommand operations like connection opening and command execution.

Here’s an example of such a configuration file:

<?xml version="1.0" encoding="utf-8" ?>
    <section name="SqlConfigurableRetryLogicConnection"
             type="Microsoft.Data.SqlClient.SqlConfigurableRetryConnectionSection, Microsoft.Data.SqlClient"/>
    <section name="SqlConfigurableRetryLogicCommand"
             type="Microsoft.Data.SqlClient.SqlConfigurableRetryCommandSection, Microsoft.Data.SqlClient"/>

    <section name="AppContextSwitchOverrides"
             type="Microsoft.Data.SqlClient.AppContextSwitchOverridesSection, Microsoft.Data.SqlClient"/>

  <!--Enable safety switch in .NET Core-->
  <AppContextSwitchOverrides value="Switch.Microsoft.Data.SqlClient.EnableRetryLogic=true"/>

  <!--Retry method for SqlConnection-->
  <SqlConfigurableRetryLogicConnection retryMethod ="CreateFixedRetryProvider" numberOfTries ="3" deltaTime ="00:00:10" maxTime ="00:00:30"
                                    transientErrors="40615" />

  <!--Retry method for SqlCommand containing SELECT queries-->
  <SqlConfigurableRetryLogicCommand retryMethod ="CreateIncrementalRetryProvider" numberOfTries ="5" deltaTime ="00:00:10" maxTime ="00:01:10"
                                    authorizedSqlCondition="\b(SELECT)\b" transientErrors="102, 4060, 0"/>

Once .NET AppContext safety switch is on, you can define a retry logic for SqlConnection and SqlCommand independently, together with a set of options like:

  • Number of retries
  • Min/max/delta times for generating retry intervals
  • List of comma-separated error codes that you want to be treated as retriable in addition to pre-defined ones
  • A regular expression string representing authorized command texts. As an example, a string like “\b(SELECT)\b” will only retry commands containing SELECT operations while skipping everything else.

The value of retryMethod attribute determines which enumerator is used to generate time intervals:

  • CreateFixedRetryProvider: will use a class called SqlFixedIntervalEnumerator, which is generating a constant time interval defined by deltaTime (with a max overall defined by maxTime).
  • CreateIncrementalRetryProvider: will use SqlIncrementalIntervalEnumerator to generate time intervals, and will basically accumulate delays defined by deltaTime linearly (e.g. first retry 10 sec, second retry 20 sec, etc., until it reached maxTime).
  • CreateExponentialRetryProvider: uses SqlExponentialIntervalEnumerator which, as the name implies, will basically increment time delays with power of 2 ration (e.g. first retry 10 sec, second retry 30 sec, third retry 70 sec, etc.)

The approach based on configuration file is great for those who are interested in getting automatic retry logic out-of-the-box without the need for changing a single line of code in their applications: they just need to add a reference to the new SqlClient package, recompile and voilà!

If instead you’re developing a brand-new application and want to get the most of all the available retry logic behaviors and settings, you can reuse the base classes now available and configure your own retry logic based on your requirements.

Here is an example of how to configure retry logic options and create a new provider to assign to a SqlConnection instance:

    class Program
        static void Main(string[] args)
            AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.EnableRetryLogic", true);
            SqlRetryLogicOption rto = new SqlRetryLogicOption() { NumberOfTries = 3, DeltaTime = TimeSpan.Parse("00:00:10"), MaxTimeInterval = TimeSpan.Parse("00:00:30"), TransientErrors = new[] { 40615 } };
            SqlRetryLogicBaseProvider prov =  SqlConfigurableRetryFactory.CreateFixedRetryProvider(rto);
            using (SqlConnection cnn = new SqlConnection("server=.;"))
                cnn.RetryLogicProvider = prov;
                cnn.RetryLogicProvider.Retrying += new EventHandler<SqlRetryingEventArgs>(OnRetrying);


        protected static void OnRetrying(object sender, SqlRetryingEventArgs args)
              Console.WriteLine("Retrying for {0} times for {1:0.00} sec - Error code: {2}", args.RetryCount, args.Delay.TotalMilliseconds/1000.0, (args.Exceptions[0] as SqlException).Number);

When executed against an Azure SQL Database with firewall rules blocking this client machine (this is where the error code 40615 is coming from), here is the result:

Image Screenshot 2021 03 09 164550

As you can see, connection opening has been retried for 3 times before returning an exception to the application.

As a side note, you can see that despite DeltaTime property was set to 10 seconds, the real delay before retries can be slightly shorter or longer. This is because we’re introducing a small random delay (or anticipation) to make sure that, in case multiple clients are trying to connect using the same retry logic, not all of them are hitting the backend database at the same exact time, causing potential concurrency issues.

Scenarios where configurable retry logic can be a game changer

For new, cloud-native data applications using Azure SQL as backend database, having a configurable (and extensible) retry logic framework baked into client drivers like Microsoft.Data.SqlClient is critical to create a more reliable user experience and respond to unexpected connectivity issues.

For existing applications moving to the cloud, the ability to add automatic retry logic by just configuring few sections in the app config file will remove many barriers and simplify migration projects of all sizes.

One really critical scenario is where both new and existing applications want to target Azure SQL Database Serverless tier to benefit from capabilities like auto-scaling and auto-pausing to adapt to variable workloads and optimize costs. Configurable retry logic is, in fact, capable of eliminating completely the side effect of connection issues during warm up condition (when a serverless database is waking up after auto-pausing, usually between 30 and 60 seconds) by configuring proper number of retries covering that time interval. Here’s an example of how to configure retry logic to use an exponential logic that will retry that initial connection for 5 times starting from a 10 seconds interval up to 2 minutes and half to cover warm up time:

<SqlConfigurableRetryLogicConnection retryMethod ="CreateExponentialRetryProvider" numberOfTries ="5" deltaTime ="00:00:10" maxTime ="00:02:30" />

A word of caution

Connectivity issues can potentially manifest themselves in different forms. If a network disconnection happens after an application has already executed a certain business transaction which committed regularly, automatic retry logic for SqlCommand execution on the client side has no practical ways to discriminate if it must retry the last operation or not. If your application has complex transaction logic that is not idempotent (i.e. can lead to logical data corruption or duplication if re-executed blindly) out of the box retry logic may not fit well.
You can still configure retry logic to skip data manipulation commands like INSERT and UPDATE (or stored procedure calls) and only operate on read only commands to increase application resiliency to transient connectivity issues anyway, this will still help on many important use cases.

Next steps

We’re planning to make configurable retry logic generally available over the next few months, as soon as the initial validation process by internal and external customers will tell us that it is ready for the prime time.

In addition, we’re planning to keep working over the next months to introduce configurable retry logic for all other major client drivers for Azure SQL, like JDBC, ODBC and OLE DB so that all customers will benefit from more resilient data access applications.

Go ahead and give it a try, and let us know what you think on our GitHub repo!


Leave a comment