Introducing Configurable Retry Logic in Microsoft.Data.SqlClient v3.0.0-Preview1

Silvano

 

 

Introduction

The first (false) assumption mentioned in the infamous Fallacies of distributed computing states that “the network is reliable”. Unfortunately, it turns out that distributed applications leveraging services running on multiple compute nodes can suffer from transient connectivity issues, and this is true for on-premises, hybrid and cloud-based deployments. Applications that are not implementing well-known Retry pattern on generic service-to-service communications will usually experience timeouts and exceptions when connectivity issues will manifest themselves.

In addition to network blips. services like Azure SQL can dynamically reconfigure themselves in response to situations like heavy workloads, planned events (like software upgrade or maintenance) and unplanned issues, like hardware or software failures. When events like these happen, client applications can lose connection to the database and receive specific error codes providing an indication of what’s going on in the system. To get a comprehensive list of error messages that can usually be considered as transient and retriable, we recommend you take a deeper look at this other article.

Over the last few years, we provided guidance on how customers could create their own retry logic or, as an alternative, reuse existing libraries aimed to simplify this task for them, More recently, we decided to provide a better experience by incorporating configurable retry logic capabilities directly in our client drivers portfolio starting with Microsoft.Data.SqlClient for .NET 5, .NET Core and full .NET Framework (although we’re planning to cover other major drivers like JDBC, ODBC and more over the next months).

Introducing Configurable Retry Logic

In essence, configurable retry logic lets developers and app managers specify, through code or app config files, how operations like opening a connection or executing a command should react when they encounter a connectivity issue manifesting itself with a certain exception number considered transient or “retriable”. In addition to error codes treated as transient (those published in this list), they can also add new error codes that are part of their business logic through the configuration file and specify regular expression patters of commands that should be considered or excluded from automatic retries, without writing a single line of code for that.

Digging down into details, this new feature is composed by three main components:

  1. A core API that developers can use when they create new data access applications, or refactor existing ones, to define their own retry logic provider with custom tailored behaviors and characteristics.
  2. A default, out of the box, retry logic provider with some pre-defined behaviors (e.g. fixed interval retry, exponential retry, etc.) that developer can reuse as-is without the need of re-implementing their owns.
  3. A configuration file schema that developers can use to enable default retry logic provider and configure what behaviors they want to associate to connection opening and command execution when a connectivity issue happens, without changing a line of code in their existing applications.

The Core API itself is designed around these three interfaces:

  1. Enumerator: generates a sequence of time intervals that depends on a given retry logic.
  2. Retry logic: retrieves the next time interval, considering max number of retries, if a transient condition happens.
  3. Retry provider: applies a given retry logic to connection and command operations.

Image Screenshot 2021 03 09 174247

As this is an extensible design, application developers with very peculiar requirements can create their own retry logic provider, containing a custom implementation of these concrete classes, where specify all the details of how operation will be retried.

For most developers though, they can just use the default base provider that offers out-of-the-box implementation of most common retry patterns like fixed, incremental and exponential retry intervals.

How to use Configurable Retry Logic?

By adding a reference to Microsoft.Data.SqlClient v3.0 Preview NuGet package in your existing project, and creating specific sections in your configuration file, you can enable Retry Logic by turning on the safety switch we have during Preview phase and define what retry method you want to use for SqlConnection and SqlCommand operations like connection opening and command execution.

Here’s an example of such a configuration file:

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <configSections>
    <section name="SqlConfigurableRetryLogicConnection"
             type="Microsoft.Data.SqlClient.SqlConfigurableRetryConnectionSection, Microsoft.Data.SqlClient"/>
    <section name="SqlConfigurableRetryLogicCommand"
             type="Microsoft.Data.SqlClient.SqlConfigurableRetryCommandSection, Microsoft.Data.SqlClient"/>

    <section name="AppContextSwitchOverrides"
             type="Microsoft.Data.SqlClient.AppContextSwitchOverridesSection, Microsoft.Data.SqlClient"/>
  </configSections>

  <!--Enable safety switch in .NET Core-->
  <AppContextSwitchOverrides value="Switch.Microsoft.Data.SqlClient.EnableRetryLogic=true"/>

  <!--Retry method for SqlConnection-->
  <SqlConfigurableRetryLogicConnection retryMethod ="CreateFixedRetryProvider" numberOfTries ="3" deltaTime ="00:00:10" maxTime ="00:00:30"
                                    transientErrors="40615" />

  <!--Retry method for SqlCommand containing SELECT queries-->
  <SqlConfigurableRetryLogicCommand retryMethod ="CreateIncrementalRetryProvider" numberOfTries ="5" deltaTime ="00:00:10" maxTime ="00:01:10"
                                    authorizedSqlCondition="\b(SELECT)\b" transientErrors="102, 4060, 0"/>
</configuration>

Once .NET AppContext safety switch is on, you can define a retry logic for SqlConnection and SqlCommand independently, together with a set of options like:

  • Number of retries
  • Min/max/delta times for generating retry intervals
  • List of comma-separated error codes that you want to be treated as retriable in addition to pre-defined ones
  • A regular expression string representing authorized command texts. As an example, a string like “\b(SELECT)\b” will only retry commands containing SELECT operations while skipping everything else.

The value of retryMethod attribute determines which enumerator is used to generate time intervals:

  • CreateFixedRetryProvider: will use a class called SqlFixedIntervalEnumerator, which is generating a constant time interval defined by deltaTime (with a max overall defined by maxTime).
  • CreateIncrementalRetryProvider: will use SqlIncrementalIntervalEnumerator to generate time intervals, and will basically accumulate delays defined by deltaTime linearly (e.g. first retry 10 sec, second retry 20 sec, etc., until it reached maxTime).
  • CreateExponentialRetryProvider: uses SqlExponentialIntervalEnumerator which, as the name implies, will basically increment time delays with power of 2 ration (e.g. first retry 10 sec, second retry 30 sec, third retry 70 sec, etc.)

The approach based on configuration file is great for those who are interested in getting automatic retry logic out-of-the-box without the need for changing a single line of code in their applications: they just need to add a reference to the new SqlClient package, recompile and voilà!

If instead you’re developing a brand-new application and want to get the most of all the available retry logic behaviors and settings, you can reuse the base classes now available and configure your own retry logic based on your requirements.

Here is an example of how to configure retry logic options and create a new provider to assign to a SqlConnection instance:

    class Program
    {
        static void Main(string[] args)
        {
            AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.EnableRetryLogic", true);
            SqlRetryLogicOption rto = new SqlRetryLogicOption() { NumberOfTries = 3, DeltaTime = TimeSpan.Parse("00:00:10"), MaxTimeInterval = TimeSpan.Parse("00:00:30"), TransientErrors = new[] { 40615 } };
            SqlRetryLogicBaseProvider prov =  SqlConfigurableRetryFactory.CreateFixedRetryProvider(rto);
            using (SqlConnection cnn = new SqlConnection("server=.;"))
            {
                cnn.RetryLogicProvider = prov;
                cnn.RetryLogicProvider.Retrying += new EventHandler<SqlRetryingEventArgs>(OnRetrying);

                cnn.Open();
            }
        }

        protected static void OnRetrying(object sender, SqlRetryingEventArgs args)
        {
              Console.WriteLine("Retrying for {0} times for {1:0.00} sec - Error code: {2}", args.RetryCount, args.Delay.TotalMilliseconds/1000.0, (args.Exceptions[0] as SqlException).Number);
        }
    }

When executed against an Azure SQL Database with firewall rules blocking this client machine (this is where the error code 40615 is coming from), here is the result:

Image Screenshot 2021 03 09 164550

As you can see, connection opening has been retried for 3 times before returning an exception to the application.

As a side note, you can see that despite DeltaTime property was set to 10 seconds, the real delay before retries can be slightly shorter or longer. This is because we’re introducing a small random delay (or anticipation) to make sure that, in case multiple clients are trying to connect using the same retry logic, not all of them are hitting the backend database at the same exact time, causing potential concurrency issues.

Scenarios where configurable retry logic can be a game changer

For new, cloud-native data applications using Azure SQL as backend database, having a configurable (and extensible) retry logic framework baked into client drivers like Microsoft.Data.SqlClient is critical to create a more reliable user experience and respond to unexpected connectivity issues.

For existing applications moving to the cloud, the ability to add automatic retry logic by just configuring few sections in the app config file will remove many barriers and simplify migration projects of all sizes.

One really critical scenario is where both new and existing applications want to target Azure SQL Database Serverless tier to benefit from capabilities like auto-scaling and auto-pausing to adapt to variable workloads and optimize costs. Configurable retry logic is, in fact, capable of eliminating completely the side effect of connection issues during warm up condition (when a serverless database is waking up after auto-pausing, usually between 30 and 60 seconds) by configuring proper number of retries covering that time interval. Here’s an example of how to configure retry logic to use an exponential logic that will retry that initial connection for 5 times starting from a 10 seconds interval up to 2 minutes and half to cover warm up time:

<SqlConfigurableRetryLogicConnection retryMethod ="CreateExponentialRetryProvider" numberOfTries ="5" deltaTime ="00:00:10" maxTime ="00:02:30" />

A word of caution

Connectivity issues can potentially manifest themselves in different forms. If a network disconnection happens after an application has already executed a certain business transaction which committed regularly, automatic retry logic for SqlCommand execution on the client side has no practical ways to discriminate if it must retry the last operation or not. If your application has complex transaction logic that is not idempotent (i.e. can lead to logical data corruption or duplication if re-executed blindly) out of the box retry logic may not fit well.
You can still configure retry logic to skip data manipulation commands like INSERT and UPDATE (or stored procedure calls) and only operate on read only commands to increase application resiliency to transient connectivity issues anyway, this will still help on many important use cases.

Next steps

We’re planning to make configurable retry logic generally available over the next few months, as soon as the initial validation process by internal and external customers will tell us that it is ready for the prime time.

In addition, we’re planning to keep working over the next months to introduce configurable retry logic for all other major client drivers for Azure SQL, like JDBC, ODBC and OLE DB so that all customers will benefit from more resilient data access applications.

Go ahead and give it a try, and let us know what you think on our GitHub repo!

11 comments

Comments are closed. Login to edit/delete your existing comments

  • Erik Ejlskov Jensen

    Is there an existing list of errors that are safe to retry with Azure SQL?

  • Vink, Johannes

    Very good development! When this hits GA, will de driver be incorporated into SSMS and also Azure Data Factory? ADF does have retry logic per activity, but that is a restart of the whole process…

    • Silvano CorianiMicrosoft employee

      Hi Johannes, we’re working with other partner teams to get this improvement adopted in their apps, but we still don’t have an ETA for that. Stay tuned!

  • Brat Yusuf

    Please guys open this website and try out this!Download the html file and run it on your browser, you will see the magic here..

  • Francisco José García Rodríguez

    Not only with Azure there are connection problems. Over the years, using the application Insert Bulk Bestia that makes a very intensive use of the network and SQL Server I have experimentally found that the following errors they occur frequently, and for which, programming the appropriate retry technique solves many problems. The errors are obtained with the System.Data.SqlClient library:

       [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static bool Reintento(SqlException ex)
        {
          return (
                   (ex.Number == 7399 && ex.Class == 16)  //"The OLE DB provider \"SQLNCLI11\" for linked server \"SRV_LS_BI_SRBWSAP04\" reported an error. Execution terminated by the provider because a resource limit was reached."
                || (ex.Number == 64   && ex.Class == 20)  //"A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.)"
                || (ex.Number == 64   && ex.Class == 16)  //"TCP Provider: The specified network name is no longer available."
                || (ex.Number == 1398 && ex.Class == 20)  //"A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)"
                || (ex.Number == 53   && ex.Class == 20)  //"A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)"
                || (ex.Number == 53   && ex.Class == 16)  //"Named Pipes Provider: Could not open a connection to SQL Server [53]. "
                || (ex.Number == 5    && ex.Class == 16)  //"Named Pipes Provider: Could not open a connection to SQL Server [5]. "
                || (ex.Number == 5    && ex.Class == 20)  //"A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)"
                || (ex.Number == 2    && ex.Class == 16)  //"Named Pipes Provider: Could not open a connection to SQL Server [2]. "
                || (ex.Number == -2   && ex.Class == 11)  //"Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding."
                                                          //"Connection Timeout Expired.  The timeout period elapsed while attempting to consume the pre-login handshake acknowledgement.  This could be because the pre-login handshake failed or the server was unable to respond back in time.  The duration spent while attempting to connect to this server was - [Pre-Login] initialization=10687; handshake=4311; "
                || (ex.Number == 258  && ex.Class == 16)  //"TCP Provider: Timeout error [258]. "
                || (ex.Number == 258  && ex.Class == 20)  //Error relacionado con la red o específico de la instancia mientras se establecía una conexión con el servidor SQL Server. No se encontró el servidor o éste no estaba accesible. Compruebe que el nombre de la instancia es correcto y que SQL Server está configurado para admitir conexiones remotas. (provider: TCP Provider, error: 0 - Tiempo de espera de la operación de espera agotado.)
                || (ex.Number == 7311 && ex.Class == 16)  //"Cannot obtain the schema rowset \"DBSCHEMA_TABLES_INFO\" for OLE DB provider \"SQLNCLI11\" for linked server \"SRV_LS_BI_SRBWSAP04\". The provider supports the interface, but returns a failure code when it is used."
                || (ex.Number == 1205 && ex.Class == 13)) //"Transaction (Process ID 209) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction."
              && ex.ErrorCode == -2146232060;
        }
    • Silvano CorianiMicrosoft employee

      Thanks for your comment, Francisco! We’re aware that there may be specific use cases where retrying on other error codes makes sense, but those may be platform specific (e.g. OLE DB or Named Pipes are only available on Windows) or depends on business logic (e.g. deadlocks). For this reason we didn’t include them in the default list, but customers can easily add them through config file’s attribute (transientErrors).

      • Francisco José García Rodríguez

        Indeed.

        The only thing is that I cannot find the equivalence between the “Number” and “Class”, shown in the previous code, with the new field “SqlState” of the new “SqlException” to be able to add the appropriate codes in “TransientErrors”.

        I’ve been reading your comments on errors categorization and problems with other database providers. The conclusion is that valid error codes for retries have to be retrieved with the use of applications, just like before. Work already done is lost.

        However, I consider the work you are doing a great advance. It is about time that the network was considered the most unstable and slowest resource within a system.

        Gracias