Improve JDBC application reliability with Azure SQL and connection pooling
A growing number of Azure SQL Database customers are developing new applications in Java, or porting existing ones, using a JDBC driver from Microsoft or a 3rd party.
While these drivers are providing quite extensive support and covering most of database service capabilities and performance expectations, there are a couple of areas that worth some special attention.
An important distinction in cloud environments compared to traditional on-premises enterprise solutions is that network interruptions between application tier and database service can happen more frequently, due for example to planned maintenance activities, unplanned failures or other network connectivity glitches. As explained in that document, these interruptions are generally self-resolved in few seconds from a service perspective, and applications with proper retry logic in place are usually not even affected.
Some client libraries like SqlClient, ODBC driver or EntityFramework are even providing some embedded connection retry capabilities, and customers can simply configure through connection strings without modifying application code. These will usually help in deal with transient issues happening when connections are sitting idle in connection pools, making sure that pure application logic will remain unchanged.
Retry logic frameworks
A more sophisticated use case is when these transient errors are happening while the application is actually executing some command against the database. For this scenario, command retry strategies that can be defined in your code to repeat failed commands after a fixed or exponential time interval, letting developers to implement proper logic to avoid duplicate executions and potential inconsistent state in the database.
As of today, unfortunately, Microsoft JDBC Driver for SQL Server doesn’t provide idle connection retry capabilities, mainly because some challenges residing in underlying Java classes responsible for managing TCP sockets are preventing a reliable and performant way of detecting network interruptions without executing a roundtrip with the server which would introduce performance overhead.
This, of course, doesn’t necessarily reduce the possibilities of creating reliable and efficient Java apps working with Azure SQL Database but, rather, that some attentions from application developers are needed to prevent potential issues in some circumstances. For example, in the Java-sphere, a number of frameworks exists that can be used to define sophisticated retry logic strategies to detect transient errors and repeat specific operations or entire code blocks. Some examples are Retry4j, FailSafe, and even Spring-retry, all quite straightforward to use.
Importance of default connection properties
Connection properties are playing an important role on how interaction between applications and backend database services will behave. These can be specified as part of the connection string, or through specific methods of DriverManager or DataSource classes in application code, and some of these properties are more relevant than others in making sure that application can manage transient conditions properly.
loginTimeout defines the number of seconds the driver should wait before timing out a failed connection. If we don’t specify it, timeout will be the default system timeout, which is 15 seconds. In most cases, this alone will cover a transient connection blip due to a reconfiguration or a failover, and the application will just experience a relatively short delay.
If interruptions are longer, in general just increasing the timeout may or may not be a good idea, and that depends a lot on specific workload characteristics. For interactive workloads, for example, usually keeping timeout relatively low, report the issue to the end user and eventually ask to retry the operation is a good approach. Even for batch approaches can be valuable to just catch the exception due to a transient issue, log it somewhere and retry, rather than just let the application hang for longer time.
queryTimeout and socketTimeout instead are more relevant for service interruptions that may happen while a command is getting executed or is returning results to the client.
Behind the scenes, when JDBC driver is executing a command and waiting for a response from the server, it uses Read() method Java’s InputStream class, and the method blocks until the requested number of bytes return. TCP is a synchronous communication scheme that requires “handshaking” from both sides of the connection as packets are exchanged (SYN and ACK packets). If the data flow is interrupted abruptly due to connectivity issues the application effectively hangs, and even if connection will get re-established, application will stay in that state until it will get restarted.
As explained in this quite extensive article on this subject, role of timeouts is to deal with statement executions that are taking longer than expected, but also to deal with so called “dead connections” to wait indefinitely and impact application availability.
This is because both timeouts by default are set to -1, which means that the application will wait forever and not react to what will happen on the service side. So, why those values are not set to meaningful numbers by default in the JDBC driver? The answer is that safe values will depend again a lot on specific workloads, so a best practice is actually that developers and system engineers will set them based on what “regular” and expected query execution times will be.
If max execution time under regular conditions is, let’s say, 2 seconds a reasonable queryTimeout could be for example 15 seconds (to make sure we cover temporarily slow query execution times without necessarily failing) whole an healthy socketTimeout could be 30 seconds so that, worst case, the application will be able to react to the exception and apply proper retry logic as previously described.
Connection pooling libraries
Connection pooling is a technique to maintain a cache of idle database connections that can be reused for future requests to mitigate the cost of opening new connections and manage resources efficiently. In client libraries like SqlClient or ODBC, connection pooling capabilities are natively implemented, while generally JDBC drivers are just implementing required interfaces for 3rd party components to provide these capabilities. Historically speaking, some of these libraries like HikariCP, C3P0, Dbcp2 or Vibur emerged as widely used within Java development community.
For the reasons mentioned above, connection pooling can play a role in overall application reliability. They all offer an implicit (e.g. supporting JDBC4 Connection.isValid() method) or explicit way of specify a test query to validate connections when application is requesting them, or to setup idle timeout to evict connections from the pool after a given timeout when they are not used.
HikariCP is battle-tested on transient connectivity errors, but it is important to re-iterate the importance of proper configuration on both connection pooling layer and on specific JDBC driver connection properties like the socketTimeout emphasized earlier. This one remain critical to signal to the connection pooling infrastructure that the underlying connection with database is not responding properly and then it should eliminated from the pool instead of remaining there forever in a stuck state.
On connection pooling specifics, in addition to obvious properties determining pool size (minumumIdle and maximumPoolSize, default 10 each but likely dependent on specific workload), there are some additional properties that should be carefully considered for application availability.
connectionTimeout controls the amount of time that the app will wait for getting a connection from the pool. Default is 30 sec, and it makes sense to keep it slightly higher than JDBC driver loginTimeout in case all connections in the pool are active and a new one needs to be created.
maxLifetime controls maximum lifetime of a connection sitting in the pool doing nothing. It is recommended to set this value slightly shorter than any other connection timeout limit.
validationTimeout basically controls how long validation test will wait before declaring a connection “dead” and evicted from the pool if backend will not respond. Default is 5 sec and should work well in most scenarios but can be decreased if you’re looking for faster reaction times in case of transient errors.
What this article and accompanied code sample are demonstrating is that, with few specific configuration options and additional code libraries, customers can implement Java applications for Azure SQL Database that are more reliable and less impacted by transient service reconfigurations and network disconnections.