July 18th, 2023

Latest NoSQL Java Ecosystem Updates 2023 Q1 & Q2

Theo van Kraay
Principal Program Manager

We’re always busy adding new features, fixes, patches, and improvements to our Java-based client libraries for Azure Cosmos DB for NoSQL. In this regular blog series, we share highlights of recent updates in the last period.

 

January – June 2023 updates

 

  1. Spring data – enhanced multi-tenancy support
  2. Azure Active Directory authentication support for Spark Connector
  3. Java SDK Proactive Connection Management
  4. Hierarchical Partition Key support
  5. Retriable Writes
  6. Partition merge support
  7. End-to-end timeout policy
  8. Threshold-based availability optimization
  9. Priority-based throttling
  10. Computed Properties
  11. Request-level metrics filtering
  12. Spring Boot 3 support for Spring Data
  13. Kafka connector enhancements
  14. Session token mismatch optimization
  15. Fixes, patches, and enhancements

 

Spring data – enhanced multi-tenancy support

In January 2023 we added enhanced support for building multi-tenant applications in Azure Cosmos DB using the Azure Cosmos DB Spring Data Client Library. Developers can now implement a database per tenant or container per tenant performance isolation model seamlessly integrated into the Spring Data Repository abstraction layer. Check out this blog for more details!

 

Azure Active Directory authentication support for Spark Connector

Azure Active Directory support was added in February 2023 for the OLTP Spark connector. Check out the documentation here, and the fully worked sample here!

 

Java SDK Proactive Connection Management

In February 2023 proactive connection management was added to the Java SDK. Developers can now warm up connections and caches for containers for both the current read region and a pre-defined number of preferred remote regions, as opposed to just the current read region. Some scenarios in which you may want to do this would include:

  • Improving tail latency in cross-region failover scenarios (for example with speculative processing).
  • Reducing overall latency for writes in multi-region scenarios where only a single write region is configured.

Define the number of regions for which you want connections to be warmed up, and the SDK will use the list of preferred regions configured through CosmosClientBuilder. A duration can also be specified within which connections are established aggressively in a blocking manner. Once this duration elapses, connections are established defensively but in a non-blocking manner.

// containers to which connections are to be proactively opened
CosmosContainerIdentity containerIdentity1 = new CosmosContainerIdentity("sample_db_id", "sample_container_id_1");
CosmosContainerIdentity containerIdentity2 = new CosmosContainerIdentity("sample_db_id", "sample_container_id_2");

// no. of regions to which connections are to be proactively opened
int proactiveConnectionRegionsCount = 1;

// duration for which connections are to be aggressively opened in a blocking manner
// beyond this duration connections will be defensively opened in a non-blocking manner
Duration aggressiveWarmupDuration = Duration.ofSeconds(1);

// building the client along with opening connections
CosmosAsyncClient clientWithOpenConnections = new CosmosClientBuilder()
          .endpoint("<account URL goes here")
          .key("<account key goes here>")
          .endpointDiscoveryEnabled(true)
          .preferredRegions(Arrays.asList("sample_region_1", "sample_region_2"))
          .openConnectionsAndInitCaches(new CosmosContainerProactiveInitConfigBuilder(Arrays.asList(containerIdentity1, containerIdentity2))
                .setProactiveConnectionRegionsCount(proactiveConnectionRegionsCount)
                .setAggressiveWarmupDuration(aggressiveWarmupDuration)
                .build())
          .directMode()
          .buildAsyncClient();

 

 

Hierarchical Partition Key support

Support for hierarchical partition keys in Azure Cosmos DB was made GA for the Java SDK in March 2023. Check out code examples here! Note: this is currently available only for the core Java SDK, and is not yet supported for the other connectors at the time of writing.

 

Retriable Writes

We’ve added support for retrying writes when they are not guaranteed to be idempotent. Previously the SDK would only issue retries for write operations when the failure condition happened before the request was actually written to the network, or when the error code from the service guarantees that the service never processed the request. But now, developers can now opt-in to retrying writes at a request level, even when these conditions are not met. See here for more details.

String pkValue = "myPKValue"; // whatever the logical partition key value is      
boolean ENABLE_RETRIES = true;
boolean USE_TRACKING_ID = true;
CosmosItemRequestOptions optionsWithRetry = new CosmosItemRequestOptions()
    .setNonIdempotentWriteRetryPolicy(ENABLE_RETRIES, USE_TRACKING_ID);
asyncContainer.createItem(item, new PartitionKey(pkValue), optionsWithRetry).block();

 

Partition merge support

Merging partitions in Azure Cosmos DB are now fully supported for the Java SDK, Spark Connector, and Spring Data Client Library!

 

End-to-end timeout policy

We observed that many customers want to set aggressive end-to-end timeouts, which ultimately impacted tail latency and effectively caused availability to drop significantly since requests were not being canceled.

The Java SDK now supports an end-to-end timeout policy, allowing developers to provide a timeout value that covers the whole execution of any request, including requests that span multiple partitions. Ongoing requests will be canceled if the timeout setting is reached.

This helps to optimize availability while still honoring aggressive end-to-end timeouts. Tail latency is reduced by failing faster, and request units and client-side compute costs are reduced by stopping retries after the timeout.

The timeout duration can be set on CosmosItemRequestOptions. The options can then be passed to any request sent to Azure Cosmos DB.

CosmosEndToEndOperationLatencyPolicyConfig endToEndOperationLatencyPolicyConfig = new CosmosEndToEndOperationLatencyPolicyConfigBuilder(Duration.ofSeconds(1)).build();
CosmosItemRequestOptions options = new CosmosItemRequestOptions();
options.setCosmosEndToEndOperationLatencyPolicyConfig(endToEndOperationLatencyPolicyConfig);
container.readItem("id", new PartitionKey("pk"), options, TestObject.class);

 

Threshold-based availability optimization

A new threshold-based parallel processing capability in the Java SDK has also been added, which can be activated when creating an end-to-end timeout policy as above, to even further improve tail latency and availability. When enabled, parallel executions of the same request (read requests only) will be sent to secondary regions, where the request that responds fastest is the one that is accepted.

How parallel execution works

The object ThresholdBasedAvailabilityStrategy takes two parameters. The first is the threshold, and the second is the threshold step.

Assume you have three regions set as `preferredRegions` in CosmosClientBuilder, in the following order:  East US, East US 2, West US.

Assume the following values for speculation threshold and threshold step:

int threshold = 500;
int thresholdStep = 100;

  1. At time T1, a request to East US is made.
  2. If there is no response in 500ms, a request to East US 2 is made.
  3. If there is no response at 500+100ms, a request to West US is made.
int threshold = 500;
int thresholdStep = 100;
CosmosEndToEndOperationLatencyPolicyConfig config = new CosmosEndToEndOperationLatencyPolicyConfigBuilder(Duration.ofSeconds(3))
        .availabilityStrategy(new ThresholdBasedAvailabilityStrategy(Duration.ofMillis(threshold), Duration.ofMillis(thresholdStep)))
        .build();
CosmosItemRequestOptions options = new CosmosItemRequestOptions();
options.setCosmosEndToEndOperationLatencyPolicyConfig(config);
container.readItem("id", new PartitionKey("pk"), options, JsonNode.class).block();

Note that in the case where cross region requests are required, this can incur additional cost in terms of request units – but this is of course a trade-off one would choose in order to drastically reduce and in some cases eliminate tail latency.

 

Priority-based throttling

We have introduced Priority Based Throttling to the Java SDK, allowing developers to set a priority on their request. This will help them avoid throttling on high-priority requests and will throttle the low-priority requests. In the Java SDK, priority-based throttling has been exposed as a property of throughput control groups. You can read more about priority-based throttling in our blog here, and a demo sample of priority-based throttling here. Priority-based throttling can also be enabled in the Spark Connector by setting spark.cosmos.throughputControl.priorityLevel to low or high.

 

Computed Properties

Support for computed properties is now available in the Java SDK. See here for code examples.

 

Request-level metrics filtering

In the last period, we released support for emitting metrics from the Azure Cosmos DB Java SDK via a micrometer MeterRegistry. In recent months, we added support for filtering metrics based on certain thresholds. This will reduce the overhead significantly, as request-level metrics will be sampled and only be emitted when operations violate the expected thresholds (latency, Request Unit charge, etc.). You can also now apply sampling for diagnostic capturing in the Azure Cosmos DB SDK using sampleDiagnostic, to help further tune client-side resource consumption related to metrics. The sampling rate can be modified after Azure Cosmos DB Client initialization – so the sampling rate can be modified without any restarts being necessary.

PrometheusMeterRegistry prometheusRegistry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);

//provide the prometheus registry to the telemetry config
CosmosClientTelemetryConfig telemetryConfig = new CosmosClientTelemetryConfig()
        .diagnosticsThresholds(
                new CosmosDiagnosticsThresholds()
                        .setRequestChargeThreshold(10)
                        .setNonPointOperationLatencyThreshold(Duration.ofDays(10))
                        .setPointOperationLatencyThreshold(Duration.ofDays(10))
        )
        .sampleDiagnostics(0.25)
        .clientCorrelationId("samplePrometheusMetrics001")
        .metricsOptions(new CosmosMicrometerMetricsOptions().meterRegistry(prometheusRegistry)
                //.configureDefaultTagNames(CosmosMetricTagName.PARTITION_KEY_RANGE_ID)
                .applyDiagnosticThresholdsForTransportLevelMeters(true)
        );
Spring Boot 3 support for Spring Data

Support for Spring Boot 3 has been added to the latest version of the Azure Cosmos DB Spring Data Client Library.

 

Kafka connector enhancements

We’ve added two configurable enhancements to Kafka Connect for Azure Cosmos DB:

  1. When using the source connector, determine whether to update the lease container continuation token based on Kafka offset using config connect.cosmos.offset.useLatest. This resolves issues that may occur depending on whether a lease container is, or is not, initialized. By default, this is set to false. See the change log for more information.
  2. When using the sink connector, determine whether to enable “compression” to remove duplicate records in a single batch, using the config connect.cosmos.sink.bulk.compression.enabled. When set to true, this ensures that when there are multiple records arriving from Kafka which have the same id value in a given batch, only the record with the latest timestamp is kept. By default, this is set to true.

 

Session token mismatch optimization

One of the consistency levels that Azure Cosmos DB provides is session consistency. With this setting, reads are guaranteed to honor the read-your-writes, and write-follows-reads guarantees. After every write operation, the client receives an updated Session Token from the server. The client caches the tokens and sends them to the server for read operations in a specified region. However, there are some scenarios where 404 / 1002 (NOT_FOUND / READ_SESSION_NOT_AVAILABLE) errors are returned by the service. This can occur when:

  • The session token in the request has a higher GlobalLSN value or a higher LocalLSN value (specific to multi-write accounts) than that of the session token in the response. This points to either a lagging region or a lagging replica.
  • A loss of quorum between replicas due to network partition between them.

By default, the SDK will retry all replicas in the local region in order to obtain a valid session token. However, this might incur unnecessary tail latency in a multi-region deployment, if a valid session token can be obtained from a remote region. This change allows application developers to configure hints through a SessionRetryOptions instance which will signal to the SDK whether to pin retries on the local region or move quicker to a remote region, especially when READ_SESSION_NOT_AVAILABLE errors are thrown.

First, build the SessionRetryOptions instance:

// if local region retries are prioritzed
SessionRetryOptions sessionRetryOptions = new SessionRetryOptionsBuilder()
                .setRegionSwitchHint(CosmosRegionSwitchHint.LOCAL_REGION_PREFERRED)
                .build();
// if remote region retries are prioritized
SessionRetryOptions sessionRetryOptions = new SessionRetryOptionsBuilder()
                .setRegionSwitchHint(CosmosRegionSwitchHint.REMOTE_REGION_PREFERRED)
                .build();

Set the built SessionRetryOptions instance on the CosmosClientBuilder instance:

CosmosAsyncClient clientWithPreferredRegions = new CosmosClientBuilder()
    .endpoint("<account URL goes here>")
    .key("<account key goes here>")
    .sessionRetryOptions(sessionRetryOptions)
    .directMode()
    .buildAsyncClient();

In order to control the no. of retries in the local region when REMOTE_REGION_PREFERRED is set as the region switch hint, set the following JVM config as below:

int maxRetriesInLocalRegion= 5;
System.setProperty("COSMOS.MAX_RETRIES_IN_LOCAL_REGION_WHEN_REMOTE_REGION_PREFERRED", String.valueOf(maxRetriesInLocalRegion));

Using this hint can also improve CPU utilization since the SDK is not making as many retry requests to different replicas on a single region. Read availability also improves since we leverage more regions that the customer has configured for their account. 

 

Fixes, patches, and enhancements

In addition to all of the above features, we have of course made a large number of smaller bug fixes, security patches, enhancements, and improvements. You can track all the changes for each client library, along with the minimum version we recommend you use, by viewing the change logs:

 

Get Started with Java in Azure Cosmos DB

About Azure Cosmos DB

Azure Cosmos DB is a fully managed and serverless distributed database for modern app development, with SLA-backed speed and availability, automatic and instant scalability, and support for open source PostgreSQL, MongoDB and Apache Cassandra. Try Azure Cosmos DB for free here. To stay in the loop on Azure Cosmos DB updates, follow us on Twitter, YouTube, and LinkedIn.

To easily build your first database, watch our Get Started videos on YouTube and explore ways to dev/test free.

Author

Theo van Kraay
Principal Program Manager

Principal Program Manager on the Azure Cosmos DB engineering team. Focused on Apache Cassandra offerings, Java ecosystem, high availability, and customer success.

0 comments

Discussion are closed.

Feedback