Diagnosing & Troubleshooting 4 Common Client-side Issues in Azure Cosmos DB for NoSQL with Java SDK V4

Subhasish Ghosh

Introduction

Knowing how to effectively troubleshoot Azure Cosmos DB for NoSQL client-side performance issues is critical for using the service. This blog post will show you how you can leverage the Diagnostics property in the Azure Cosmos DB for NoSQL Java SDK V4 so you can design and operate highly available and scalable mission-critical applications using Azure Cosmos DB.

What are client diagnostics?

Azure Cosmos DB for NoSQL Java SDK V4 enables you to extract and represent information related to database, container, item, and query operations using the Diagnostics property in the response object of a request. This property contains information such as CPU and memory levels on the compute host for the SDK instance as well as service-side metrics covering nearly every aspect of the execution of your request.

The Diagnostics output is returned as a JSON string object. And with a little help to make the output more readable, can be reviewed by you. The information contained can help solve most of your client-side issues plaguing your application. One term you will see when using client diagnostics is rntbd. This is Azure Cosmos DB’s customized TCP transport layer stack which handles the life cycle of a CRUD request end-to-end.

Image rntbd

What kinds of issues can I solve with them?

Let’s explore how client diagnostics can help diagnose and solve the following four issues:

Scenario # User-end experience Core issue HTTP Status Code observed
1 High latency reads (not query) Rate-limiting (429s) by Azure Cosmos DB service. 429
2 High latency writes with high CPU load on SDK client. Create requests at high write volumes are either very slow OR high percentage create requests getting timed out. 408
3 In a high-volume burst scenario, point-reads breach a 10 ms read latency barrier in P99 scenario. Number of physical partitions and replicas for the container range from mid-range to large. Point-reads exhibiting high latency in a high-volume, sudden burst scenario. No error code, just very high latency.
4 In a very high-volume burst scenario, in-partition queries breach a 15 ms read latency barrier in a P99 scenario. Number of physical partitions and replicas for the container is very high. In-partition queries exhibiting high latency in a high-volume, sudden burst scenario. No error code, just very high latency.

 

Helpful diagnostics values

The amount of data that comes back from the diagnostics in the Java SDK is extensive and a bit overwhelming. It can be hard to know what to look for to diagnose issues. In this section, we cover a few key diagnostics values. We’ll cover a few more important ones later in this blog post when we look at some specific issues that are common for customers to run into.

userAgent This is the name of the client instance running on your machine. There are some basic barebones that are injected already you will always see.

  • “azsdk-java-cosmos” is the name of the package that has been pulled from Maven.
  • “4.38.0” is the version of the Java SDK that you are using.
  • Either a Linux version OR Windows version and JRE version that you have running on the machine.

You can customize the userAgent string when you initialize the Java SDK and inject your own parameters. For instance, say you have a client instance running on a VM in three separate regions. You could append something like, VM-SCUS, VM-EastUS2, etc. This can help you diagnose issues with a specific client instance or a specific region such as an increase in cross-region calls. The userAgent can be used to deep-dive lots of specific issues when there are multiple instances being used.

requestLatencyInMs This property shows how long the operation took to complete in milliseconds. A good place to start for any troubleshooting.
responseStatisticsList This is the actual meat of the JSON object. It captures all the essence of what comes from the Azure Cosmos DB service.
lsn

globalCommittedLsn

itemLSN

sessionToken

LSN stands for the logical sequence number. It represents the transaction ID for a create operation. LSNs play a vital role in Azure Cosmos DB since LSNs are used to co-ordinate transactions and data across regions (in specific global replication scenarios). LSNs are also used to co-ordinate Change Feed requests if you are using Azure Cosmos DB Change Feed feature.

Azure Cosmos DB supports two types of session tokens:

  1. Simple session token which has format, {pkrangeid}:{globalLSN}
  2. Vector session token which has format, {pkrangeid}:{Version}#{GlobalLSN}#{RegionId1}={LocalLsn1}#{RegionId2}={LocalLsn2}….#{RegionIdN}={LocalLsnN}

You can read Convert session token formats in .NET SDK for further info.

partitionKeyRangeId The partitionKeyRangeId identifies the physical partition on which the request was sent.

This is helpful in scenarios if you are repeatedly seeing high scenarios and you see a specific “partitionKeyRangeId” coming up repeatedly. This could then point to a problem on the Azure Cosmos DB service end.

But, on the other hand, if you are seeing a lot of these “partitionKeyRangeId” ranges showing up in your logs for high latency scenarios, and there is no one partition where you are seeing issues, then the odds are its something on the client side which needs to be investigated. You could use this information to observe whether you’re using a cross-partition query and diagnose 429 Errors in Scenario #1 as discussed below.

statusCode

subStatusCode

The “statusCode”:201, “subStatusCode”:0, section is vital since it tells us whether the request succeeded or not.

  • If you observe a 200 or 201, this status code signifies a success.
  • HTTP Status code 200: Operation is a success (OK)
  • HTTP Status code 201: A create operation is a success (OK).
  • If you observe any of the 400s, then you should take a closer look at what the error is. But again, there are a few exceptions also, e.g., 429 is technically not an error but a rate-limiting call.

You can read HTTP Status Codes for Azure Cosmos DB for further info.

requestCharge This is the same RequestCharge property that is returned in the response object in a request.
requestResourceType:Document

requestOperationType:Create

These values tell you the type of operation, such as Create, Read, Query, etc. as well as the resource type the operation is on. In this case, an Azure Cosmos DB document.
regionsContacted:[

      “south india” ]

This section shows the region the request was sent to. This can be useful when latency is high, to see whether a request was retried in a different region than the primary in the PreferredLocations property. In situations such as those, something may be wrong in that region.
retryContext This section is vital if your requests are being rate-limited by the Azure Cosmos DB service. In such a case, the service will return an array of 429 and sub status codes in an array for each retry, then provide a total count with latency in milliseconds.

 

Issues: Diagnosing & Troubleshooting

In this section, we will delve deeper into some issues and understand how to diagnose them using the Diagnostics JSON object output and resolution steps.

Scenario #1: High latency reads (not query)

Read requests throttled or rate-limited by Azure Cosmos DB service.

In this scenario, end-users are seeing high latency during reads above what they normally have encountered. Note that in this scenario the value for requestOperationType is set to Read.

Diagnostics section to review:

Look for the retryContext section of the Diagnostics JSON object. This section is vital if your requests are being rate-limited by the Azure Cosmos DB service. When requests are rate limited, the service will return an HTTP Status code of 429. This is what you would typically see in this section:

"retryContext":{
      "statusAndSubStatusCodes": [
	[429, 3200],
	[429, 3200],
	[429, 3200],
	[429, 3200],
 ],
       "retryCount":4,
       "retryLatency":1347
 },

Here you can see for retries for the request all with 429. There is also a second SubStatusCode returned. This provides more precise information on why the request was rate limited. For a 429, here is the list of SubStatusCodes:

// 429: Request Rate Too Large
RUBudgetExceeded = 3200,            // RUs allocated has been consumed.
GatewayThrottled = 3201,            // Gateway throttled volume of requests.
StoredProcedureConcurrency = 3084   // Concurrency limit in stored procedure reached.

The [429, 3200], pinpoints we have exceeded the total number of Request Units (RUs) allocated.

Resolution:

There are a number of paths to take here. As a resolution, you could either opt to increase the provisioned throughput on the container, or alternatively, use auto-scale throughput which should resolve this issue. If this was a query operation and you were facing 429 Errors, you should subsequently investigate by looking at the partition key ranges. If you find a large number of physical partitions are being accessed, the query is likely doing fan-out and you should take steps towards creating a second collection with a better partition key for that query or possibly use Cosmos DB’s cache feature. If you find that there are no other design-related issues such as hot partitions, you could either choose to investigate whether this specific query was poorly designed, or in absence of the former, you could increase the provisioned throughput use auto-scale throughput which should resolve this issue.

Scenario #2: High latency writes with high CPU load

In this scenario users are experiencing higher than normal write latency during periods of high write volumes with a high percentage of create item requests getting HTTP Status code 408 (time out). This is typically observed with an accompanying high client-side CPU load.

Diagnostics section to review:

Start by reviewing the serializationDiagnosticsList section of the Diagnostics JSON object. This section tells us the amount of time we spent on the client-side performing a serialization operation.  If you are performing a write request, you see this section populated as exhibited below:

"serializationDiagnosticsList":[
     {
         "serializationType":"ITEM_SERIALIZATION",
         "startTimeUTC":"2022-10-18T09:13:55.527664800Z",
         "endTimeUTC":"2022-10-18T09:13:55.537624400Z",
         "durationInMilliSecs":9.9596
     }
 ]

The serializationDiagnosticsList section is vital in identifying client-side latencies related to issues in serialization and/or deserialization by looking at the durationInMillisecs value.

Next find the systemCpuLoad property in systemInformation section. Here you may see high values as exhibited below. e.g.,

"systemCpuLoad":"
    (2022-05-02T07:27:59.495855600Z 95.0%),
    (2022-05-02T07:28:04.489517300Z 99.0%),
    (2022-05-02T07:28:09.495177300Z 99.0%),
    (2022-05-02T07:28:14.495187100Z 99.0%).

This represents the load on CPU of your client-machine. This number is sent back to the client every 5 seconds for a 30-second window. A consistent > 70% of CPU load can lead to resource starvation on your client-end machine. This could lead to a large volume of requests which will eventually fail with an HTTP Status code 408 (time out).

Resolution:

There are two ways to deal with high serialization times and high CPU load.

  1. Document size. Avoid overly large document sizes. A document size of 1-5 KB is ideal. Larger document sizes (e.g., 100KB, 500KB and so on), require much more time serializing that object into JSON, then pushing it across the wire. A data modeling best practice for a high-performance NoSQL database like Azure Cosmos DB is to always minimize document sizes by stripping away properties which are not required for a specific operation.
  2. Document depth. Documents with many deeply nested components require more CPU cycles during serialization or deserialization that can result in greater CPU load. You can optimize this by flattening document models with overly complex and deep hierarchies for a flatter design. Typically, levels of nesting of 3 or less is ideal.

Scenario #3: Point-reads exhibiting high latency in a high-volume, sudden burst scenario.

In this scenario you are seeing point-reads with exceptionally high latency of greater than 10 ms reads at 99th percentile, in a high-volume, sudden burst scenario. In this scenario, the number of physical partitions for the container range from medium (100’s) to large (1000’s) in number. Typical in this scenario there is no specific error code, just the higher than normal latency.

Diagnostics section to review:

First, you should check the different event sections in the rntbd transport layer stack event times lines in the JSON object. If none of the sections exhibit anomalies for time duration, next you should focus on the networking parameters which the client-end application is using. This section below shows what the default values are when set by the Azure Cosmos DB Java SDK V4:

"connCfg":{
         "rntbd":"(cto:PT5S, nrto:PT5S, icto:PT0S, ieto:PT1H, mcpe:130, mrpc:30, cer:true)",
         ………………………………
         ………………………………
      },

In the above,

  • “cto” stands for ConnectionTimeout duration on the rntbd TCP stack. This signifies that in case you are facing any connectivity issues, you are instructing the TCP stack to retry that specific request. By default, cto is set to 5 seconds by the Java SDK. For specific use-cases you can choose to customize it. E.g., if you are running an eCommerce site which is expecting to observe a high volume of peak requests during a Black Friday event, you can choose to customize the “cto” values based on the query patterns in your code. Here are some example values we’ve observed customers using:
    1. Read heavy app primarily doing point-reads: 600 ms
    2. Read heavy app primarily doing in-partition queries: 600 ms
    3. An app primarily executing stored procedures: 1 sec
    4. Write heavy app: 600 ms
    5. An app which does cross-partition queries: 5 sec
    6. An app which primarily does Cross Region calls: 1 Sec
  • “nrto” stands for NetworkRequestTimeout duration. This signifies that you are instructing the TCP stack to timeout the request if the request does not complete its execution in 5 seconds. By default, nrto is set to 5 seconds. For specific use-cases you can choose to customize it. An example use-case would be that you have a 10-minute time window wherein your Azure Cosmos DB container requires you to handle very low-latency high volume of CRUD requests. Since the volume is very high and corresponding time window is very low; you would ideally want your requests to either complete its execution in 100 milliseconds and not 5 seconds. The percentage of those requests that reach cut-off of 100 milliseconds are either failed or retried.
  • “icto” stands for IdleConnectionTimeout duration for a single connection. If a channel (socket) has been created to an endpoint and there is nothing going-on on that channel, you could choose to terminate that channel. The service then re-creates the channel as when required. By default, icto is set to 0 seconds. If you have a use-case wherein you wish to retain a channel to a replica endpoint to ensure that underlying operations (e.g., SSL handshakes etc.) is not re-done at the time of initialization once again, you may wish to customize this value and set it to a value greater than zero.
  • “ieto” stands for IdleEndpointTimeout duration and applies for the entire connection pool for a single replica. This basically means, is there is a specific endpoint to a specific replica that you have not connected to for a while (in other words, the replica endpoint is lying idle), and in such a case do I still need to keep this connectivity to that endpoint or could I request the service to kill connectivity to that endpoint. This is usually a very large value, e.g., 1 Hour in this case. If you have a use-case wherein you usually handle very high volume to requests over a specific period of time (e.g., Black Friday Sales), you can configure ieto to a very high value (e.g., 25 hours). Which means, Azure Cosmos DB service maintains connectivity to a replica endpoint even if the replica is lying idle for 25 hours.
  • “mcpe” stands for Max Channels Per Endpoint. This represents the maximum channels (socket connections) you wish to configure per end point. This is synonymous with a pipe to the replica endpoint. The SDK sets this to 130 by default. You can choose to override it based on your specific use-case. If you have a use-case wherein you have long-running executions (e.g., stored proc) running in a specific time interval and you want to limit the number of socket connections to a minimum viable number, you can set this to a lower number (e.g., 75). This prevents resource contention at the max channels per endpoint level.
  • “mrpc” stands for Max Requests Per Channel. This represents the maximum number of requests that you want to be put on a channel for sending across the wire. The maximum number of requests per channel is limited to 30.

Resolution:

If you have a use-case wherein you are configuring Cosmos DB for a very low-latency, high volume eCommerce portal. You could use the same code shown below for configuring the networking parameters:

/* Exercise caution in production environments. 
/* Reach out to Microsoft for specific guidance. 
DirectConnectionConfig directConnectionConfig = DirectConnectionConfig.getDefaultConfig(); 
directConnectionConfig.setConnectTimeout(Duration.ofMillis(600));
directConnectionConfig.setNetworkRequestTimeout(Duration.ofSeconds(5));
directConnectionConfig.setIdleConnectionTimeout(Duration.ofSeconds(0));
directConnectionConfig.setIdleEndpointTimeout(Duration.ofHours(1));
directConnectionConfig.setMaxConnectionsPerEndpoint(130);

client = new CosmosClientBuilder()
         .endpoint(AccountSettings.HOST)
         .key(AccountSettings.MASTER_KEY)
         .consistencyLevel(ConsistencyLevel.SESSION)
         .contentResponseOnWriteEnabled(false)
         .directMode(directConnectionConfig)
         .buildAsyncClient();

Scenario #4: In-partition queries exhibiting high latency in a high-volume, sudden burst scenario.

In a very high-volume burst scenario, in-partition queries breach a 15 ms read latency barrier in a P99 scenario. The number of physical partitions for the container is very high. Usually such a situation has no specific error code associated with it, but you end up higher than desirable latency.

Diagnostics section to review:

Same as with scenario #3 you will want to check the different event sections in the rntbd transport layer stack event times lines in the JSON object. If none of the sections have higher than normal duration, then next focus on the networking parameters which the client-end application is using. Here are the default values set by the Azure Cosmos DB Java SDK V4 below:

"connCfg":{
          "rntbd":"(cto:PT5S, nrto:PT5S, icto:PT0S, ieto:PT1H, mcpe:130, mrpc:30, cer:true)",
          ………………………………
          ………………………………
        },

Resolution:

You should set the IdleEndpointTimeOut value to 3 hours. This timeout setting is for idle endpoint connection. This can influence collections with a very high number of partitions. The max connections per endpoint is recommended to be set to 130.

Let us look at a real-life example. Let us assume we have an Azure Cosmos DB collection with 100 partitions. Each partition has 4 replicas, so we have a total of = 100 X 4 = 400 Replicas = 400 X 130 (max connections per endpoint of the replica) = 52,000 connections could be created.

You could use the same code shown below for configuring the networking parameters:

/* Exercise caution in production environments.
/* Reach out to Microsoft for specific guidance.
DirectConnectionConfig directConnectionConfig = DirectConnectionConfig.getDefaultConfig();
directConnectionConfig.setConnectTimeout(Duration.ofMillis(600));
directConnectionConfig.setNetworkRequestTimeout(Duration.ofSeconds(5));
directConnectionConfig.setIdleConnectionTimeout(Duration.ofSeconds(0));
directConnectionConfig.setIdleEndpointTimeout(Duration.ofHours(3));
directConnectionConfig.setMaxConnectionsPerEndpoint(130);

client = new CosmosClientBuilder()
        .endpoint(AccountSettings.HOST)
        .key(AccountSettings.MASTER_KEY)
        .consistencyLevel(ConsistencyLevel.SESSION)
        .contentResponseOnWriteEnabled(false)
        .directMode(directConnectionConfig)
        .buildAsyncClient();

The reason why IdleEndpointTimeOut value is assigned a large number is because if you have a period of downtime wherein your request volume is low, and you have a large number of partitions and replicas, you may see a lot of those endpoints not seeing requests for maybe a few minutes, so the next time you suddenly see a request the SDK now needs to re-establish the connection, redo the SSL handshake, which could add some latency in that initial request to that endpoint. You can customize this property to a very large value, e.g., 3 hours, which means, you can go 3 hours without hitting an endpoint before Azure Cosmos DB kills connectivity to that particular endpoint.

Getting started with Client Diagnostics

Logging Samples in GitHub

We hope these scenarios above are helpful to you in getting you familiar with the diagnostics information returned by the Java SDK. To get started using client diagnostics you need to instrument your service calls to Azure Cosmos DB. Here is an example below:

// Write Item
Mono<CosmosItemResponse<Family>> itemResponseMono = container.createItem(family, 
               new PartitionKey(family.getLastName()),
               new CosmosItemRequestOptions());
itemResponseMono.map(itemResponse -> {
     CosmosDiagnostics diagnostics = itemResponse.getDiagnostics();
     logger.info("Create item diagnostics : {}", diagnostics);
     }).subscribe();

For a complete list of samples visit the Azure Cosmos DB for NoSQL Java SDK Diagnostics Quick Start in GitHub.

Pre-requisites

  • An Azure Cosmos DB account with an active Azure subscription. You can create a free-tier Azure Cosmos DB account or try Azure Cosmos DB for free without an Azure subscription.
  • Java Development Kit (JDK) 8. Point your JAVA_HOME environment variable to the folder where the JDK is installed. We are using the Azure Cosmos DB Core (SQL) API Java SDK 4.38.0 (2022-10-12) for our examples in this blog post. Refer to Java SDK Release history Change Log for latest versions.
  • A Maven binary archive. On Ubuntu, run apt-get install maven to install Maven.
  • On Ubuntu, run sudo apt-get install git to install Git.

Tip on how to Make diagnostics strings more readable.

The diagnostics string returned in a response is not very easy to read. Below is a sample of the diagnostics output generated for a ‘Create Item’ operation:

{"userAgent":"azsdk-java-cosmos/4.38.0 Windows11/10.0 JRE/17.0.4","activityId":"30f6fce4-4ec5-11ed-a7e2-1934226b1acc","requestLatencyInMs":591,"requestStartTimeUTC":"2022-10-18T09:13:55.523693300Z","requestEndTimeUTC":"2022-10-18T09:13:56.115026900Z","responseStatisticsList":[{"storeResult":{"storePhysicalAddress":"rntbd://cdb-ms-prod-southindia1-be1.documents.azure.com:14344/apps/23b16621-f57f-40c6-b7b4-751b5f00808d/services/fb32b432-fbe9-498e-a92c-e1583f9a4171/partitions/afd1f1e8-6488-42cb-b971-652cfdbe32f8/replicas/133105575018894551p/","lsn":11,"globalCommittedLsn":10,"partitionKeyRangeId":"0","isValid":true,"statusCode":201,"subStatusCode":0,"isGone":false,"isNotFound":false,"isInvalidPartition":false,"isThroughputControlRequestRateTooLarge":false,"requestCharge":8.76,"itemLSN":-1,"sessionToken":"0:-1#11","backendLatencyInMs":3.884,"replicaStatusList":["14344:Unknown"],"transportRequestTimeline":[{"eventName":"created","startTimeUTC":"2022-10-………………………………………………………….

To make this more human readable, use an online JSON formatter to format the output to look like this…

{
   "userAgent":"azsdk-java-cosmos/4.38.0 Windows11/10.0 JRE/17.0.4",
   "activityId":"30f6fce4-4ec5-11ed-a7e2-1934226b1acc",
   "requestLatencyInMs":591,
   "requestStartTimeUTC":"2022-10-18T09:13:55.523693300Z",
   "requestEndTimeUTC":"2022-10-18T09:13:56.115026900Z",
   "responseStatisticsList":[
      {
         "storeResult":{
            "storePhysicalAddress":"rntbd://cdb-ms-prod-southindia1-be1.documents.azure.com:14344/apps/23b16621-f57f-40c6-b7b4-751b5f00808d/services/fb32b432-fbe9-498e-a92c-e1583f9a4171/partitions/afd1f1e8-6488-42cb-b971-652cfdbe32f8/replicas/133105575018894551p/",
            "lsn":11,
            "globalCommittedLsn":10,
            "partitionKeyRangeId":"0",
            "isValid":true,
            "statusCode":201,
            "subStatusCode":0,
            "isGone":false,
            "isNotFound":false,
            "isInvalidPartition":false,
            "isThroughputControlRequestRateTooLarge":false,
            "requestCharge":8.76,
            "itemLSN":-1,
            "sessionToken":"0:-1#11",
            "backendLatencyInMs":3.884,
            "replicaStatusList":[
               "14344:Unknown"
            ],
            "transportRequestTimeline":[
               {
                  "eventName":"created",
                  "startTimeUTC":"2022-10-18T09:13:55.832485400Z",
                  "durationInMilliSecs":0.9988
               },
               {
                  "eventName":"queued",
                  "startTimeUTC":"2022-10-18T09:13:55.833484200Z",
                  "durationInMilliSecs":0.0
               },
               {
                  "eventName":"channelAcquisitionStarted",
                  "startTimeUTC":"2022-10-18T09:13:55.833484200Z",
                  "durationInMilliSecs":245.0755
               },  ………………………………………………………….

Where to learn more.

Explore the following links for further information:

0 comments

Comments are closed. Login to edit/delete your existing comments

Feedback usabilla icon