When building applications that interact with databases, developers frequently encounter scenarios where default SDK configurations don’t align with their specific operational requirements. They need to customize SDK behavior to address real-world challenges like network instability, performance bottlenecks, debugging complexity, monitoring requirements, and more. These factors become even more pronounced when working with a massively scalable, cloud-native, distributed database like Azure Cosmos DB.
This blog post explores how to customize and configure the Go SDK for Azure Cosmos DB beyond its default settings, covering techniques for modifying client behavior, implementing custom policies, accessing operational metrics, etc. These enable developers to build more resilient applications, troubleshoot issues effectively, and gain deeper insights into their database interactions.
The Go SDK for Azure Cosmos DB is built on top of the core Azure Go SDK package, which implements several patterns that are applied throughout the SDK. The core SDK is designed to be quite customizable, and its configurations can be applied with the ClientOptions struct when creating a new Azure Cosmos DB client object using NewClient (and other similar functions). If you peek inside the azcore.ClientOptions struct, you will notice that it has many options for configuring the HTTP client, retry policies, timeouts, and other settings. In this blog, we will cover how to make use of (and extend) these common options when building applications with the Go SDK for Azure Cosmos DB.
Let’s dive into how to make use of (and extend) these common options when building Go applications with Azure Cosmos DB.
Retry policies
Common retry scenarios are handled in the SDK. Here is a summary of errors for which retries are attempted:
Error Type / Status Code | Retry Logic |
---|---|
Network Connection Errors | Retry after marking endpoint unavailable and waiting for defaultBackoff . |
403 Forbidden (with specific substatuses) | Retry after marking endpoint unavailable and updating the endpoint manager. |
404 Not Found (specific substatus) | Retry by switching to another session or endpoint. |
503 Service Unavailable | Retry by switching to another preferred location. |
The upcoming sections demonstrate some of these in action.
Non-Retriable Errors
When a request fails with a non-retriable error, the SDK does not retry the operation. This is useful for scenarios where the error indicates that the operation cannot succeed.
For example, here is a function that tries to read a database that does not exist.
func retryPolicy1() {
c, err := auth.GetClientWithDefaultAzureCredential("https://demodb.documents.azure.com:443/", nil)
if err != nil {
log.Fatal(err)
}
azlog.SetListener(func(cls azlog.Event, msg string) {
// Log retry-related events
switch cls {
case azlog.EventRetryPolicy:
fmt.Printf("Retry Policy Event: %s\n", msg)
}
})
// Set logging level to include retries
azlog.SetEvents(azlog.EventRetryPolicy)
db, err := c.NewDatabase("i_dont_exist")
if err != nil {
log.Fatal("NewDatabase call failed", err)
}
_, err = db.Read(context.Background(), nil)
if err != nil {
log.Fatal("Read call failed: ", err)
}
}
The azcore
 logging implementation is configured using SetListener
 and SetEvents
to write retry policy event logs to standard output. Refer to the Logging section in azcosmos package README for details.
auth.GetEmulatorClientWithAzureADAuth
function is part of the cosmosdb-go-sdk-helper package.Here are the logs from code execution:
//....
Retry Policy Event: exit due to non-retriable status code
Retry Policy Event: =====> Try=1 for GET https://demodb.documents.azure.com:443/dbs/i_dont_exist
Retry Policy Event: response 404
Retry Policy Event: exit due to non-retriable status code
Read call failed: GET https://demodb-region.documents.azure.com:443/dbs/i_dont_exist
--------------------------------------------------------------------------------
RESPONSE 404: 404 Not Found
ERROR CODE: 404 Not Found
//...
When a request is made to read a non-existent database, the SDK gets a 404 (not found) response for the database. This is recognized as a non-retriable error and the SDK stops retrying. Retries are only performed for retriable errors (like network issues or certain status codes). The operation failed because the database does not exist.
Retriable Errors
When a request fails with a retriable error, the SDK automatically retries the operation based on the retry policy. This is useful for transient errors that may resolve themselves after a few attempts.
This function tries to create a Azure Cosmos DB client using an invalid account endpoint. It sets up logging for retry policy events and attempts to create a database.
func retryPolicy2() {
c, err := auth.GetClientWithDefaultAzureCredential("https://iamnothere.documents.azure.com:443/", nil)
if err != nil {
log.Fatal(err)
}
azlog.SetListener(func(cls azlog.Event, msg string) {
// Log retry-related events
switch cls {
case azlog.EventRetryPolicy:
fmt.Printf("Retry Policy Event: %s\n", msg)
}
})
// Set logging level to include retries
azlog.SetEvents(azlog.EventRetryPolicy)
_, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil)
if err != nil {
log.Fatal(err)
}
}
In the logs, you can see show how the SDK handles retries when the endpoint is unreachable:
//....
Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host
Retry Policy Event: End Try #1, Delay=682.644105ms
Retry Policy Event: =====> Try=2 for GET https://iamnothere.documents.azure.com:443/
Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host
Retry Policy Event: End Try #2, Delay=2.343322179s
Retry Policy Event: =====> Try=3 for GET https://iamnothere.documents.azure.com:443/
Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host
Retry Policy Event: End Try #3, Delay=7.177314269s
Retry Policy Event: =====> Try=4 for GET https://iamnothere.documents.azure.com:443/
Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host
Retry Policy Event: MaxRetries 3 exceeded
failed to retrieve account properties: Get "https://iamnothere.docume
Each failed attempt is logged, and the SDK retries the operation several times (three times to be specific), with increasing delays between attempts. After exceeding the maximum number of retries, the operation fails with an error indicating the host could not be found – the SDK automatically retries transient network errors before giving up.
But you don’t have to stick to the default retry policy. You can customize the retry policy by setting the azcore.ClientOptions
when creating the Azure Cosmos DB client.
Configurable Retries
Let’s say you want to set a custom retry policy with a maximum of two retries and a delay of one second between retries. You can do this by creating a policy.RetryOptions
 struct and passing it to the azcosmos.ClientOptions
 when creating the client.
func retryPolicy3() {
retryPolicy := policy.RetryOptions{
MaxRetries: 2,
RetryDelay: 1 * time.Second,
}
opts := azcosmos.ClientOptions{
ClientOptions: policy.ClientOptions{
Retry: retryPolicy,
},
}
c, err := auth.GetClientWithDefaultAzureCredential("https://iamnothere.documents.azure.com:443/", &opts)
if err != nil {
log.Fatal(err)
}
log.Println(c.Endpoint())
azlog.SetListener(func(cls azlog.Event, msg string) {
// Log retry-related events
switch cls {
case azlog.EventRetryPolicy:
fmt.Printf("Retry Policy Event: %s\n", msg)
}
})
azlog.SetEvents(azlog.EventRetryPolicy)
_, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil)
if err != nil {
log.Fatal(err)
}
}
Each failed attempt is logged, and the SDK retries the operation according to the custom policy — only two retries, with a 1-second delay after the first attempt and a longer delay after the second. After reaching the maximum number of retries, the operation fails with an error indicating the host could not be found.
Retry Policy Event: =====> Try=1 for GET https://iamnothere.documents.azure.com:443/
//....
Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host
Retry Policy Event: End Try #1, Delay=1.211970493s
Retry Policy Event: =====> Try=2 for GET https://iamnothere.documents.azure.com:443/
Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host
Retry Policy Event: End Try #2, Delay=3.300739653s
Retry Policy Event: =====> Try=3 for GET https://iamnothere.documents.azure.com:443/
Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host
Retry Policy Event: MaxRetries 2 exceeded
failed to retrieve account properties: Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host
exit status 1
Fault Injection
You can customize this further by creating custom policies to inject faults into the request pipeline. This is useful for testing how your application handles various error scenarios without needing to rely on actual network failures or service outages.
For example, you can create a custom policy that injects a fault into the request pipeline. Here, we use a custom policy (FaultInjectionPolicy
) that simulates a network error on every request.
type FaultInjectionPolicy struct {
failureProbability float64 // e.g., 0.3 for 30% chance to fail
}
// Implement the Policy interface
func (f *FaultInjectionPolicy) Do(req *policy.Request) (*http.Response, error) {
if rand.Float64() < f.failureProbability {
// Simulate a network error
return nil, &net.OpError{
Op: "read",
Net: "tcp",
Err: errors.New("simulated network failure"),
}
}
// no failure - continue with the request
return req.Next()
}
This function configures the Azure Cosmos DB client to use this policy, sets up logging for retry events, and attempts to create a database.
func retryPolicy4() {
opts := azcosmos.ClientOptions{
ClientOptions: policy.ClientOptions{
PerRetryPolicies: []policy.Policy{&FaultInjectionPolicy{failureProbability: 0.6}},
},
}
c, err := auth.GetClientWithDefaultAzureCredential("https://ACCOUNT_NAME.documents.azure.com:443/", &opts) // Updated to use opts
if err != nil {
log.Fatal(err)
}
azlog.SetListener(func(cls azlog.Event, msg string) {
// Log retry-related events
switch cls {
case azlog.EventRetryPolicy:
fmt.Printf("Retry Policy Event: %s\n", msg)
}
})
// Set logging level to include retries
azlog.SetEvents(azlog.EventRetryPolicy)
_, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test_1"}, nil)
if err != nil {
log.Fatal(err)
}
}
Take a look at the logs generated when this code is run – each request attempt fails due to the simulated network error. The SDK logs each retry, with increasing delays between attempts. After reaching the maximum number of retries (default = 3), the operation fails with an error indicating a simulated network failure.
FaultInjectionPolicy
. In this case, we set it to 0.6 (60% chance of failure), so you may see different results each time you run the code. Retry Policy Event: =====> Try=1 for GET https://ACCOUNT_NAME.documents.azure.com:443/
//....
Retry Policy Event: MaxRetries 0 exceeded
Retry Policy Event: error read tcp: simulated network failure
Retry Policy Event: End Try #1, Delay=794.018648ms
Retry Policy Event: =====> Try=2 for GET https://ACCOUNT_NAME.documents.azure.com:443/
Retry Policy Event: error read tcp: simulated network failure
Retry Policy Event: End Try #2, Delay=2.374693498s
Retry Policy Event: =====> Try=3 for GET https://ACCOUNT_NAME.documents.azure.com:443/
Retry Policy Event: error read tcp: simulated network failure
Retry Policy Event: End Try #3, Delay=7.275038434s
Retry Policy Event: =====> Try=4 for GET https://ACCOUNT_NAME.documents.azure.com:443/
Retry Policy Event: error read tcp: simulated network failure
Retry Policy Event: MaxRetries 3 exceeded
Retry Policy Event: =====> Try=1 for GET https://ACCOUNT_NAME.documents.azure.com:443/
Retry Policy Event: error read tcp: simulated network failure
Retry Policy Event: End Try #1, Delay=968.457331ms
2025/05/05 19:53:50 failed to retrieve account properties: read tcp: simulated network failure
exit status 1
HTTP-level customizations
There are scenarios where you may need to customize the HTTP client used by the SDK. For example, when using the Azure Cosmos DB emulator locally, you want to skip certificate verification to connect without SSL errors during development or testing.
TLSClientConfig
 allows you to customize TLS settings for the HTTP client and setting InsecureSkipVerify: true
disables certificate verification – this is not recommended for production, but handy for testing.
func customHTTP1() {
// Create a custom HTTP client with a timeout
client := &http.Client{
Transport: &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
},
}
clientOptions := &azcosmos.ClientOptions{
ClientOptions: azcore.ClientOptions{
Transport: client,
},
}
c, err := auth.GetEmulatorClientWithAzureADAuth("http://localhost:8081", clientOptions)
if err != nil {
log.Fatal(err)
}
_, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil)
if err != nil {
log.Fatal(err)
}
}
All you need to do is pass the custom HTTP client to the ClientOptions
struct when creating the Azure Cosmos DB client. The SDK will use this for all requests.
Another scenario is when you want to set a custom header for all requests to track requests or add metadata. All you need to do is implement the Do
 method of the policy.Policy
 interface and set the header in the request:
type CustomHeaderPolicy struct{}
func (c *CustomHeaderPolicy) Do(req *policy.Request) (*http.Response, error) {
correlationID := uuid.New().String()
req.Raw().Header.Set("X-Correlation-ID", correlationID)
return req.Next()
}
Looking at the logs, notice the custom header X-Correlation-ID
 is added to each request:
//...
Request Event: ==> OUTGOING REQUEST (Try=1)
GET https://ACCOUNT_NAME.documents.azure.com:443/
Authorization: REDACTED
User-Agent: azsdk-go-azcosmos/v1.3.0 (go1.23.6; darwin)
X-Correlation-Id: REDACTED
X-Ms-Cosmos-Sdk-Supportedcapabilities: 1
X-Ms-Date: Tue, 06 May 2025 04:27:37 GMT
X-Ms-Version: 2020-11-05
Request Event: ==> OUTGOING REQUEST (Try=1)
POST https://ACCOUNT_NAME-region.documents.azure.com:443/dbs
Authorization: REDACTED
Content-Length: 27
Content-Type: application/query+json
User-Agent: azsdk-go-azcosmos/v1.3.0 (go1.23.6; darwin)
X-Correlation-Id: REDACTED
X-Ms-Cosmos-Sdk-Supportedcapabilities: 1
X-Ms-Date: Tue, 06 May 2025 04:27:37 GMT
X-Ms-Documentdb-Query: True
X-Ms-Version: 2020-11-05
//....
Query and Index Metrics
The Go SDK provides a way to access query and index metrics, which can help you optimize your queries and understand their performance characteristics.
Query Metrics
When executing queries, you can get basic metrics about the query execution. The Go SDK provides a way to access these metrics through the QueryResponse
 struct in the QueryItemsResponse object. This includes information about the query execution, including the number of documents retrieved, etc.
func queryMetrics() {
//....
container, err := c.NewContainer("existing_db", "existing_container")
if err != nil {
log.Fatal(err)
}
query := "SELECT * FROM c"
pager := container.NewQueryItemsPager(query, azcosmos.NewPartitionKey(), nil)
for pager.More() {
queryResp, err := pager.NextPage(context.Background())
if err != nil {
log.Fatal("query items failed:", err)
}
log.Println("query metrics:\n", *queryResp.QueryMetrics)
//....
}
}
The query metrics are provided as a simple raw string in a key-value format (semicolon-separated), which is very easy to parse. Here is an example:
totalExecutionTimeInMs=0.34;queryCompileTimeInMs=0.04;queryLogicalPlanBuildTimeInMs=0.00;queryPhysicalPlanBuildTimeInMs=0.02;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.07;indexLookupTimeInMs=0.00;instructionCount=41;documentLoadTimeInMs=0.04;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=9;retrievedDocumentSize=1251;outputDocumentCount=9;outputDocumentSize=2217;writeOutputTimeInMs=0.02;indexUtilizationRatio=1.00
Here is a breakdown of the metrics you can obtain from the query response:
| Metric | Unit | Description |
| ------------------------------ | ----- | ------------------------------------------------------------ |
| totalExecutionTimeInMs | ms | Total time taken to execute the query, including all phases. |
| queryCompileTimeInMs | ms | Time spent compiling the query. |
| queryLogicalPlanBuildTimeInMs | ms | Time spent building the logical plan for the query. |
| queryPhysicalPlanBuildTimeInMs | ms | Time spent building the physical plan for the query. |
| queryOptimizationTimeInMs | ms | Time spent optimizing the query. |
| VMExecutionTimeInMs | ms | Time spent executing the query. |
| indexLookupTimeInMs | ms | Time spent looking up indexes. |
| instructionCount | count | Number of instructions executed for the query. |
| documentLoadTimeInMs | ms | Time spent loading documents from storage. |
| systemFunctionExecuteTimeInMs | ms | Time spent executing system functions in the query. |
| userFunctionExecuteTimeInMs | ms | Time spent executing user-defined functions in the query. |
| retrievedDocumentCount | count | Number of documents retrieved by the query. |
| retrievedDocumentSize | bytes | Total size of documents retrieved. |
| outputDocumentCount | count | Number of documents returned as output. |
| outputDocumentSize | bytes | Total size of output documents. |
| writeOutputTimeInMs | ms | Time spent writing the output. |
| indexUtilizationRatio | ratio | Ratio of index utilization (1.0 means fully utilized). |
Index Metrics
Indexing metrics shows both utilized indexed paths and recommended indexed paths. You can use the indexing metrics to optimize query performance, especially in cases where you aren’t sure how to modify the indexing policy.
To enable indexing metrics in Go SDK, set PopulateIndexMetrics
 to true
 in the QueryOptions. Index metrics data in the QueryItemsResponse is base64
 encoded and needs to be decoded before they can be used.
pager := container.NewQueryItemsPager("SELECT c.id FROM c WHERE CONTAINS(LOWER(c.description), @word)", azcosmos.NewPartitionKey(), &azcosmos.QueryOptions{
PopulateIndexMetrics: true,
QueryParameters: []azcosmos.QueryParameter{
{
Name: "@word",
Value: "happy",
},
},
})
if pager.More() {
page, _ := pager.NextPage(context.Background())
// process results
decoded, _ := base64.StdEncoding.DecodeString(*page.IndexMetrics)
log.Println("Index metrics", string(decoded))
}
{
"UtilizedSingleIndexes": [
{
"FilterExpression": "",
"IndexSpec": "/description/?",
"FilterPreciseSet": true,
"IndexPreciseSet": true,
"IndexImpactScore": "High"
}
],
"PotentialSingleIndexes": [],
"UtilizedCompositeIndexes": [],
"PotentialCompositeIndexes": []
}
OpenTelemetry support
The Azure Go SDK supports distributed tracing via OpenTelemetry. This allows you to collect, export, and analyze traces for requests made to Azure services, including Azure Cosmos DB.
The azotel package is used to connect an instance of OpenTelemetry’s TracerProvider
to an Azure SDK client (in this case Azure Cosmos DB). You can then configure the TracingProvider
 in azcore.ClientOptions to enable automatic propagation of trace context and emission of spans for SDK operations.
func getClientOptionsWithTracing() (*azcosmos.ClientOptions, *trace.TracerProvider) {
exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint())
if err != nil {
log.Fatalf("failed to initialize stdouttrace exporter: %v", err)
}
tp := trace.NewTracerProvider(trace.WithBatcher(exporter))
otel.SetTracerProvider(tp)
op := azcosmos.ClientOptions{
ClientOptions: policy.ClientOptions{
TracingProvider: azotel.NewTracingProvider(tp, nil),
},
}
return &op, tp
}
The above function creates a stdout
 exporter for OpenTelemetry (prints traces to the console). It sets up a TracerProvider
, registers this as the global tracer, and returns a ClientOptions
 struct with the TracingProvider
set, ready to be used with the Azure Cosmos DB client.
func tracing() {
op, tp := getClientOptionsWithTracing()
defer func() { _ = tp.Shutdown(context.Background()) }()
c, err := auth.GetClientWithDefaultAzureCredential("https://ACCOUNT_NAME.documents.azure.com:443/", op)
//....
container, err := c.NewContainer("existing_db", "existing_container")
if err != nil {
log.Fatal(err)
}
//ctx := context.Background()
tracer := otel.Tracer("tracer_app1")
ctx, span := tracer.Start(context.Background(), "query-items-operation")
defer span.End()
query := "SELECT * FROM c"
pager := container.NewQueryItemsPager(query, azcosmos.NewPartitionKey(), nil)
for pager.More() {
queryResp, err := pager.NextPage(ctx)
if err != nil {
log.Fatal("query items failed:", err)
}
for _, item := range queryResp.Items {
log.Printf("Queried item: %+v\n", string(item))
}
}
}
The above function calls getClientOptionsWithTracing
to get tracing-enabled options and a tracer provider and ensures the tracer provider is shut down at the end (flushes traces). It creates a Azure Cosmos DB client with tracing enabled, executes an operation to query items in a container. The SDK call is traced automatically, and exported to stdout in this case.
The traces are quite large – here is a small snippet of the output:
//...
{
"Name": "query_items democontainer",
"SpanContext": {
"TraceID": "39a650bcd34ff70d48bbee467d728211",
"SpanID": "f2c892bec75dbf5d",
"TraceFlags": "01",
"TraceState": "",
"Remote": false
},
"Parent": {
"TraceID": "39a650bcd34ff70d48bbee467d728211",
"SpanID": "b833d109450b779b",
"TraceFlags": "01",
"TraceState": "",
"Remote": false
},
"SpanKind": 3,
"StartTime": "2025-05-06T17:59:30.90146+05:30",
"EndTime": "2025-05-06T17:59:36.665605042+05:30",
"Attributes": [
{
"Key": "db.system",
"Value": {
"Type": "STRING",
"Value": "cosmosdb"
}
},
{
"Key": "db.cosmosdb.connection_mode",
"Value": {
"Type": "STRING",
"Value": "gateway"
}
},
{
"Key": "db.namespace",
"Value": {
"Type": "STRING",
"Value": "demodb-gosdk3"
}
},
//.....
Conclusion
The Go SDK for Azure Cosmos DB is designed to be flexible and customizable, allowing you to tailor it to your specific needs. In this blog, we covered how to configure and customize the Go SDK for Azure Cosmos DB. We looked at retry policies, HTTP-level customizations, OpenTelemetry support, and how to access metrics.
For more information, refer to the package documentation and the GitHub repository for the Go SDK. I hope you find this useful!
About Azure Cosmos DB
Azure Cosmos DB is a fully managed and serverless distributed database for modern app development, with SLA-backed speed and availability, automatic and instant scalability, and support for open-source PostgreSQL, MongoDB, and Apache Cassandra. To stay in the loop on Azure Cosmos DB updates, follow us on X, YouTube, and LinkedIn.
To easily build your first database, watch our Get Started videos on YouTube and explore ways to dev/test free.
0 comments
Be the first to start the discussion.