Introduction

In the first part of our series on Azure Cosmos DB Vector Search with DiskANN, we explored the fundamentals of vector indexing and search capabilities using Azure Cosmos DB for NoSQL and demonstrated a few early performance and cost characteristics. In Part 2, we’ll demonstrate how to scale to 1 billion vector datasets, while maintaining a query latency of under 100 milliseconds with > 90% recall, all in a cost-effective manner. Along the way, we will review concepts like partitioning for managing large-scale vector data in Azure Cosmos DB, along with optimizations & best practices for faster ingestion and reducing cross-partition query latencies. You’ll learn to design, optimize and measure large-scale vector indexes in Azure Cosmos DB for efficiency and cost-effective. Let’s dive in!

Code

You can find the code to replicate the experiments in this article on GitHub: VectorIndexScenarioSuite. Refer to the MSTuringEmbeddingOnlyScenario.cs file, for a holistic view of the operations. We also recommend that you familiarize yourself with creating vector indexes in Azure Cosmos DB for NoSQL by reviewing part 1 of this series.

Partitions and scale out

Azure Cosmos DB for NoSQL uses partitioning to ensure low-latency across large indices. Internally, it utilizes a single DiskANN instance per physical partition, which allows for faster indexing as each physical partition can use its own RU (Request Unit) allotment to independently index vectors. The number of physical partitions in a container depends on the following characteristics:

The total amount of throughput (RU/s) provisioned on the container. Each physical partition can provide throughput up to 10,000 RU/s.
The total amount of data (documents + index) stored in the container.

Azure Cosmos DB for NoSQL has logical partitions that are defined by partition keys. Many logical partitions are mapped to a single physical partition, which are units of data management.

Configuring the container with a throughput of N x 10,000 RU/s, it creates N physical partitions. For example, if you create a collection with 50 x 10,000 RU/s = 500,000 RU/s, it will create 50 physical partitions. This setup can significantly increase indexing speeds because it distributes the work for indexing across all physical partitions.

Partition Key

When you choose a partition key, ensure it has high cardinality, meaning it offers a wide range of distinct values to evenly distribute data and RU usage across partitions, and create many logical partitions. For large-scale datasets, consider using hierarchical partition keys to create multi-level partitioning. You need to thoughtfully design hierarchical partition keys to ensure the first level maintains high cardinality, maximizing scalability and performance. You can learn more about partitioning in Azure Cosmos DB from our official documentation.

In the experiments here, we use the vector ID as the partition key. This ensures a uniform distribution of data across partitions, which helps avoid hotspots and maintain efficiency.

Optimizations for large scale vector workloads

Let’s see how to batch documents to increase indexing throughput and reduce cross-partition query latency by increasing concurrency of the SDK.

Faster Ingestion

The Azure Cosmos DB API allows us to insert multiple documents with vectors in one transaction. This helps lower the number of round trips between your client SDK and the Azure Cosmos DB servers. This also allows the vector indexer to parallelize the indexing of all vectors in the batch. This feature can enabled by specifying the AllowBulkExecution option in the Azure Cosmos DB for NoSQL .NET, Java, and JavaScript SDKs. The Python SDK supports transactional batch, but doesn’t support bulk execution at this time. However, you can use the async client for concurrent requests. The batches that the SDK creates to optimize throughput have a current maximum of 100 operations per batch.

Here is an example of enabling AllowBulkExecution using the .NET SDK.

CosmosClientOptions cosmosClientOptions = new()
{
   ConnectionMode = ConnectionMode.Direct,
   AllowBulkExecution = true,
   // SDK will handle throttling and retry after recommended time has elapsed.
   // https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/how-to-migrate-from-bulk-executor-library
   MaxRetryAttemptsOnRateLimitedRequests = 100,
   MaxRetryWaitTimeOnRateLimitedRequests = TimeSpan.FromSeconds(600)
};

Improving Query Latency

Low-latency querying is essential for developing responsive AI applications. In a scale-out configuration, the client SDK will query multiple physical partitions to expedite response times and improve latency. The Azure Cosmos DB SDK provides MaxConcurrency option to control the number of concurrent operations run client-side during parallel query execution. If you set it to less than 0, the system automatically decides the number of concurrent operations to run. By setting a positive value for MaxConcurrency, you can increase the number of concurrent operations to the specified value and reduce latency.

var requestOptions = new QueryRequestOptions()
{
   MaxConcurrency = 64,
};
FeedIterator<IdWithSimilarityScore> queryResultSetIterator =
this.CosmosContainerForQuery.GetItemQueryIterator<IdWithSimilarityScore>(queryDefinition,null, requestOptions);

A few best practices to consider to optimize latency:

Ensure clients are in the same region as your Azure Cosmos DB and AI model resources.
Reuse client instances (Recommended singleton per AppDomain in C#).
Avoid a “cold start” of the Azure Cosmos DB SDK by issuing warm-up queries before evaluation.

Experiments and Evaluation

Now that we’ve covered key concepts, tips, and best practices for large-scale vector indexing and search, let’s put them into action. We’ll use the VectorIndexScenarioSuite code, which makes it easy to run experiments, benchmark performance, and validate the impact of different configuration choices, such as partitioning strategy, indexing parameters, and query patterns.

Experimental Setup

We will experiment with a billion-vector index dataset to illustrate the impact of partitioning on query latency (P50, P95, average), recall and RU charges. Note that recall@k measures the overlap between the top-k results returned by the index and actual top-k results – essentially the accuracy of the index.

We use the Microsoft Turing 1B dataset which consists of 1 billion Bing search queries encoded by Turing AGI v5. It can be downloaded with this script: ms_turing_download.ps1. You could use an alternate dataset in your experiment, including those with 1000+ dimensions. Cosmos DB costs and query latency scales highly sublinearly with the number of dimensions as well.

RU Charges Configuration

We will manually configure the Azure Cosmos DB container with different RU settings to create different physical partitions. This allows us to observe their impact on RU consumption, query latency, and recall. The RU settings will be:

50 * 10,000 = 500,000 RU/s
100 * 10,000 = 1,000,000 RU/s

Latency Comparison

The latency measurements will include:

Per Partition Query Latency: The time taken to process a query within each individual partition.
Client End-to-End Latency: The total time taken from the client sending the query to receiving the results, including network latency and any client-side processing.

Latency numbers are commonly reported at different percentiles for a clear picture of overall distribution. The recall and latency distribution for top-K (K=10 here) vector search queries on post warmup are listed below. L is the search list parameters for DiskANN and equal to K*searchListSizeMultiplier, an optional parameter in the VectorDistance query system function, set to 10 by default.

(Please note that these numbers are rounded and may improve in the future)

Scenario

P50 Per Partition Latency (ms)

L=100/200

P95 Per Partition Latency (ms)

L=100/200

Avg Per Partition Latency (ms)

L=100/200

Recall@k=10

L=100/200

1B vectors

500k RU/s, 50 partitions

29/50

49/74

31/51

89.6/93.9

1B vectors

1M RU/s, 100 partitions

28/41

37/53

28/41

91.0/94.7

(Please note that these numbers are rounded and may improve in the future)

Scenario

P50 Client Latency (ms)

P95 Client Latency (ms)

Avg Client Latency (ms)

Recall@k=10

1B vectors

500k RU/s, 50 partitions

61/86

78.33/113

61/88

89.6/93.9

1B vectors

1M RU/s, 100 partitions

71/100

85/120

72/100

91.0/94.7

Vector Search Cost

Finally, cost is an important consideration when choosing a vector database. The number of vectors, their dimension and query rate influence the cost. In Azure Cosmos DB for NoSQL, vector searches are charged as Request Units (RUs) like any other query (learn more about RUs here). In the table below, we list RU cost distributions for querying the TOP K = 10 closest vectors over 1 billion data.

(Please note that these numbers are rounded and may improve in the future)

Scenario

P50 RU Cost

L=100/200

P95 RU Cost

L=100/200

Avg RU Cost

L=100/200

1B vectors

500k RU/s, 50 partitions

1,498/2,425

1,770/2,889

1,476/2,368

1B vectors

1M RU/s, 100 partitions

3,010/4,930

3,504/5,747

2,965/4,801

In the MS Turing 1B scenario, the data is distributed across multiple physical partitions, either by pre-configured or automatically scaling. To translate RUs into monthly cost (in USD) in a production environment, we use the RU cost table below for Azure Cosmos DB for NoSQL in the West US 2 region:

Pricing (Full Azure Cosmos DB pricing details can be found here)
- Azure Cosmos DB for NoSQL, Provisioned Throughput (Manual): $0.008 per 100 RU/s for 1 hour
- Azure Cosmos DB for NoSQL, Provisioned Throughput (Autoscale): $0.012 per 100 RU/s for 1 hour
- Azure Cosmos DB for NoSQL, Serverless: $0.25 per 1,000,000 RUs
Throughput:
- Manual: RU/s is manually configured to sustain 10 and 100 Queries Per Second (QPS) 100% of the time.
- Autoscale RU/s configured to sustain the following scenarios, and autoscale will automatically scale in/out to meet the throughput demand (QPS).
  - 10 QPS for 100% of the time
  - 10 QPS for 50% of the time, 1 QPS for 50% of the time
  - 100 QPS for 100% of the time
  - 100 QPS for 50% of the time, 10 QPS for 50% of the time
RU cost estimate: We use the P95 RU cost estimate of 1770 for a TOP 10 vector search. The container is provisioned with 500,000 RU during vector ingestion, then adjusted according to different query patterns.

(Please note that these numbers are rounded and may improve in the future)

Scenario

Manual

(100% 10 QPS)

Manual

(100% 100 QPS)

Autoscale

(50% 1 QPS, 50% 10 QPS)

Autoscale

(100% 10QPS)

Autoscale

(50% 10 QPS, 50% 100 QPS)

Autoscale

(100% 100 QPS)

Serverless

(100% 10 QPS)

1B vectors

$1,054

$10,535

$917

$1,580

$8,428

$15,803

$11,470

We can observe several key points from these experiments:

Azure Cosmos DB can scale to billion vector indices while allowing <100ms query latency with recall exceeding 90%.
The RU consumption for vector search operations varies with partition configurations. Configuring the container with higher RU settings (e.g., 100 * 10,000 RUs) resulted in more physical partitions, which increased the overall RU charges. However, fewer partitions packed have lower cost.
The end-to-end client latency is sensitive to the worst latency on the server side. Using fewer partitions with more vectors per partition achieved better latency. Like with any query in Azure Cosmos DB, providing a partition key filter for your data, where possible, will reduce fan-out and make queries faster.
The recall remained high across different partition configurations, ensuring that the vector search results were relevant. The DiskANN index effectively maintained high recall accuracy even with large-scale vector data.

Recommendations for Designing and Optimizing Large Scale Vector Indexes

Based on the experimental findings, here are some recommendations for designing and optimizing large-scale vector indexes in Azure Cosmos DB:

Partitioning Strategy: Use fewer partitions packed with as many vectors as possible. This approach helps achieve the best latency and reduces RU charges. The query cost grows logarithmically with the number of vectors in a partition but linearly with the number of partitions.
Maximize Concurrency: Configure the MaxConcurrency option in the SDK to issue requests to different partitions concurrently. This improves query performance by leveraging parallelism.
Bulk Ingestion: Enable bulk execution in the SDK to achieve high throughput during data ingestion. Bulk ingestion is particularly effective for large-scale ingestion because each partition will have its own DiskANN index instance.
Optimize Latency: Measure both server latency per partition and client end-to-end latency. Co-locate the index to the same Azure region as the client to minimize networking cost and with maximum concurrency to achieve best latency.
Optimize Cost: Begin with higher Request Units (RUs) to ensure efficient partitions for rapid data ingestion. When serving the index, configure lower RUs manually or use automatic scaling to match the workload’s request patterns.

Wrapping Up & Looking Forward

In this article, we covered how Azure Cosmos DB for NoSQL enables fast and cost-effective billion-scale vector search. We walked through tips to optimize both vector inserts and vector query performance, and introduced new tools to help you test, measure, and tune your vector search workloads… and we’re just getting started! Check back soon on Microsoft DevBlogs as we continue this series with deep dives into streaming vector indexing, real-time retrieval, and filtering techniques to supercharge your search experience.

Leave a review

Tell us about your Azure Cosmos DB experience! Leave a review on PeerSpot and we’ll gift you $50. Get started here.

About Azure Cosmos DB

Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.

Try Azure Cosmos DB for free here. To stay in the loop on Azure Cosmos DB updates, follow us on X, YouTube, and LinkedIn.

Azure Cosmos DB with DiskANN Part 2: Scaling to 1 Billion Vectors with <100ms Search Latency

Introduction

Code