{"id":9928,"date":"2025-04-21T09:02:59","date_gmt":"2025-04-21T16:02:59","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cosmosdb\/?p=9928"},"modified":"2025-04-23T07:35:50","modified_gmt":"2025-04-23T14:35:50","slug":"azure-cosmos-db-with-diskann-part-2-scaling-to-1-billion-vectors-with","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cosmosdb\/azure-cosmos-db-with-diskann-part-2-scaling-to-1-billion-vectors-with\/","title":{"rendered":"Azure Cosmos DB with DiskANN Part 2: Scaling to 1 Billion Vectors with <100ms Search Latency"},"content":{"rendered":"<h1>Introduction<\/h1>\n<p>In the <a href=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/azure-cosmos-db-vector-search-with-diskann-part-1-full-space-search\/\">first part of our series on Azure Cosmos DB Vector Search with DiskANN<\/a>, we explored the fundamentals of vector indexing and search capabilities using Azure Cosmos DB for NoSQL and demonstrated a few early performance and cost characteristics. In Part 2, we\u2019ll demonstrate how to scale to 1 billion vector datasets, while maintaining a <strong>query latency of under 100 milliseconds with &gt; 90% recall, all in a cost-effective manner. <\/strong>Along the way, we will review concepts like partitioning for managing large-scale vector data in Azure Cosmos DB, along with optimizations &amp; best practices for faster ingestion and reducing cross-partition query latencies. You&#8217;ll learn to design, optimize and measure large-scale vector indexes in Azure Cosmos DB for efficiency and cost-effective. Let&#8217;s dive in!<\/p>\n<h2>Code<\/h2>\n<p>You can find the code to replicate the experiments in this article on GitHub: <a href=\"https:\/\/github.com\/AzureCosmosDB\/VectorIndexScenarioSuite\">VectorIndexScenarioSuite<\/a>. Refer to the <a href=\"https:\/\/github.com\/AzureCosmosDB\/VectorIndexScenarioSuite\/blob\/main\/src\/MSTuringEmbeddingOnlyScenario.cs\"><span style=\"font-family: terminal, monaco, monospace;\">MSTuringEmbeddingOnlyScenario.cs<\/span><\/a> file, for a holistic view of the operations. We also recommend that you familiarize yourself with creating vector indexes in Azure Cosmos DB for NoSQL by reviewing <a href=\"https:\/\/devblogs.microsoft.com\/cosmosdb\/azure-cosmos-db-vector-search-with-diskann-part-1-full-space-search\/\">part 1 of this series<\/a>.<\/p>\n<h2>Partitions and scale out<\/h2>\n<p><span class=\"TextRun SCXW260013218 BCX8\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun CommentStart SCXW260013218 BCX8\">Azure Cosmos DB <\/span><span class=\"NormalTextRun SCXW260013218 BCX8\">for NoSQL <\/span><\/span><span class=\"TrackChangeTextInsertion TrackedChange TrackChangeHoverSelectColorRed SCXW260013218 BCX8\"><span class=\"TextRun SCXW260013218 BCX8\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun TrackChangeHoverSelectHighlightRed SCXW260013218 BCX8\">uses partitioning <\/span><\/span><\/span><span class=\"TrackChangeTextInsertion TrackedChange TrackChangeHoverSelectColorRed SCXW260013218 BCX8\"><span class=\"TextRun SCXW260013218 BCX8\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun TrackChangeHoverSelectHighlightRed SCXW260013218 BCX8\">to ensure <\/span><span class=\"NormalTextRun ContextualSpellingAndGrammarErrorV2Themed TrackChangeHoverSelectHighlightRed SCXW260013218 BCX8\">low-latency<\/span><span class=\"NormalTextRun TrackChangeHoverSelectHighlightRed SCXW260013218 BCX8\"> across large indices. <\/span><\/span><\/span><span class=\"TrackChangeTextInsertion TrackedChange TrackChangeHoverSelectColorRed SCXW260013218 BCX8\"><span class=\"TextRun SCXW260013218 BCX8\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun TrackChangeHoverSelectHighlightRed SCXW260013218 BCX8\">Internally, it<\/span><\/span><\/span> utilizes a single DiskANN instance per physical partition, which allows for faster indexing as each physical partition can use its own RU (Request Unit) allotment to independently index vectors. The number of physical partitions in a container depends on the following characteristics:<\/p>\n<ul>\n<li>The total amount of throughput (RU\/s) provisioned on the container. Each physical partition can provide throughput up to 10,000 RU\/s.<\/li>\n<li>The total amount of data (documents + index) stored in the container.<\/li>\n<\/ul>\n<p>Azure Cosmos DB for NoSQL has <a href=\"https:\/\/learn.microsoft.com\/azure\/cosmos-db\/partitioning-overview#logical-partitions\"><em>logical partitions<\/em><\/a> that are defined by <em>partition keys<\/em>. Many logical partitions are mapped to a single physical partition, which are units of data management.<\/p>\n<p>Configuring the container with a throughput of N x 10,000 RU\/s, it creates N physical partitions. For example, if you create a collection with 50 x 10,000 RU\/s = 500,000 RU\/s, it will create 50 physical partitions. This setup can significantly increase indexing speeds because it distributes the work for indexing across all physical partitions.<\/p>\n<h4>Partition Key<\/h4>\n<p>When you choose a partition key, ensure it has high cardinality, meaning it offers a wide range of distinct values to evenly distribute data and RU usage across partitions, and create many logical partitions. For large-scale datasets, consider using hierarchical partition keys to create multi-level partitioning. You need to thoughtfully design hierarchical partition keys to ensure the first level maintains high cardinality, maximizing scalability and performance. You can learn more about partitioning in Azure Cosmos DB from <a href=\"https:\/\/learn.microsoft.com\/azure\/cosmos-db\/partitioning-overview&quot; \\l &quot;physical-partitions\">our official documentation<\/a>.<\/p>\n<p>In the experiments here, we use the vector ID as the partition key. This ensures a uniform distribution of data across partitions, which helps avoid hotspots and maintain efficiency.<\/p>\n<h2>Optimizations for large scale vector workloads<\/h2>\n<p>Let\u2019s see how to batch documents to increase indexing throughput and reduce cross-partition query latency by increasing concurrency of the SDK.<\/p>\n<h3>Faster Ingestion<\/h3>\n<p>The Azure Cosmos DB API allows us to insert multiple documents with vectors in one transaction. This helps lower the number of round trips between your client SDK and the Azure Cosmos DB servers. This also allows the vector indexer to parallelize the indexing of all vectors in the batch. This feature can enabled by specifying the <em>AllowBulkExecution<\/em> option in the Azure Cosmos DB for NoSQL <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/cosmos-db\/nosql\/sdk-dotnet-v3\">.NET<\/a>, <a href=\"https:\/\/learn.microsoft.com\/azure\/cosmos-db\/nosql\/sdk-java-v4\">Java<\/a>, and <a href=\"https:\/\/learn.microsoft.com\/javascript\/api\/overview\/azure\/cosmos-readme?view=azure-node-latest\">JavaScript<\/a> SDKs. The <a href=\"https:\/\/learn.microsoft.com\/python\/api\/overview\/azure\/cosmos-db?view=azure-python\">Python SDK<\/a> supports transactional batch, but doesn\u2019t support bulk execution at this time. However, you can use <a href=\"https:\/\/github.com\/Azure\/azure-sdk-for-python\/blob\/main\/sdk\/cosmos\/azure-cosmos\/README.md#using-the-async-client-as-a-workaround-to-bulk\">the async client<\/a> for concurrent requests. The batches that the SDK creates to optimize throughput have a current maximum of 100 operations per batch.<\/p>\n<p>Here is an example of enabling AllowBulkExecution using the .NET SDK.<\/p>\n<pre class=\"prettyprint language-cs language-csharp\"><code class=\"language-cs language-csharp\">CosmosClientOptions cosmosClientOptions = new()\r\n{\r\n   ConnectionMode = ConnectionMode.Direct,\r\n   AllowBulkExecution = true,\r\n   \/\/ SDK will handle throttling and retry after recommended time has elapsed.\r\n   \/\/ https:\/\/learn.microsoft.com\/en-us\/azure\/cosmos-db\/nosql\/how-to-migrate-from-bulk-executor-library\r\n   MaxRetryAttemptsOnRateLimitedRequests = 100,\r\n   MaxRetryWaitTimeOnRateLimitedRequests = TimeSpan.FromSeconds(600)\r\n};<\/code><\/pre>\n<h3>Improving Query Latency<\/h3>\n<p>Low-latency querying is essential for developing responsive AI applications. In a scale-out configuration, the client SDK will query multiple physical partitions to expedite response times and improve latency. The Azure Cosmos DB SDK provides <em>MaxConcurrency<\/em> option to control the number of concurrent operations run client-side during parallel query execution. If you set it to less than 0, the system automatically decides the number of concurrent operations to run. By setting a positive value for <em>MaxConcurrency<\/em>, you can increase the number of concurrent operations to the specified value and reduce latency.<\/p>\n<pre class=\"prettyprint language-cs language-csharp\"><code class=\"language-cs language-csharp\">var requestOptions = new QueryRequestOptions()\r\n{\r\n   MaxConcurrency = 64,\r\n};\r\nFeedIterator&lt;IdWithSimilarityScore&gt; queryResultSetIterator =\r\nthis.CosmosContainerForQuery.GetItemQueryIterator&lt;IdWithSimilarityScore&gt;(queryDefinition,null, requestOptions);<\/code><\/pre>\n<p>A few best practices to consider to optimize latency:<\/p>\n<ol>\n<li>Ensure clients are in the same region as your Azure Cosmos DB and AI model resources.<\/li>\n<li>Reuse client instances (Recommended singleton per AppDomain in C#).<\/li>\n<li>Avoid a \u201ccold start\u201d of the Azure Cosmos DB SDK by issuing warm-up queries before evaluation.<\/li>\n<\/ol>\n<h2>Experiments and Evaluation<\/h2>\n<p class=\"\" data-start=\"91\" data-end=\"224\">Now that we&#8217;ve covered key concepts, tips, and best practices for large-scale vector indexing and search, let&#8217;s put them into action. We&#8217;ll use the <a href=\"https:\/\/github.com\/AzureCosmosDB\/VectorIndexScenarioSuite\">VectorIndexScenarioSuite<\/a> code, which makes it easy to run\u00a0experiments, benchmark performance, and validate the impact of different configuration choices, such as partitioning strategy, indexing parameters, and query patterns.<\/p>\n<h3>Experimental Setup<\/h3>\n<p>We will experiment with a billion-vector index dataset to illustrate the impact of partitioning on query latency (P50, P95, average), recall and RU charges. Note that recall@k measures the overlap between the top-k results returned by the index and actual top-k results \u2013 essentially the accuracy of the index.<\/p>\n<p>We use the Microsoft Turing 1B dataset which consists of 1 billion Bing search queries encoded by Turing AGI v5. It can be downloaded with this script: <a href=\"https:\/\/github.com\/AzureCosmosDB\/VectorIndexScenarioSuite\/blob\/main\/dataset\/ms_turing_download.ps1\">ms_turing_download.ps1<\/a>. You could use an alternate dataset in your experiment, including those with 1000+ dimensions. Cosmos DB costs and query latency scales highly sublinearly with the number of dimensions as well.<\/p>\n<h3>RU Charges Configuration<\/h3>\n<p>We will manually configure the Azure Cosmos DB container with different RU settings to create different physical partitions. This allows us to observe their impact on RU consumption, query latency, and recall. The RU settings will be:<\/p>\n<ul>\n<li>50 * 10,000 = 500,000 RU\/s<\/li>\n<li>100 * 10,000 = 1,000,000 RU\/s<\/li>\n<\/ul>\n<h3>Latency Comparison<\/h3>\n<p>The latency measurements will include:<\/p>\n<ul>\n<li><strong>Per Partition Query Latency<\/strong>: The time taken to process a query within each individual partition.<\/li>\n<li><strong>Client End-to-End Latency<\/strong>: The total time taken from the client sending the query to receiving the results, including network latency and any client-side processing.<\/li>\n<\/ul>\n<p>Latency numbers are commonly reported at different percentiles for a clear picture of overall distribution. The recall and latency distribution for top-K (K=10 here) vector search queries on post warmup are listed below. L is the search list parameters for DiskANN and equal to K*searchListSizeMultiplier, an optional parameter in the <a href=\"https:\/\/learn.microsoft.com\/azure\/cosmos-db\/nosql\/query\/vectordistance\">VectorDistance<\/a> query system function, set to 10 by default.<\/p>\n<p><em>(Please note that these numbers are rounded and may improve in the future)<\/em><\/p>\n<table class=\" aligncenter\" style=\"height: 310px; width: 100%;\">\n<tbody>\n<tr>\n<td><strong>Scenario<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>P50 Per Partition Latency (ms) <\/strong><\/p>\n<p>L=100\/200<\/td>\n<td style=\"text-align: center;\"><strong>P95 Per Partition Latency (ms) <\/strong><\/p>\n<p>L=100\/200<\/td>\n<td style=\"text-align: center;\"><strong>Avg Per Partition Latency (ms) <\/strong><\/p>\n<p>L=100\/200<\/td>\n<td style=\"text-align: center;\"><strong>Recall@k=10<\/strong><\/p>\n<p>L=100\/200<\/td>\n<\/tr>\n<tr>\n<td>1B vectors<\/p>\n<p>500k RU\/s, 50 partitions<\/td>\n<td style=\"text-align: center;\">29\/50<\/td>\n<td style=\"text-align: center;\">49\/74<\/td>\n<td style=\"text-align: center;\">31\/51<\/td>\n<td style=\"text-align: center;\">89.6\/93.9<\/td>\n<\/tr>\n<tr>\n<td>1B vectors<\/p>\n<p>1M RU\/s, 100 partitions<\/td>\n<td style=\"text-align: center;\">28\/41<\/td>\n<td style=\"text-align: center;\">37\/53<\/td>\n<td style=\"text-align: center;\">28\/41<\/td>\n<td style=\"text-align: center;\">91.0\/94.7<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p><em>(Please note that these numbers are rounded and may improve in the future)<\/em><\/p>\n<table class=\" aligncenter\" style=\"height: 234px; width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 20.0443%;\"><strong>Scenario<\/strong><\/td>\n<td style=\"width: 22.1484%; text-align: center;\"><strong>P50 Client Latency (ms) <\/strong><\/td>\n<td style=\"width: 22.0377%; text-align: center;\"><strong>P95 Client Latency (ms) <\/strong><\/td>\n<td style=\"width: 21.9269%; text-align: center;\"><strong>Avg Client Latency (ms) <\/strong><\/td>\n<td style=\"width: 31.848%; text-align: center;\"><strong>Recall@k=10<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 20.0443%;\">1B vectors<\/p>\n<p>500k RU\/s, 50 partitions<\/td>\n<td style=\"width: 22.1484%; text-align: center;\">61\/86<\/td>\n<td style=\"width: 22.0377%; text-align: center;\">78.33\/113<\/td>\n<td style=\"width: 21.9269%; text-align: center;\">61\/88<\/td>\n<td style=\"width: 31.848%; text-align: center;\">89.6\/93.9<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 20.0443%;\">1B vectors<\/p>\n<p>1M RU\/s, 100 partitions<\/td>\n<td style=\"width: 22.1484%; text-align: center;\">71\/100<\/td>\n<td style=\"width: 22.0377%; text-align: center;\">85\/120<\/td>\n<td style=\"width: 21.9269%; text-align: center;\">72\/100<\/td>\n<td style=\"width: 31.848%; text-align: center;\">91.0\/94.7<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong>Vector Search Cost<\/strong><\/p>\n<p>Finally, cost is an important consideration when choosing a vector database. The number of vectors, their dimension and query rate influence the cost. In Azure Cosmos DB for NoSQL, vector searches are charged as Request Units (RUs) like any other query (learn more about RUs <a href=\"https:\/\/learn.microsoft.com\/azure\/cosmos-db\/request-units\">here<\/a>). In the table below, we list RU cost distributions for querying the TOP K = 10 closest vectors over 1 billion data.<\/p>\n<p style=\"text-align: center;\"><em>(Please note that these numbers are rounded and may improve in the future)<\/em><\/p>\n<p>&nbsp;<\/p>\n<table class=\" aligncenter\" style=\"height: 319px; width: 100%;\">\n<tbody>\n<tr style=\"height: 96px;\">\n<td style=\"width: 34.3753%; height: 96px;\"><strong>Scenario<\/strong><\/td>\n<td style=\"width: 17.9503%; height: 96px; text-align: center;\"><strong>P50 RU Cost<\/strong><\/p>\n<p>L=100\/200<\/td>\n<td style=\"width: 24.3678%; height: 96px; text-align: center;\"><strong>P95 RU Cost<\/strong><\/p>\n<p>L=100\/200<\/td>\n<td style=\"width: 90.6926%; height: 96px; text-align: center;\"><strong>Avg RU Cost<\/strong><\/p>\n<p>L=100\/200<\/td>\n<\/tr>\n<tr style=\"height: 96px;\">\n<td style=\"width: 34.3753%; height: 96px;\">1B vectors<\/p>\n<p>500k RU\/s, 50 partitions<\/td>\n<td style=\"width: 17.9503%; height: 96px; text-align: center;\">1,498\/2,425<\/td>\n<td style=\"width: 24.3678%; height: 96px; text-align: center;\">1,770\/2,889<\/td>\n<td style=\"width: 90.6926%; height: 96px; text-align: center;\">1,476\/2,368<\/td>\n<\/tr>\n<tr style=\"height: 101px;\">\n<td style=\"width: 34.3753%; height: 101px;\">1B vectors<\/p>\n<p>1M RU\/s, 100 partitions<\/td>\n<td style=\"width: 17.9503%; height: 101px; text-align: center;\">3,010\/4,930<\/td>\n<td style=\"width: 24.3678%; height: 101px; text-align: center;\">3,504\/5,747<\/td>\n<td style=\"width: 90.6926%; height: 101px; text-align: center;\">2,965\/4,801<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In the MS Turing 1B scenario, the data is distributed across multiple physical partitions, either by pre-configured or automatically scaling. To translate RUs into monthly cost (in USD) in a production environment, we use the RU cost table below for Azure Cosmos DB for NoSQL in the West US 2 region:<\/p>\n<ul>\n<li><strong>Pricing<\/strong> (Full Azure Cosmos DB pricing details can be found <a href=\"https:\/\/azure.microsoft.com\/en-us\/pricing\/details\/cosmos-db\/autoscale-provisioned\/\">here<\/a>)\n<ul>\n<li>Azure Cosmos DB for NoSQL, Provisioned Throughput (Manual): $0.008 per 100 RU\/s for 1 hour<\/li>\n<li>Azure Cosmos DB for NoSQL, Provisioned Throughput (Autoscale): $0.012 per 100 RU\/s for 1 hour<\/li>\n<li>Azure Cosmos DB for NoSQL, Serverless: $0.25 per 1,000,000 RUs<\/li>\n<\/ul>\n<\/li>\n<li><strong>Throughput:<\/strong>\n<ul>\n<li><strong>Manual: <\/strong>RU\/s is manually configured to sustain 10 and 100 Queries Per Second (QPS) 100% of the time.<\/li>\n<li><strong>Autoscale <\/strong> RU\/s configured to sustain the following scenarios, and autoscale will automatically scale in\/out to meet the throughput demand (QPS).\n<ul>\n<li>10 QPS for 100% of the time<\/li>\n<li>10 QPS for 50% of the time, 1 QPS for 50% of the time<\/li>\n<li>100 QPS for 100% of the time<\/li>\n<li>100 QPS for 50% of the time, 10 QPS for 50% of the time<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li><strong>RU cost estimate: <\/strong>We use the P95 RU cost estimate of 1770 for a TOP 10 vector search. The container is provisioned with 500,000 RU during vector ingestion, then adjusted according to different query patterns.<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><em>(Please note that these numbers are rounded and may improve in the future)<\/em><\/p>\n<table style=\"height: 253px; width: 100%;\">\n<tbody>\n<tr style=\"height: 120px;\">\n<td style=\"height: 120px; width: 7.20299%;\"><strong>Scenario<\/strong><\/td>\n<td style=\"height: 120px; width: 10.29%;\"><strong>Manual<\/strong><\/p>\n<p>(100% 10 QPS)<\/td>\n<td style=\"height: 120px; width: 11.2254%;\"><strong>Manual<\/strong><\/p>\n<p>(100% 100 QPS)<\/td>\n<td style=\"height: 120px; width: 18.522%;\"><strong>Autoscale<\/strong><\/p>\n<p>(50% 1 QPS, 50% 10 QPS)<\/td>\n<td style=\"height: 120px; width: 10.0094%;\"><strong>Autoscale<\/strong><\/p>\n<p>(100% 10QPS)<\/td>\n<td style=\"height: 120px; width: 19.3639%;\"><strong>Autoscale<\/strong><\/p>\n<p>(50% 10 QPS, 50% 100 QPS)<\/td>\n<td style=\"height: 120px; width: 11.2254%;\"><strong>Autoscale<\/strong><\/p>\n<p>(100% 100 QPS)<\/td>\n<td style=\"height: 120px; width: 31.3377%;\"><strong>Serverless<\/strong><\/p>\n<p>(100% 10 QPS)<\/td>\n<\/tr>\n<tr style=\"height: 93px;\">\n<td style=\"height: 93px; width: 7.20299%;\">1B vectors<\/td>\n<td style=\"height: 93px; width: 10.29%;\">$1,054<\/td>\n<td style=\"height: 93px; width: 11.2254%;\">$10,535<\/td>\n<td style=\"height: 93px; width: 18.522%;\">\u00a0$917<\/td>\n<td style=\"height: 93px; width: 10.0094%;\">$1,580<\/td>\n<td style=\"height: 93px; width: 19.3639%;\">$8,428<\/td>\n<td style=\"height: 93px; width: 11.2254%;\">$15,803<\/td>\n<td style=\"height: 93px; width: 31.3377%;\">$11,470<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We can observe several key points from these experiments:<\/p>\n<ul>\n<li>Azure Cosmos DB can scale to billion vector indices while allowing &lt;100ms query latency with recall exceeding 90%.<\/li>\n<li>\u00a0The RU consumption for vector search operations varies with partition configurations. Configuring the container with higher RU settings (e.g., 100 * 10,000 RUs) resulted in more physical partitions, which increased the overall RU charges. However, fewer partitions packed have lower cost.<\/li>\n<li>The end-to-end client latency is sensitive to the worst latency on the server side. Using fewer partitions with more vectors per partition achieved better latency. Like with any query in Azure Cosmos DB, providing a partition key filter for your data, where possible, will reduce fan-out and make queries faster.<\/li>\n<li>The recall remained high across different partition configurations, ensuring that the vector search results were relevant. The DiskANN index effectively maintained high recall accuracy even with large-scale vector data.<\/li>\n<\/ul>\n<h2>Recommendations for Designing and Optimizing Large Scale Vector Indexes<\/h2>\n<p>Based on the experimental findings, here are some recommendations for designing and optimizing large-scale vector indexes in Azure Cosmos DB:<\/p>\n<ul>\n<li><strong>Partitioning Strategy: <\/strong>Use fewer partitions packed with as many vectors as possible. This approach helps achieve the best latency and reduces RU charges. The query cost grows logarithmically with the number of vectors in a partition but linearly with the number of partitions.<\/li>\n<li><strong>Maximize Concurrency:<\/strong> Configure the <em>MaxConcurrency<\/em> option in the SDK to issue requests to different partitions concurrently. This improves query performance by leveraging parallelism.<\/li>\n<li><strong>Bulk Ingestion:<\/strong> Enable bulk execution in the SDK to achieve high throughput during data ingestion. Bulk ingestion is particularly effective for large-scale ingestion because each partition will have its own DiskANN index instance.<\/li>\n<li><strong>Optimize Latency: <\/strong>Measure both server latency per partition and client end-to-end latency. Co-locate the index to the same Azure region as the client to minimize networking cost and with maximum concurrency to achieve best latency.<\/li>\n<li><strong>Optimize Cost:<\/strong> Begin with higher Request Units (RUs) to ensure efficient partitions for rapid data ingestion. When serving the index, configure lower RUs manually or use automatic scaling to match the workload&#8217;s request patterns.<\/li>\n<\/ul>\n<h2>Wrapping Up &amp; Looking Forward<\/h2>\n<p>In this article, we covered how Azure Cosmos DB for NoSQL enables fast and cost-effective billion-scale vector search. We walked through tips to optimize both vector inserts and vector query performance, and introduced new tools to help you test, measure, and tune your vector search workloads&#8230; and we\u2019re just getting started! Check back soon on Microsoft DevBlogs as we continue this series with deep dives into streaming vector indexing, real-time retrieval, and filtering techniques to supercharge your search experience.<\/p>\n<h2>Leave a review<\/h2>\n<p>Tell us about your Azure Cosmos DB experience! Leave a review on PeerSpot and we\u2019ll gift you $50. <a href=\"https:\/\/peerspotdotcom.my.site.com\/proReviews\/?SalesOpportunityProduct=00kPy000004TKXJIA4&amp;productPeerspotNumber=30881&amp;CalendlyAccount=peerspot&amp;CalendlyFormLink=peerspot-product-reviews-ps-gc-vi-sf-50&amp;giftCard=50\">Get started here<\/a>.<\/p>\n<h2>About Azure Cosmos DB<\/h2>\n<p>Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.<\/p>\n<p><a href=\"https:\/\/cosmos.azure.com\/try\/\">Try Azure Cosmos DB for free here.<\/a> To stay in the loop on Azure Cosmos DB updates, follow us on <a href=\"https:\/\/twitter.com\/AzureCosmosDB\">X<\/a>, <a href=\"https:\/\/aka.ms\/AzureCosmosDBYouTube\">YouTube<\/a>, and <a href=\"https:\/\/www.linkedin.com\/company\/azure-cosmos-db\/\">LinkedIn<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction In the first part of our series on Azure Cosmos DB Vector Search with DiskANN, we explored the fundamentals of vector indexing and search capabilities using Azure Cosmos DB for NoSQL and demonstrated a few early performance and cost characteristics. In Part 2, we\u2019ll demonstrate how to scale to 1 billion vector datasets, while [&hellip;]<\/p>\n","protected":false},"author":118435,"featured_media":9934,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1610,14],"tags":[1176,499,1808,1917,1797,1866,1867,1868],"class_list":["post-9928","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-core-sql-api","tag-autoscale","tag-azure-cosmos-db","tag-best-practices","tag-diskann","tag-performance","tag-vector-database","tag-vector-db","tag-vector-search"],"acf":[],"blog_post_summary":"<p>Introduction In the first part of our series on Azure Cosmos DB Vector Search with DiskANN, we explored the fundamentals of vector indexing and search capabilities using Azure Cosmos DB for NoSQL and demonstrated a few early performance and cost characteristics. In Part 2, we\u2019ll demonstrate how to scale to 1 billion vector datasets, while [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/9928","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/users\/118435"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/comments?post=9928"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/9928\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/media\/9934"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/media?parent=9928"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/categories?post=9928"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/tags?post=9928"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}