{"id":152,"date":"2019-11-11T09:00:03","date_gmt":"2019-11-11T17:00:03","guid":{"rendered":"http:\/\/devblogs.microsoft.com\/cosmosdb\/?p=152"},"modified":"2021-02-19T14:02:55","modified_gmt":"2021-02-19T22:02:55","slug":"introducing-bulk-support-in-the-net-sdk","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cosmosdb\/introducing-bulk-support-in-the-net-sdk\/","title":{"rendered":"Introducing Bulk support in the .NET SDK"},"content":{"rendered":"<p>The Azure Cosmos DB .NET SDK has <a href=\"https:\/\/docs.microsoft.com\/azure\/cosmos-db\/sql-api-sdk-dotnet-standard#-340---2019-11-04\" target=\"_blank\" rel=\"noopener noreferrer\">recently released<\/a> Bulk support in version 3.4.0.<\/p>\n<h2>What exactly is \u201cBulk\u201d?<\/h2>\n<p>Bulk refers to scenarios that require a high degree of throughput, where you need to dump a <strong>big volume of data<\/strong>, and you need to do it with <strong>as much throughput as possible<\/strong>.<\/p>\n<p>Are you doing a nightly dump of 2 million records into your Cosmos DB container? That is bulk.<\/p>\n<p>Are you processing a stream of data that comes in batches of 100 thousand items you need to update? That is bulk too.<\/p>\n<p>Are you dynamically generating groups of operations that execute concurrently? Yeah, that is also bulk.<\/p>\n<h2>What is NOT \u201cBulk\u201d?<\/h2>\n<p>Bulk is about throughput, <strong>not latency<\/strong>. So, if you are working on a latency-sensitive scenario, where point operations need to resolve as quick as possible, then bulk is not the right tool.<\/p>\n<h2>How do I enable Bulk in .NET SDK V3?<\/h2>\n<p>Enabling bulk is rather easy, when you create the <strong>CosmosClient<\/strong>, toggle the <strong>AllowBulkExecution<\/strong> flag in the <strong>CosmosClientOptions<\/strong> like so:<\/p>\n<pre class=\"lang:c# decode:true\" title=\"Creating a CosmosClient with Bulk enabled\">CosmosClientOptions options = new CosmosClientOptions() { AllowBulkExecution = true };\r\nCosmosClient cosmosClient = new CosmosClient(connectionString, options);<\/pre>\n<p>After that, all you need to do is get an instance of a <strong>Container<\/strong> where you want to perform the operations, and create a <strong>list of Tasks<\/strong> that represent the operations you want to perform based on your input data (I\u2019m removing obtaining the information from the source for simplicity\u2019s sake and because the origin of the data will vary in each case):<\/p>\n<pre class=\"lang:c# decode:true\" title=\"Creating a list of concurrent Tasks\">Container container = cosmosClient.GetContainer(\"myDb\", \"myCollection\");\r\n\r\n\/\/ Assuming your have your data available to be inserted or read\r\nList&lt;Task&gt; concurrentTasks = new List&lt;Task&gt;();\r\nforeach(Item itemToInsert in ReadYourData())\r\n{\r\n    concurrentTasks.Add(container.CreateItemAsync(itemToInsert, new PartitionKey(itemToInsert.MyPk)));\r\n}\r\n\r\nawait Task.WhenAll(concurrentTasks);\r\n<\/pre>\n<h2>Wait, what is happening with those Tasks?<\/h2>\n<p>Normally, if you issue a hundred CreateItemAsync operations <strong>in parallel<\/strong>, each one will generate a service request and response independently. The illustration below shows what execution looks like when individual threads individually insert new items.<\/p>\n<p><img decoding=\"async\" class=\"aligncenter wp-image-153 size-full\" src=\"http:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2019\/11\/nobulk.gif\" alt=\"Operations resolving when Bulk support is not enabled\" width=\"960\" height=\"384\" \/><\/p>\n<p>But the magic here happens when the SDK detects that the Bulk mode is enabled.<\/p>\n<p>The SDK will create batches, all the <strong>concurrent<\/strong> operations will be <strong>grouped<\/strong> by <a href=\"https:\/\/docs.microsoft.com\/azure\/cosmos-db\/partition-data\" target=\"_blank\" rel=\"noopener noreferrer\">physical partition<\/a> affinity and distributed across these batches. When a batch fills up, it gets dispatched, and a new batch is created to be filled with more concurrent operations. Each batch will contain many operations, so this <strong>greatly reduces the amount of back end requests<\/strong>. There could be many batches being dispatched in parallel targeting different partitions, so the more evenly distributed the operations, the better results.<\/p>\n<p>Now, here comes the best part.<\/p>\n<p>When you (the user) call an operation method (CreateItemAsync in this example), the SDK returns a <strong>Task<\/strong>. In a normal, non-bulk CosmosClient, this Task represents the service request for that operation, and completes when the request get the response. But in Bulk, that Task holds the <strong>promise of a result<\/strong>, it does not map to a service request. The SDK is grouping and squeezing operations into batches.<\/p>\n<p>When a batch is completed, the SDK then <strong>unwraps<\/strong> all the results for all the operations the batch contained and <strong>completes the related Tasks<\/strong> with the result. In the illustration below, you can see how batch makes things more efficient and allows you to consume more throughput than you could if done as individual threads.<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-154\" src=\"http:\/\/devblogs.microsoft.com\/cosmosdb\/wp-content\/uploads\/sites\/52\/2019\/11\/bulk.gif\" alt=\"Operations resolving when Bulk support is enabled\" width=\"960\" height=\"384\" \/><\/p>\n<p>From the user perspective, its transparent.<\/p>\n<h2>What are the caveats?<\/h2>\n<p>Since Bulk is geared towards obtaining as much throughput as possible, and the volume of data will be higher than executing the operations individually, the <strong>provisioned throughput<\/strong> (<a href=\"https:\/\/docs.microsoft.com\/azure\/cosmos-db\/request-units\" target=\"_blank\" rel=\"noopener noreferrer\">RU\/s<\/a>) consumed will be higher. Make sure you adjust it based on the volume of operations you want to push.<\/p>\n<p>If your container gets throttled during a Bulk process, it means that now the bottleneck of you getting a higher throughput and pushing even more operations is on the provisioned throughput, raising it will let you push even more operations per second.\u00a0The .NET SDK will automatically retry when throttling happens, but the overall effect will be that processing the data will be slower.<\/p>\n<p>Another caveat is the <strong>size of the documents<\/strong>. The batches that the SDK creates to optimize throughput have a current maximum of 2Mb or 100 operations per batch, the smaller the documents, the greater the optimization that can be achieved (the bigger the documents, the more batches need to be used).<\/p>\n<p>And finally, the <strong>amount of operations<\/strong>. As it was mentioned before, the SDK will construct batches and group operations, when the batch is full, it will get dispatched, but if the batch doesn&#8217;t fill up, there is a <strong>timer<\/strong> that will dispatch it to make sure they complete. This timer currently is 100 milliseconds. So if the batch does not get filled up (for example, you are just sending 50 concurrent operations), then the overall latency might be affected.<\/p>\n<p>In general, to <strong>troubleshoot<\/strong> this scenario it is good to:<\/p>\n<ul>\n<li>Check the <a href=\"https:\/\/docs.microsoft.com\/azure\/cosmos-db\/use-metrics#understand-how-many-requests-are-succeeding-or-causing-errors\" target=\"_blank\" rel=\"noopener noreferrer\">metrics on the Azure Portal<\/a>. A raise in HTTP 429s during the duration of your Bulk operation would indicate throttling.<\/li>\n<li>Logging the <a href=\"https:\/\/docs.microsoft.com\/dotnet\/api\/microsoft.azure.cosmos.itemresponse-1.diagnostics?view=azure-dotnet#Microsoft_Azure_Cosmos_ItemResponse_1_Diagnostics\" target=\"_blank\" rel=\"noopener noreferrer\">Diagnostics<\/a> property, would help identify if there are retries happening and understand the dispatch times.<\/li>\n<li>Collect <a href=\"https:\/\/github.com\/Azure\/azure-cosmos-dotnet-v2\/blob\/master\/docs\/documentdb-sdk_capture_etl.md\" target=\"_blank\" rel=\"noopener noreferrer\">tracing from the SDK<\/a> operations as an alternative.<\/li>\n<\/ul>\n<h2>Are there good practices or tips?<\/h2>\n<p class=\"ew\"><strong>Yes!<\/strong> Whenever possible, <strong>provide the PartitionKey<\/strong> to your operations, even if you are using the Typed APIs, this avoid the SDK needing to extract it from your data.<\/p>\n<p id=\"3f4e\" class=\"ew\" data-selectable-paragraph=\"\">If your information comes already as a Stream, use the <strong>Stream APIs<\/strong> (for example, CreateItemStreamAsync), this avoids any and all serialization.<\/p>\n<p data-selectable-paragraph=\"\">In scenarios where your information might already be separated in partition keys, you can choose to create one <strong>Worker Task<\/strong> per partition key, and each Worker Task can spawn and coordinate the Tasks that do each item operation, like so:<\/p>\n<pre class=\"lang:c# decode:true \" title=\"Worker tasks approach\">public async Task Worker(string partitionKey)\r\n{\r\n    PartitionKey partitionKey = new PartitionKey(partitionKey);\r\n    int maxDegreeOfParallelismPerWorker = 10; \/\/ This number is just an example\r\n\r\n    \/\/ Obtain the items from the source, and while there are items to be saved in the container, generate groups of Tasks\r\n    while (true)\r\n    {\r\n        List&lt;Item&gt; itemsToSave = await GetMoreDataForPartitionKeyUpTo(maxDegreeOfParallelismPerWorker);\r\n        if (itemsToSave.Count == 0)\r\n        {\r\n            break; \/\/ Nothing more to import\r\n        }\r\n        List&lt;Task&gt; concurrentTasks = new List&lt;Task&gt;(maxDegreeOfParallelism); \r\n        foreach(Item itemToInsert in itemsToSave) \r\n        { \r\n            concurrentTasks.Add(container.CreateItemAsync(itemToInsert, partitionKey)); \r\n        } \r\n        \r\n        await Task.WhenAll(concurrentTasks);\r\n    }\r\n}\r\n\r\n\r\nContainer container = cosmosClient.GetContainer(\"myDb\", \"myCollection\");\r\n\r\nList&lt;Task&gt; workerTasks = new List&lt;Task&gt;(workerTasks);\r\nforeach(string partitionKey in KnownPartitionKeys)\r\n{\r\n    workerTasks.Add(new Worker(partitionKey));\r\n}\r\n\r\nawait Task.WhenAll(workerTasks);\r\n\r\n<\/pre>\n<h2>Next steps<\/h2>\n<p data-selectable-paragraph=\"\">If you want to try out Bulk, you can follow our <a href=\"https:\/\/docs.microsoft.com\/azure\/cosmos-db\/tutorial-sql-api-dotnet-bulk-import\" target=\"_blank\" rel=\"noopener noreferrer\">quickstart<\/a>. Please share any feedback on our <a href=\"https:\/\/github.com\/Azure\/azure-cosmos-dotnet-v3\/issues\" target=\"_blank\" rel=\"noopener noreferrer\">official Github repository<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Learn how to use the Azure Cosmos DB .NET SDK Bulk support to create high throughput data migration and ingestion applications that are optimized to take advantage of your provisioned throughput<\/p>\n","protected":false},"author":9477,"featured_media":61,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[14,19],"tags":[],"class_list":["post-152","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-core-sql-api","category-tips-and-tricks"],"acf":[],"blog_post_summary":"<p>Learn how to use the Azure Cosmos DB .NET SDK Bulk support to create high throughput data migration and ingestion applications that are optimized to take advantage of your provisioned throughput<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/152","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/users\/9477"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/comments?post=152"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/152\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/media\/61"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/media?parent=152"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/categories?post=152"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/tags?post=152"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}