When transferring data with the Azure Storage client libraries, a lot is happening behind-the-scenes. These workings can affect speed, memory usage, and sometimes whether the transfer succeeds. This post will help you get the most out of Storage client library data transfers.
These concepts apply to the Azure.Storage.Blobs and Azure.Storage.Files.DataLake packages. Specifically, we’re looking at APIs that accept StorageTransferOptions as a parameter. Commonly used examples are:
BlobClient.UploadAsync(Stream stream, ...)
BlobClient.UploadAsync(string path, ...)
BlobClient.DownloadToAsync(Stream stream, ...)
BlobClient.DownloadToAsync(string path, ...)
DataLakeFileClient.UploadAsync(Stream stream, ...)
DataLakeFileClient.UploadAsync(string path, ...)
DataLakeFileClient.ReadToAsync(Stream stream, ...)
DataLakeFileClient.ReadToAsync(string path, ...)
StorageTransferOptions
StorageTransferOptions
is the key class for tuning your performance. Storage transfers are partitioned into several subtransfers based on the values in this class. Here, you define values for the following properties, which are the basis for managing your transfer:
MaximumConcurrency
: the maximum number of parallel subtransfers that can take place at once.- From launch until the present (
Azure.Storage.Blobs
12.10.0 andAzure.Storage.Files.DataLake
12.8.0), only asynchronous operations can parallelize transfers. Synchronous operations will ignore this value and work in sequence. - The effectiveness of this value is subject to the restrictions set by .NET’s connection pool limit, which may hinder you by default. For more information about these restrictions, see this blog post.
- From launch until the present (
MaximumTransferSize
: the maximum data size of a subtransfer, in bytes.- To keep data moving, the client libraries may not always reach this value for every subtransfer for several reasons.
- Different REST APIs have different maximum values they support for transfer, and those values have changed across service versions. Check your documentation to determine the limits you can select for this value.
You can also define a value for InitialTransferSize
. Unlike the name suggests, your MaximumTransferSize
does not limit this value. In fact, you often want InitialTransferSize
to be at least as large as your MaximumTransferSize
, if not larger. InitialTransferSize
defines a separate data size limitation for an initial attempt to do the entire operation at once with no subtransfers. Using a single transfer cuts down on overhead, leading to faster transfers for some data lengths based on your MaximumTransferSize
. If unsure of what’s best for you, setting this property to the same value used for MaximumTransferSize
is a safe option.
While the class contains nullable values, the client libraries will use defaults for each individual value when not provided. These defaults are fine in a data center environment, but likely unsuitable for home consumer environments. Poorly tuned StorageTransferOptions
can result in excessively long operations and even timeouts. You should always be proactive in determining your values for this class.
Uploads
The Storage client libraries will split a given upload stream into various subuploads based on provided StorageTransferOptions
, each with their own dedicated REST call. With BlobClient
, this operation will be Put Block and with DataLakeFileClient
, this operation will be Append Data. The Storage client libraries manage these REST operations in parallel (depending on transfer options) to complete the total upload.
Note: block blobs have a maximum block count of 50,000. Your blob, then, has a maximum size of 50,000 times MaximumTransferSize
.
Buffering on uploads
The Storage REST layer doesn’t support picking up a REST upload where you left off. Individual transfers are either completed or lost. To ensure resiliency, if a stream isn’t seekable, the Storage client libraries will buffer the data for each individual REST call before starting the upload. Outside of network speed, this behavior is also why you may be interested in setting a smaller value for MaximumTransferSize
even when uploading in sequence. MaximumTransferSize
is the maximum division of data to be retried after a connection failure.
If uploading with parallel REST calls to maximize network throughput, the client libraries need sources they can read from in parallel. Since streams are sequential, when uploading in parallel, the Storage client libraries will buffer the data for each individual REST call before starting the upload even if the provided stream is already seekable.
To avoid the Storage client libraries buffering your data for upload, you must provide a seekable stream and ensure MaximumConcurrency
is set to 1. While this strategy should suffice in most situations, your code could be using other features of the client libraries that require buffering anyway. In this case, buffering will still occur.
InitialTransferSize
on upload
When a seekable stream is provided, its length is checked against this value. If the stream length is within this value, the entire stream will be uploaded as a single REST call. Otherwise, upload will be done in parts as described previously in this document.
Note: when using BlobClient
, an upload within the InitialTransferSize
will be performed using Put Blob, rather than Put Block.
InitialTransferSize
has no effect on an unseekable stream and will be ignored.
Downloads
The Storage client libraries will split a given download request into various subdownloads based on provided StorageTransferOptions
, each with their own dedicated REST call. The client libraries manage these REST operations in parallel (depending on transfer options) to complete the total download.
Buffering on downloads
Receiving multiple HTTP responses simultaneously with body contents will have memory implications. However, the Storage client libraries don’t explicitly add a buffer step for downloaded contents. Incoming responses are processed in order. The client libraries configure a 16-kilobyte buffer for copying streams from HTTP response stream to caller-provided destination stream/file path.
InitialTransferSize
on download
The Storage client libraries will make one download range request using InitialTransferSize
before anything else. Upon downloading that range, total resource size will be known. If the initial request downloaded the whole content, we’re done! Otherwise, the download steps described previously will begin.
Summary
StorageTransferOptions
contains the tools to optimize your transfers. It provides options that affect transfer speeds and memory usage. Unless you’re working with trivial file sizes, be proactive in configuring these options based on the environment in which your client will run.
Azure SDK Blog Contributions
Thanks for reading this Azure SDK blog post. We hope you learned something new, and we welcome you to share the post. We’re open to Azure SDK blog contributions from our readers. To get started, contact us at azsdkblog@microsoft.com with your idea, and we’ll set you up as a guest blogger.
- Azure SDK Website: aka.ms/azsdk
- Azure SDK Intro (3-minute video): aka.ms/azsdk/intro
- Azure SDK Intro Deck (PowerPoint deck): aka.ms/azsdk/intro/deck
- Azure SDK Releases: aka.ms/azsdk/releases
- Azure SDK Blog: aka.ms/azsdk/blog
- Azure SDK Twitter: twitter.com/AzureSDK
- Azure SDK Design Guidelines: aka.ms/azsdk/guide
- Azure REST API Guidelines: aka.ms/azapi/guidelines
- Azure SDKs & Tools: azure.microsoft.com/downloads
- Azure SDK Central Repository: github.com/azure/azure-sdk
- Azure SDK for .NET: github.com/azure/azure-sdk-for-net
- Azure SDK for Java: github.com/azure/azure-sdk-for-java
- Azure SDK for Python: github.com/azure/azure-sdk-for-python
- Azure SDK for JavaScript/TypeScript: github.com/azure/azure-sdk-for-js
- Azure SDK for Android: github.com/Azure/azure-sdk-for-android
- Azure SDK for iOS: github.com/Azure/azure-sdk-for-ios
- Azure SDK for Go: github.com/Azure/azure-sdk-for-go
- Azure SDK for C: github.com/Azure/azure-sdk-for-c
- Azure SDK for C++: github.com/Azure/azure-sdk-for-cpp
0 comments