Tuning your uploads and downloads with the Azure Storage client library for .NET

James Schreppler

When transferring data with the Azure Storage client libraries, a lot is happening behind-the-scenes. These workings can affect speed, memory usage, and sometimes whether the transfer succeeds. This post will help you get the most out of Storage client library data transfers.

These concepts apply to the Azure.Storage.Blobs and Azure.Storage.Files.DataLake packages. Specifically, we’re looking at APIs that accept StorageTransferOptions as a parameter. Commonly used examples are:

  • BlobClient.UploadAsync(Stream stream, ...)
  • BlobClient.UploadAsync(string path, ...)
  • BlobClient.DownloadToAsync(Stream stream, ...)
  • BlobClient.DownloadToAsync(string path, ...)
  • DataLakeFileClient.UploadAsync(Stream stream, ...)
  • DataLakeFileClient.UploadAsync(string path, ...)
  • DataLakeFileClient.ReadToAsync(Stream stream, ...)
  • DataLakeFileClient.ReadToAsync(string path, ...)

StorageTransferOptions

StorageTransferOptions is the key class for tuning your performance. Storage transfers are partitioned into several subtransfers based on the values in this class. Here, you define values for the following properties, which are the basis for managing your transfer:

  • MaximumConcurrency: the maximum number of parallel subtransfers that can take place at once.
    • From launch until the present (Azure.Storage.Blobs 12.10.0 and Azure.Storage.Files.DataLake 12.8.0), only asynchronous operations can parallelize transfers. Synchronous operations will ignore this value and work in sequence.
    • The effectiveness of this value is subject to the restrictions set by .NET’s connection pool limit, which may hinder you by default. For more information about these restrictions, see this blog post.
  • MaximumTransferSize: the maximum data size of a subtransfer, in bytes.
    • To keep data moving, the client libraries may not always reach this value for every subtransfer for several reasons.
    • Different REST APIs have different maximum values they support for transfer, and those values have changed across service versions. Check your documentation to determine the limits you can select for this value.

You can also define a value for InitialTransferSize. Unlike the name suggests, your MaximumTransferSize does not limit this value. In fact, you often want InitialTransferSize to be at least as large as your MaximumTransferSize, if not larger. InitialTransferSize defines a separate data size limitation for an initial attempt to do the entire operation at once with no subtransfers. Using a single transfer cuts down on overhead, leading to faster transfers for some data lengths based on your MaximumTransferSize. If unsure of what’s best for you, setting this property to the same value used for MaximumTransferSize is a safe option.

While the class contains nullable values, the client libraries will use defaults for each individual value when not provided. These defaults are fine in a data center environment, but likely unsuitable for home consumer environments. Poorly tuned StorageTransferOptions can result in excessively long operations and even timeouts. You should always be proactive in determining your values for this class.

Uploads

The Storage client libraries will split a given upload stream into various subuploads based on provided StorageTransferOptions, each with their own dedicated REST call. With BlobClient, this operation will be Put Block and with DataLakeFileClient, this operation will be Append Data. The Storage client libraries manage these REST operations in parallel (depending on transfer options) to complete the total upload.

Note: block blobs have a maximum block count of 50,000. Your blob, then, has a maximum size of 50,000 times MaximumTransferSize.

Buffering on uploads

The Storage REST layer doesn’t support picking up a REST upload where you left off. Individual transfers are either completed or lost. To ensure resiliency, if a stream isn’t seekable, the Storage client libraries will buffer the data for each individual REST call before starting the upload. Outside of network speed, this behavior is also why you may be interested in setting a smaller value for MaximumTransferSize even when uploading in sequence. MaximumTransferSize is the maximum division of data to be retried after a connection failure.

If uploading with parallel REST calls to maximize network throughput, the client libraries need sources they can read from in parallel. Since streams are sequential, when uploading in parallel, the Storage client libraries will buffer the data for each individual REST call before starting the upload even if the provided stream is already seekable.

To avoid the Storage client libraries buffering your data for upload, you must provide a seekable stream and ensure MaximumConcurrency is set to 1. While this strategy should suffice in most situations, your code could be using other features of the client libraries that require buffering anyway. In this case, buffering will still occur.

InitialTransferSize on upload

When a seekable stream is provided, its length is checked against this value. If the stream length is within this value, the entire stream will be uploaded as a single REST call. Otherwise, upload will be done in parts as described previously in this document.

Note: when using BlobClient, an upload within the InitialTransferSize will be performed using Put Blob, rather than Put Block.

InitialTransferSize has no effect on an unseekable stream and will be ignored.

Downloads

The Storage client libraries will split a given download request into various subdownloads based on provided StorageTransferOptions, each with their own dedicated REST call. The client libraries manage these REST operations in parallel (depending on transfer options) to complete the total download.

Buffering on downloads

Receiving multiple HTTP responses simultaneously with body contents will have memory implications. However, the Storage client libraries don’t explicitly add a buffer step for downloaded contents. Incoming responses are processed in order. The client libraries configure a 16-kilobyte buffer for copying streams from HTTP response stream to caller-provided destination stream/file path.

InitialTransferSize on download

The Storage client libraries will make one download range request using InitialTransferSize before anything else. Upon downloading that range, total resource size will be known. If the initial request downloaded the whole content, we’re done! Otherwise, the download steps described previously will begin.

Summary

StorageTransferOptions contains the tools to optimize your transfers. It provides options that affect transfer speeds and memory usage. Unless you’re working with trivial file sizes, be proactive in configuring these options based on the environment in which your client will run.

Azure SDK Blog Contributions

Thanks for reading this Azure SDK blog post. We hope you learned something new, and we welcome you to share the post. We’re open to Azure SDK blog contributions from our readers. To get started, contact us at azsdkblog@microsoft.com with your idea, and we’ll set you up as a guest blogger.

0 comments

Comments are closed. Login to edit/delete your existing comments