Azure SDK network reliability

This blog post describes how the Azure SDKs reliably upload large streams of data to an Azure service. While this post uses Go and Azure Storage blobs as examples, the concepts apply to all the Azure SDK languages and other Azure services that accept large streams of data.

I’ve been working with distributed cloud applications for many decades now. Distributed refers to applications whose parts communicate via networking. Cloud refers to embracing failure and writing your application networking code in a reliable way so it can recover from failures. After all, distributed systems fail all the time. See the Fallacies of Distributed Computing.

A major feature of the Azure SDKs is that they can retry an HTTP operation when they detect a network failure or timeout. In this way, your application’s communication with Azure services becomes reliable simply by using an Azure SDK. Almost all Azure services use HTTP/REST and send/receive JSON objects in HTTP bodies. These JSON objects are serialized from a structure into an in-memory string. The string is sent in the body for each retry of an HTTP request.

However, some HTTP operations don’t send structures/JSON objects. Instead, the body content is passed in by you, our SDK customer. The canonical example is when you want to pass a stream of bytes to the Azure Storage service to create a blob. The byte stream could represent a text file, image file, document file, audio file, or whatever you desire.

To upload a stream to a block blob, you can call blockblob.Client‘s public Upload or StageBlock methods:

func (bb *Client) Upload(
   ctx     context.Context,
   body    io.ReadSeekCloser,
   options *UploadOptions) (UploadResponse, error)

func (bb *Client) StageBlock(
   ctx           context.Context,
   base64BlockID string,
   body          io.ReadSeekCloser,
   options       *StageBlockOptions) (StageBlockResponse, error)

Notice that both methods accept an io.ReadSeekCloser for the body. You might expect the body to be just an io.ReadCloser because Go’s http.Request structure contains a Body field whose type is io.ReadCloser. So, why do we require seeking? The Azure SDK methods need to seek back to the beginning of the stream whenever the operation must be retried to make the operation reliable in the face of network failures. As a side benefit, these methods can also get the number of bytes in the stream by seeking to the end. Then they can set the HTTP request’s Content-Length header properly.

If you’re trying to upload a string/byte slice, you can turn it into an io.ReadSeeker by wrapping it with a strings.Reader/bytes.Reader. Then you can use our streaming.NopCloser function to make it an io.ReadSeekCloser. Here’s an example of what your code might look like:

upload, err := blockBlobClient.Upload(
   context.TODO(),
   streaming.NopCloser(strings.NewReader("Text to upload to a blob")),
   nil)

Now, we know that these two low-level building-block methods are inconvenient to use when uploading large amounts of data. To simplify things for many of our customers, we provide the UploadBuffer and UploadFile convenience methods:

func (bb *Client) UploadBuffer(
   ctx    context.Context,
   buffer []byte,
   o      *UploadBufferOptions) (UploadBufferResponse, error)

func (bb *Client) UploadFile(
   ctx  context.Context,
   file *os.File,
   o    *UploadFileOptions) (UploadFileResponse, error)

These functions upload the large buffer/file more quickly using multiple goroutines. Each goroutine calls StageBlock, passing it a section (block) of the buffer/file. The caller controls the maximum concurrency (bandwidth usage) via the Upload???Options structure’s fields.

UploadBuffer allocates no buffers internally. It just creates smaller byte slices over sections of the large buffer and passes each byte slice to separate goroutines that call StageBlock. UploadFile accepts an os.File, which we split into blocks using io.NewSectionReader. Each io.SectionReader implements io.ReadSeeker, and then we use goroutines to pass each SectionReader to StageBlock.

We design and test these convenience methods and document/support them in perpetuity for all customers. We believe that almost all our customers will find these convenience methods useful. But these methods don’t expose all the “bells and whistles” possible when uploading a buffer/file. If some customers find them insufficient for their needs, then customers can always write their own function that internally calls our publicly exposed, low-level building block method: StageBlock. Customers can also use the source code of our convenience method to help bootstrap their effort.

After a while, we received numerous pieces of customer feedback. In addition to the convenience methods like UploadBlob and UploadFile, customers wanted us to offer a method that uploads from an io.Reader to a blob. Initially, we resisted implementing it in our SDK because an io.Reader isn’t seekable. Therefore, this method would always fail if the network connection failed, as our SDK is unable to retry the operation. Imagine if this method always worked while debugging/testing an app because there were no networking issues but then occasionally failed when deployed to production! This kind of non-deterministic behavior is unacceptable to us and would cause customer issues and confusion.

After thinking about this problem, we came up with a way to reliably implement an UploadStream method:

func (bb *Client) UploadStream(
   ctx  context.Context,
   body io.Reader,
   o    *UploadStreamOptions) (UploadStreamResponse, error)

But, to provide customers with a consistent and reliable implementation, we had no choice but to have it internally allocate buffers, read from the io.Reader stream into the buffers, and then internally call StageBlock to upload the buffers. Each buffer is seekable, so we can retry the upload should a network failure occur. If there’s an error while reading from the io.Reader, there’s nothing our UploadStream method can do, and it will fail. It’s up to you to ensure that the io.Reader is itself reliable.

To improve performance, our UploadStream method lazily allocates buffers and uses multiple goroutines to upload blocks in parallel. Reading from the io.Reader must be done sequentially and so only one goroutine is responsible for reading from the stream into buffer(s) and then this goroutine spawns other goroutines that call StageBlock. If StageBlock is fast, then UploadStream will efficiently use the one buffer it allocated repeatedly, to avoid extra buffer allocations. But, if StageBlock is slow, then UploadStream allocates more buffers/goroutines to process the stream in parallel quickly. You can also control the maximum number of buffers and their size by setting the BlockSize and Concurrency fields in the UploadStreamOptions structure.

One more thing to note is that UploadStream allocates its buffers as anonymous memory-mapped files (MMFs). This way, they aren’t under control of Go’s garbage collector, and we ensure that these buffers are explicitly freed when UploadStream returns. MMFs help to reduce address space fragmentation and keep the process’ working set small.

Summary

This blog post emphasizes how the Azure SDK team prioritizes its design principles when implementing its client libraries. The things we consider, in order, are:

A well-architected and sustainable code base with documentation, testing, examples, and customer support.
Help customers build applications resilient to network failures with Azure services.
Help customers build applications that efficiently use system resources (CPU, memory, disk, etc.)

We also consider principles rated to ease of use, security/authentication, observability, debuggability, and so on, which may be addressed in future blog posts.