September 1st, 2020

Java NIO FileSystem APIs and the new Azure SDK for Java

Rick Ley
Software Engineer

The Azure SDK for Java recently released a preview of a custom implementation of Java’s FileSystem APIs (the azure-storage-blob-nio package on Maven), enabling developers to access Azure Blob Storage through a familiar file system interface. By adding this new dependency, you can easily instruct the JVM to point all file system operations to Azure Blob Storage rather than the local system. In this article, I will discuss why this tool is useful and some common use cases. I will then describe how this tool fits into the Java ecosystem using code samples that demonstrate how to leverage it. Finally, I will discuss nuances of this tool and thoughts to keep in mind when using it.

I will expand on this later, but it should be emphasized up front that while viewing Blob Storage through the lens of a file system can offer the benefits discussed here, the fact remains that Blob Storage is not a file system, and even this project can only make a best effort at converging the two. The end of the article will discuss implications of this in more depth.

Why We Built azure-storage-blob-nio

Azure Blob Storage is fundamentally an object store and not a file system in the traditional sense. Consequently, it has a different interface, different semantics, and a different contract than most typical file systems. There are good reasons for why this is the case, many of which focus around being able to store tremendous amounts of data and operate on said data with extremely low latency. Still, traditional file system semantics are often more familiar and therefore easier to reason about. Furthermore, the JDK confers some added benefits for storage systems which implement a specific interface.

The benefit for file systems implementing the JDK interfaces is that the JVM is able to dynamically load them at runtime. This means all that’s needed to add a backend file store to your application is to add a new type to your class path along with a few configurations the first time the new implementation is accessed. Then, as all implementations adhere to a standard interface, storage services, either local or remote, may be swapped in or out as needed during the run of the application; no code needs to change to switch between any two storage services.

Side-by-Side Comparison

As an example, let’s look at the code needed to simply create a directory blob in each case.

First, using the SDK. Keep in mind that this sample only shows the code necessary to create the blob itself. Implementing all the checks necessary to abide by the JDK contract for this operation–validating the path, avoiding overwrites, ensuring the parent exists, etc.–is significantly more complex when using the blob SDK alone and cannot fit here.

@Override
public void createDirectory(Path path, FileAttribute<?>... fileAttributes) throws IOException {

    BlobClient blobClient = new BlobClientBuilder()
        .endpoint("<your-storage-account-url>")
        .sasToken("<your-sasToken>")
        .containerName("mycontainer")
        .blobName("directoryPath")
        .buildClient();

    Map<String, String> blobMetadata = new HashMap<>();
    this.blobMetadata.put(DIR_METADATA_MARKER, "true");

    blobClient.getBlockBlobClient().commitBlockListWithResponse(Collections.emptyList(), this.blobHeaders, blobMetadata null, requestConditions, null, null);

And now using the nio package, which includes all the extra behavior mentioned above:

Map<String, Object> config = new HashMap<>();
String stores = "mycontainer";
config.put(AzureFileSystem.AZURE_STORAGE_ACCOUNT_KEY, "<your_account_key>");
config.put(AzureFileSystem.AZURE_STORAGE_FILE_STORES, stores);
FileSystem myFs = FileSystems.newFileSystem(new URI("azb://?account=myaccount"), config);
Files.createDirecotry(myFs.getPath("mycontainer:/directoryPath"))

While the SDK excels at accessing Blob Storage as a blob store, mimicking a file system is evidently non-trivial. Moreover, equivalent code would need to be written for each storage platform using whatever APIs they natively support. As the number of storage service supporting the app grows, the complexity of understanding each and maintaining this interface also grows. By contrast, using the nio package, all the code is uniform even across storage services; it is sufficient to simply include and configure new storage services and then treat them the same.

Typical Use Case

A typical example of how this is used is scientific engines, which may have their data source and output locations spread across a variety of local and remote stores. Rather than writing code specific to each location, the engine can simply load the implementation appropriate for the given operation and have everything work seamlessly.

What are Java’s NIO FileSystem APIs?

Java 7 introduced the nio package as a new way to interact with file systems. For our purposes, the key additions are a set of interfaces and extensible types that Java developers were invited to implement. Although it is rarely used directly by application developers, the FileSystemProvider type is particularly interesting to us here. Because, as the name indicates, this type is a “provider”, the introduction of this type means that the Java 7 file APIs can dovetail with another Java paradigm—Service Provider Interfaces (SPI).

Java 6 introduced the concept of a Service Provider Interface. Let’s break that down. A service is a standard set of operations that collectively implement some functionality. A provider may be thought of as a factory which generates instances of this service. An interface, as with all other interfaces, offers a standard set of types and methods that developers may write code against and be assured of certain behavior. Here, though, the implementation may be picked at runtime rather than being hard coded.

Putting that together, Azure Storage has implemented the FileSystemProvider type (the provider), which may be used to create new FileSystems (the service) backed by Azure Blob Storage, and any available file operations are common to all other implementations (the interface). I’ll explain below how the file system maps to Blob Storage in our case. The point here is that by simply adding our package as a dependency, you can tell the JVM to load this implementation whenever you want, and all your file system calls will be directed to Blob Storage.

How to Use azure-storage-blob-nio

In short, keep in mind that a file system maps to an Azure Blob Storage account. A root directory or file store (like a C: drive on Windows) maps to a container. And files and directories map to blobs.

Importing the package

As mentioned, all that needs to be done to begin working is to include the dependency and create a FileSystem instance. To include the dependency, just add the following to your Maven project’s pom (be sure to check Maven for the latest version).

<dependency>
  <groupId>com.azure</groupId>
  <artifactId>azure-storage-blob-nio</artifactId>
  <version>12.0.0-beta.2</version> 
</dependency>

Creating a FileSystem

Now, let’s create a FileSystem. The standard way to do this is with the FileSystems.createFileSystem method. The first parameter is a URI which will uniquely identify the file system, and is of the format “azb://?accountName=<your_account_name>”. There are two components to this URI: a scheme and a query. The scheme will always be azb, which tells the JVM to use the implementation found in the azure-storage-blob-nio package. The query component is of the form “?accountName=<your_account_name>” and indicates that the returned file system will be backed by some number of containers in your account (see the next paragraph).

The other necessary component is configuration. To configure the file system, pass in a map with the desired values. We offer a set of constants on AzureFileSystem which you can use to help build this configuration map. You are required to at least pass a comma separated list of containers which will serve as your FileStores/root directories as specified by the configuration value AZURE_STORAGE_FILE_STORES.

You must also set some form of authentication, either a shared key (AZURE_STORAGE_ACCOUNT_KEY) or sas token (AZURE_STORAGE_SAS_TOKEN). If you choose to use a sas token, ensure that it has necessary permissions to perform the desired operations (including the connection check of checking the existence of and/or creating containers), and that it is valid for a time sufficient to execute the workload. All other configurations are optional.

Please see the documentation on AzureFileSystemProvider and createFileSystem for more details on these requirements. Below is an example of how this might look. The result is a file system which will be backed by the account myaccount and have two file stores backed by containers: container1 and container2. This also implies there are two root directories for the file system: container1: and container2: as described next.

Map<String, Object> config = new HashMap<>();
String stores = "container1,container2";
config.put(AzureFileSystem.AZURE_STORAGE_ACCOUNT_KEY, "<your_account_key>");
config.put(AzureFileSystem.AZURE_STORAGE_FILE_STORES, stores);
FileSystem myFs = FileSystems.newFileSystem(new URI("azb://?account=myaccount"), config);

FileSystem Operations

From here, let’s do some actual file system operations. Most of these are contained in the static helper type Files. You’ll notice that most of these methods take a Path object. You can use the FileSystem.getPath method to get a Path which points to a resource in this file system. By creating a path from a specific FileSystem instance, the provider and file system information are carried with it, and you no longer need to worry about specifying the scheme and query components; you can stick to just worrying about the path to your blob in your account.

The structure of an absolute path in this context is as follows: “mycontainername:/dir1/file”. The root component is a container name followed by a ':'. The ':' is what we use to indicate a root directory, which is always a container, as they have special properties, such as being unable to be deleted. The rest is a series of names separated by our path separator '/'. Everything after the root component will be the name of the blob. In the above example, the path would point to a blob called “dir1/file” in the container "mycontainername”.

Note that creating a Path object does not guarantee that a resource exists at that location or that the Path is logically valid for the given operation, only that it is syntactically valid. There may be any number of directories in the path. A path which does not have a root component will be resolved against the default directory, which is the root directory backed by the first container listed during configuration.

At the moment, you will be able to create and iterate over directories, copy and delete files, and open input and output streams to given file locations.

The following demonstrates how to create a directory using the file system we created above. For more samples, refer to the project’s README.

Path dirPath = myFs.getPath("dir");
Files.createDirectory(dirPath);

Futher Details

Here we describe additional details on how to work with this file system implementation and how to reason about what is going on in your account as a result. We’ll also highlight which operations are currently optimized for and which ones are not supported or may behave unexpectedly.

You should always keep in mind, as mentioned at the start, that we are making something which is not a file system to look like a file system. Consequently, you should always take care to read the documentation on each method to understand any nuance present, and you should be cognizant of the fact that a remote file system will have a different performance profile and hit different network-related failures than a typical local file system.

Relatedly, there is an expectation that no other applications are modifying the contents of this system concurrently. This package does not account for outside actors interfering with the data, and unexpected errors or behavior may arise if this assumption is violated.

Caching

We do not keep a local cache of the remote files. This is because we have chosen to optimize, at least initially, for random reads and full writes. Enabling the ability to only read and therefore download a piece of a very large blob is in conflict with a local cache, as building a local cache would require first downloading the entire file in order to then offer random reads. As a consequence, because we are then always writing directly to Blob Storage and not to a local cache, and Blob Storage does not itself offer random writes, this package also does not offer random writes at the moment. Files must always be written completely and will be flushed upon closing the output stream.

Directories

One final thing to note is our handling of directories. Directories, too, are not native to Azure Blob Storage. Because of this, they may occasionally behave unexpectedly. In particular, there are two kinds of directory representations with slightly different behavior. If a set of data is pre-loaded in an account and simply uses the path separator “/” in blob names to indicate virtual directories, we will recognize the presence and validity of these directories, but they will also disappear when they become empty. On the other hand, directories created through this package will be “concrete” and create a 0 length blob with special metadata which indicates that it is a directory and therefore preserve its existence even when empty. You will never have to inspect and code around this and other distinctions between “virtual” and “concrete” directories, but you should be aware of how that behavior may manifest in your application.

Conclusion

When working with a variety of back end storage services, using Java’s nio package can greatly simplify the code needed to interface with each of them. Rather than maintaining code specific to each one, you can work with a single, uniform, familiar interface. Now, you can use azure-storage-blob-nio to interact with Azure Blob Storage in that same way.

For more information, please visit the project’s homepage, which includes more samples and links to further documentation.

Azure SDK Blog Contributions

Thank you for reading this Azure SDK blog post! We hope that you learned something new and welcome you to share this post. We are open to Azure SDK blog contributions. Please contact us at azsdkblog@microsoft.com with your topic and we’ll get you setup as a guest blogger.

Azure SDK Links

Author

Rick Ley
Software Engineer

Software Developer for Microsoft working on Azure Storage Developer Experience

0 comments

Discussion are closed.

Feedback