{"id":3443,"date":"2025-06-26T08:00:12","date_gmt":"2025-06-26T15:00:12","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/azure-sdk\/?p=3443"},"modified":"2025-06-25T14:31:54","modified_gmt":"2025-06-25T21:31:54","slug":"introducing-the-azure-storage-connector-for-pytorch","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/azure-sdk\/introducing-the-azure-storage-connector-for-pytorch\/","title":{"rendered":"Introducing the Azure Storage Connector for PyTorch"},"content":{"rendered":"<p>We&#8217;re excited to introduce the Azure Storage Connector for PyTorch (<code>azstoragetorch<\/code>), a new library that brings seamless, performance-optimized integration between Azure Storage and PyTorch. The library makes it easy to access and store data in Azure Blob Storage directly within your training workflows.<\/p>\n<h2>What is PyTorch?<\/h2>\n<p><a href=\"https:\/\/pytorch.org\/\">PyTorch<\/a> is a widely used open-source machine learning framework, known for its flexibility and strong support for both research and production deployments. Training models with PyTorch often involves handling large volumes of data. This can include loading massive datasets, saving and restoring model checkpoints, and managing data pipelines. These pipelines can live across environments like local machines, cloud virtual machines, and distributed compute clusters, adding to complexity.<\/p>\n<h2>How can the Azure Storage Connector for PyTorch help?<\/h2>\n<p>The Azure Storage Connector for PyTorch (<code>azstoragetorch<\/code>) bridges the powerful storage capabilities of Azure Storage with PyTorch, enabling seamless integrations with key PyTorch APIs for your model training workflows.<\/p>\n<p>The package supports model checkpointing directly with Azure Storage with <code>torch.save()<\/code> and <code>torch.load()<\/code> and directly loading data from Azure Storage to PyTorch <code>Dataset<\/code> classes.<\/p>\n<h2>Use the Azure Storage Connector for PyTorch<\/h2>\n<h3>Installation<\/h3>\n<p>The Azure Storage Connector for PyTorch is listed on PyPI, and you can install it using your favorite package manager. This example utilizes <code>pip<\/code>:<\/p>\n<pre><code class=\"language-bash\">pip install azstoragetorch<\/code><\/pre>\n<p>This installs the <code>azstoragetorch<\/code> library and other dependencies such as <code>torch<\/code> and <code>azure-storage-blob<\/code>.<\/p>\n<h3>Authentication<\/h3>\n<p>The Azure Storage Connector for PyTorch package&#8217;s interfaces default to using the Azure Identity library&#8217;s <a href=\"https:\/\/learn.microsoft.com\/python\/api\/azure-identity\/azure.identity.defaultazurecredential?view=azure-python\"><code>DefaultAzureCredential<\/code><\/a>, which automatically retrieves Microsoft Entra ID tokens based on your current environment. For more information, see <a href=\"http:\/\/aka.ms\/azsdk\/python\/identity\/credential-chains#defaultazurecredential-overview\">DefaultAzureCredential overview<\/a>.<\/p>\n<p>This means that as long as there exists a credential on your machine in the chained credential list for <code>DefaultAzureCredential<\/code>, your credentials are securely handled automatically and you&#8217;re ready to start using the library.<\/p>\n<h3>Save and Load a Model Checkpoint<\/h3>\n<p>The Azure Storage Connector for PyTorch provides the <a href=\"https:\/\/azure.github.io\/azure-storage-for-pytorch\/api.html#azstoragetorch.io.BlobIO\"><code>azstoragetorch.io.BlobIO<\/code><\/a> class to save and load models directly to and from Azure Blob Storage. This class adheres to the file-like object accepted when using <code>torch.save()<\/code> and <code>torch.load()<\/code> for model checkpointing in PyTorch.<\/p>\n<p>The <code>BlobIO<\/code> class takes an Azure Storage Blob URL and the mode you&#8217;d like to operate in &#8211; either write (<code>\"wb\"<\/code>) for saving or read (<code>\"rb\"<\/code>) for loading. Because the <code>BlobIO<\/code> class is file-like in Python, it can be safely handled when using the <code>with<\/code> statement.<\/p>\n<pre><code class=\"language-python\">import torch\r\nimport torchvision.models # Install this separately, e.g. ``pip install torchvision``\r\nfrom azstoragetorch.io import BlobIO\r\n\r\n# The URL to your container\r\nCONTAINER_URL = \"https:\/\/&lt;my-storage-account-name&gt;.blob.core.windows.net\/&lt;my-container-name&gt;\"\r\n\r\n# Your model of choice, in this case ResNet18 for image recognition   \r\nmodel = torchvision.models.resnet18(weights=\"DEFAULT\")\r\n\r\n# Save model weights to an Azure Storage Blob named \"model_weights.pth\" in the container\r\nwith BlobIO(f\"{CONTAINER_URL}\/model_weights.pth\", \"wb\") as f:\r\n    torch.save(model.state_dict(), f)\r\n\r\n# Load model weights from model_weights.pth in the Azure Storage Container\r\nwith BlobIO(f\"{CONTAINER_URL}\/model_weights.pth\", \"rb\") as f:\r\n    model.load_state_dict(torch.load(f))<\/code><\/pre>\n<h3>Sample: Use PyTorch Datasets on Azure Storage<\/h3>\n<p>PyTorch has two primitives for loading samples, <a href=\"https:\/\/docs.pytorch.org\/tutorials\/beginner\/basics\/data_tutorial.html#datasets-dataloaders\"><code>DataSet<\/code> and <code>DataLoader<\/code><\/a>. The Azure Storage Connector for PyTorch has implementations for both of PyTorch datasets, <a href=\"https:\/\/docs.pytorch.org\/docs\/stable\/data.html#map-style-datasets\">map-style<\/a> and <a href=\"https:\/\/docs.pytorch.org\/docs\/stable\/data.html#map-style-datasets\">iterable-style<\/a>.<\/p>\n<p>The <a href=\"https:\/\/azure.github.io\/azure-storage-for-pytorch\/api.html#azstoragetorch.datasets.BlobDataset\"><code>azstoragetorch.datasets.BlobDataset<\/code><\/a> class is a map-style dataset, enabling random access to data samples. The <a href=\"https:\/\/azure.github.io\/azure-storage-for-pytorch\/api.html#azstoragetorch.datasets.IterableBlobDataset\"><code>azstoragetorch.datasets.IterableBlobDataset<\/code><\/a> class is an iterable dataset, which should be used when working on large datasets that may not fit in memory.<\/p>\n<p>Both classes support two methods: <code>from_container_url()<\/code> and <code>from_blob_urls()<\/code>. The <code>from_container_url()<\/code> method instantiates a dataset by listing blobs in a container, and the <code>from_blob_urls()<\/code> method takes a list of blob URLs when creating a dataset.<\/p>\n<p>These <code>Dataset<\/code> integrations fit naturally into a PyTorch workflow. Let&#8217;s dive into an image example using the <a href=\"https:\/\/data.caltech.edu\/records\/mzrjq-6wc02\">caltech101<\/a> dataset and the <code>resnet18<\/code> model.<\/p>\n<p>The caltech101 dataset contains about 9,000 images across various categories and the <code>resnet18<\/code> model is a residual neural network introduced in the paper <a href=\"https:\/\/arxiv.org\/abs\/1512.03385\">Deep Residual Learning for Image Recognition<\/a>.<\/p>\n<h4>Prerequisites and Setup<\/h4>\n<p>First, install prerequisite packages <code>azstoragetorch<\/code>, <code>pillow<\/code>, and <code>torchvision<\/code>.<\/p>\n<pre><code class=\"language-bash\">pip install azstoragetorch pillow torchvision<\/code><\/pre>\n<p>Once you have the packages installed, ensure you have the caltech101 dataset in your Azure Storage Account. To ease this setup process, clone the <a href=\"https:\/\/github.com\/Azure\/azure-storage-for-pytorch\/tree\/main\">GitHub repository<\/a> and run the provided <a href=\"https:\/\/github.com\/Azure\/azure-storage-for-pytorch\/blob\/main\/samples\/intro_notebook\/bootstrap.py\">setup file<\/a>.<\/p>\n<p>Lastly, it&#8217;s helpful to understand the importance of transforming data for PyTorch operations. The default output of dataset samples in the <code>azstoragetorch<\/code> package are dictionaries representing a blob in the dataset. It&#8217;s often necessary to transform this data into the shape needed for your PyTorch workflows.<\/p>\n<p>To override the default output, we can provide a <code>transform<\/code> callable in the <code>from_blob_urls<\/code> or <code>from_container_url<\/code> methods that accept an argument of type <a href=\"https:\/\/azure.github.io\/azure-storage-for-pytorch\/api.html#azstoragetorch.datasets.Blob\"><code>azstoragetorch.datasets.Blob<\/code><\/a>. The transform callable in the following code sample is based on the <a href=\"https:\/\/pytorch.org\/hub\/pytorch_vision_resnet\/\">PyTorch documentation<\/a>. To learn more about how to use the <code>transform<\/code> callable in the <code>azstoragetorch<\/code> library, visit the <a href=\"https:\/\/azure.github.io\/azure-storage-for-pytorch\/user-guide.html#transforming-dataset-output\">documentation<\/a>.<\/p>\n<pre><code class=\"language-python\">from torch.utils.data import DataLoader\r\nimport torchvision.transforms\r\nfrom PIL import Image\r\n\r\nfrom azstoragetorch.datasets import BlobDataset\r\n\r\n# Method to transform blob data to a torch.Tensor\r\n# To learn more about why these particular transforms were chosen,\r\n# Visit the documentation site: https:\/\/pytorch.org\/hub\/pytorch_vision_resnet\/\r\ndef blob_to_category_and_tensor(blob):\r\n    with blob.reader() as f:\r\n        img = Image.open(f).convert(\"RGB\")\r\n        img_transform = torchvision.transforms.Compose([\r\n            torchvision.transforms.Resize(256),\r\n            torchvision.transforms.CenterCrop(224),\r\n            torchvision.transforms.ToTensor(),\r\n            torchvision.transforms.Normalize(\r\n                mean=[0.485, 0.456, 0.406],\r\n                std=[0.229, 0.224, 0.225]\r\n            ),\r\n        ])\r\n        img_tensor = img_transform(img)\r\n\r\n    # Get second to last component of blob name which will be the image category\r\n    # Example: blob.blob_name -&gt; datasets\/caltech101\/dalmatian\/image_0001.jpg\r\n    # category -&gt; dalmatian\r\n    category = blob.blob_name.split(\"\/\")[-2]\r\n    return category, img_tensor\r\n\r\n# The URL to your container\r\nCONTAINER_URL = \"https:\/\/&lt;my-storage-account-name&gt;.blob.core.windows.net\/&lt;my-container-name&gt;\"\r\n\r\n# Initialize dataset with the azstoragetorch library\r\ndataset = BlobDataset.from_container_url(\r\n    CONTAINER_URL,\r\n    prefix=\"datasets\/caltech101\",\r\n    transform=blob_to_category_and_tensor,\r\n)\r\n\r\n# Set up data loader\r\nloader = DataLoader(dataset, batch_size=32)\r\n\r\nfor categories, img_tensors in loader:\r\n    print(categories, img_tensors.size())<\/code><\/pre>\n<h2>Conclusion<\/h2>\n<p>The Azure Storage Connector for PyTorch is designed around the principle that cloud storage integration for your ML workflows shouldn&#8217;t require learning new paradigms. Key features include:<\/p>\n<ul>\n<li>Zero configuration: Automatic credential discovery means no setup code<\/li>\n<li>Familiar patterns: If you know <code>open()<\/code> and PyTorch datasets, you already know this library<\/li>\n<li>Framework integration: Direct compatibility with <code>torch.save()<\/code>, <code>torch.load()<\/code>, and <code>DataLoader<\/code><\/li>\n<li>Flexible access: Support for both container-wide and specific blob\/object access patterns for reads and writes<\/li>\n<li>Debugging friendly: Clear error messages and standard Python exceptions<\/li>\n<\/ul>\n<p>Install azstoragetorch today to enable several use cases with your Machine Learning workflows using your data stored in Azure Blob Storage:<\/p>\n<ul>\n<li>Distributed Training: Save and load model checkpoints across multiple nodes without shared file systems<\/li>\n<li>Model Sharing: Easily share trained models across teams and environments<\/li>\n<li>Dataset Management: Access large datasets stored in Azure Blob Storage without local storage constraints<\/li>\n<li>Experimentation: Quickly iterate on different models and datasets without data movement overhead<\/li>\n<\/ul>\n<p>The Azure Storage Connector for PyTorch is in Public Preview and we&#8217;re actively seeking feedback from the community. The project is open source and available on GitHub where we&#8217;d love to get your feedback, feature requests, and future integrations we should include.<\/p>\n<h2>Resources<\/h2>\n<ul>\n<li><a href=\"https:\/\/aka.ms\/azstoragetorch\">Azure Storage Connector for PyTorch (azstoragetorch) (Preview)<\/a><\/li>\n<li><a href=\"https:\/\/aka.ms\/azstoragetorch\/samples\">azstoragetorch Samples and Quickstart<\/a><\/li>\n<li><a href=\"https:\/\/build.microsoft.com\/sessions\/BRK192\">Data-intensive AI Training and Inferencing with Azure Blob Storage<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/azure\/storage\/blobs\/storage-quickstart-blobs-python\">Quickstart: Azure Blob Storage client library for Python<\/a><\/li>\n<\/ul>\n<p>For feature requests, bug reports, or general support,\u202f<a href=\"https:\/\/github.com\/Azure\/azure-storage-for-pytorch\/issues\">open an issue<\/a>\u202fin the repository on GitHub.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post announces the Azure Storage Connector for PyTorch (azstoragetorch), integrating files from Azure Blob Storage into your PyTorch training pipeline.<\/p>\n","protected":false},"author":98329,"featured_media":3444,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[162,948,732,738],"class_list":["post-3443","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-azure-sdk","tag-python","tag-pytorch","tag-release","tag-storage"],"acf":[],"blog_post_summary":"<p>This post announces the Azure Storage Connector for PyTorch (azstoragetorch), integrating files from Azure Blob Storage into your PyTorch training pipeline.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/posts\/3443","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/users\/98329"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/comments?post=3443"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/posts\/3443\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/media\/3444"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/media?parent=3443"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/categories?post=3443"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/tags?post=3443"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}