{"id":3825,"date":"2026-05-19T08:00:07","date_gmt":"2026-05-19T15:00:07","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/azure-sdk\/?p=3825"},"modified":"2026-07-05T22:43:54","modified_gmt":"2026-07-06T05:43:54","slug":"eliminate-llm-cold-starts-load-models-up-to-6x-faster-with-azure-blob-storage-and-runai-model-streamer","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/azure-sdk\/eliminate-llm-cold-starts-load-models-up-to-6x-faster-with-azure-blob-storage-and-runai-model-streamer\/","title":{"rendered":"Eliminate LLM Cold starts: Load models up to 6x Faster with Azure Blob Storage and Run:AI Model Streamer"},"content":{"rendered":"<p><a href=\"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-content\/uploads\/sites\/58\/2026\/05\/RunAIPerfGraph-title.webp\"><img decoding=\"async\" class=\"alignnone size-medium wp-image-3830\" src=\"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-content\/uploads\/sites\/58\/2026\/05\/RunAIPerfGraph-title-300x158.webp\" alt=\"RunAIPerfGraph title image\" width=\"300\" height=\"158\" srcset=\"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-content\/uploads\/sites\/58\/2026\/05\/RunAIPerfGraph-title-300x158.webp 300w, https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-content\/uploads\/sites\/58\/2026\/05\/RunAIPerfGraph-title-1024x538.webp 1024w, https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-content\/uploads\/sites\/58\/2026\/05\/RunAIPerfGraph-title-768x403.webp 768w, https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-content\/uploads\/sites\/58\/2026\/05\/RunAIPerfGraph-title.webp 1200w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p><em>Stop paying for idle GPUs while model weights copy to disk. Stream them straight into GPU memory instead with Run:AI Streamer from Azure Blob Storage. <\/em><\/p>\n<h2>The Problem: Every Cold Start Costs You More Than Money<\/h2>\n<p>GPU compute is among the most expensive cloud infrastructure, and every second a GPU is allocated but unavailable for serving is real money lost. The cost also goes beyond your Azure bill: slow cold starts can delay responses, stress SLAs, and degrade user experience during traffic spikes, when users need capacity most.<\/p>\n<p>In many conventional inference deployments, a cold start triggered by auto-scaling, spot eviction, rolling deploy, restart, or model swap follows the same basic pattern: fetch model weights from object storage to local disk, then load them into GPU memory. In our tests, a 232.8 GiB model took roughly 3 to 5 minutes of allocated GPU capacity with the default vLLM loader, before the replica could serve requests.<\/p>\n<p>Cold starts are not rare in production. Auto-scalers add replicas during spikes, spot VMs can be reclaimed, rolling deploys eventually touch every replica, and multi-tenant serving systems may swap models on demand. Each event can pay the same download-then-load tax unless the serving path is designed to avoid it.<\/p>\n<p>While a large model is moving from object storage to local disk, then into GPU memory, several problems can stack up at once:<\/p>\n<ul>\n<li><strong>The replicas already running absorb all traffic.<\/strong> Queues grow, responses slow down, and the slowest users wait even longer.<\/li>\n<li><strong>The autoscaler can continue adding replicas<\/strong> based on lagging capacity signals. Each new replica also needs time to load, so usable capacity arrives after the spike has already hurt latency.<\/li>\n<li><strong>Some requests can start timing out.<\/strong> If queues grow past common 30-to-60-second gateway or client timeouts, users see errors and may retry, adding more pressure.<\/li>\n<li><strong>Each restart adds another unavailable window.<\/strong> A rolling deployment across many replicas can stretch into a long operational event, and a spot reclaim leaves that replica unavailable until loading completes.<\/li>\n<\/ul>\n<p>Run:AI Model Streamer reduces that load window from minutes to seconds in our benchmark, which gives autoscaling, rollout, and recovery systems a much better chance of absorbing the event before the queue cascade starts.<\/p>\n<p><strong>The default loader runs two sequential steps:<\/strong> download the full model from Azure Blob Storage to local disk, then read it from disk into GPU memory. The GPU sits idle through both, and local disk becomes an extra copy stop and a bandwidth bottleneck.<\/p>\n<p><strong>Run:AI Model Streamer skips the local disk hop<\/strong> for model weights. It reads model data from Azure Blob Storage through CPU memory into GPU memory. Removing the extra copy step and local disk bottleneck <strong>lets the replica start serving in seconds<\/strong> rather than minutes in our benchmark, reducing the most expensive idle window in a cold start.<\/p>\n<p>The <a href=\"https:\/\/github.com\/run-ai\/runai-model-streamer\">Run:AI Model Streamer<\/a> is now natively wired into two widely used open-source inference engines: <a href=\"https:\/\/docs.vllm.ai\/\"><strong>vLLM<\/strong><\/a> (a fast and easy-to-use library for LLM inference and serving) and <a href=\"https:\/\/sglang.io\/\"><strong>SGLang<\/strong><\/a> (a high-performance serving framework for large language models and multimodal models). Both engines stream weights directly from Azure Blob Storage via <code>az:\/\/<\/code> URIs, so users on Azure can start serving requests in seconds rather than minutes.<\/p>\n<h2>Performance: Why Streaming Beats Downloading<\/h2>\n<p>Production autoscalers typically run on tens-of-seconds polling cadences, often once every 30 to 60 seconds (e.g., <a href=\"https:\/\/keda.sh\/docs\/latest\/reference\/scaledobject-spec\/\">KEDA<\/a>, <a href=\"https:\/\/huggingface.co\/docs\/inference-endpoints\/autoscaling\">Hugging Face Inference Endpoints<\/a>). A cold start that runs several minutes longer than those cycles leaves the autoscaler reacting to traffic that has already moved on, and the cascade from the previous section kicks in. Below are the numbers on a <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/virtual-machines\/sizes\/gpu-accelerated\/ndh100v5-series\">Standard_ND96isr_H100_v5<\/a> VM (8x NVIDIA H100 80 GB, 80 Gbps network) streaming from a Premium block blob storage account in the same region.<\/p>\n<h3>vLLM Model Load Times: Run:ai Streamer vs. Default Loader<\/h3>\n<table style=\"width: 98.1055%; height: 78px;\">\n<thead>\n<tr style=\"height: 48px;\">\n<th style=\"height: 48px; width: 23.5067%;\">Model<\/th>\n<th style=\"height: 48px; width: 16.7494%;\">Run:ai Streamer (s)<\/th>\n<th style=\"height: 48px; width: 21.2972%;\">Default vLLM Loader (s)<\/th>\n<th style=\"height: 48px; width: 10.4062%;\">Speedup<\/th>\n<th style=\"height: 48px; width: 99.755%;\">Loads within one autoscaler cycle?<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr style=\"height: 96px;\">\n<td style=\"height: 10px; width: 23.5067%;\">Meta-Llama-3.1-8B-Instruct (14.99 GiB)<\/td>\n<td style=\"height: 10px; width: 16.7494%; text-align: center;\"><strong>3.61 +\/- 0.17<\/strong><\/td>\n<td style=\"height: 10px; width: 21.2972%; text-align: center;\">15.48 +\/- 8.69<\/td>\n<td style=\"height: 10px; width: 10.4062%; text-align: center;\">~4.3x<\/td>\n<td style=\"height: 10px; width: 99.755%;\">Default: Yes;\u00a0Streamer: Yes<\/td>\n<\/tr>\n<tr style=\"height: 96px;\">\n<td style=\"height: 10px; width: 23.5067%;\">GPT-OSS-120B (60.8 GiB)<\/td>\n<td style=\"height: 10px; width: 16.7494%; text-align: center;\"><strong>12.76 +\/- 1.11<\/strong><\/td>\n<td style=\"height: 10px; width: 21.2972%; text-align: center;\">42.29 +\/- 25.96<\/td>\n<td style=\"height: 10px; width: 10.4062%; text-align: center;\">~3.3x<\/td>\n<td style=\"height: 10px; width: 99.755%;\">Default: Almost;\u00a0Streamer: Yes<\/td>\n<\/tr>\n<tr style=\"height: 48px;\">\n<td style=\"height: 10px; width: 23.5067%;\">Qwen3.5-122B-A10B (232.8 GiB)<\/td>\n<td style=\"height: 10px; width: 16.7494%; text-align: center;\"><strong>37.14 +\/- 0.79<\/strong><\/td>\n<td style=\"height: 10px; width: 21.2972%; text-align: center;\">225.57 +\/- 81.00<\/td>\n<td style=\"height: 10px; width: 10.4062%; text-align: center;\">~6.1x<\/td>\n<td style=\"height: 10px; width: 99.755%;\">Default: No;\u00a0Streamer: Yes<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Each configuration was run 5 times under cold-start conditions; results are averages +\/- standard deviation.<\/p>\n<p><strong>Key takeaways:<\/strong><\/p>\n<ul>\n<li><strong>Cold starts fit one autoscaler cycle, not just run faster.<\/strong> On the 233 GiB Qwen model, the default loader averages ~3.7 minutes with 80-second swings. That triggers the cascade above (queue buildup, over-provisioning, 5xx on tight gateways). The streamer averages ~37 seconds with sub-second variance, so the autoscaler&#8217;s next decision sees the new replica online and traffic redistributes cleanly.<\/li>\n<li><strong>Consistent, saturated throughput.<\/strong> The streamer holds a steady 80 Gbps across the full load. The default loader peaks at ~40 Gbps and drops as low as ~10 Gbps on the largest model, leaving half to seven-eighths of the network pipe idle.<\/li>\n<li><strong>Speedups grow with model size.<\/strong> The streamer&#8217;s lead widens from ~4.3x at 15 GiB to ~6.1x at 233 GiB. The bigger the model, the more disk-hop overhead the streamer skips, which is exactly the regime where cold starts matter most for autoscaling.<\/li>\n<\/ul>\n<p>For full benchmark methodology and detailed results, see the <a href=\"https:\/\/gist.github.com\/kyleknap\/bfc194b159fdcf5263daed6a89f28662\">complete benchmark report<\/a>.<\/p>\n<h2>Quick Start: Serve Models from Azure Blob Storage<\/h2>\n<p>Both <strong>vLLM<\/strong> and <strong>SGLang<\/strong> support streaming directly from Azure Blob Storage via <code>az:\/\/<\/code> URIs. Here is how to get started.<\/p>\n<h3>Prerequisites: Get Model Weights into Azure Blob Storage<\/h3>\n<p>Before streaming, you need SafeTensors model weights in an Azure Blob Storage container. Here&#8217;s the quickest path:<\/p>\n<p><strong>1. Download the model from Hugging Face:<\/strong><\/p>\n<pre><code class=\"language-bash\">huggingface-cli download meta-llama\/Llama-3.1-8B-Instruct --local-dir llama-3.1-8b<\/code><\/pre>\n<p><strong>2. Upload to your Azure Blob Storage container with azcopy:<\/strong><\/p>\n<pre><code class=\"language-bash\">cd llama-3.1-8b\r\nazcopy copy . \"https:\/\/&lt;your_account&gt;.blob.core.windows.net\/&lt;your-container&gt;\/models\/\" --recursive<\/code><\/pre>\n<p>The model weights are now accessible at <code>az:\/\/&lt;your-container&gt;\/models\/llama-3.1-70b<\/code> and ready to stream. Repeat for any model you want to serve.<\/p>\n<p><strong>Note:<\/strong> Paths use the form <code>az:\/\/&lt;container&gt;\/&lt;path&gt;<\/code>; the storage account is passed separately via the <code>AZURE_STORAGE_ACCOUNT_NAME<\/code> environment variable. The streamer requires SafeTensors weights (the default on Hugging Face), so make sure your model includes <code>.safetensors<\/code> files, not just <code>.bin<\/code>.<\/p>\n<h3>Using with vLLM<\/h3>\n<p><strong>Install vLLM with Run:AI support:<\/strong><\/p>\n<pre><code class=\"language-bash\">uv pip install vllm[runai]<\/code><\/pre>\n<p><strong>Set your storage account once for all subsequent commands:<\/strong><\/p>\n<pre><code class=\"language-bash\">export AZURE_STORAGE_ACCOUNT_NAME=\"&lt;your_account_name&gt;\"<\/code><\/pre>\n<p><strong>Serve a model directly from Azure Blob Storage:<\/strong><\/p>\n<pre><code class=\"language-bash\">vllm serve az:\/\/&lt;your-container&gt;\/models\/llama-3.1-8b \\\r\n  --load-format runai_streamer<\/code><\/pre>\n<p>No local copies, no staging scripts.<\/p>\n<p>For multi-GPU serving, enable distributed streaming and optionally tune concurrency and memory limits via <code>--model-loader-extra-config<\/code>:<\/p>\n<pre><code class=\"language-bash\">vllm serve az:\/\/&lt;your-container&gt;\/models\/llama-3.1-405b \\\r\n  --load-format runai_streamer \\\r\n  --tensor-parallel-size 8 \\\r\n  --model-loader-extra-config '{\"distributed\": true, \"concurrency\": 32}'<\/code><\/pre>\n<p><strong>Tip:<\/strong> <code>concurrency<\/code> controls parallel download streams. Higher values (32, sometimes 64) better saturate high-throughput NICs.<\/p>\n<p>Authentication uses <code>DefaultAzureCredential<\/code>, which supports <code>az login<\/code>, managed identity, environment variables (<code>AZURE_CLIENT_ID<\/code>, <code>AZURE_TENANT_ID<\/code>, <code>AZURE_CLIENT_SECRET<\/code>), and other methods, so it works out of the box on AKS, Azure ML, and VMs with managed identity.<\/p>\n<p>For full details, see the <a href=\"https:\/\/docs.vllm.ai\/en\/stable\/models\/extensions\/runai_model_streamer\/\">vLLM Run:AI Model Streamer documentation<\/a>.<\/p>\n<h3>Using with SGLang<\/h3>\n<p>SGLang also supports <a href=\"https:\/\/docs.sglang.io\/advanced_features\/object_storage.html\">loading models directly from object storage<\/a> via the Run:AI Model Streamer, including Azure Blob Storage with <code>az:\/\/<\/code> URIs.<\/p>\n<p><strong>Install SGLang with Run:AI support:<\/strong><\/p>\n<pre><code class=\"language-bash\">uv pip install \"sglang[runai]\" --prerelease=allow<\/code><\/pre>\n<p><strong>Start the server pointing at your Azure Blob model path:<\/strong><\/p>\n<pre><code class=\"language-bash\">export AZURE_STORAGE_ACCOUNT_NAME=\"&lt;your_account_name&gt;\"\r\npython -m sglang.launch_server \\\r\n  --model-path az:\/\/&lt;your-container&gt;\/models\/llama-3.1-8b \\\r\n  --load-format runai_streamer \\\r\n  --served-model-name llama-3.1-8b<\/code><\/pre>\n<p>For multi-GPU setups, SGLang supports distributed streaming to parallelize weight loading across devices. Enable it with tensor parallelism and the distributed flag:<\/p>\n<pre><code class=\"language-bash\">python -m sglang.launch_server \\\r\n  --model-path az:\/\/&lt;your-container&gt;\/models\/llama-3.1-405b \\\r\n  --load-format runai_streamer \\\r\n  --tp 8 \\\r\n  --model-loader-extra-config '{\"distributed\": true, \"concurrency\": 32}'<\/code><\/pre>\n<p>SGLang uses a two-phase approach: metadata files (config, tokenizer) are downloaded once to a local cache, while model weights are streamed directly from Azure Blob Storage into GPU memory on demand.<\/p>\n<h2>Tuning Performance<\/h2>\n<p>The streamer exposes environment variables for tuning performance across all storage backends, including Azure Blob Storage:<\/p>\n<table>\n<thead>\n<tr>\n<th>Variable<\/th>\n<th>Default<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>RUNAI_STREAMER_CONCURRENCY<\/code><\/td>\n<td>8 (object storage) \/ 16 (filesystem)<\/td>\n<td>Number of concurrent I\/O threads<\/td>\n<\/tr>\n<tr>\n<td><code>RUNAI_STREAMER_MEMORY_LIMIT<\/code><\/td>\n<td>-1 (distributed) \/ 40 GiB (non-distributed)<\/td>\n<td>CPU buffer memory cap; set to a byte value to limit<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Increasing concurrency (for example, from the default 8 to 32 or 64) can improve throughput, especially when streaming from Azure Blob Storage where the backend can handle high parallelism. Retries and timeouts use Azure SDK defaults automatically. See the <a href=\"https:\/\/github.com\/run-ai\/runai-model-streamer\/blob\/master\/docs\/src\/usage.md\">streamer usage docs<\/a> for the full list of configuration options.<\/p>\n<h2>Getting Involved<\/h2>\n<p>The Run:AI Model Streamer is fully open source under the Apache 2.0 license. The Azure Blob Storage plugin was developed in collaboration between Run:AI and Microsoft to bring first-class streaming support to Azure customers.<\/p>\n<ul>\n<li><strong>GitHub:<\/strong> <a href=\"https:\/\/github.com\/run-ai\/runai-model-streamer\">run-ai\/runai-model-streamer<\/a><\/li>\n<li><strong>PyPI:<\/strong> <code>pip install runai-model-streamer[azure]<\/code><\/li>\n<li><strong>Azure Blob PR:<\/strong> <a href=\"https:\/\/github.com\/run-ai\/runai-model-streamer\/pull\/116\">#116<\/a>, the full implementation story and review discussion<\/li>\n<\/ul>\n<p>For deeper technical details on the streaming architecture, see the <a href=\"https:\/\/github.com\/run-ai\/runai-model-streamer\">Run:AI Model Streamer documentation<\/a>.<\/p>\n<p><em>Have questions or want to share your benchmarks? Open an issue on <a href=\"https:\/\/github.com\/run-ai\/runai-model-streamer\/issues\">GitHub<\/a>.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Stop paying for idle GPUs while model weights copy to disk. Stream them straight into GPU memory instead with Run:AI Streamer from Azure Blob Storage. The Problem: Every Cold Start Costs You More Than Money GPU compute is among the most expensive cloud infrastructure, and every second a GPU is allocated but unavailable for serving [&hellip;]<\/p>\n","protected":false},"author":199674,"featured_media":3928,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3825","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-azure-sdk"],"acf":[],"blog_post_summary":"<p>Stop paying for idle GPUs while model weights copy to disk. Stream them straight into GPU memory instead with Run:AI Streamer from Azure Blob Storage. The Problem: Every Cold Start Costs You More Than Money GPU compute is among the most expensive cloud infrastructure, and every second a GPU is allocated but unavailable for serving [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/posts\/3825","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/users\/199674"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/comments?post=3825"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/posts\/3825\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/media\/3928"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/media?parent=3825"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/categories?post=3825"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/tags?post=3825"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}