Foundry Local: A New Era of Edge AI

At Microsoft Build 2025, Microsoft unveiled its groundbreaking Foundry Local solution for edge devices—an efficient platform specifically designed for local AI inference. As a critical component of Microsoft’s AI strategy, Foundry Local empowers developers to smoothly deploy and run Small Language Models (SLMs) on resource-constrained edge devices, opening new possibilities for the convergence of edge computing and artificial intelligence.

Core Architecture and Technical Advantages

Technical Foundation: The ONNX Ecosystem

Foundry Local is built on ONNX (Open Neural Network Exchange)—a mature, open standard for model interoperability. As a widely recognized model exchange format in machine learning and deep learning, ONNX brings significant advantages to Foundry Local:

Broad Compatibility: Supports models converted from various deep learning frameworks (PyTorch, TensorFlow, JAX, etc.)
Cross-Platform Optimization: Delivers highly optimized inference performance across different hardware architectures (CPU, GPU, NPU)
Rich Tooling Ecosystem: Leverages mature tools like Microsoft Olive for model optimization and quantization

Comprehensive Development Toolkit

Foundry Local offers a one-stop development experience:

Diverse Interface Options
- Command Line Interface (CLI): Provides powerful model management, deployment, and testing capabilities
- Multi-language SDKs: Currently supports NodeJS and Python, offering native programming experiences
- RESTFul API:Standardized interface supporting seamless integration with various applications – OpenAI API compatibility
Developer-Optimized Experience
- Clean, intuitive API design
- Comprehensive documentation and code examples
- Built-in model management and monitoring tools
Edge-First Performance Optimization
- Memory utilization optimized for resource-constrained environments
- Intelligent caching and inference acceleration techniques
- Flexible deployment options supporting various device configurations

Breakthrough Technical Advantages

Foundry Local brings four key advantages to edge AI applications:

Ultra-Low Latency Experience
- Local inference eliminates network communication overhead
- Millisecond-level response times, enabling native-like fluid interactions
- Adaptive batch processing optimization to increase throughput
Complete Offline Capability
- No continuous internet connection required, suitable for network-limited environments
- Full functionality maintained in offline states
- Ideal for remote areas, edge devices, or high network security requirements
Enterprise-Grade Data Privacy
- Sensitive data processed entirely locally, no cloud uploads required
- Compliant with strict regulatory requirements
- Reduced data breach risks, enhanced customer trust
Maximized Resource Efficiency
- Finely tuned model quantization significantly reduces memory requirements
- Dynamic resource allocation adapts to device load variations
- Battery-friendly design extends edge device usage time

Building Cloud-Edge Collaborative AI Solutions

By combining Azure AI Foundry’s cloud-based Model Catalog, powerful computing resources, and unified management platform, developers can build customized EdgeAI solutions that meet various business requirements. This “cloud training, edge inference” model enables enterprises to balance computational costs, performance, and privacy requirements.

Exploring and Selecting Ideal Models

Built-in Model Library

Let’s start by exploring the pre-configured models provided by Foundry Local. With a simple CLI command, we can view all available options:

foundry model list

After executing the command above, you’ll see output similar to the following:

Currently, Foundry Local natively supports multiple high-quality small language models, including:

Microsoft Phi Series: Phi-3, Phi-3.5, Phi-4-mini, and Phi-4-mini-reasoning optimized specifically for inference
Alibaba Qwen Series: Lightweight models with excellent Chinese language capabilities
Mistral AI Series: Open-source models that perform exceptionally well with small parameter counts

Expanding Model Selection

For enterprise applications, pre-configured models may not satisfy specific requirements. In such cases, Azure AI Foundry’s Model Catalog offers a vast selection of over 11,000+ diverse models.

When selecting models suitable for edge deployment, consider the following factors:

Model Parameters	Recommended Range	Use Cases
1B-3B	Edge devices, IoT devices	Simple conversations, classification tasks
3B-7B	Edge servers, high-end devices	Complex reasoning, multimodal tasks

For EdgeAI applications, Microsoft Phi, Mistral AI, and Llama series models with parameter counts between 1B-7B are typically ideal choices, striking a good balance between performance and resource consumption.

After selecting a base model, developers can use it directly or fine-tune it for specific tasks. This article focuses on direct usage scenarios (for detailed fine-tuning guides, please refer to my separate specialized blog post).

Efficient Model Conversion and Quantization

Foundry Local requires models in the ONNX format. To convert models obtained from Azure AI Foundry into formats suitable for edge deployment, we need to perform model conversion and quantization.

Choosing a Conversion Environment

You can perform model conversion on your local workstation or leverage Azure ML cloud environments. For large models, Azure ML is strongly recommended as it provides:

Flexible Computing Resources: On-demand selection of CPU or GPU for conversion
Scalability: Process large models without local hardware limitations
Pre-configured Environments: Skip complex environment setup

Using Microsoft Olive for Model Optimization

Microsoft Olive is Microsoft’s model optimization tool, specifically designed to convert various models to high-performance ONNX format. It supports optimization of mainstream models like Phi, Llama, Mistral, and Qwen.

Environment Setup:

Create a new Notebook in Azure ML Studio, select the “Python 3.10-Azure ML” environment, and install the following key dependencies:

!pip install olive-ai==0.8.0 onnxruntime-genai==0.7.1 onnxruntime==1.12.1 transformers==4.51.3

Executing Model Conversion:

Use the following command to convert the model to INT4 quantized ONNX format, significantly reducing model size while preserving inference performance:

!olive auto-opt \

  --model_name_or_path {Your Model at Azure Model Location} \

  --provider CPUExecutionProvider \

  --use_model_builder \

  --precision int4 \

  --output_path {Your ONNX Model output path} \

  --log_level 1 \

  --trust_remote_code

Tip: For edge devices, INT4 quantization typically reduces model size to approximately 25% of the original size while maintaining about 95% of the performance. For scenarios requiring higher performance, consider using INT8 quantization.

Cloud Model Management and Version Control

Saving converted models to Azure ML model registry is a best practice, providing professional version control, access management, and deployment tracking capabilities.

Model Registration and Management

Azure ML model registry enables teams to:

Centralized Management: Store and manage all models in a single location
Version Control: Track model evolution and roll back to previous versions at any time
Metadata Tagging: Add key information to models, such as accuracy, size, and purpose
Access Control: Set fine-grained permissions for security

The image below shows the model management interface in Azure ML:

Here’s an example code for registering an ONNX model to Azure ML: https://github.com/microsoft/Build25-LAB329/blob/main/Lab329/Notebook/04.AzureML_RegisterToAzureML.ipynb

Deploying Models to Edge Devices

Foundry Local has designed a straightforward deployment process, enabling developers to quickly integrate custom models into edge applications. Here’s a detailed step-by-step guide:

Retrieve Optimized Models from the Cloud

First, download the optimized ONNX model from Azure ML model registry using the appropriate code. https://github.com/microsoft/Build25-LAB329/blob/main/Lab329/Notebook/05.Local_Download.ipynb

Place Model Files Correctly

Place the downloaded model files in the Foundry Local model directory:

# Create directory for model

mkdir -p ./models/llama/

# Move model files to appropriate location

# Note: May need to adjust based on specific model structure

mv ./downloaded-model/* ./models/llama/

Create an inference_model.json configuration file in the model directory, defining model metadata and prompt templates:

{

  "Name": "llama-3.2-1b-onnx",

  "PromptTemplate": {

    "assistant": "{Content}",

    "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are my EdgeAI assistant, help me to answer question<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{Content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

  }

}

Configuration Notes:

Name: Unique identifier for the model, used to reference it in Foundry Local
PromptTemplate: Defines how inputs and outputs are formatted, supports custom system prompts and special tokens

Verify Model Deployment

Use the Foundry Local CLI to test if the model is correctly deployed and can run:

foundry cache cd models

foundry model run llama-3.2-1b-onnx --verbose

After execution, you’ll see output similar to the image below, indicating the model has been successfully loaded and can respond to queries:

Integrating Foundry Local into Applications

After successfully deploying the model, you can use Foundry Local in your applications through multiple approaches, including SDKs and REST APIs. Let’s explore the integration options:

Using SDK for Integration

Foundry Local provides native SDKs for Python and Node.js, allowing developers to easily integrate edge AI capabilities into existing applications. Its APIs are intentionally designed to be compatible with the OpenAI API, making migration from cloud to edge straightforward.

Example with Streaming:

import openai

from foundry_local import FoundryLocalManager


# Initialize the Foundry Local manager and get connection details

manager = FoundryLocalManager()

alias = "llama-3.2-1b-onnx-int4-cpu"

# Create a client using the OpenAI-compatible interface

client = openai.OpenAI(

    base_url=manager.endpoint,

    api_key=manager.api_key

)

# Stream responses for better UX

stream = client.chat.completions.create(

    model=alias,

    messages=[{"role": "user", "content": "explain 1+1=2 ?"}],

    stream=True

)


# Process the streamed response

for chunk in stream:

    if chunk.choices[0].delta.content is not None:

        print(chunk.choices[0].delta.content, end="", flush=True)

Application Scenarios and Best Practices

Foundry Local’s flexible architecture supports various edge AI application scenarios:

Key Application Areas

Smart Home Devices: Provide offline voice assistant capabilities for smart speakers, home control centers, etc.
Industrial IoT: Deploy intelligent monitoring systems in factory floors without transmitting sensitive data to the cloud
Medical Devices: Provide AI-assisted diagnostic capabilities for medical devices while complying with regulations like HIPAA
Field Service Applications: Deliver offline AI support for service personnel in remote or poorly networked areas
Retail Smart Terminals: Offer personalized recommendations and customer service while protecting customer privacy

Deployment Best Practices

Cache Optimization: Configure appropriate cache sizes to improve response times for common queries
Concurrent Processing: Adjust batch processing parameters to optimize throughput in multi-user scenarios
Monitoring and Updates: Implement model performance monitoring and update models regularly to improve accuracy

Security and Compliance Considerations

When deploying AI at the edge, security becomes a critical concern. Foundry Local implements several security features:

: Cryptographic verification of models to prevent tampering
: API key authentication and role-based access for multi-user deployments
Data Protection: Local processing eliminates data transmission risks
Audit Logging: Comprehensive logging of model usage and performance metrics

For regulated industries, Foundry Local’s on-device processing simplifies compliance with regulations , and industry-specific standards by keeping sensitive data within organizational boundaries.

Conclusion

Foundry Local represents an important step in bringing artificial intelligence technology from the cloud to the edge. By bringing AI capabilities directly to user devices, it not only addresses key challenges of latency, privacy, and connection reliability but also provides developers with a powerful foundation for building next-generation intelligent applications.

As edge AI technology continues to develop, we can expect to see more and more innovative applications emerging across various industries, bringing users smarter, more private, and more efficient experiences. Whether you’re just beginning your AI development journey or seeking to optimize existing solutions as a professional developer, Foundry Local offers a platform worth exploring.

Resources

Official resources for learning more about Foundry Local and related technologies:

Microsoft Foundry Local Repository – Official codebase, documentation, and examples
Microsoft Olive Repository – Tool for optimizing and converting models
Custom Model Deployment Guide – Detailed deployment documentation and examples
Azure AI Foundry Overview – Learn about the cloud AI platform
Azure AI Model Catalog – Explore available pre-trained models
Fine-Tune End-to-End Distillation Models with Azure AI Foundry Models