May 21st, 2025
0 reactions

Foundry Local: A New Era of Edge AI

At Microsoft Build 2025, Microsoft unveiled its groundbreaking Foundry Local solution for edge devices—an efficient platform specifically designed for local AI inference. As a critical component of Microsoft’s AI strategy, Foundry Local empowers developers to smoothly deploy and run Small Language Models (SLMs) on resource-constrained edge devices, opening new possibilities for the convergence of edge computing and artificial intelligence.

Core Architecture and Technical Advantages

Technical Foundation: The ONNX Ecosystem

Foundry Local is built on ONNX (Open Neural Network Exchange)—a mature, open standard for model interoperability. As a widely recognized model exchange format in machine learning and deep learning, ONNX brings significant advantages to Foundry Local:

  • Broad Compatibility: Supports models converted from various deep learning frameworks (PyTorch, TensorFlow, JAX, etc.)
  • Cross-Platform Optimization: Delivers highly optimized inference performance across different hardware architectures (CPU, GPU, NPU)
  • Rich Tooling Ecosystem: Leverages mature tools like Microsoft Olive for model optimization and quantization

Comprehensive Development Toolkit

Foundry Local offers a one-stop development experience:

  • Diverse Interface Options
    • Command Line Interface (CLI): Provides powerful model management, deployment, and testing capabilities
    • Multi-language SDKs: Currently supports NodeJS and Python, offering native programming experiences
    • RESTFul API:Standardized interface supporting seamless integration with various applications – OpenAI API compatibility
  • Developer-Optimized Experience
    • Clean, intuitive API design
    • Comprehensive documentation and code examples
    • Built-in model management and monitoring tools
  • Edge-First Performance Optimization
    • Memory utilization optimized for resource-constrained environments
    • Intelligent caching and inference acceleration techniques
    • Flexible deployment options supporting various device configurations

Breakthrough Technical Advantages

Foundry Local brings four key advantages to edge AI applications:

  • Ultra-Low Latency Experience
    • Local inference eliminates network communication overhead
    • Millisecond-level response times, enabling native-like fluid interactions
    • Adaptive batch processing optimization to increase throughput
  • Complete Offline Capability
    • No continuous internet connection required, suitable for network-limited environments
    • Full functionality maintained in offline states
    • Ideal for remote areas, edge devices, or high network security requirements
  • Enterprise-Grade Data Privacy
    • Sensitive data processed entirely locally, no cloud uploads required
    • Compliant with strict regulatory requirements
    • Reduced data breach risks, enhanced customer trust
  • Maximized Resource Efficiency
    • Finely tuned model quantization significantly reduces memory requirements
    • Dynamic resource allocation adapts to device load variations
    • Battery-friendly design extends edge device usage time

Building Cloud-Edge Collaborative AI Solutions

By combining Azure AI Foundry’s cloud-based Model Catalog, powerful computing resources, and unified management platform, developers can build customized EdgeAI solutions that meet various business requirements. This “cloud training, edge inference” model enables enterprises to balance computational costs, performance, and privacy requirements.

Exploring and Selecting Ideal Models

Built-in Model Library

Let’s start by exploring the pre-configured models provided by Foundry Local. With a simple CLI command, we can view all available options:

foundry model list

After executing the command above, you’ll see output similar to the following:

01 image

Currently, Foundry Local natively supports multiple high-quality small language models, including:

  • Microsoft Phi Series: Phi-3, Phi-3.5, Phi-4-mini, and Phi-4-mini-reasoning optimized specifically for inference
  • Alibaba Qwen Series: Lightweight models with excellent Chinese language capabilities
  • Mistral AI Series: Open-source models that perform exceptionally well with small parameter counts

Expanding Model Selection

For enterprise applications, pre-configured models may not satisfy specific requirements. In such cases, Azure AI Foundry’s Model Catalog offers a vast selection of over 11,000+ diverse models.

02 image

When selecting models suitable for edge deployment, consider the following factors:

Model Parameters Recommended Range Use Cases
1B-3B Edge devices, IoT devices Simple conversations, classification tasks
3B-7B Edge servers, high-end devices Complex reasoning, multimodal tasks

For EdgeAI applications, Microsoft Phi, Mistral AI, and Llama series models with parameter counts between 1B-7B are typically ideal choices, striking a good balance between performance and resource consumption.

After selecting a base model, developers can use it directly or fine-tune it for specific tasks. This article focuses on direct usage scenarios (for detailed fine-tuning guides, please refer to my separate specialized blog post).

Efficient Model Conversion and Quantization

Foundry Local requires models in the ONNX format. To convert models obtained from Azure AI Foundry into formats suitable for edge deployment, we need to perform model conversion and quantization.

Choosing a Conversion Environment

You can perform model conversion on your local workstation or leverage Azure ML cloud environments. For large models, Azure ML is strongly recommended as it provides:

  • Flexible Computing Resources: On-demand selection of CPU or GPU for conversion
  • Scalability: Process large models without local hardware limitations
  • Pre-configured Environments: Skip complex environment setup

Using Microsoft Olive for Model Optimization

Microsoft Olive is Microsoft’s model optimization tool, specifically designed to convert various models to high-performance ONNX format. It supports optimization of mainstream models like Phi, Llama, Mistral, and Qwen.

Environment Setup:

Create a new Notebook in Azure ML Studio, select the “Python 3.10-Azure ML” environment, and install the following key dependencies:

!pip install olive-ai==0.8.0 onnxruntime-genai==0.7.1 onnxruntime==1.12.1 transformers==4.51.3

Executing Model Conversion:

Use the following command to convert the model to INT4 quantized ONNX format, significantly reducing model size while preserving inference performance:

!olive auto-opt \

  --model_name_or_path {Your Model at Azure Model Location} \

  --provider CPUExecutionProvider \

  --use_model_builder \

  --precision int4 \

  --output_path {Your ONNX Model output path} \

  --log_level 1 \

  --trust_remote_code

Tip: For edge devices, INT4 quantization typically reduces model size to approximately 25% of the original size while maintaining about 95% of the performance. For scenarios requiring higher performance, consider using INT8 quantization.

Cloud Model Management and Version Control

Saving converted models to Azure ML model registry is a best practice, providing professional version control, access management, and deployment tracking capabilities.

Model Registration and Management

Azure ML model registry enables teams to:

  • Centralized Management: Store and manage all models in a single location
  • Version Control: Track model evolution and roll back to previous versions at any time
  • Metadata Tagging: Add key information to models, such as accuracy, size, and purpose
  • Access Control: Set fine-grained permissions for security

The image below shows the model management interface in Azure ML:

03 image

Here’s an example code for registering an ONNX model to Azure ML: https://github.com/microsoft/Build25-LAB329/blob/main/Lab329/Notebook/04.AzureML_RegisterToAzureML.ipynb

Deploying Models to Edge Devices

Foundry Local has designed a straightforward deployment process, enabling developers to quickly integrate custom models into edge applications. Here’s a detailed step-by-step guide:

  1. Retrieve Optimized Models from the Cloud

First, download the optimized ONNX model from Azure ML model registry using the appropriate code.  https://github.com/microsoft/Build25-LAB329/blob/main/Lab329/Notebook/05.Local_Download.ipynb

  1. Place Model Files Correctly

Place the downloaded model files in the Foundry Local model directory:

# Create directory for model

mkdir -p ./models/llama/

# Move model files to appropriate location

# Note: May need to adjust based on specific model structure

mv ./downloaded-model/* ./models/llama/
  1. Create an inference_model.json configuration file in the model directory, defining model metadata and prompt templates:
{

  "Name": "llama-3.2-1b-onnx",

  "PromptTemplate": {

    "assistant": "{Content}",

    "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are my EdgeAI assistant, help me to answer question<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{Content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

  }

}

Configuration Notes:

  • Name: Unique identifier for the model, used to reference it in Foundry Local
  • PromptTemplate: Defines how inputs and outputs are formatted, supports custom system prompts and special tokens
  1. Verify Model Deployment

Use the Foundry Local CLI to test if the model is correctly deployed and can run:

foundry cache cd models

foundry model run llama-3.2-1b-onnx --verbose

After execution, you’ll see output similar to the image below, indicating the model has been successfully loaded and can respond to queries:

04 image

Integrating Foundry Local into Applications

After successfully deploying the model, you can use Foundry Local in your applications through multiple approaches, including SDKs and REST APIs. Let’s explore the integration options:

Using SDK for Integration

Foundry Local provides native SDKs for Python and Node.js, allowing developers to easily integrate edge AI capabilities into existing applications. Its APIs are intentionally designed to be compatible with the OpenAI API, making migration from cloud to edge straightforward.

Example with Streaming:

import openai

from foundry_local import FoundryLocalManager


# Initialize the Foundry Local manager and get connection details

manager = FoundryLocalManager()

alias = "llama-3.2-1b-onnx-int4-cpu"

# Create a client using the OpenAI-compatible interface

client = openai.OpenAI(

    base_url=manager.endpoint,

    api_key=manager.api_key

)

# Stream responses for better UX

stream = client.chat.completions.create(

    model=alias,

    messages=[{"role": "user", "content": "explain 1+1=2 ?"}],

    stream=True

)


# Process the streamed response

for chunk in stream:

    if chunk.choices[0].delta.content is not None:

        print(chunk.choices[0].delta.content, end="", flush=True)

Application Scenarios and Best Practices

Foundry Local’s flexible architecture supports various edge AI application scenarios:

Key Application Areas

  • Smart Home Devices: Provide offline voice assistant capabilities for smart speakers, home control centers, etc.
  • Industrial IoT: Deploy intelligent monitoring systems in factory floors without transmitting sensitive data to the cloud
  • Medical Devices: Provide AI-assisted diagnostic capabilities for medical devices while complying with regulations like HIPAA
  • Field Service Applications: Deliver offline AI support for service personnel in remote or poorly networked areas
  • Retail Smart Terminals: Offer personalized recommendations and customer service while protecting customer privacy

Deployment Best Practices

  • Cache Optimization: Configure appropriate cache sizes to improve response times for common queries
  • Concurrent Processing: Adjust batch processing parameters to optimize throughput in multi-user scenarios
  • Monitoring and Updates: Implement model performance monitoring and update models regularly to improve accuracy

Security and Compliance Considerations

When deploying AI at the edge, security becomes a critical concern. Foundry Local implements several security features:

  • : Cryptographic verification of models to prevent tampering
  • : API key authentication and role-based access for multi-user deployments
  • Data Protection: Local processing eliminates data transmission risks
  • Audit Logging: Comprehensive logging of model usage and performance metrics

For regulated industries, Foundry Local’s on-device processing simplifies compliance with regulations , and industry-specific standards by keeping sensitive data within organizational boundaries.

Conclusion

Foundry Local represents an important step in bringing artificial intelligence technology from the cloud to the edge. By bringing AI capabilities directly to user devices, it not only addresses key challenges of latency, privacy, and connection reliability but also provides developers with a powerful foundation for building next-generation intelligent applications.

As edge AI technology continues to develop, we can expect to see more and more innovative applications emerging across various industries, bringing users smarter, more private, and more efficient experiences. Whether you’re just beginning your AI development journey or seeking to optimize existing solutions as a professional developer, Foundry Local offers a platform worth exploring.

Resources

Official resources for learning more about Foundry Local and related technologies:

  1. Microsoft Foundry Local Repository – Official codebase, documentation, and examples
  2. Microsoft Olive Repository – Tool for optimizing and converting models
  3. Custom Model Deployment Guide – Detailed deployment documentation and examples
  4. Azure AI Foundry Overview – Learn about the cloud AI platform
  5. Azure AI Model Catalog – Explore available pre-trained models
  6. Fine-Tune End-to-End Distillation Models with Azure AI Foundry Models

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Author

kinfeylo
Senior Cloud Advocate

Kinfey Lo, a Microsoft Senior Cloud Advocate, concentrates on the development and operationalization of Small Language Models (SLMs) within Edge AI ecosystems. He is the author of the "Phi Cookbook," a resource for working with Phi series SLMs. His expertise lies in constructing GenAIOps strategies tailored for the unique demands of Edge AI.

leestott
Principal Cloud Advocate Manager

Lee Stott is a Principal Cloud Advocate Manager at Microsoft, where he leads initiatives that empower developers and organisations to harness the full potential of Microsoft’s cloud and AI technologies. With over 20 years of experience in software development, artificial intelligence, and cloud computing.

0 comments