ONNX GenAI Connector for Python (Experimental)
With the latest update we added support for running models locally with the onnxruntime-genai. The onnxruntime-genai package is powered by the ONNX Runtime in the background, but first let’s clarify what ONNX, ONNX Runtime and ONNX Runtime-GenAI are.
ONNX is an open-source format for AI models, both for Deep Learning and traditional Machine Learning. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.
The ONNX Runtime executes the saved weighted operations stored using the ONNX format. The runtime is optimized to inference the model on different hardware’s like NVIDIA Cuda, Qualcom NPU’s or Apple CoreML. The runtime is specific for each targeted hardware and choosing the right one for your hardware will run as fast as it can.
The GenAI is an optimized version of the runtime dedicated for Generative AI Models. While the ONNX Runtime already allows us to run inference in a model, the GenAI version of the runtime runs language models in a very optimized way, improving the probability distribution of the next token, appending next tokens to the sentence as well as using caching tricks to boost the overall performance in a repetitive and iterative process.
How can we use it in Semantic Kernel?
With the ONNX Connector now available in the python version of Semantic Kernel, you are now enabled to use one of the fastest local inference engine on the market (Source: ONNX Runtime | Accelerating Phi-2, CodeLlama, Gemma and other Gen AI models with ONNX Runtime). This will enable customers to run models offline with maximum speed, making Semantic Kernel a valuable orchestration engine for edge use cases.
Use Cases
- Offline program to anonymize sentences before sending to the cloud to stay GDPR compliant.
- On-Premises RAG Applications, using local memory connectors not exposing the data ensuring privacy.
- High availability use-cases, deploying applications to Edge Devices, for example, a Jetson Nano that runs inside a car or elevator. Those are great examples where a stable internet connection is not guaranteed, ensuring 100% availability with a local Small Language Model (SLM).
- The connector implements a Tokenizer and a Multiprocessor, which enables also Multimodality with the connector running a Phi3-Vision.
Running a Phi3-Vision model locally on your PC
The full Demo code can be found on here (github.com).
Note that we will run the model on our CPU since it’s easier to setup, an inference is also possible to run on a GPU, using the onnxruntime-genai-cuda, be aware that you must install the corresponding versions of CUDA & CUDNN for this. A GPU inference is recommended for best performance.
Install Dependencies
pip install semantic-kernel
pip install onnxruntime-genai==0.4.0 or the GPU Package
Please note that when you are using a mac, the pip package needs to be built from source and is not available on pypi.
Download the Model from Hugging Face
Make sure the hf-cli is installed
pip install -U "huggingface_hub[cli]"
Download Phi3-Vision
huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-cpu --include cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
Download an Example Image
curl https://onnxruntime.ai/images/coffee.png -o coffee.png
Use the connector in Semantic Kernel
Step 1: Load the ONNX connector
import asyncio
from semantic_kernel.connectors.ai.onnx import OnnxGenAIChatCompletion, OnnxGenAIPromptExecutionSettings
from semantic_kernel.contents import AuthorRole, ChatMessageContent, ImageContent, ChatHistory
chat_completion = OnnxGenAIChatCompletion(
ai_model_path="./cpu-int4-rtn-block-32-acc-level-4",
template="phi3v",
)
# Max length property is important to allocate RAM
# If the value is too big, you ran out of memory
# If the value is too small, your input is limited
settings = OnnxGenAIPromptExecutionSettings(
temperature=0.0,
max_length=7680,
)
Steps 2: Create the ChatHistory
system_message = """
You are a helpful assistant.
You know about provided images and the history of the conversation.
"""
chat_history = ChatHistory(system_message=system_message)
chat_history.add_message(
ChatMessageContent(
role=AuthorRole.USER,
items=[
ImageContent.from_image_path(image_path="coffee.png"),
],
),
)
chat_history.add_user_message("Describe the image.")
Step 3: Run the Model
answer = asyncio.run(
chat_completion.get_chat_message_content(
chat_history=chat_history,
settings=settings,
)
)
print(f"Answer: {answer}")
Step 3.5: Running Multimodal Models on CPU can take some time (1-2 Minutes), make sure you grab yourself a coffee. 🙂
Answer: The image shows a cup of coffee with a latte art design on top.
Running the Connector with Accelerated Hardware
The ONNX Runtime can also be run with NVIDIA CUDA, DirectML, or Qualcom NPU’s currently. If you want to use the connector while leveraging your dedicated hardware, please check out the following sites and use the corresponding pip package.
Known Issues
There are known issues regarding the inference for images, if you experience any exceptions of the ONNX Runtime please comment out the image and use text-only inference. We are working on fixing this and will graduate the connector once it’s stable.
0 comments
Be the first to start the discussion.