{"id":3705,"date":"2024-12-10T08:32:03","date_gmt":"2024-12-10T16:32:03","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/semantic-kernel\/?p=3705"},"modified":"2024-12-10T08:32:03","modified_gmt":"2024-12-10T16:32:03","slug":"onnx-genai-connector-for-python-experimental","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/agent-framework\/onnx-genai-connector-for-python-experimental\/","title":{"rendered":"ONNX GenAI Connector for Python (Experimental)\u00a0"},"content":{"rendered":"<h3 aria-level=\"1\"><span style=\"font-size: 18pt;\">ONNX GenAI Connector for Python (Experimental)\u00a0<\/span><\/h3>\n<p><span style=\"font-size: 12pt;\">With the latest update we added support for running models locally with the <a href=\"https:\/\/github.com\/microsoft\/onnxruntime-genai\">onnxruntime-genai<\/a>. The onnxruntime-genai package is powered by the ONNX Runtime in the background, but first let\u2019s clarify what ONNX, ONNX Runtime and ONNX Runtime-GenAI are.\u00a0<\/span><\/p>\n<p aria-level=\"2\"><a href=\"https:\/\/onnx.ai\/\"><span style=\"font-size: 14pt;\"><strong>ONNX<\/strong><\/span><\/a><\/p>\n<p><span style=\"font-size: 12pt;\">ONNX is an open-source format for AI models, both for Deep Learning and traditional Machine Learning. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.\u00a0<\/span><\/p>\n<p aria-level=\"2\"><a href=\"https:\/\/onnxruntime.ai\/\"><span style=\"font-size: 14pt;\"><strong>ONNX Runtime<\/strong><\/span><\/a><\/p>\n<p><span style=\"font-size: 12pt;\">The ONNX Runtime executes the saved weighted operations stored using the ONNX format. The runtime is optimized to inference the model on different hardware&#8217;s like NVIDIA Cuda, Qualcom NPU\u2019s or Apple CoreML. The runtime is specific for each targeted hardware and choosing the right one for your hardware will run as fast as it can.\u00a0<\/span><\/p>\n<p aria-level=\"2\"><a href=\"https:\/\/onnxruntime.ai\/generative-ai\"><span style=\"font-size: 14pt;\"><strong>ONNX Runtime GenAI<\/strong><\/span><\/a><\/p>\n<p><span style=\"font-size: 12pt;\">The GenAI is an optimized version of the runtime dedicated for Generative AI Models. While the ONNX Runtime already allows us to run inference in a model, the GenAI version of the runtime runs language models in a very optimized way, improving the probability distribution of the next token, appending next tokens to the sentence as well as using caching tricks to boost the overall performance in a repetitive and iterative process.\u00a0\u00a0<\/span><\/p>\n<p aria-level=\"2\"><span style=\"font-size: 14pt;\"><strong>How can we use it in Semantic Kernel?\u00a0<\/strong><\/span><\/p>\n<p><span style=\"font-size: 12pt;\">With the ONNX Connector now available in the python version of Semantic Kernel, you are now enabled to use one of the fastest local inference engine on the market (Source: <a href=\"https:\/\/onnxruntime.ai\/blogs\/accelerating-phi-2\">ONNX Runtime | Accelerating Phi-2, CodeLlama, Gemma and other Gen AI models with ONNX Runtime<\/a>). This will enable customers to run models offline with maximum speed, making Semantic Kernel a valuable orchestration engine for edge use cases.<\/span><\/p>\n<p><strong><span style=\"font-size: 14pt;\">Use Cases<\/span><\/strong><\/p>\n<ol>\n<li><span style=\"font-size: 12pt;\">Offline program to anonymize sentences before sending to the cloud to stay GDPR compliant. <\/span><\/li>\n<li>On-Premises RAG Applications, using local memory connectors not exposing the data ensuring privacy.<\/li>\n<li>High availability use-cases, deploying applications to Edge Devices, for example, a Jetson Nano that runs inside a car or elevator. Those are great examples where a stable internet connection is not guaranteed, ensuring 100% availability with a local Small Language Model (SLM).<\/li>\n<li>The connector implements a Tokenizer and a Multiprocessor, which enables also Multimodality with the connector running a Phi3-Vision.<\/li>\n<\/ol>\n<p aria-level=\"2\"><strong><span style=\"font-size: 18pt;\">Running a Phi3-Vision model locally on your PC\u00a0<\/span><\/strong><\/p>\n<p><span style=\"font-size: 12pt;\">The full Demo code can be found on <a href=\"https:\/\/github.com\/microsoft\/semantic-kernel\/blob\/main\/python\/samples\/concepts\/local_models\/onnx_phi3_vision_completion.py\">here (github.com)<\/a>.\u00a0\u00a0<\/span><\/p>\n<p><span style=\"font-size: 12pt;\">Note that we will run the model on our CPU since it\u2019s easier to setup, an inference is also possible to run on a GPU, using the onnxruntime-genai-cuda, be aware that you must install the corresponding versions of CUDA &amp; CUDNN for this. A GPU inference is recommended for best performance.\u00a0<\/span><\/p>\n<p><span style=\"font-size: 14pt;\"><a href=\"https:\/\/onnxruntime.ai\/docs\/genai\/howto\/install\">Install | onnxruntime<\/a>\u00a0<\/span><\/p>\n<p aria-level=\"3\"><strong><span style=\"font-size: 14pt;\">Install Dependencies\u00a0<\/span><\/strong><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">pip install semantic-kernel\u00a0\r\npip install onnxruntime-genai==0.4.0 or the GPU Package\u00a0<\/code><\/pre>\n<p><span style=\"font-size: 12pt;\">Please note that when you are using a mac, the pip package needs to be built from source and is not available on pypi.\u00a0<\/span><\/p>\n<p aria-level=\"3\"><strong><span style=\"font-size: 14pt;\">Download the Model from Hugging Face\u00a0<\/span><\/strong><\/p>\n<p><span style=\"font-size: 14pt;\">Make sure the hf-cli is installed\u00a0\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">pip install -U \"huggingface_hub[cli]\"\u00a0<\/code><\/pre>\n<p><span style=\"font-size: 14pt;\">Download Phi3-Vision\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">huggingface-cli download microsoft\/Phi-3-vision-128k-instruct-onnx-cpu --include cpu-int4-rtn-block-32-acc-level-4\/* --local-dir .\u00a0<\/code><\/pre>\n<p aria-level=\"3\"><strong><span style=\"font-size: 14pt;\">Download an Example Image\u00a0<\/span><\/strong><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">curl https:\/\/onnxruntime.ai\/images\/coffee.png -o coffee.png\u00a0<\/code><\/pre>\n<p aria-level=\"3\"><strong><span style=\"font-size: 14pt;\">Use the connector in Semantic Kernel<\/span><\/strong><\/p>\n<p aria-level=\"4\"><span style=\"font-size: 12pt;\">Step 1: Load the ONNX connector\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-py\"><code class=\"language-py\">import asyncio\r\n\r\nfrom semantic_kernel.connectors.ai.onnx import OnnxGenAIChatCompletion, OnnxGenAIPromptExecutionSettings\r\nfrom semantic_kernel.contents import AuthorRole, ChatMessageContent, ImageContent, ChatHistory\r\n\r\nchat_completion = OnnxGenAIChatCompletion(\r\n    ai_model_path=\".\/cpu-int4-rtn-block-32-acc-level-4\",\r\n    template=\"phi3v\",\r\n)\r\n\r\n# Max length property is important to allocate RAM\r\n# If the value is too big, you ran out of memory\r\n# If the value is too small, your input is limited\r\nsettings = OnnxGenAIPromptExecutionSettings(\r\n    temperature=0.0,\r\n    max_length=7680,\r\n)<\/code><\/pre>\n<p aria-level=\"4\"><span style=\"font-size: 12pt;\">Steps 2: Create the ChatHistory<\/span><\/p>\n<pre class=\"prettyprint language-py\"><code class=\"language-py\">system_message = \"\"\"\r\nYou are a helpful assistant.\r\nYou know about provided images and the history of the conversation.\r\n\"\"\"\r\n\r\nchat_history = ChatHistory(system_message=system_message)\r\nchat_history.add_message(\r\n    ChatMessageContent(\r\n        role=AuthorRole.USER,\r\n        items=[\r\n            ImageContent.from_image_path(image_path=\"coffee.png\"),\r\n        ],\r\n    ),\r\n)\r\nchat_history.add_user_message(\"Describe the image.\")<\/code><\/pre>\n<p aria-level=\"4\"><span style=\"font-size: 12pt;\">Step 3: Run the Model<\/span><\/p>\n<pre class=\"prettyprint language-py\"><code class=\"language-py\">answer = asyncio.run(\r\n    chat_completion.get_chat_message_content(\r\n        chat_history=chat_history,\r\n        settings=settings,\r\n    )\r\n)\r\n\r\nprint(f\"Answer: {answer}\")<\/code><\/pre>\n<p aria-level=\"4\"><span style=\"font-size: 12pt;\">Step 3.5: Running Multimodal Models on CPU can take some time (1-2 Minutes), make sure you grab yourself a coffee. \ud83d\ude42\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">Answer: The image shows a cup of coffee with a latte art design on top.<\/code><\/pre>\n<p aria-level=\"2\"><span style=\"font-size: 14pt;\"><strong>Running the Connector with Accelerated Hardware\u00a0<\/strong><\/span><\/p>\n<p><span style=\"font-size: 12pt;\">The ONNX Runtime can also be run with NVIDIA CUDA, DirectML, or Qualcom NPU\u2019s currently. If you want to use the connector while leveraging your dedicated hardware, please check out the following sites and use the corresponding pip package.\u00a0<\/span><\/p>\n<ul>\n<li><span style=\"font-size: 12pt;\"><a href=\"https:\/\/onnxruntime.ai\/docs\/execution-providers\/\">Execution Providers | onnxruntime<\/a>\u00a0<\/span><\/li>\n<li><span style=\"font-size: 12pt;\"><a href=\"https:\/\/onnxruntime.ai\/docs\/genai\/howto\/build-from-source.html\">Build from source | onnxruntime<\/a>\u00a0<\/span><\/li>\n<\/ul>\n<h3 aria-level=\"2\"><span style=\"font-size: 18pt;\">Known Issues\u00a0<\/span><\/h3>\n<p><span style=\"font-size: 12pt;\">There are known issues regarding the inference for images, if you experience any exceptions of the ONNX Runtime please comment out the image and use text-only inference. We are working on fixing this and will graduate the connector once it\u2019s stable.\u00a0<\/span><\/p>\n<ul>\n<li><span style=\"font-size: 12pt;\"><a href=\"https:\/\/github.com\/microsoft\/onnxruntime-genai\/issues\/823\">Some answers in phi3-vision just return &lt;\/s&gt; \u00b7 Issue #823 \u00b7 microsoft\/onnxruntime-genai<\/a>\u00a0<\/span><\/li>\n<li><span style=\"font-size: 12pt;\"><a href=\"https:\/\/github.com\/microsoft\/onnxruntime-genai\/issues\/954\">phi3.5 genai converted model output garbage results with input length around 3000 and 8000. \u00b7 Issue #954 \u00b7 microsoft\/onnxruntime-genai<\/a>\u00a0<\/span><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>ONNX GenAI Connector for Python (Experimental)\u00a0 With the latest update we added support for running models locally with the onnxruntime-genai. The onnxruntime-genai package is powered by the ONNX Runtime in the background, but first let\u2019s clarify what ONNX, ONNX Runtime and ONNX Runtime-GenAI are.\u00a0 ONNX ONNX is an open-source format for AI models, both for [&hellip;]<\/p>\n","protected":false},"author":149071,"featured_media":2302,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[48,63,53,9],"class_list":["post-3705","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-semantic-kernel","tag-ai","tag-microsoft-semantic-kernel","tag-python","tag-semantic-kernel"],"acf":[],"blog_post_summary":"<p>ONNX GenAI Connector for Python (Experimental)\u00a0 With the latest update we added support for running models locally with the onnxruntime-genai. The onnxruntime-genai package is powered by the ONNX Runtime in the background, but first let\u2019s clarify what ONNX, ONNX Runtime and ONNX Runtime-GenAI are.\u00a0 ONNX ONNX is an open-source format for AI models, both for [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/posts\/3705","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/users\/149071"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/comments?post=3705"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/posts\/3705\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/media\/2302"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/media?parent=3705"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/categories?post=3705"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/tags?post=3705"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}