{"id":4601,"date":"2025-04-03T09:31:56","date_gmt":"2025-04-03T16:31:56","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/semantic-kernel\/?p=4601"},"modified":"2025-04-03T09:32:54","modified_gmt":"2025-04-03T16:32:54","slug":"using-openais-audio-preview-model-with-semantic-kernel","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/agent-framework\/using-openais-audio-preview-model-with-semantic-kernel\/","title":{"rendered":"Using OpenAI&#8217;s Audio-Preview Model with Semantic Kernel"},"content":{"rendered":"<p><center><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2025\/04\/ghibli-audio-preview.jpg\"><img decoding=\"async\" class=\"aligncenter wp-image-4602\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2025\/04\/ghibli-audio-preview.jpg\" alt=\"OpenAI Audio-preview support\" width=\"512\" height=\"512\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2025\/04\/ghibli-audio-preview.jpg 1024w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2025\/04\/ghibli-audio-preview-300x300.jpg 300w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2025\/04\/ghibli-audio-preview-150x150.jpg 150w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2025\/04\/ghibli-audio-preview-768x768.jpg 768w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2025\/04\/ghibli-audio-preview-24x24.jpg 24w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2025\/04\/ghibli-audio-preview-48x48.jpg 48w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2025\/04\/ghibli-audio-preview-96x96.jpg 96w\" sizes=\"(max-width: 512px) 100vw, 512px\" \/><\/a><\/center>OpenAI&#8217;s <strong>gpt-4o-audio-preview<\/strong>\u00a0is a powerful multimodal model that enables audio input and output capabilities, allowing developers to create more natural and accessible AI interactions. This model supports both speech-to-text and text-to-speech functionalities in a single API call through the Chat Completions API, making it suitable for building voice-enabled applications where turn-based interactions are appropriate.<\/p>\n<p class=\"code-line\" dir=\"auto\" style=\"text-align: left;\" data-line=\"4\">In this post, we&#8217;ll explore how to use the audio-preview model with Semantic Kernel in both C# and Python to create voice-enabled AI applications.<\/p>\n<h3 id=\"best-use-cases\" class=\"code-line\" dir=\"auto\" data-line=\"6\">Best Use Cases<\/h3>\n<p class=\"code-line code-active-line\" dir=\"auto\" data-line=\"8\">Best for turn-based interactions where complete audio messages are processed as discrete units. Suitable for applications like voice-based Q&amp;A systems, audio transcription with AI responses, or asynchronous voice messaging where real-time interaction isn&#8217;t critical.<\/p>\n<h2 id=\"key-features-of-openais-audio-preview-model-with-chat-completions-api\" class=\"code-line\" dir=\"auto\" data-line=\"16\">Key Features of OpenAI&#8217;s Audio-Preview Model with Chat Completions API<\/h2>\n<ul class=\"code-line\" dir=\"auto\" data-line=\"12\">\n<li class=\"code-line\" dir=\"auto\" data-line=\"12\">\n<p class=\"code-line\" dir=\"auto\" data-line=\"12\"><strong>Multimodal Input\/Output<\/strong>: Process both text and audio inputs, and generate both text and audio outputs in a single API call.<\/p>\n<\/li>\n<li class=\"code-line\" dir=\"auto\" data-line=\"14\">\n<p class=\"code-line\" dir=\"auto\" data-line=\"14\"><strong>Turn-Based Voice Interactions<\/strong>: Suitable for non-real-time, turn-based conversational applications where each interaction is a complete request-response cycle.<\/p>\n<\/li>\n<li class=\"code-line\" dir=\"auto\" data-line=\"16\">\n<p class=\"code-line\" dir=\"auto\" data-line=\"16\"><strong>Voice Synthesis Options<\/strong>: Generate speech with support for multiple voices (like Alloy, Echo, Fable, Onyx, Nova, and Shimmer).<\/p>\n<\/li>\n<li class=\"code-line\" dir=\"auto\" data-line=\"18\">\n<p class=\"code-line\" dir=\"auto\" data-line=\"18\"><strong>Audio Understanding<\/strong>: Transcribe and comprehend spoken language from audio files.<\/p>\n<\/li>\n<li class=\"code-line\" dir=\"auto\" data-line=\"20\">\n<p class=\"code-line\" dir=\"auto\" data-line=\"20\"><strong>Multilingual Support<\/strong>: Process and generate audio in multiple languages, making it accessible to global users.<\/p>\n<\/li>\n<li class=\"code-line\" dir=\"auto\" data-line=\"22\">\n<p class=\"code-line\" dir=\"auto\" data-line=\"22\"><strong>Integration with Function Calling<\/strong>: Combine audio capabilities with function calling to create voice-controlled applications that can perform actions.<\/p>\n<\/li>\n<li class=\"code-line\" dir=\"auto\" data-line=\"24\">\n<p class=\"code-line\" dir=\"auto\" data-line=\"24\"><strong>Simplified Development<\/strong>: Single API for both audio input processing and audio output generation, reducing the complexity of building voice-enabled applications.<\/p>\n<\/li>\n<li class=\"code-line\" dir=\"auto\" data-line=\"26\">\n<p class=\"code-line\" dir=\"auto\" data-line=\"26\"><strong>Batch Processing<\/strong>: Well-suited for applications where complete audio messages are processed as discrete units rather than continuous streams.<\/p>\n<\/li>\n<\/ul>\n<p class=\"code-line\" dir=\"auto\" data-line=\"28\"><strong>Note<\/strong>: For truly low-latency, real-time voice interactions, OpenAI&#8217;s Realtime API is the more appropriate choice. The Chat Completions API with audio capabilities is better suited for non-real-time applications where some latency is acceptable.<\/p>\n<h2 id=\"using-audio-preview-in-semantic-kernel\" class=\"code-line\" dir=\"auto\" data-line=\"36\">Using Audio-Preview in Semantic Kernel<\/h2>\n<p class=\"code-line\" dir=\"auto\" data-line=\"32\">Semantic Kernel provides a straightforward way to integrate with OpenAI&#8217;s audio-preview model. Let&#8217;s see how to implement basic audio input and output functionality in both C# and Python.<\/p>\n<h3 id=\"in-net-c\" class=\"code-line\" dir=\"auto\" data-line=\"40\">In .NET (C#)<\/h3>\n<p class=\"code-line\" dir=\"auto\" data-line=\"36\">For a C# project using Semantic Kernel, you can add the audio-preview model as an OpenAI chat completion service. Make sure you have your OpenAI API key (or Azure OpenAI endpoint and key if using Azure):<\/p>\n<pre><code class=\"code-line language-csharp\" dir=\"auto\" data-line=\"38\"><span class=\"hljs-keyword\">using<\/span> Microsoft.SemanticKernel;\r\n<span class=\"hljs-keyword\">using<\/span> Microsoft.SemanticKernel.ChatCompletion;\r\n<span class=\"hljs-keyword\">using<\/span> Microsoft.SemanticKernel.Connectors.OpenAI;\r\n\r\n<span class=\"hljs-comment\">\/\/ Initialize the OpenAI chat completion service with the audio-preview model<\/span>\r\n<span class=\"hljs-keyword\">var<\/span> kernel = Kernel.CreateBuilder()\r\n    .AddOpenAIChatCompletion(\r\n        modelId: <span class=\"hljs-string\">\"gpt-4o-audio-preview\"<\/span>,\r\n        apiKey: <span class=\"hljs-string\">\"YOUR_OPENAI_API_KEY\"<\/span>\r\n    )\r\n    .Build();\r\n\r\n<span class=\"hljs-keyword\">var<\/span> chatCompletionService = kernel.GetRequiredService&lt;IChatCompletionService&gt;();\r\n\r\n<span class=\"hljs-comment\">\/\/ Configure settings for audio output<\/span>\r\n<span class=\"hljs-keyword\">var<\/span> settings = <span class=\"hljs-keyword\">new<\/span> OpenAIPromptExecutionSettings\r\n{\r\n    Audio = <span class=\"hljs-keyword\">new<\/span> ChatAudioOptions(\r\n        ChatOutputAudioVoice.Shimmer, <span class=\"hljs-comment\">\/\/ Choose from available voices<\/span>\r\n        ChatOutputAudioFormat.Mp3     <span class=\"hljs-comment\">\/\/ Choose output format<\/span>\r\n    ),\r\n    Modalities = ChatResponseModalities.Text | ChatResponseModalities.Audio <span class=\"hljs-comment\">\/\/ Request both text and audio<\/span>\r\n};\r\n\r\n<span class=\"hljs-comment\">\/\/ Create a chat history and add an audio message<\/span>\r\n<span class=\"hljs-keyword\">var<\/span> chatHistory = <span class=\"hljs-keyword\">new<\/span> ChatHistory(<span class=\"hljs-string\">\"You are a helpful assistant.\"<\/span>);\r\n\r\n<span class=\"hljs-comment\">\/\/ Add audio input (from a file or recorded audio)<\/span>\r\n<span class=\"hljs-built_in\">byte<\/span>[] audioBytes = File.ReadAllBytes(<span class=\"hljs-string\">\"user_question.wav\"<\/span>);\r\nchatHistory.AddUserMessage([<span class=\"hljs-keyword\">new<\/span> AudioContent(audioBytes, <span class=\"hljs-string\">\"audio\/wav\"<\/span>)]);\r\n\r\n<span class=\"hljs-comment\">\/\/ Get the model's response with both text and audio<\/span>\r\n<span class=\"hljs-keyword\">var<\/span> result = <span class=\"hljs-keyword\">await<\/span> chatCompletionService.GetChatMessageContentAsync(chatHistory, settings);\r\n\r\n<span class=\"hljs-comment\">\/\/ Access the text response<\/span>\r\nConsole.WriteLine(<span class=\"hljs-string\">$\"Assistant &gt; <span class=\"hljs-subst\">{result}<\/span>\"<\/span>);\r\n\r\n<span class=\"hljs-comment\">\/\/ Access the audio response (if available)<\/span>\r\n<span class=\"hljs-keyword\">if<\/span> (result.Items.OfType&lt;AudioContent&gt;().Any())\r\n{\r\n    <span class=\"hljs-keyword\">var<\/span> audioContent = result.Items.OfType&lt;AudioContent&gt;().First();\r\n    <span class=\"hljs-comment\">\/\/ Save or play the audio response<\/span>\r\n    File.WriteAllBytes(<span class=\"hljs-string\">\"assistant_response.mp3\"<\/span>, audioContent.Data.ToArray());\r\n}\r\n<\/code><\/pre>\n<p class=\"code-line\" dir=\"auto\" data-line=\"85\">We have also created a C# sample within Semantic Kernel repository using audio-preview model here:<\/p>\n<ul class=\"code-line\" dir=\"auto\" data-line=\"86\">\n<li class=\"code-line\" dir=\"auto\" data-line=\"86\"><a href=\"https:\/\/github.com\/microsoft\/semantic-kernel\/blob\/main\/dotnet\/samples\/Concepts\/ChatCompletion\/OpenAI_ChatCompletionWithAudio.cs\" data-href=\"https:\/\/github.com\/microsoft\/semantic-kernel\/blob\/main\/dotnet\/samples\/Concepts\/ChatCompletion\/OpenAI_ChatCompletionWithAudio.cs\">Chat Completion with Audio<\/a><\/li>\n<\/ul>\n<h2 id=\"conclusion\" class=\"code-line\" dir=\"auto\" data-line=\"91\">Conclusion<\/h2>\n<p class=\"code-line\" dir=\"auto\" data-line=\"91\">OpenAI&#8217;s audio-preview model represents a significant advancement in creating more natural and accessible AI interactions. With Semantic Kernel&#8217;s straightforward integration, developers can build voice-enabled applications that provide an enhanced user experience.<\/p>\n<h2 id=\"references\" class=\"code-line\" dir=\"auto\" data-line=\"95\">References<\/h2>\n<ul class=\"code-line\" dir=\"auto\" data-line=\"95\">\n<li class=\"code-line\" dir=\"auto\" data-line=\"95\"><a href=\"https:\/\/platform.openai.com\/docs\/guides\/audio\" data-href=\"https:\/\/platform.openai.com\/docs\/guides\/audio\">OpenAI Platform Documentation &#8211; Audio and Speech<\/a><\/li>\n<li class=\"code-line\" dir=\"auto\" data-line=\"96\"><a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/concepts\/models\" data-href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/concepts\/models\">Azure OpenAI Service Models<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>OpenAI&#8217;s gpt-4o-audio-preview\u00a0is a powerful multimodal model that enables audio input and output capabilities, allowing developers to create more natural and accessible AI interactions. This model supports both speech-to-text and text-to-speech functionalities in a single API call through the Chat Completions API, making it suitable for building voice-enabled applications where turn-based interactions are appropriate. In this [&hellip;]<\/p>\n","protected":false},"author":63983,"featured_media":4602,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[78,27,2,1],"tags":[79,48,49,69,31,63,9],"class_list":["post-4601","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-net","category-agents","category-samples","category-semantic-kernel","tag-net","tag-ai","tag-ai-agents","tag-api","tag-c","tag-microsoft-semantic-kernel","tag-semantic-kernel"],"acf":[],"blog_post_summary":"<p>OpenAI&#8217;s gpt-4o-audio-preview\u00a0is a powerful multimodal model that enables audio input and output capabilities, allowing developers to create more natural and accessible AI interactions. This model supports both speech-to-text and text-to-speech functionalities in a single API call through the Chat Completions API, making it suitable for building voice-enabled applications where turn-based interactions are appropriate. In this [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/posts\/4601","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/users\/63983"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/comments?post=4601"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/posts\/4601\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/media\/4602"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/media?parent=4601"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/categories?post=4601"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/tags?post=4601"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}