
In this post, we’ll explore how to use the audio-preview model with Semantic Kernel in both C# and Python to create voice-enabled AI applications.
Best Use Cases
Best for turn-based interactions where complete audio messages are processed as discrete units. Suitable for applications like voice-based Q&A systems, audio transcription with AI responses, or asynchronous voice messaging where real-time interaction isn’t critical.
Key Features of OpenAI’s Audio-Preview Model with Chat Completions API
-
Multimodal Input/Output: Process both text and audio inputs, and generate both text and audio outputs in a single API call.
-
Turn-Based Voice Interactions: Suitable for non-real-time, turn-based conversational applications where each interaction is a complete request-response cycle.
-
Voice Synthesis Options: Generate speech with support for multiple voices (like Alloy, Echo, Fable, Onyx, Nova, and Shimmer).
-
Audio Understanding: Transcribe and comprehend spoken language from audio files.
-
Multilingual Support: Process and generate audio in multiple languages, making it accessible to global users.
-
Integration with Function Calling: Combine audio capabilities with function calling to create voice-controlled applications that can perform actions.
-
Simplified Development: Single API for both audio input processing and audio output generation, reducing the complexity of building voice-enabled applications.
-
Batch Processing: Well-suited for applications where complete audio messages are processed as discrete units rather than continuous streams.
Note: For truly low-latency, real-time voice interactions, OpenAI’s Realtime API is the more appropriate choice. The Chat Completions API with audio capabilities is better suited for non-real-time applications where some latency is acceptable.
Using Audio-Preview in Semantic Kernel
Semantic Kernel provides a straightforward way to integrate with OpenAI’s audio-preview model. Let’s see how to implement basic audio input and output functionality in both C# and Python.
In .NET (C#)
For a C# project using Semantic Kernel, you can add the audio-preview model as an OpenAI chat completion service. Make sure you have your OpenAI API key (or Azure OpenAI endpoint and key if using Azure):
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;
// Initialize the OpenAI chat completion service with the audio-preview model
var kernel = Kernel.CreateBuilder()
.AddOpenAIChatCompletion(
modelId: "gpt-4o-audio-preview",
apiKey: "YOUR_OPENAI_API_KEY"
)
.Build();
var chatCompletionService = kernel.GetRequiredService<IChatCompletionService>();
// Configure settings for audio output
var settings = new OpenAIPromptExecutionSettings
{
Audio = new ChatAudioOptions(
ChatOutputAudioVoice.Shimmer, // Choose from available voices
ChatOutputAudioFormat.Mp3 // Choose output format
),
Modalities = ChatResponseModalities.Text | ChatResponseModalities.Audio // Request both text and audio
};
// Create a chat history and add an audio message
var chatHistory = new ChatHistory("You are a helpful assistant.");
// Add audio input (from a file or recorded audio)
byte[] audioBytes = File.ReadAllBytes("user_question.wav");
chatHistory.AddUserMessage([new AudioContent(audioBytes, "audio/wav")]);
// Get the model's response with both text and audio
var result = await chatCompletionService.GetChatMessageContentAsync(chatHistory, settings);
// Access the text response
Console.WriteLine($"Assistant > {result}");
// Access the audio response (if available)
if (result.Items.OfType<AudioContent>().Any())
{
var audioContent = result.Items.OfType<AudioContent>().First();
// Save or play the audio response
File.WriteAllBytes("assistant_response.mp3", audioContent.Data.ToArray());
}
We have also created a C# sample within Semantic Kernel repository using audio-preview model here:
Conclusion
OpenAI’s audio-preview model represents a significant advancement in creating more natural and accessible AI interactions. With Semantic Kernel’s straightforward integration, developers can build voice-enabled applications that provide an enhanced user experience.
Love it! Really happy this made its way in – thx for the effort Roger!