April 3rd, 2025

Using OpenAI’s Audio-Preview Model with Semantic Kernel

Roger Barreto
Senior Software Engineer

OpenAI Audio-preview support
OpenAI’s gpt-4o-audio-preview is a powerful multimodal model that enables audio input and output capabilities, allowing developers to create more natural and accessible AI interactions. This model supports both speech-to-text and text-to-speech functionalities in a single API call through the Chat Completions API, making it suitable for building voice-enabled applications where turn-based interactions are appropriate.

In this post, we’ll explore how to use the audio-preview model with Semantic Kernel in both C# and Python to create voice-enabled AI applications.

Best Use Cases

Best for turn-based interactions where complete audio messages are processed as discrete units. Suitable for applications like voice-based Q&A systems, audio transcription with AI responses, or asynchronous voice messaging where real-time interaction isn’t critical.

Key Features of OpenAI’s Audio-Preview Model with Chat Completions API

  • Multimodal Input/Output: Process both text and audio inputs, and generate both text and audio outputs in a single API call.

  • Turn-Based Voice Interactions: Suitable for non-real-time, turn-based conversational applications where each interaction is a complete request-response cycle.

  • Voice Synthesis Options: Generate speech with support for multiple voices (like Alloy, Echo, Fable, Onyx, Nova, and Shimmer).

  • Audio Understanding: Transcribe and comprehend spoken language from audio files.

  • Multilingual Support: Process and generate audio in multiple languages, making it accessible to global users.

  • Integration with Function Calling: Combine audio capabilities with function calling to create voice-controlled applications that can perform actions.

  • Simplified Development: Single API for both audio input processing and audio output generation, reducing the complexity of building voice-enabled applications.

  • Batch Processing: Well-suited for applications where complete audio messages are processed as discrete units rather than continuous streams.

Note: For truly low-latency, real-time voice interactions, OpenAI’s Realtime API is the more appropriate choice. The Chat Completions API with audio capabilities is better suited for non-real-time applications where some latency is acceptable.

Using Audio-Preview in Semantic Kernel

Semantic Kernel provides a straightforward way to integrate with OpenAI’s audio-preview model. Let’s see how to implement basic audio input and output functionality in both C# and Python.

In .NET (C#)

For a C# project using Semantic Kernel, you can add the audio-preview model as an OpenAI chat completion service. Make sure you have your OpenAI API key (or Azure OpenAI endpoint and key if using Azure):

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;

// Initialize the OpenAI chat completion service with the audio-preview model
var kernel = Kernel.CreateBuilder()
    .AddOpenAIChatCompletion(
        modelId: "gpt-4o-audio-preview",
        apiKey: "YOUR_OPENAI_API_KEY"
    )
    .Build();

var chatCompletionService = kernel.GetRequiredService<IChatCompletionService>();

// Configure settings for audio output
var settings = new OpenAIPromptExecutionSettings
{
    Audio = new ChatAudioOptions(
        ChatOutputAudioVoice.Shimmer, // Choose from available voices
        ChatOutputAudioFormat.Mp3     // Choose output format
    ),
    Modalities = ChatResponseModalities.Text | ChatResponseModalities.Audio // Request both text and audio
};

// Create a chat history and add an audio message
var chatHistory = new ChatHistory("You are a helpful assistant.");

// Add audio input (from a file or recorded audio)
byte[] audioBytes = File.ReadAllBytes("user_question.wav");
chatHistory.AddUserMessage([new AudioContent(audioBytes, "audio/wav")]);

// Get the model's response with both text and audio
var result = await chatCompletionService.GetChatMessageContentAsync(chatHistory, settings);

// Access the text response
Console.WriteLine($"Assistant > {result}");

// Access the audio response (if available)
if (result.Items.OfType<AudioContent>().Any())
{
    var audioContent = result.Items.OfType<AudioContent>().First();
    // Save or play the audio response
    File.WriteAllBytes("assistant_response.mp3", audioContent.Data.ToArray());
}

We have also created a C# sample within Semantic Kernel repository using audio-preview model here:

Conclusion

OpenAI’s audio-preview model represents a significant advancement in creating more natural and accessible AI interactions. With Semantic Kernel’s straightforward integration, developers can build voice-enabled applications that provide an enhanced user experience.

References

Author

Roger Barreto
Senior Software Engineer

1 comment

  • José Luis Latorre Millás 1 week ago

    Love it! Really happy this made its way in – thx for the effort Roger!