Using OpenAI's Audio-Preview Model with Semantic Kernel

OpenAI’s gpt-4o-audio-preview is a powerful multimodal model that enables audio input and output capabilities, allowing developers to create more natural and accessible AI interactions. This model supports both speech-to-text and text-to-speech functionalities in a single API call through the Chat Completions API, making it suitable for building voice-enabled applications where turn-based interactions are appropriate.

In this post, we’ll explore how to use the audio-preview model with Semantic Kernel in both C# and Python to create voice-enabled AI applications.

Best Use Cases

Best for turn-based interactions where complete audio messages are processed as discrete units. Suitable for applications like voice-based Q&A systems, audio transcription with AI responses, or asynchronous voice messaging where real-time interaction isn’t critical.

Key Features of OpenAI’s Audio-Preview Model with Chat Completions API

Multimodal Input/Output: Process both text and audio inputs, and generate both text and audio outputs in a single API call.
Turn-Based Voice Interactions: Suitable for non-real-time, turn-based conversational applications where each interaction is a complete request-response cycle.
Voice Synthesis Options: Generate speech with support for multiple voices (like Alloy, Echo, Fable, Onyx, Nova, and Shimmer).
Audio Understanding: Transcribe and comprehend spoken language from audio files.
Multilingual Support: Process and generate audio in multiple languages, making it accessible to global users.
Integration with Function Calling: Combine audio capabilities with function calling to create voice-controlled applications that can perform actions.
Simplified Development: Single API for both audio input processing and audio output generation, reducing the complexity of building voice-enabled applications.
Batch Processing: Well-suited for applications where complete audio messages are processed as discrete units rather than continuous streams.

Note: For truly low-latency, real-time voice interactions, OpenAI’s Realtime API is the more appropriate choice. The Chat Completions API with audio capabilities is better suited for non-real-time applications where some latency is acceptable.

Using Audio-Preview in Semantic Kernel

Semantic Kernel provides a straightforward way to integrate with OpenAI’s audio-preview model. Let’s see how to implement basic audio input and output functionality in both C# and Python.

In .NET (C#)

For a C# project using Semantic Kernel, you can add the audio-preview model as an OpenAI chat completion service. Make sure you have your OpenAI API key (or Azure OpenAI endpoint and key if using Azure):

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;

// Initialize the OpenAI chat completion service with the audio-preview model
var kernel = Kernel.CreateBuilder()
    .AddOpenAIChatCompletion(
        modelId: "gpt-4o-audio-preview",
        apiKey: "YOUR_OPENAI_API_KEY"
    )
    .Build();

var chatCompletionService = kernel.GetRequiredService<IChatCompletionService>();

// Configure settings for audio output
var settings = new OpenAIPromptExecutionSettings
{
    Audio = new ChatAudioOptions(
        ChatOutputAudioVoice.Shimmer, // Choose from available voices
        ChatOutputAudioFormat.Mp3     // Choose output format
    ),
    Modalities = ChatResponseModalities.Text | ChatResponseModalities.Audio // Request both text and audio
};

// Create a chat history and add an audio message
var chatHistory = new ChatHistory("You are a helpful assistant.");

// Add audio input (from a file or recorded audio)
byte[] audioBytes = File.ReadAllBytes("user_question.wav");
chatHistory.AddUserMessage([new AudioContent(audioBytes, "audio/wav")]);

// Get the model's response with both text and audio
var result = await chatCompletionService.GetChatMessageContentAsync(chatHistory, settings);

// Access the text response
Console.WriteLine($"Assistant > {result}");

// Access the audio response (if available)
if (result.Items.OfType<AudioContent>().Any())
{
    var audioContent = result.Items.OfType<AudioContent>().First();
    // Save or play the audio response
    File.WriteAllBytes("assistant_response.mp3", audioContent.Data.ToArray());
}

We have also created a C# sample within Semantic Kernel repository using audio-preview model here:

Chat Completion with Audio

Conclusion

OpenAI’s audio-preview model represents a significant advancement in creating more natural and accessible AI interactions. With Semantic Kernel’s straightforward integration, developers can build voice-enabled applications that provide an enhanced user experience.

Using OpenAI’s Audio-Preview Model with Semantic Kernel

Semantic Kernel Agents are now Generally Available

Guest Blog: Semantic Kernel and Copilot Studio Usage Series – Part 1

Best Use Cases

Key Features of OpenAI’s Audio-Preview Model with Chat Completions API

Using Audio-Preview in Semantic Kernel

In .NET (C#)

Conclusion

References

Author

1 comment

Leave a commentCancel reply

Best Use Cases Copy link

Key Features of OpenAI’s Audio-Preview Model with Chat Completions API Copy link

Using Audio-Preview in Semantic Kernel Copy link

In .NET (C#) Copy link

Conclusion Copy link

References Copy link

Author

1 comment

Leave a commentCancel reply

Stay informed

Best Use Cases

Key Features of OpenAI’s Audio-Preview Model with Chat Completions API

Using Audio-Preview in Semantic Kernel

In .NET (C#)

Conclusion

References