We are pleased to announce the arrival of audio support in Semantic Kernel Python. This new audio functionality will enable you to create more interactive and accessible user experiences. In this blog post, I will detail the new interface, the existing connectors, and provide samples. Please continue reading for more information.
Audio-to-Text
The first feature we are introducing is the ability to transcribe audio into text. The foundational class for this feature is AudioToTextClientBase
, which includes two public methods:Â get_text_contents
 and get_text_content
. The former returns a list of possible transcriptions based on the number requested, while the latter returns a single transcription. The transcriptions are returned as TextContent
objects. Notably, get_text_content
 internally calls get_text_contents
 and simply returns the first transcription from the list.
As of the publication of this blog post, the available services include OpenAIAudioToText
 and AzureAudioToText
, allowing you to utilize either your OpenAI endpoints or Azure deployments.
AudioContent
With the introduction of audio support, we are also introducing a new content type known as AudioContent
. Instances of the AudioContent
 class should encapsulate either the binary data or the URI pointing to the location of the audio data. Additionally, this class offers a convenient method that allows you to create an AudioContent
object directly from a file:
AudioContent.from_audio_file(path=PATH_TO_AUDIO_FILE)
Audio-to-Text Example Using Azure OpenAI
Please ensure that Semantic Kernel is updated to the latest version. To process audio input, the following components are required:
- A speech-to-text model, such as whisper-1
- An audio input device
To begin, create the service with the following code:
from semantic_kernel.connectors.ai.open_ai.services.azure_audio_to_text import AzureAudioToText
audio_to_text_service = AzureAudioToText(api_key="...", deployment_name="...", endpoint="...")
Next, you will need to create the audio content that will be transcribed into text:
from semantic_kernel.contents.audio_content import AudioContent
audio_content = AudioContent.from_audio_file(path="...")
Finally, you can invoke the service to get the transcription:
user_input = await audio_to_text_service.get_text_content(audio_content)
print(user_input)
To further create an interactive chat app that takes audio as input, please read this blog post or see the sample app in our GitHub repository.
Text-To-Audio
The second feature we are introducing is the ability to create audio from text. The foundational class for this feature is TextToAudioClientBase
, which includes two public methods:Â get_audio_contents
 and get_audio_content
. Similar to AudioToTextCientBase
,the former returns a list of possible audio generations based on the number requested, while the latter returns a single audio generation. The audio generations are returned as AudioContent
objects that contain the audio data. To listen to the audio data, AudioContent
 provide another convenient method to save the data to an audio file:
audio_content = ...
audio_content.write_to_file(path=PATH_TO_FILE)
As of the publication of this blog post, the available services include OpenAIAudioToText
 and AzureAudioToText
, allowing you to utilize either your OpenAI endpoints or Azure deployments.
Text-to-Audio Example Using Azure OpenAI
Please ensure that Semantic Kernel is updated to the latest version. To process audio output, the following components are required:
- A text-to-speech model, such as tts
- An audio output device
To begin, create the service with the following code:
from semantic_kernel.connectors.ai.open_ai.services.azure_text_to_audio import AzureTextToAudio
text_to_audio_service = AzureTextToAudio(api_key="...", deployment_name="...", endpoint="...")
Next, you can invoke the service to get an audio generation:
audio_content = await text_to_audio_service.get_audio_content("Hello World!")
Finally, save the audio content so that you can listen to it with your favorite player:
audio_content.write_to_file(path="...")
To further create an interactive chat app that output audio, please read more in this blog post or see the sample app in our GitHub repository.
Conclusion
To learn more about Semantic Kernel visit our learn site as well as our GitHub repository. Please reach out if you have any questions or feedback through our Semantic Kernel GitHub Discussion Channel. We look forward to hearing from you! We would also love your support, if you’ve enjoyed using Semantic Kernel, give us a star on GitHub.
0 comments
Be the first to start the discussion.