November 15th, 2024

Working with Audio in Semantic Kernel Python

We are pleased to announce the arrival of audio support in Semantic Kernel Python. This new audio functionality will enable you to create more interactive and accessible user experiences. In this blog post, I will detail the new interface, the existing connectors, and provide samples. Please continue reading for more information.

Audio-to-Text

The first feature we are introducing is the ability to transcribe audio into text. The foundational class for this feature is AudioToTextClientBase, which includes two public methods: get_text_contents and get_text_content. The former returns a list of possible transcriptions based on the number requested, while the latter returns a single transcription. The transcriptions are returned as TextContent objects. Notably, get_text_content internally calls get_text_contents and simply returns the first transcription from the list.

As of the publication of this blog post, the available services include OpenAIAudioToText and AzureAudioToText, allowing you to utilize either your OpenAI endpoints or Azure deployments.

AudioContent

With the introduction of audio support, we are also introducing a new content type known as AudioContent. Instances of the AudioContent class should encapsulate either the binary data or the URI pointing to the location of the audio data. Additionally, this class offers a convenient method that allows you to create an AudioContent object directly from a file:

AudioContent.from_audio_file(path=PATH_TO_AUDIO_FILE)

Audio-to-Text Example Using Azure OpenAI

Please ensure that Semantic Kernel is updated to the latest version. To process audio input, the following components are required:

  1. A speech-to-text model, such as whisper-1
  2. An audio input device

To begin, create the service with the following code:

from semantic_kernel.connectors.ai.open_ai.services.azure_audio_to_text import AzureAudioToText


audio_to_text_service = AzureAudioToText(api_key="...", deployment_name="...", endpoint="...")

Next, you will need to create the audio content that will be transcribed into text:

from semantic_kernel.contents.audio_content import AudioContent


audio_content = AudioContent.from_audio_file(path="...")

Finally, you can invoke the service to get the transcription:

user_input = await audio_to_text_service.get_text_content(audio_content)
print(user_input)

To further create an interactive chat app that takes audio as input, please read this blog post or see the sample app in our GitHub repository.

Text-To-Audio

The second feature we are introducing is the ability to create audio from text. The foundational class for this feature is TextToAudioClientBase, which includes two public methods: get_audio_contents and get_audio_content. Similar to AudioToTextCientBase,the former returns a list of possible audio generations based on the number requested, while the latter returns a single audio generation. The audio generations are returned as AudioContent objects that contain the audio data. To listen to the audio data, AudioContent provide another convenient method to save the data to an audio file:

audio_content = ...
audio_content.write_to_file(path=PATH_TO_FILE)

As of the publication of this blog post, the available services include OpenAIAudioToText and AzureAudioToText, allowing you to utilize either your OpenAI endpoints or Azure deployments.

Text-to-Audio Example Using Azure OpenAI

Please ensure that Semantic Kernel is updated to the latest version. To process audio output, the following components are required:

  1. A text-to-speech model, such as tts
  2. An audio output device

To begin, create the service with the following code:

from semantic_kernel.connectors.ai.open_ai.services.azure_text_to_audio import AzureTextToAudio


text_to_audio_service = AzureTextToAudio(api_key="...", deployment_name="...", endpoint="...")

Next, you can invoke the service to get an audio generation:

audio_content = await text_to_audio_service.get_audio_content("Hello World!")

Finally, save the audio content so that you can listen to it with your favorite player:

audio_content.write_to_file(path="...")

To further create an interactive chat app that output audio, please read more in this blog post or see the sample app in our GitHub repository.

Conclusion

To learn more about Semantic Kernel visit our learn site as well as our GitHub repository. Please reach out if you have any questions or feedback through our Semantic Kernel GitHub Discussion Channel. We look forward to hearing from you! We would also love your support, if you’ve enjoyed using Semantic Kernel, give us a star on GitHub.

Author

0 comments