November 15th, 2024

Allow users to talk and listen to your chatbot using Semantic Kernel Python

Until now, Semantic Kernel Python only allowed for the development of text-based AI applications. However, this is no longer the case, as we have expanded its capabilities to include audio as one of the supported modalities. In this article, I will provide a detailed, step-by-step guide on how to create a chatbot that can both speak to and listen to your users.

As of the time of this blog post, OpenAI has announced the release of the Realtime API. For further details, you can find more information here. Please note that this blog post is not intended as a tutorial on the Realtime API. The Semantic Kernel team remains committed to delivering the latest advancements in AI to all developers. We encourage you to stay tuned for future updates.

Step 1: Create a chatbot

  1. Please make sure you have the latest Semantic Kernel Python installed.
  2. Please make sure you have an Azure OpenAI chat completion model deployment or an OpenAI endpoint.
# Copyright (c) Microsoft. All rights reserved.

import asyncio
import logging
import os

from semantic_kernel.connectors.ai.open_ai import (
    AzureChatCompletion,
    OpenAIChatPromptExecutionSettings,
)
from semantic_kernel.contents import ChatHistory

logging.basicConfig(level=logging.WARNING)

system_message = """
You are a chat bot. Your name is Mosscap and
you have one goal: figure out what people need.
Your full name, should you need to know it, is
Splendid Speckled Mosscap. You communicate
effectively, but you tend to answer with long
flowery prose.
"""


chat_service = AzureChatCompletion()

history = ChatHistory(system_message=system_message)
history.add_user_message("Hi there, who are you?")
history.add_assistant_message("I am Mosscap, a chat bot. I'm trying to figure out what people need.")


async def chat() -> bool:
    try:
        user_input = input("User:> ")
    except KeyboardInterrupt:
        print("\n\nExiting chat...")
        return False
    except EOFError:
        print("\n\nExiting chat...")
        return False

    if "exit" in user_input.lower():
        print("\n\nExiting chat...")
        return False

    history.add_user_message(user_input)

    chunks = chat_service.get_streaming_chat_message_content(
        chat_history=history,
        settings=OpenAIChatPromptExecutionSettings(
            max_tokens=2000,
            temperature=0.7,
            top_p=0.8,
        ),
    )

    print("Mosscap:> ", end="")
    answer = ""
    async for message in chunks:
        print(str(message), end="")
        answer += str(message)
    print("\n")

    history.add_assistant_message(str(answer))

    return True


async def main() -> None:
    chatting = True
    while chatting:
        chatting = await chat()


if __name__ == "__main__":
    asyncio.run(main())

In the code snippet provided, we begin by defining a system message that establishes the personality of the chatbot. Subsequently, we create a chat completion service utilizing the Azure OpenAI connector, along with a chat history containing pre-populated messages to initiate the conversation. Finally, we implement a loop that captures user input and generates a streaming response. It is important to note that both user inputs and model responses will be stored in the chat history, allowing the chatbot to maintain the context of the conversation throughout each iteration.

Step 2: Allow the chatbot to listen to you

  1. Please make sure you have an Azure OpenAI speech-to-text model (i.e. whisper) deployment or an OpenAI endpoint. 
  2. Python dependency: pyaudio for working with audio
  3. Python dependency: keyboard for controlling audio input duration
pip install pyaudio
pip install keyboard

Our goal is to convert audio into text, a step commonly referred to as “transcription”.

from semantic_kernel.connectors.ai.open_ai import AzureAudioToText

audio_to_text_service = AzureAudioToText()

We first create an audio-to-text service. We are doing it here with the AzureAudioToText connector. 

# Copyright (c) Microsoft. All rights reserved.

import os
import wave
from typing import ClassVar

import keyboard
import pyaudio
from pydantic import BaseModel


class AudioRecorder(BaseModel):
    """A class to record audio from the microphone and save it to a WAV file.

    To start recording, press the spacebar. To stop recording, release the spacebar.

    To use as a context manager, that automatically removes the output file after exiting the context:
    ```
    with AudioRecorder(output_filepath="output.wav") as recorder:
        recorder.start_recording()
        # Do something with the recorded audio
        ...
    ```
    """

    # Audio recording parameters
    FORMAT: ClassVar[int] = pyaudio.paInt16
    CHANNELS: ClassVar[int] = 1
    RATE: ClassVar[int] = 44100
    CHUNK: ClassVar[int] = 1024

    output_filepath: str

    def start_recording(self) -> None:
        # Wait for the spacebar to be pressed to start recording
        keyboard.wait("space")

        # Start recording
        audio = pyaudio.PyAudio()
        stream = audio.open(
            format=self.FORMAT,
            channels=self.CHANNELS,
            rate=self.RATE,
            input=True,
            frames_per_buffer=self.CHUNK,
        )
        frames = []

        while keyboard.is_pressed("space"):
            data = stream.read(self.CHUNK)
            frames.append(data)

        # Recording stopped as the spacebar is released
        stream.stop_stream()
        stream.close()

        # Save the recorded data as a WAV file
        with wave.open(self.output_filepath, "wb") as wf:
            wf.setnchannels(self.CHANNELS)
            wf.setsampwidth(audio.get_sample_size(self.FORMAT))
            wf.setframerate(self.RATE)
            wf.writeframes(b"".join(frames))

        audio.terminate()

    def remove_output_file(self) -> None:
        os.remove(self.output_filepath)

    def __enter__(self) -> "AudioRecorder":
        return self

    def __exit__(self, exc_type, exc_value, traceback) -> None:
        self.remove_output_file()

Next, we will create a helper class named AudioRecorder, which facilitates easier interaction with audio functionality on your system. This class initiates recording when the user presses and holds the space bar on the keyboard, and it ceases recording upon the release of the key. The recorded audio is saved as a file on the disk. Additionally, when used as a context manager, the audio file will be automatically deleted after the audio processing is completed.

AUDIO_FILEPATH = os.path.join(os.path.dirname(__file__), "output.wav")

try:
    print("User:> ", end="", flush=True)
    with AudioRecorder(output_filepath=AUDIO_FILEPATH) as recorder:
        recorder.start_recording()
        user_input = await audio_to_text_service.get_text_content(AudioContent.from_audio_file(AUDIO_FILEPATH))
        print(user_input.text)
except KeyboardInterrupt:
    print("\n\nExiting chat...")
    return False
except EOFError:
    print("\n\nExiting chat...")
    return False

if "exit" in user_input.text.lower():
    print("\n\nExiting chat...")
    return False

history.add_user_message(user_input.text)

Finally, we utilize the AudioRecorderto capture user input and subsequently transcribe the audio into text for the chat completion service. It is important to note that, since the audio-to-text service returns a TextContentobject, we must retrieve the text by accessing itstextproperty.

To view the complete sample, please visit this link to our GitHub repository.

Now, please run the application. Hold down the space bar on your keyboard and begin speaking. Once you have finished, release the key and wait for the response to be displayed on the screen.

Step 3: Allow the chatbot to talk to you

What we have achieved thus far is promising, but it is not yet complete. It is time to incorporate the final component.

  1. Please make sure you have an Azure OpenAI text-to-speech model (i.e. tts) deployment or an OpenAI endpoint. 

Our goal is to convert the response generated by the chat completion service back to audio.

from semantic_kernel.connectors.ai.open_ai import AzureTextToAudio

text_to_audio_service = AzureTextToAudio()

We first create a text-to-audio service. We are doing it here with the AzureTextToAudioconnector.

audio_content = await text_to_audio_service.get_audio_content(
    response.content, OpenAITextToAudioExecutionSettings(response_format="wav")
)

Next, we will invoke the text-to-audio service to generate audio for the response. At this stage, we are specifying the output format for reasons that will be addressed later. With the OpenAITextToAudioExecutionSettings, you can also define the type of voice and the speed of the audio. Please feel free to experiment with different settings to discover those that you find most comfortable.

# Copyright (c) Microsoft. All rights reserved.

import io
import logging
import wave
from typing import ClassVar

import pyaudio
from pydantic import BaseModel

from semantic_kernel.contents import AudioContent

logging.basicConfig(level=logging.WARNING)
logger: logging.Logger = logging.getLogger(__name__)


class AudioPlayer(BaseModel):
    """A class to play an audio file to the default audio output device."""

    # Audio replay parameters
    CHUNK: ClassVar[int] = 1024

    audio_content: AudioContent

    def play(self, text: str | None = None) -> None:
        """Play the audio content to the default audio output device.

        Args:
            text (str, optional): The text to display while playing the audio. Defaults to None.
        """
        audio_stream = io.BytesIO(self.audio_content.data)
        with wave.open(audio_stream, "rb") as wf:
            audio = pyaudio.PyAudio()
            stream = audio.open(
                format=audio.get_format_from_width(wf.getsampwidth()),
                channels=wf.getnchannels(),
                rate=wf.getframerate(),
                output=True,
            )

            if text:
                # Simulate the output of text while playing the audio
                data_frames = []

                data = wf.readframes(self.CHUNK)
                while data:
                    data_frames.append(data)
                    data = wf.readframes(self.CHUNK)

                if len(data_frames) < len(text):
                    logger.warning(
                        "The audio is too short to play the entire text. ",
                        "The text will be displayed without synchronization.",
                    )
                    print(text)
                else:
                    for data_frame, text_frame in self._zip_text_and_audio(text, data_frames):
                        stream.write(data_frame)
                        print(text_frame, end="", flush=True)
                    print()
            else:
                data = wf.readframes(self.CHUNK)
                while data:
                    stream.write(data)
                    data = wf.readframes(self.CHUNK)

            stream.stop_stream()
            stream.close()
            audio.terminate()

    def _zip_text_and_audio(self, text: str, audio_frames: list) -> zip:
        """Zip the text and audio frames together so that they can be displayed in sync.

        This is done by evenly distributing empty strings between each character and
        append the remaining empty strings at the end.

        Args:
            text (str): The text to display while playing the audio.
            audio_frames (list): The audio frames to play.

        Returns:
            zip: The zipped text and audio frames.
        """
        text_frames = list(text)
        empty_string_count = len(audio_frames) - len(text_frames)
        empty_string_spacing = len(text_frames) // empty_string_count

        modified_text_frames = []
        current_empty_string_count = 0
        for i, text_frame in enumerate(text_frames):
            modified_text_frames.append(text_frame)
            if current_empty_string_count < empty_string_count and i % empty_string_spacing == 0:
                modified_text_frames.append("")
                current_empty_string_count += 1

        if current_empty_string_count < empty_string_count:
            modified_text_frames.extend([""] * (empty_string_count - current_empty_string_count))

        return zip(audio_frames, modified_text_frames)

Now, we have the audio; however, we currently lack a method to play it for the user. To address this, we are creating another helper class called AudioPlayer, which will accept an AudioContentand play the audio through your speakers. This helper class will also synchronize the audio and the text when provided, creating a streaming effect.

print("Mosscap:> ", end="", flush=True)
AudioPlayer(audio_content=audio_content).play(text=response.content)

Finally, we play the audio and display the response on the screen.

To view the complete sample, please visit this link to our GitHub repository.

Now, please run the application. Press and hold the space bar on your keyboard while you begin speaking. Once you have finished, release the key and wait for the response to be spoken to you.

Conclusion

In this blog post, we show you how to incorporate audio into your AI application to elevate its experience with just a few lines of code. In fact, it takes even more code to record and play audio on your computer! To learn more about the basics, you can read more in this blog post or visit our learn site as well as our GitHub repository.

 

The Semantic Kernel team is dedicated to empowering developers by providing access to the latest advancements in the industry. We encourage you to leverage your creativity and build remarkable solutions with SK! Please reach out if you have any questions or feedback through our Semantic Kernel GitHub Discussion Channel. We look forward to hearing from you! We would also love your support, if you’ve enjoyed using Semantic Kernel, give us a star on GitHub.

Author

0 comments