Multi-modal interfaces are becoming increasingly popular for app developers. These interfaces allow users to interact with apps in a variety of ways by combining different modes of input and output, such as voice, touch, and visuals, to create a more interactive and engaging user experience. In this blog we will overview how you can use Semantic Kernel with a multi-modal example. Â
Why use multi-modal in your application?
Applications with Multi-modal integration offer a more versatile and engaging user experience, making them an increasingly popular choice for application developers.  By incorporating multi-modal experiences into your application, you can increase the reach of the application, end user satisfaction and yield positive impact to product-led-growth. Below are more key benefits of multi-modal implementation.Â
- Enhanced User Experience: Multi-modal applications provide users with more options for interacting with the application, making it more accessible and engaging. Users can choose the input mode that is most comfortable for them, such as voice, touch, or gesture-based input, and receive feedback through different modalities.
- Improved Accessibility: By supporting multiple modes of input and output, multi-modal applications can be more inclusive and accessible to users with disabilities or impairments. For example, users with visual impairments can use voice-based input and receive audio feedback.
- Increased Efficiency: Multi-modal applications can help users complete tasks more quickly and efficiently. By using voice-based input, users can input information more quickly than typing, while visual and audio feedback can provide quick confirmation and eliminate the need to switch between different modes of input and output.
- Personalization: Multi-modal applications can personalize the user experience by allowing users to choose the input and output modes that are most comfortable and convenient for them.
- Flexibility: Multi-modal applications can be used in a variety of contexts and environments, from hands-free operation in a car to touch-based input on a mobile device.
What are some use cases for multi-modal experiences?
- Voice Control: App developers can incorporate voice control into their apps, allowing users to interact with the app using their voice. This can be used to control various functions of the app, such as playing music, setting alarms, or searching for information.
- Gesture Control: App developers can also incorporate gesture control into their apps, allowing users to interact with the app using gestures such as swiping and tapping. This can be used to control various functions of the app, such as scrolling through menus or selecting options.
- Health and Wellness Apps: Health and wellness apps can use multi-modal input and output to help users track their fitness goals, monitor their health, and receive personalized feedback. For example, a fitness app might allow users to input their workouts through voice-based input and provide visual feedback on their progress.
- Education and Training: Education and training apps can use multi-modal input and output to provide learners with a more engaging and interactive experience. For example, a language-learning app might allow users to input their responses through voice-based input and provide visual feedback on their pronunciation.
- Entertainment and Gaming: Entertainment and gaming apps can use multi-modal input and output to create more immersive and engaging experiences for users. For example, a virtual reality game might use gesture-based input and provide audio and visual feedback to create a more realistic and immersive environment.
- Automotive: Automotive applications can use multi-modal input and output to provide drivers with a more natural and intuitive way to interact with their vehicles. For example, voice-based input can be used to input navigation commands, while visual and audio feedback can provide confirmation and alerts.
So how can you try a multi-modal experience with Semantic Kernel?
Follow along in this video or dive right in. First run his notebook to enter your configuration, for this sample we will be using OpenAI models, so you need an OpenAPI Key.
Now we are ready to run this notebook example. In this notebook we will combine OpenAI’s ChatGPT text based large language model with OpenAI’s DALL-E 2 Model. We leverage Semantic Kernel’s extensible orchestration abilities with the flexibility to invoke different model types.
The notebook starts by importing packaging. We then define the multiple models we will be using and generate AI service instances. Next we create a new chat instance and describe what we will chat about.
That’s it! We are ready to chat.
We create a loop to take a user’s text input , then obtain a response from ChatGPT. We then take the text response from ChatGPT and feed the text response to DALL-E 2 to render an image. We then display the image from DALL-E 2 as a response to the users input prompt.
We can continue the conversion between the user input as text and a response displayed as an image. Check out the video to see this notebook in action. The example really shows you can leverage different model types to create new experiences beyond ‘traditional’ text chat.
Try the notebook and give us your feedback!
Next Steps:
Explore the sample in GitHub
Learn more about Semantic Kernel
Join the community and let us know what you think:Â https://aka.ms/sk/discord