September 5th, 2024

Guest Blog: Vision Buddy – Empowering the Visually Impaired with AI and .NET Semantic Kernel

Today we’re featuring a guest author, Jonathan David. He’s written an article we’re sharing below, focused on how to create Vision Buddy to empower visually impaired individuals using .NET and Semantic Kernel. We’ll turn it over to Jonathan to share more!

Discover how this proof-of-concept leverages AI using .NET Semantic Kernel to provide visually impaired users with real-time audio descriptions of their surroundings. This web app showcases the potential power of AI in Assistive Technology.

Vision Buddy - Empowering the Visually Impaired with AI and .NET Semantic Kernel

Introduction

Today’s technological landscape is evolving at an unprecedented pace, with AI and machine learning at the forefront of innovation. These technologies have the potential to revolutionize the way we interact with the world around us, making it more accessible and inclusive for everyone. The development of assistive technologies is not just about innovation; it’s about creating tools that can significantly enhance the quality of life for people with disabilities.

In recent years, I’ve focused on creating accessible frontends and web applications. This work has taught me a lot about using proper HTML tags, ARIA attributes, and screen reader testing. I’ve also gained valuable insights into various types of impairments, enhancing my understanding of inclusive design.

But I wanted to go further. I wanted to explore how AI could be used to empower impaired users and help them navigate the world around them. This is how the idea for Vision Buddy was born.

Vision Buddy is a proof of concept designed to assist visually impaired individuals using Azure OpenAI to transform images into descriptive audio, providing users with an enhanced sense of their surroundings through accessible technology.

The Power of AI in Assistive Technology

The WHO states in its 2022 Global Report On Assistive Technology that more than 2.5 billion people need at least one form of assistive technology and predicts that this number will rise to 3.5 billion people by 2050. This includes devices, software, and systems that help people with disabilities live more independently and participate more fully in society.

Assistive technology can take many forms, from screen readers and magnifiers to speech recognition software and communication devices. The goal is to remove barriers to access and enable individuals to perform everyday tasks with greater ease and efficiency.

AI is revolutionizing the way scientists and developers approach accessibility. With recent strides in computer vision and natural language processing, for example, AI systems can now interpret visual data, generate meaningful descriptions, and even convert text to speech with remarkable accuracy and naturalness. These advancements hold increasing potential for visually impaired individuals, enabling them to interact with the world in ways that were once unimaginable.

Introducing Vision Buddy: A Proof of Concept

The idea behind the web app I dubbed ‘Vision Buddy’ is to use AI’s ability to process and interpret visual data and serve as the digital “eyes” for those who cannot perceive the world and their surroundings as well as others.

The application consists of several key components:

  • Azure OpenAI Services:
    • GPT-4o model to generate natural language descriptions of images. Using a tweaked system prompt to get a good balance between length and details of the response.
    • TTS model to convert text to speech, providing users with natural-sounding audio feedback. As of writing this article, the TTS model is still considered to be experimental and only available in English.
  • .NET Service: The backbone of the application, this service orchestrates the flow of data between the user interface and the AI models. It strongly relies on .NET Semantic Kernel to easily interact with the Azure OpenAI services.
  • Vue 3 Web App: The user interface, built with Vue 3, allows users to take pictures, send them to the .NET service, and receive audio descriptions for playing back.

In a nutshell, Vision Buddy works as follows:

A user takes a picture using the Vue 3 web app on their mobile device. This image is sent to the .NET service, which then relays it to Azure OpenAI’s GPT-4o model. The model processes the image and returns a short, meaningful description, which is then converted into speech using the TTS model and played back to the user through the web app.

Technical Details

Once the user opens the web app, they are presented with a simple interface to take a picture. During loading, the app will ask for permission to use the camera by harnessing the Media Capture and Stream API available in modern browsers.

An HTML video element is used to display the camera feed and below it is a single button to take a picture. Once the user clicks the button the current frame of the video is being drawn onto a canvas element. The canvas context is then converted to a JPEG image that is then sent to the .NET service.

On the .NET service the image is being stored in a temporary location that is accessible by the Azure OpenAI services. The service uses Semtantic Kernel’s ChatCompletionService with a specific system prompt and sends the prompt alongside the URL to the GPT-4o model. The model generates a description of the image and sends it back to the .NET service. There the service, using Semantic Kernel’s TextToAudioService, sends the description to the TTS model that converts the description into an audio response. After receiving the audio data the service returns the audio file and textual description to the web app.

On receiving the audio file the data is being transformed into a blob and played back using an HTML audio element while the description is being shown on the screen.

DEMO

Loading times have been shortened for the purpose of this video. (Original duration was around 5-6 seconds per interaction)

Demo

Further spinning the idea

Vision Buddy has the potential to make an impact on the lives of visually impaired individuals. By providing them with near real-time, descriptive feedback about their surroundings, the app enhances their ability to navigate the world independently. Whether it’s identifying objects in a room, understanding street signs, or describing landscapes and settings, this proof of concept opens up new possibilities for interaction and engagement.

While the initial focus is on assisting visually impaired users, the potential applications for Vision Buddy could extend beyond this group. For example:

  • Educational Applications: Vision Buddy can be adapted for use in schools, helping students with learning disabilities understand visual content through audio descriptions.
  • Language Learning: The app can be used by language learners, migrants, or refugees to help them associate words with images, enhancing their vocabulary acquisition.
  • Assistance for Cognitive Impairments: Vision Buddy could assist individuals with cognitive impairments by providing simple, easy-to-understand descriptions of their environment.

Future Developments and Challenges

While Vision Buddy is a promising proof of concept, there is always room for improvement. Future developments could include:

  • Enhanced Performance: Optimizing the app’s performance to reduce latency and improve response times.
    • This could be achieved by leveraging caching mechanisms, background processing and changing the flow of the application to be more asynchronous.
  • Multi-language Support: Expanding the app’s capabilities to support multiple languages, making it accessible to a wider audience.
    • This is entirely dependent on the availability of the TTS model in other languages.
  • User Customization: Allowing users to customize the type, details, playback speed or length of information they receive based on their specific needs or preferences.
  • File Upload: Allowing users to upload images to the application and receive audio descriptions.

Conclusion

For me, Vision Buddy is more than just a proof of concept — it’s a glimpse into the future of AI-assisted accessibility. By harnessing the power of AI, the ease of integrating it using .NET Semantic Kernel, this web app demonstrates how technology can be leveraged to create tools that empower individuals, particularly those with disabilities. As we continue to explore the possibilities of AI in accessibility, it’s clear that the potential to improve lives is immense.

The tools and frameworks available today make it easier than ever to integrate AI into applications, opening the door to a new generation of assistive technologies. By continuing to innovate and thinking outside the box – not always focusing on business cases – we can ensure that the benefits of technology are available to all, regardless of their physical abilities.

From the Semantic Kernel team, we’d like to thank Jonathan for his time and all of his great work.  Please reach out if you have any questions or feedback through our Semantic Kernel GitHub Discussion Channel. We look forward to hearing from you! We would also love your support, if you’ve enjoyed using Semantic Kernel, give us a star on GitHub.

0 comments