Multimodal Voice Intelligence with .NET MAUI

One of the most interesting ways to enhance your existing applications with AI is to enable more ways for your users to interact with it. Today you probably handle text input, and perhaps some touch gestures for your power users. Now it’s easier than ever to expand that to voice and vision, especially when your users’ primary input is via mobile device.

At Microsoft Build 2025 I demonstrated expanding the .NET MAUI sample “to do” app from text input to supporting voice and vision when those capabilities are detected. Let me show you how .NET MAUI and our fantastic ecosystem of plugins makes this rather painless to do with a single implementation that works across all platforms starting with voice.

Screenshots showing the voice input flow in the .NET MAUI app with recording, transcription, and task extraction screens

Talk to me

Being able to talk to an app isn’t anything revolutionary. We’ve all spoken to Siri, Alexa, and our dear Cortana a time or two, and the key is in knowing the keywords and recipes of things they can comprehend and act on. “Start a timer”, “turn down the volume”, “tell me a joke”, and everyone’s favorite “I wasn’t talking to you”.

The new and powerful capability we now have with large language models is having them take our unstructured ramblings and make sense of that in order to fit the structured format our apps expect and require.

Listening to audio

The first thing to do is add the Plugin.Maui.Audio NuGet which helps us request permissions to the microphone and start capturing a stream. The plugin is also capable of playback.

dotnet add package Plugin.Maui.Audio --version 4.0.0

In MauiProgram.cs configure the recording settings and add the IAudioService from the plugin to the services container.

public static class MauiProgram
{
 public static MauiApp CreateMauiApp()
 {
  var builder = MauiApp.CreateBuilder();
  builder
   .UseMauiApp<App>()
            .AddAudio(
    recordingOptions =>
    {
#if IOS || MACCATALYST
     recordingOptions.Category = AVFoundation.AVAudioSessionCategory.Record;
     recordingOptions.Mode = AVFoundation.AVAudioSessionMode.Default;
     recordingOptions.CategoryOptions = AVFoundation.AVAudioSessionCategoryOptions.MixWithOthers;
#endif
    });

        builder.Services.AddSingleton<IAudioService, AudioService>();
        // more code
    }
}

Be sure to also review and implement any additional configuration steps in the documentation.

Now the app is ready to capture some audio. In VoicePage the user will tap the microphone button, start speaking, and tap again to end the recording.

This is a trimmed version of the actual code for starting and stopping the recording.

[RelayCommand]
private async Task ToggleRecordingAsync()
{
    if (!IsRecording)
    {
        var status = await Permissions.CheckStatusAsync<Permissions.Microphone>();
        if (status != PermissionStatus.Granted)
        {
            status = await Permissions.RequestAsync<Permissions.Microphone>();
            if (status != PermissionStatus.Granted)
            {
                // more code
                return;
            }
        }

        _recorder = _audioManager.CreateRecorder();
        await _recorder.StartAsync();

        IsRecording = true;
        RecordButtonText = "⏹ Stop";
    }
    else
    {
        _audioSource = await _recorder.StopAsync();
        IsRecording = false;
        RecordButtonText = "🎤 Record";

        // more code

        TranscribeAsync();
    }
}

Once it has the audio stream it can start transcribing and processing it. (source)

private async Task TranscribeAsync()
{
    string audioFilePath = Path.Combine(FileSystem.CacheDirectory, $"recording_{DateTime.Now:yyyyMMddHHmmss}.wav");

    if (_audioSource != null)
    {
        await using (var fileStream = File.Create(audioFilePath))
        {
            var audioStream = _audioSource.GetAudioStream();
            await audioStream.CopyToAsync(fileStream);            
        }

        Transcript = await _transcriber.TranscribeAsync(audioFilePath, CancellationToken.None);

        await ExtractTasksAsync();
    }
}

In this sample app, I used Microsoft.Extensions.AI with OpenAI to perform the transcription with the whisper-1 model trained specifically for this use case. There are certainly other methods of doing this including on-device with SpeechToText in the .NET MAUI Community Toolkit.

By using Microsoft.Extensions.AI I can easily swap out another cloud based AI service, use a local LLM with ONNX, or later choose another on-device solution.

using Microsoft.Extensions.AI;
using OpenAI;

namespace Telepathic.Services;

public class WhisperTranscriptionService : ITranscriptionService
{        
    public async Task<string> TranscribeAsync(string path, CancellationToken ct)
    {
        var openAiApiKey = Preferences.Default.Get("openai_api_key", string.Empty);
        var client = new OpenAIClient(openAiApiKey);

        try
        {
            await using var stream = File.OpenRead(path);
            var result = await client
                            .GetAudioClient("whisper-1")
                            .TranscribeAudioAsync(stream, "file.wav", cancellationToken: ct);

            return result.Value.Text.Trim();
        }
        catch (Exception ex)
        {
            // Will add better error handling in Phase 5
            throw new Exception($"Failed to transcribe audio: {ex.Message}", ex);
        }
    }
}

Making sense and structure

Once I have the transcript, I can have my AI service make sense of it to return projects and tasks using the same client. This happens in the ExtractTasksAsync method referenced above. The key parts of this method are below. (source)

private async Task ExtractTasksAsync()
{
    var prompt = $@"
        Extract projects and tasks from this voice memo transcript. 
        Analyze the text to identify actionable tasks I need to keep track of. Use the following instructions:
        1. Tasks are actionable items that can be completed, such as 'Buy groceries' or 'Call Mom'.
        2. Projects are larger tasks that may contain multiple smaller tasks, such as 'Plan birthday party' or 'Organize closet'.
        3. Tasks must be grouped under a project and cannot be grouped under multiple projects.
        4. Any mentioned due dates use the YYYY-MM-DD format

        Here's the transcript: {Transcript}";

    var chatClient = _chatClientService.GetClient();
    var response = await chatClient.GetResponseAsync<ProjectsJson>(prompt);

    if (response?.Result != null)
    {
        Projects = response.Result.Projects;
    }
}

The _chatClientService is an injected service class that handles the creation and retrieval of the IChatClient instance provided by Microsoft.Extensions.AI. Here I use the GetResponseAsync method along with passing a strong type and a prompt, and the LLM (gpt-4o-mini in this case) returns a ProjectsJson response. The response includes a Projects list with which I can proceed.

Co-creation

Now I’ve gone from having an app that only took data entry input via a form, to an app that can also take unstructured voice input and produce structure data. While I was tempted to just insert the results into the database and claim success, there was yet more to do to make this a truly satisfying experience.

There’s a reasonable chance that the project name needs to be adjusted for clarity, or some task was misheard or worse yet omitted. To address this, I add a step of approval where the use can see the projects and tasks as recommendations and choose to accept them as-is with changes. This is not much different than the experience we have now in Copilot when changes are make but we have the option to iterate further, keep, or discard.

For more guidance like this for designing great AI experiences in your apps, consider checking out the HAX Toolkit and Microsoft AI Principles.

Resources

Here are key resources mentioned in this article to help you implement multimodal AI capabilities in your .NET MAUI apps:

AI infused mobile & desktop app development with .NET MAUI
Plugin.Maui.Audio – NuGet package for handling audio recording and playback in .NET MAUI apps
Microsoft.Extensions.AI – Framework for integrating AI capabilities into .NET applications
Whisper Model – OpenAI’s speech-to-text model used for audio transcription
SpeechToText in .NET MAUI Community Toolkit – On-device alternative for speech recognition
ONNX Runtime – For running local LLMs on mobile devices
HAX Toolkit – Design guidance for AI experiences in applications
Microsoft AI Principles – Guidelines for responsible AI implementation
Telepathy Sample App Source Code – Complete implementation example referenced in this article

Summary

In this article, we explored how to enhance .NET MAUI applications with multimodal AI capabilities, focusing on voice interaction. We covered how to implement audio recording using Plugin.Maui.Audio, transcribe speech using Microsoft.Extensions.AI with OpenAI’s Whisper model, and extract structured data from unstructured voice input.

By combining these technologies, you can transform a traditional form-based app into one that accepts voice commands and intelligently processes them into actionable data. The implementation works across all platforms with a single codebase, making it accessible for any .NET MAUI developer.

With these techniques, you can significantly enhance user experience by supporting multiple interaction modes, making your applications more accessible and intuitive, especially on mobile devices where voice input can be much more convenient than typing.