{"id":57146,"date":"2025-06-17T12:30:00","date_gmt":"2025-06-17T19:30:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=57146"},"modified":"2025-06-17T12:29:31","modified_gmt":"2025-06-17T19:29:31","slug":"multimodal-vision-intelligence-with-dotnet-maui","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/multimodal-vision-intelligence-with-dotnet-maui\/","title":{"rendered":"Multimodal Vision Intelligence with .NET MAUI"},"content":{"rendered":"<p>Expanding the many ways in which users can interact with our apps is one of the most exciting parts of working with modern AI models and device capabilities. With .NET MAUI, it&#8217;s easy to enhance your app from a text-based experience to one that supports <strong>voice<\/strong>, <strong>vision<\/strong>, and more.<\/p>\n<p>Previously I covered adding <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/multimodal-voice-intelligence-with-dotnet-maui\"><strong>voice<\/strong> support<\/a> to the &#8220;to do&#8221; app from <a href=\"https:\/\/www.youtube.com\/watch?v=tFOFU7LDQlA\">our Microsoft Build 2025 session<\/a>. Now I&#8217;ll review the <strong>vision<\/strong> side of multimodal intelligence. I want to let users capture or select an image and have AI extract actionable information from it to create a project and tasks in the <a href=\"https:\/\/github.com\/davidortinau\/telepathy\">Telepathic<\/a> sample app. This goes well beyond OCR scanning by using an AI agent to use context and prompting to produce meaningful input.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2025\/06\/multimodal-vision.png\" alt=\"Screenshots showing the photo capture and processing flow in the .NET MAUI app with camera, gallery, and AI analysis screens\" \/><\/p>\n<h2>See what I see<\/h2>\n<p>From the floating action button menu on <code>MainPage<\/code> the user selects the camera button immediately transitioning to the <code>PhotoPage<\/code> where <code>MediaPicker<\/code> takes over. <code>MediaPicker<\/code> provides a single cross-platform API for working with photo gallery, media picking, and taking photos. It was recently modernized in .NET 10 Preview 4.<\/p>\n<p>The <code>PhotoPageModel<\/code> handles both photo capture and file picking, starting from the <code>PageAppearing<\/code> lifecycle event that I&#8217;ve easily tapped into using the <code>EventToCommandBehavior<\/code> <a href=\"https:\/\/learn.microsoft.com\/dotnet\/communitytoolkit\/maui\/behaviors\/event-to-command-behavior\">from the Community Toolkit for .NET MAUI<\/a>.<\/p>\n<pre><code class=\"language-xml\">&lt;ContentPage.Behaviors&gt;\n    &lt;toolkit:EventToCommandBehavior\n        EventName=\"Appearing\"\n        Command=\"{Binding PageAppearingCommand}\"\/&gt;\n&lt;\/ContentPage.Behaviors&gt;<\/code><\/pre>\n<p>The <code>PageAppearing<\/code> method is decorated with <code>[RelayCommand]<\/code> which generates a command thanks to the <a href=\"https:\/\/learn.microsoft.com\/dotnet\/communitytoolkit\/mvvm\/generators\/relaycommand\">Community Toolkit for MVVM<\/a> (yes, toolkits are a recurring theme of adoration that you&#8217;ll hear from me). I then check for the type of device being used and choose to pick or take a photo. .NET MAUI&#8217;s cross-platform APIs for <code>DeviceInfo<\/code> and <code>MediaPicker<\/code> save me a ton of time navigating through platform-specific idiosyncrasies.<\/p>\n<pre><code class=\"language-csharp\">if (DeviceInfo.Idiom == DeviceIdiom.Desktop)\n{\n    result = await MediaPicker.PickPhotoAsync(new MediaPickerOptions\n    {\n        Title = \"Select a photo\"\n    });\n}\nelse\n{\n    if (!MediaPicker.IsCaptureSupported)\n    {\n        return;\n    }\n\n    result = await MediaPicker.CapturePhotoAsync(new MediaPickerOptions\n    {\n        Title = \"Take a photo\"\n    });\n}<\/code><\/pre>\n<p>Another advantage of using the built-in <code>MediaPicker<\/code> is giving users the native experience for photo input they are already accustomed to. When you&#8217;re implementing this, be sure to perform the <a href=\"https:\/\/learn.microsoft.com\/dotnet\/maui\/platform-integration\/device-media\/picker\">necessary platform-specific setup as documented<\/a>.<\/p>\n<h2>Processing the image<\/h2>\n<p>Once an image is received, it&#8217;s desplayed on screen along with an optional <code>Editor<\/code> field to capture any additional context and instructions the user might want to provide. I build the prompt with <code>StringBuilder<\/code> (in other apps I like to use Scriban templates), grab an instance of the <code>Microsoft.Extensions.AI<\/code>&#8216;s <a href=\"https:\/\/learn.microsoft.com\/dotnet\/api\/microsoft.extensions.ai.ichatclient\"><code>IChatClient<\/code><\/a> from a service, get the image bytes, and supply everything to the chat client using a <a href=\"https:\/\/learn.microsoft.com\/dotnet\/api\/microsoft.extensions.ai.chatmessage\"><code>ChatMessage<\/code><\/a> that packs <a href=\"https:\/\/learn.microsoft.com\/dotnet\/api\/microsoft.extensions.ai.textcontent\"><code>TextContent<\/code><\/a> and <a href=\"https:\/\/learn.microsoft.com\/dotnet\/api\/microsoft.extensions.ai.datacontent\"><code>DataContent<\/code><\/a>.<\/p>\n<pre><code class=\"language-csharp\">private async Task ExtractTasksFromImageAsync()\n{\n    \/\/ more code\n\n    var prompt = new System.Text.StringBuilder();\n    prompt.AppendLine(\"# Image Analysis Task\");\n    prompt.AppendLine(\"Analyze the image for task lists, to-do items, notes, or any content that could be organized into projects and tasks.\");\n    prompt.AppendLine();\n    prompt.AppendLine(\"## Instructions:\");\n    prompt.AppendLine(\"1. Identify any projects and tasks (to-do items) visible in the image\");\n    prompt.AppendLine(\"2. Format handwritten text, screenshots, or photos of physical notes into structured data\");\n    prompt.AppendLine(\"3. Group related tasks into projects when appropriate\");\n\n    if (!string.IsNullOrEmpty(AnalysisInstructions))\n    {\n        prompt.AppendLine($\"4. {AnalysisInstructions}\");\n    }\n    prompt.AppendLine();\n    prompt.AppendLine(\"If no projects\/tasks are found, return an empty projects array.\");\n\n    var client = _chatClientService.GetClient();\n    byte[] imageBytes = File.ReadAllBytes(ImagePath);\n\n    var msg = new Microsoft.Extensions.AI.ChatMessage(ChatRole.User,\n    [\n        new TextContent(prompt.ToString()),\n        new DataContent(imageBytes, mediaType: \"image\/png\")\n    ]);\n\n    var apiResponse = await client.GetResponseAsync&lt;ProjectsJson&gt;(msg);\n\n    if (apiResponse?.Result?.Projects != null)\n    {\n        Projects = apiResponse.Result.Projects.ToList();\n    }\n\n    \/\/ more code\n}<\/code><\/pre>\n<h2>Human-AI Collaboration<\/h2>\n<p>Just like with the voice experience, the photo flow doesn&#8217;t blindly assume the agent got everything right. After processing, the user is shown a proposed set of projects and tasks for review and confirmation.<\/p>\n<p>This ensures users remain in control while benefiting from AI-augmented assistance. You can learn more about designing these kinds of flows using best practices in the <a href=\"https:\/\/www.microsoft.com\/research\/project\/hax-toolkit\">HAX Toolkit<\/a>.<\/p>\n<h2>Resources<\/h2>\n<ul>\n<li><a href=\"https:\/\/github.com\/davidortinau\/telepathy\">Telepathic App Source Code<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/dotnet\/ai\/\">Microsoft.Extensions.AI<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/dotnet\/maui\/platform-integration\/device-media\/picker\">MediaPicker Documentation<\/a><\/li>\n<li><a href=\"https:\/\/www.microsoft.com\/research\/project\/hax-toolkit\">HAX Toolkit<\/a><\/li>\n<li><a href=\"https:\/\/aka.ms\/RAI\">Microsoft AI Principles<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/dotnet\/ai\/\">AI for .NET Developers<\/a><\/li>\n<\/ul>\n<h2>Summary<\/h2>\n<p>We\u2019ve now extended our .NET MAUI app to see as well as hear. With just a few lines of code and a clear UX pattern, the app can take in images, analyze them using vision-capable AI models, and return structured, actionable data like tasks and projects.<\/p>\n<p>Multimodal experiences are more accessible and powerful than ever. With cross-platform support from .NET MAUI and the modularity of <code>Microsoft.Extensions.AI<\/code>, you can rapidly evolve your apps to meet your users where they are, whether that\u2019s typing, speaking, or snapping a photo.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Enhance your .NET MAUI app with photo-based AI by capturing images and extracting structured information using Microsoft.Extensions.AI.<\/p>\n","protected":false},"author":553,"featured_media":57147,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685,7233,7781],"tags":[7238,8047,8049,7793],"class_list":["post-57146","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","category-maui","category-ai","tag-net-maui","tag-ai-foundry","tag-computer-vision","tag-copilot"],"acf":[],"blog_post_summary":"<p>Enhance your .NET MAUI app with photo-based AI by capturing images and extracting structured information using Microsoft.Extensions.AI.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/57146","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/553"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=57146"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/57146\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/57147"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=57146"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=57146"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=57146"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}