{"id":3322,"date":"2023-07-06T12:06:05","date_gmt":"2023-07-06T19:06:05","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/surface-duo\/?p=3322"},"modified":"2024-01-03T16:25:01","modified_gmt":"2024-01-04T00:25:01","slug":"multimodal-augmented-inputs-azure-cognitive-services","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/surface-duo\/multimodal-augmented-inputs-azure-cognitive-services\/","title":{"rendered":"Multimodal Augmented Inputs in LLMs using Azure Cognitive Services"},"content":{"rendered":"<p>\n  Hello AI enthusiasts,\n<\/p>\n<p>\n  This week, we\u2019ll be talking about how you can use Azure Cognitive Services to enhance the types of inputs your Android AI scenarios can support.\n<\/p>\n<h2>What makes an LLM multimodal?<\/h2>\n<p>\n  Popular LLMs like ChatGPT are trained on vast amounts of text from the internet. They accept text as input and provide text as output. \n<\/p>\n<p>\n  Extending that logic a bit further, multimodal models like GPT4 are trained on various datasets containing different types of data, like text and images. As a result, the model can accept multiple data types as input.\n<\/p>\n<p>\n  In a paper titled <a href=\"https:\/\/arxiv.org\/pdf\/2302.14045.pdf\"><em>Language Is Not All You Need: Aligning Perception with Language Models<\/em><\/a>, researchers listed the datasets that they used to train their multimodal LLM, KOSMOS-1, and shared some outputs where the model can recognize images and answer questions about them (Figure 1).\n<\/p>\n<p>\n  <img decoding=\"async\" width=\"526\" height=\"328\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-1.png\" class=\"wp-image-3323\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-1.png 526w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-1-300x187.png 300w\" sizes=\"(max-width: 526px) 100vw, 526px\" \/><br\/><em>Figure 1 \u2013 KOSMOS-1 multimodal LLM responding to image and text prompts (Huang et al., Language is not all you need: Aligning perception with language models 2023)<\/em>\n<\/p>\n<p>\n  By that logic, if we can figure out a way to pass image data to ChatGPT (Figure 2 &amp; Figure 3), does that mean we\u2019ve made ChatGPT multimodal? Well, not really. But it\u2019s better than nothing.\n<\/p>\n<p>\n  <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/figure-2-3-600.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/figure-2-3-600.png\" alt=\"Two Android phones running an AI chat app showing an image from a whiteboard and a description of it\" width=\"583\" height=\"599\" class=\"alignnone size-full wp-image-3334\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/figure-2-3-600.png 583w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/figure-2-3-600-292x300.png 292w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/figure-2-3-600-24x24.png 24w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/figure-2-3-600-48x48.png 48w\" sizes=\"(max-width: 583px) 100vw, 583px\" \/><\/a><br\/><em>Figure 2 &amp; 3 \u2013 Chatbot running ChatGPT responding to image and text inputs.<\/em>\n<\/p>\n<p>\n  In this blog we\u2019ll try to achieve the multimodal nature of KOSMOS-1 in the text-based model GPT 3.5.\n<\/p>\n<h2>How does this apply to Android?<\/h2>\n<p>\n  Mobile users interact with their devices in a variety of ways. It\u2019s a language of taps, swipes, pictures, recordings, and quick messages. The language of mobile is complex, and the LLMs that support them should embrace as much of that as they can.\n<\/p>\n<h2>Analyzing complex data types<\/h2>\n<p>\n  Since the LLM that we want to experiment with (GPT 3.5) has no concept of audio, video, images, or other complex data types, we need to process the data in some way to convert it into a text format before adding it to a prompt and passing it to the LLM.\n<\/p>\n<p><a href=\"https:\/\/azure.microsoft.com\/products\/cognitive-services\/#features\">Azure Cognitive Services<\/a> (ACS) is a set of APIs that developers can leverage to perform different AI tasks. These APIs cover a broad range of AI scenarios, including:\n<\/p>\n<ul>\n<li>\n    Speech\n  <\/li>\n<li>\n    Language\n  <\/li>\n<li>\n    Vision\n  <\/li>\n<li>\n    Decision\n  <\/li>\n<\/ul>\n<p>\n  Input types like audio and video can be passed through Azure Cognitive Services, analyzed, and added as context to the LLM prompt (Figure 4).\n<\/p>\n<p>\n  <img decoding=\"async\" width=\"1704\" height=\"1084\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/data-is-passed-through-azure-cognitive-services-a.png\" class=\"wp-image-3325\" alt=\"Data is passed through Azure Cognitive Services, analyzed, and added as context to the LLM prompt.\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/data-is-passed-through-azure-cognitive-services-a.png 1704w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/data-is-passed-through-azure-cognitive-services-a-300x191.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/data-is-passed-through-azure-cognitive-services-a-1024x651.png 1024w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/data-is-passed-through-azure-cognitive-services-a-768x489.png 768w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/data-is-passed-through-azure-cognitive-services-a-1536x977.png 1536w\" sizes=\"(max-width: 1704px) 100vw, 1704px\" \/><br\/><em>Figure 4 \u2013 Data flow diagram showing data being passed to ACS before being sent to ChatGPT.<\/em>\n<\/p>\n<p>\n  Pictures and the camera are an integral part of the mobile experience, so we\u2019ll focus on passing images to our LLM.\n<\/p>\n<h2>Figuring out which services to use<\/h2>\n<p>\n  Like the Graph Explorer for Microsoft Graph, Azure Cognitive Services has an explorer for vision endpoints called \u201cVision Studio\u201d &#8211; <a href=\"https:\/\/portal.vision.cognitive.azure.com\/\">https:\/\/portal.vision.cognitive.azure.com\/<\/a> \n<\/p>\n<p>\n  When you first sign into Vision Studio, you will get prompted to choose a resource (Figure 5). This resource will be used to cover any costs accrued while using the Vision APIs.\n<\/p>\n<p>\n  Feel free to select \u201cDo this later\u201d, Vision Studio is a great way to explore what resources you would need to create for any given feature.\n<\/p>\n<p>\n  <img decoding=\"async\" width=\"622\" height=\"398\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-5.png\" class=\"wp-image-3326\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-5.png 622w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-5-300x192.png 300w\" sizes=\"(max-width: 622px) 100vw, 622px\" \/><br\/><em>Figure 5 \u2013 Vision Studio dialogue to help users select an Azure Cognitive Resource to test with.<\/em>\n<\/p>\n<p>\n  Since we\u2019re trying to provide as much information about images as we can to our LLM, the \u201cImage Analysis\u201d section is a good starting point (Figure 6).\n<\/p>\n<p>\n  <img decoding=\"async\" width=\"1291\" height=\"969\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-6.png\" class=\"wp-image-3327\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-6.png 1291w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-6-300x225.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-6-1024x769.png 1024w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-6-768x576.png 768w\" sizes=\"(max-width: 1291px) 100vw, 1291px\" \/><br\/><em>Figure 6 \u2013 Vision Studio\u2019s image analysis catalogue.<\/em>\n<\/p>\n<h2>Setting up a Computer Vision resource<\/h2>\n<p>\n  Once we know what ACS features we\u2019d like to use, we can set up an Azure resource. This will work similarly to the API key needed for using OpenAI endpoints.\n<\/p>\n<p>\n  You can either follow the Vision Studio dialog boxes from Figure 5 to set up a resource, or follow this Microsoft Learn tutorial to <a href=\"https:\/\/learn.microsoft.com\/azure\/cognitive-services\/cognitive-services-apis-create-account?tabs=multiservice%2Canomaly-detector%2Clanguage-service%2Ccomputer-vision%2Clinux\">Create a Cognitive Services Resource with Azure Portal<\/a>.\n<\/p>\n<p>\n  In our case, we want to use the <a href=\"https:\/\/learn.microsoft.com\/azure\/cognitive-services\/computer-vision\/concept-describe-images-40?source=recommendations&amp;tabs=image\">Captions endpoint<\/a> to help our LLM understand what the image represents. After testing the endpoint in Vision Studio, we know there are only a couple of valid Azure regions our resource can be assigned to (Figure 7).\n<\/p>\n<p>\n  <img decoding=\"async\" width=\"878\" height=\"68\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-7.png\" class=\"wp-image-3328\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-7.png 878w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-7-300x23.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-7-768x59.png 768w\" sizes=\"(max-width: 878px) 100vw, 878px\" \/><br\/><em>Figure 7 \u2013 Image Analysis endpoint caveats.<\/em>\n<\/p>\n<h2>Analyzing an image<\/h2>\n<p>\n  Now that we have a valid Azure resource and know that the Captions endpoint is the one for us, the next step is calling it in our app!\n<\/p>\n<p>\n  The Image Analysis APIs that we\u2019ve been testing in Vision Studio can be found here: <a href=\"https:\/\/learn.microsoft.com\/azure\/cognitive-services\/computer-vision\/how-to\/call-analyze-image-40?tabs=rest\">https:\/\/learn.microsoft.com\/azure\/cognitive-services\/computer-vision\/how-to\/call-analyze-image-40?tabs=rest<\/a>.\n<\/p>\n<p>\n  Since there isn\u2019t an Android Client SDK to make these calls, we\u2019ll use the REST API to pass in our image, getting a JSON response from the Captions endpoint.\n<\/p>\n<pre>  val request = Request.Builder()\r\n      .url(\r\n          \"${Constants.AZURE_ENDPOINT_WEST_US}computervision\/imageanalysis:analyze?api-version=2023-02-01-preview&amp;features=caption\"\r\n      )\r\n      .addHeader(\"Ocp-Apim-Subscription-Key\", Constants.AZURE_SUBSCRIPTION_KEY)\r\n      .addHeader(\"Content-Type\", \"application\/octet-stream\")\r\n      .post(encodeToRequestBody(image))\r\n      .build()\r\n\r\n  val client = OkHttpClient.Builder()\r\n      .build()\r\n\r\n  val response = client.newCall(request).execute()<\/pre>\n<p>\n  The <code>encodeToRequestBody()<\/code> function converts our in-app Bitmap image into a PNG formatted ByteArray.\n<\/p>\n<pre>  private fun encodeToRequestBody(image: Bitmap): RequestBody {\r\n      ByteArrayOutputStream().<em>use <\/em><strong>{ <\/strong>baos <strong>-&gt;<\/strong>\r\n          image.compress(Bitmap.CompressFormat.<em>PNG<\/em>, 100, baos)\r\n          return baos.toByteArray().<em>toRequestBody<\/em>(\"image\/png\".<em>toMediaType<\/em>())\r\n      <strong>}<\/strong>\r\n  <strong>}<\/strong><\/pre>\n<p>\n  Given the image in Figure 8&#8230;\n<\/p>\n<p>\n  <img decoding=\"async\" width=\"546\" height=\"726\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-8.jpeg\" class=\"wp-image-3329\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-8.jpeg 546w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/word-image-3322-8-226x300.jpeg 226w\" sizes=\"(max-width: 546px) 100vw, 546px\" \/><br\/><em>Figure 8 \u2013 Whiteboard drawing of mountains.<\/em>\n<\/p>\n<p>\n  \u2026 we get the following result from the Captions ACS endpoint!\n<\/p>\n<pre>  {\"captionResult\":{\"text\":\"a drawing of mountains and trees on a whiteboard\"}}<\/pre>\n<p>\n  This result can be passed to the LLM as the final prompt (Figure 9) or added to a prompt template, augmenting the prompt and helping add context to a different request.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/figure-9-600.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/figure-9-600.png\" alt=\"Two Android phones running an AI chat app showing an image from a whiteboard and a description of it\" width=\"278\" height=\"600\" class=\"alignnone size-full wp-image-3335\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/figure-9-600.png 278w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/07\/figure-9-600-139x300.png 139w\" sizes=\"(max-width: 278px) 100vw, 278px\" \/><\/a><br\/><em>Figure 9 \u2013 Final result of requesting an image caption from ACS, passing it to ChatGPT, and getting a response.<\/em>\n<\/p>\n<p>\n  And now our chatbot can accept images as input even though our LLM is text-based! Notice how the results are similar to Figure 1 \u2013 we have the Generative AI component of LLMs, with an extra added layer of AI preprocessing from ACS that traditional multimodal models don\u2019t have.\n<\/p>\n<h2>Tradeoffs with Actual Multimodal Models<\/h2>\n<p>\n  In this blog we\u2019ve been ignoring one big question \u2013 <em>why would anyone want to do this instead of using a multimodal LLM?<\/em>\n<\/p>\n<p>\n  As with many other LLM evaluations, the decision comes down to a couple major factors.\n<\/p>\n<ol>\n<li>\n  Cost \u2013 Multimodal models like GPT4 are substantially more expensive than their text-based counterparts. Using other AI services like Azure Cognitive Services can help offset the cost of supporting image inputs.\n<\/li>\n<li>\n  Performance \u2013 Any amount of AI preprocessing on images is going to cause lost data. We\u2019re essentially mapping a 3-dimensional shape into a 2-D plane. But for many use cases, a simplified version of multimodal behavior is fine.\n<\/li>\n<li>\n  Availability \u2013 This factor is arguably the most variable to change in the future. Not everyone currently has access to the best, most cutting-edge models. For those who only have access to mid-tier models, this can be a good way to close the gap.\n<\/li>\n<\/ol>\n<h2>Resources and feedback<\/h2>\n<p>\n  Here\u2019s a summary of the links shared in this post:\n<\/p>\n<ul>\n<li><a href=\"https:\/\/arxiv.org\/pdf\/2302.14045.pdf\"><em>Language Is Not All You Need: Aligning Perception with Language Models<\/em><\/a>\n  <\/li>\n<li><a href=\"https:\/\/azure.microsoft.com\/products\/cognitive-services\/#features\">Azure Cognitive Services<\/a>\n  <\/li>\n<li><a href=\"https:\/\/portal.vision.cognitive.azure.com\/\">Vision Studio<\/a>\n  <\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/azure\/cognitive-services\/computer-vision\/how-to\/call-analyze-image-40?tabs=rest\">Image Analysis API<\/a>\n  <\/li>\n<\/ul>\n<p>\n  If you have any questions, use the <a href=\"http:\/\/aka.ms\/SurfaceDuoSDK-Feedback\">feedback forum<\/a> or message us on <a href=\"https:\/\/twitter.com\/surfaceduodev\">Twitter @surfaceduodev<\/a>.\n<\/p>\n<p>\n  There won\u2019t be a livestream this week, but you can check out the <a href=\"https:\/\/youtube.com\/c\/surfaceduodev\">archives on YouTube<\/a>.\n<\/p>\n<h2>Citations<\/h2>\n<p>\n  Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O. K., Patra, B., Liu, Q., Aggarwal, K., Chi, Z., Bjorck, J., Chaudhary, V., Som, S., Song, X., &amp; Wei, F. (2023, March 1). Language is not all you need: Aligning perception with language models. <a href=\"https:\/\/arxiv.org\/pdf\/2302.14045.pdf\">https:\/\/arxiv.org\/pdf\/2302.14045.pdf<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello AI enthusiasts, This week, we\u2019ll be talking about how you can use Azure Cognitive Services to enhance the types of inputs your Android AI scenarios can support. What makes an LLM multimodal? Popular LLMs like ChatGPT are trained on vast amounts of text from the internet. They accept text as input and provide text [&hellip;]<\/p>\n","protected":false},"author":90683,"featured_media":3323,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[741],"tags":[739,734,733],"class_list":["post-3322","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-azure","tag-chatgpt","tag-openai"],"acf":[],"blog_post_summary":"<p>Hello AI enthusiasts, This week, we\u2019ll be talking about how you can use Azure Cognitive Services to enhance the types of inputs your Android AI scenarios can support. What makes an LLM multimodal? Popular LLMs like ChatGPT are trained on vast amounts of text from the internet. They accept text as input and provide text [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts\/3322","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/users\/90683"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/comments?post=3322"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts\/3322\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/media\/3323"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/media?parent=3322"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/categories?post=3322"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/tags?post=3322"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}