{"id":2133,"date":"2024-03-21T10:27:39","date_gmt":"2024-03-21T17:27:39","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/semantic-kernel\/?p=2133"},"modified":"2024-03-27T08:32:31","modified_gmt":"2024-03-27T15:32:31","slug":"image-to-text-with-semantic-kernel-and-huggingface","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/agent-framework\/image-to-text-with-semantic-kernel-and-huggingface\/","title":{"rendered":"Image to Text with Semantic Kernel and HuggingFace"},"content":{"rendered":"<p>We are thrilled to introduce a new feature within Semantic Kernel that promises to improve AI capabilities: Image to Text modality service abstraction, with a new HuggingFace Service implementation using this capability.<\/p>\n<p><strong>A Glimpse into the Demonstration<\/strong><\/p>\n<p>In the video below, we\u2019ll walk through a compelling demonstration of a simple Windows Forms application, showcasing the innovative ImageToText feature integrated into the Semantic Kernel, introduced together with the latest update on our Hugging Face connector.<\/p>\n<p>This sample opens asking for a folder path in your environment containing image files. Once provided these images are then displayed in the initial window as soon as the application launches.<\/p>\n<p>The application provides an interactive feature where you can click on each image. Upon clicking, the application employs the Semantic Kernel&#8217;s HuggingFace ImageToText Service to fetch a descriptive analysis of the clicked image.<\/p>\n<p>A critical aspect of the implementation is how the application captures the binary content of the image and sends a request to the ImageToText Service, awaiting the descriptive text and updating back the UI. This process is a key highlight, showcasing how simple and easy is to integrate powerful capabilities in your application with Semantic Kernel.<\/p>\n<p><div style=\"width: 640px;\" class=\"wp-video\"><video class=\"wp-video-shortcode\" id=\"video-2133-1\" width=\"640\" height=\"360\" preload=\"metadata\" controls=\"controls\"><source type=\"video\/mp4\" src=\"https:\/\/learn.microsoft.com\/video\/media\/173a2358-379c-4e0a-a44d-6d7f66ca22fe\/ImageToText%20Demo_1710195597346_1920x1080_AACAudio_1161.mp4?_=1\" \/><a href=\"https:\/\/learn.microsoft.com\/video\/media\/173a2358-379c-4e0a-a44d-6d7f66ca22fe\/ImageToText%20Demo_1710195597346_1920x1080_AACAudio_1161.mp4\">https:\/\/learn.microsoft.com\/video\/media\/173a2358-379c-4e0a-a44d-6d7f66ca22fe\/ImageToText%20Demo_1710195597346_1920x1080_AACAudio_1161.mp4<\/a><\/video><\/div><\/p>\n<p><a href=\"https:\/\/learn.microsoft.com\/video\/media\/173a2358-379c-4e0a-a44d-6d7f66ca22fe\/ImageToText%20Demo_1710195597346_1920x1080_AACAudio_1161.mp4\">Click here to watch the ImagetoText demo video<\/a><\/p>\n<p>When building your own app using HuggingFace ImageToText you will need to use the following packages:<\/p>\n<ul>\n<li>Microsoft.SemanticKernel<\/li>\n<li>Microsoft.SemanticKernel.Connectors.HuggingFace<\/li>\n<\/ul>\n<p>Here&#8217;s a glimpse of the C# code snippet required to kickstart the integration:<\/p>\n<pre class=\"prettyprint language-cs language-csharp\"><code class=\"language-cs language-csharp\">\/\/ Initializes the Kernel\r\nvar kernel = Kernel\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 .CreateBuilder()\r\n\u2003\u2003\u2003\u2003 \u00a0 .AddHuggingFaceImageToText(\"Salesforce\/blip-image-captioning-base\")\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 .Build();\r\n\r\n\/\/ Gets the ImageToText Service\r\nvar service = this._kernel.GetRequiredService&lt;IImageToTextService&gt;();\r\n\r\n\u00a0\/\/ Get the binary content of a JPEG image:\r\nvar imageBinary = File.ReadAllBytes(\"path\/to\/file.jpg\");\r\n\r\n\/\/ Prepare the image to be sent to the LLM\r\nvar imageContent = new ImageContent(imageBinary) { MimeType = \"image\/jpeg\" };\r\n\r\n\u00a0\/\/ Retrieves the image description\r\nvar textContent = await service.GetTextContentAsync(imageContent);<\/code><\/pre>\n<p><strong>Under the Hood: Seamless Integration<\/strong><\/p>\n<p>Central to the success of this implementation is the seamless integration between our software and the HuggingFace Image to Text Service. The application adeptly captures the binary content of the selected image and dispatches a request to the service, eagerly awaiting the descriptive text in return. This process exemplifies the power and fluidity of our latest enhancement, showcasing its potential to transform how we interact with visual content.<\/p>\n<p><strong>Getting Started<\/strong><\/p>\n<p>Here is the <a href=\"https:\/\/github.com\/microsoft\/semantic-kernel\/tree\/main\/dotnet\/samples\/HuggingFaceImageTextExample\">location<\/a> of the full code example we&#8217;ll walk through below.<\/p>\n<p>To leverage the Image to Text feature within your own applications, you&#8217;ll need to ensure you have the necessary packages installed:<\/p>\n<ul>\n<li>Microsoft.SemanticKernel<\/li>\n<li>Microsoft.SemanticKernel.Connectors.HuggingFace<\/li>\n<\/ul>\n<p>The demonstration uses a simple Windows Forms application with Semantic Kernel and Hugging Face connector to get the description of the images in a local folder provided by the user.<\/p>\n<p>Steps to use the Demo.<\/p>\n<ol>\n<li>Clone semantic kernel repository<\/li>\n<li>Open your favorite IDE i.e:<\/li>\n<\/ol>\n<p>VSCode:<\/p>\n<ol>\n<li>Open in the root repository folder<\/li>\n<\/ol>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText1.png\"><img decoding=\"async\" class=\"alignnone wp-image-2199 size-full\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText1.png\" alt=\"Image ImagetoText1\" width=\"312\" height=\"445\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText1.png 312w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText1-210x300.png 210w\" sizes=\"(max-width: 312px) 100vw, 312px\" \/><\/a><\/p>\n<p>2. Go into Run and Debug (Control + Shift + D) and Select <strong>HuggingFaceImageTextSample<\/strong> to start.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText2.png\"><img decoding=\"async\" class=\"alignnone wp-image-2200 size-full\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText2.png\" alt=\"Image ImagetoText2\" width=\"464\" height=\"304\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText2.png 464w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText2-300x197.png 300w\" sizes=\"(max-width: 464px) 100vw, 464px\" \/><\/a><\/p>\n<p>Visual Studio:<\/p>\n<ol>\n<li>Open <strong>SK-dotnet.sln <\/strong>solution file inside &lt;repository root folder&gt;\/dotnet. This will trigger your Visual Studio IDE.<\/li>\n<\/ol>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText3.png\"><img decoding=\"async\" class=\"alignnone wp-image-2201 size-full\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText3.png\" alt=\"Image ImagetoText3\" width=\"276\" height=\"154\" \/><\/a><\/p>\n<p>2. Hugging Face Image Sample will be within samples folder in the solution folders.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText4.png\"><img decoding=\"async\" class=\"alignnone wp-image-2202 size-full\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText4.png\" alt=\"Image ImagetoText4\" width=\"462\" height=\"363\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText4.png 462w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText4-300x236.png 300w\" sizes=\"(max-width: 462px) 100vw, 462px\" \/><\/a><\/p>\n<p>3. On the Debug Menu bar select the <strong>HuggingFaceImageTextExample <\/strong>project as starting and click to run<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText5.png\"><img decoding=\"async\" class=\"alignnone wp-image-2203 size-full\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText5.png\" alt=\"Image ImagetoText5\" width=\"576\" height=\"235\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText5.png 576w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText5-300x122.png 300w\" sizes=\"(max-width: 576px) 100vw, 576px\" \/><\/a><\/p>\n<p>Using the Sample:<\/p>\n<ol>\n<li>Upon launching the application, a folder selection prompt will be asking for a folder with images to be used for the sample<\/li>\n<\/ol>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText6.png\"><img decoding=\"async\" class=\"alignnone wp-image-2204 size-full\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText6.png\" alt=\"Image ImagetoText6\" width=\"527\" height=\"213\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText6.png 527w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText6-300x121.png 300w\" sizes=\"(max-width: 527px) 100vw, 527px\" \/><\/a><\/p>\n<p>2. After selecting the folder the application will start showing a list of any image type supported (jpg, gif, png) in the folder.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText7.png\"><img decoding=\"async\" class=\"alignnone wp-image-2205 size-full\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText7.png\" alt=\"Image ImagetoText7\" width=\"557\" height=\"315\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText7.png 557w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText7-300x170.png 300w\" sizes=\"(max-width: 557px) 100vw, 557px\" \/><\/a><\/p>\n<p>3. After clicking on an image an asynchronous request will be sent to a HuggingFace <strong>Salesforce\/blip-image-captioning-base<\/strong> ImageToText model to process and generate a description of the\u00a0 \u00a0image, it may take a few seconds.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText8.png\"><img decoding=\"async\" class=\"alignnone wp-image-2206 size-full\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText8.png\" alt=\"Image ImagetoText8\" width=\"480\" height=\"174\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText8.png 480w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText8-300x109.png 300w\" sizes=\"(max-width: 480px) 100vw, 480px\" \/><\/a><\/p>\n<p>4. Since HuggingFace with its inference API creates a common interface for model generation, you can try different ImageToText models changing the target model in the HuggingFaceImageToText Service initialization.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText9.png\"><img decoding=\"async\" class=\"alignnone wp-image-2207 size-full\" src=\"https:\/\/devblogs.microsoft.com\/semantic-kernel\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText9.png\" alt=\"Image ImagetoText9\" width=\"552\" height=\"131\" srcset=\"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText9.png 552w, https:\/\/devblogs.microsoft.com\/agent-framework\/wp-content\/uploads\/sites\/78\/2024\/03\/ImagetoText9-300x71.png 300w\" sizes=\"(max-width: 552px) 100vw, 552px\" \/><\/a><\/p>\n<p><strong>Dive Deeper<\/strong><\/p>\n<p>Please reach out if you have any questions or feedback through our <a href=\"https:\/\/github.com\/microsoft\/semantic-kernel\/discussions\/categories\/general\">Semantic Kernel GitHub Discussion Channel<\/a>. We look forward to hearing from you!\u00a0We would also love your support, if you&#8217;ve enjoyed using Semantic Kernel, give us a star on <a href=\"https:\/\/github.com\/microsoft\/semantic-kernel\">GitHub<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We are thrilled to introduce a new feature within Semantic Kernel that promises to improve AI capabilities: Image to Text modality service abstraction, with a new HuggingFace Service implementation using this capability. A Glimpse into the Demonstration In the video below, we\u2019ll walk through a compelling demonstration of a simple Windows Forms application, showcasing the [&hellip;]<\/p>\n","protected":false},"author":149071,"featured_media":2365,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[35],"class_list":["post-2133","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-semantic-kernel","tag-semantic-kernel-huggingface-imagetotext"],"acf":[],"blog_post_summary":"<p>We are thrilled to introduce a new feature within Semantic Kernel that promises to improve AI capabilities: Image to Text modality service abstraction, with a new HuggingFace Service implementation using this capability. A Glimpse into the Demonstration In the video below, we\u2019ll walk through a compelling demonstration of a simple Windows Forms application, showcasing the [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/posts\/2133","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/users\/149071"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/comments?post=2133"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/posts\/2133\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/media\/2365"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/media?parent=2133"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/categories?post=2133"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/agent-framework\/wp-json\/wp\/v2\/tags?post=2133"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}