{"id":47283,"date":"2023-09-06T05:00:00","date_gmt":"2023-09-06T12:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=47283"},"modified":"2024-12-13T14:11:45","modified_gmt":"2024-12-13T22:11:45","slug":"demystifying-retrieval-augmented-generation-with-dotnet","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/demystifying-retrieval-augmented-generation-with-dotnet\/","title":{"rendered":"Demystifying Retrieval Augmented Generation with .NET"},"content":{"rendered":"<p><i>This post was edited on 2\/1\/2024 to update it for Semantic Kernel 1.3.0.<\/i><\/p>\n<p>Generative AI, or using AI to create text, images, audio, or basically anything else, has taken the world by storm over the last year. Developers for all manner of applications are now exploring how these systems can be incorporated to the benefit of their users. Yet while the technology advances at a breakneck pace, new models are released every day, and new SDKs constantly pop out of the woodwork, it can be challenging for developers to figure out how to actually get started. There are a variety of polished end-to-end sample applications that .NET developers can use as a reference (<a href=\"https:\/\/github.com\/Azure-Samples\/azure-search-openai-demo-csharp\/\">example<\/a>). However, I personally do better when I can build something up incrementally, learning the minimal concepts first, and then expanding on it and making it robust and beautiful later.<\/p>\n<p>To that end, this post focuses on building a simple console-based .NET chat application from the ground up, with minimal dependencies and minimal fuss. The end goal is to be able to ask questions and get answers not only based on the data on which our model was trained, but also on additional data supplied dynamically. Along the way, every code sample shown is a complete application, so you can just copy-and-paste it into a <code>Program.cs<\/code> file, run it, play with it, and then copy-and-paste into your real application, where you can refine and augment it to your heart&#8217;s content.<\/p>\n<h2>Let&#8217;s Get Chattin&#8217;<\/h2>\n<p>To begin, make sure you have .NET 8 installed, and create a simple console app (.NET 6, .NET 7, and .NET Framework will also work as well, just with a few tweaks to the project file):<\/p>\n<pre><code class=\"language-sh\">dotnet new console -o chatapp\r\ncd chatapp<\/code><\/pre>\n<p>This creates a new directory <code>chatapp<\/code> and populates it with two files: <code>chatapp.csproj<\/code> and <code>Program.cs<\/code>. We then need to bring in one NuGet package: <code>Microsoft.SemanticKernel<\/code>.<\/p>\n<pre><code class=\"language-sh\">dotnet add package Microsoft.SemanticKernel<\/code><\/pre>\n<p>We could choose to reference specific AI-related packages, like <a href=\"https:\/\/www.nuget.org\/packages\/Azure.AI.OpenAI\">Azure.AI.OpenAI<\/a>, but I&#8217;ve instead turned to <a href=\"https:\/\/learn.microsoft.com\/semantic-kernel\/overview\/\">Semantic Kernel<\/a> (SK) as a way to simplify various interactions and more easily swap in and out different implementations in order to more quickly experiment. SK provides a set of libraries that makes it easier to work with Large Language Models (LLMs). It provides abstractions for various AI concepts so that you can code to the abstraction and more easily substitute different implementations. It provides many concrete implementations of those abstractions, wrapping a multitude of other SDKs. It provides support for planning and orchestration, such that you can ask AI to create a plan for how to achieve a certain goal. It provides support for plug-ins. And much more. We&#8217;ll touch on a variety of those aspects throughout this post, but I&#8217;m primarily using SK for its abstractions.<\/p>\n<p>While for the purposes of this post I&#8217;ve tried to keep dependencies to a minimum, there&#8217;s one more I can&#8217;t avoid: you need access to an LLM. The easiest way to get access is via either <a href=\"https:\/\/platform.openai.com\/signup\">OpenAI<\/a> or <a href=\"https:\/\/learn.microsoft.com\/azure\/ai-services\/openai\/how-to\/create-resource\">Azure OpenAI<\/a>. For this post, I&#8217;m using OpenAI, but switching to use Azure OpenAI instead requires changing just one line in each of the samples. You&#8217;ll need three pieces of information for the remainder of the post:<\/p>\n<ul>\n<li>Your API key, provided to you in the portal for your service. (If you&#8217;re using Azure OpenAI instead of OpenAI, you&#8217;ll also need your endpoint, which is provided to you in your Azure portal.)<\/li>\n<li>A chat model. I&#8217;m using <code>gpt-3.5-turbo-0125<\/code>, which as of this writing has a context window of ~16K tokens. We&#8217;ll talk more about what this is later. Note that if you&#8217;re using Azure OpenAI instead of OpenAI, you won&#8217;t refer to the model by name; instead, you&#8217;ll create a deployment of that model and refer to the deployment name.\n<li>An embedding model. I&#8217;m using <code>text-embedding-3-small<\/code>.<\/li>\n<\/ul>\n<p>With that out of the way, we can dive in. Believe it or not, we can create a simple chat app in just a few lines code. Copy-and-paste this into your <code>Program.cs<\/code>:<\/p>\n<pre><code class=\"language-C#\">using Microsoft.SemanticKernel;\r\n\r\nstring apikey = Environment.GetEnvironmentVariable(\"AI:OpenAI:APIKey\")!;\r\n\r\n\/\/ Initialize the kernel\r\nKernel kernel = Kernel.CreateBuilder()\r\n    .AddOpenAIChatCompletion(\"gpt-3.5-turbo-0125\", apikey)\r\n    .Build();\r\n\r\n\/\/ Q&A loop\r\nwhile (true)\r\n{\r\n    Console.Write(\"Question: \");\r\n    Console.WriteLine(await kernel.InvokePromptAsync(Console.ReadLine()!));\r\n    Console.WriteLine();\r\n}<\/code><\/pre>\n<p>To avoid accidentally leaking my API key (which needs to be protected like you would a password) to the world in this post, I&#8217;ve stored it in an environment variable. Thus, I read in the API key and endpoint via <code>GetEnvironmentVariable<\/code>. I then create a new &#8220;kernel&#8221; with the SK APIs, asking it to add into the kernel an OpenAI chat completion service. The <code>Microsoft.SemanticKernel<\/code> package we pulled in earlier includes references to client support for both OpenAI and Azure OpenAI, so we don&#8217;t need anything additional to be able to talk to these services. And with that configured, we can now run our chat app (<code>dotnet run<\/code>), typing out questions and getting answers back from the service:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/SimpleQA.png\" alt=\"Simple question and answer with a chat agent\" \/><\/p>\n<p>The <code>await kernel.InvokePromptAsync(Console.ReadLine()!)<\/code> expression in there is the entirety of the interaction with the LLM. This reads in the user&#8217;s question and sends it off to the LLM, getting back a <code>string<\/code> response. SK supports multiple kinds of functions that can be invoked, including prompt functions (text-based interactions with AI) and normal .NET methods that can do anything C# code can do. The invocation of these functions can be issued directly by the consumer, as we&#8217;re doing here, but they can also be invoked as part of a &#8220;plan&#8221;: you supply a set of functions, each of which is capable of accomplishing something, and you ask the LLM for a plan for how to use one or more of those functions to achieve some described goal&#8230; SK can then invoke the functions according to the plan (we&#8217;ll see an example of that later). Some models also suppport &#8220;function calling&#8221;, which we&#8217;ll see in action later; this is also something SK simplifies.<\/p>\n<p>The &#8220;function&#8221; in this example is just whatever the user typed, e.g. if the user typed &#8220;What color is the sky?&#8221;, that&#8217;s the function, asking of the LLM &#8220;What color is the sky?&#8221;, since that&#8217;s what we passed to <code>InvokePromptAsync<\/code>. We can make this function nature a bit clearer by separating the function out into its own entity, via the <code>CreateFunctionFromPrompt<\/code> method, and then reusing that one function repeatedly. In doing so, we&#8217;re no longer creating a new function per user input, and thus need some way to parameterize the created function with the user&#8217;s input. For that, SK includes support for prompt templates, where you supply the prompt but with placeholders that SK will fill in based on the variables and functions available to it. For example, if I run the previous sample again, and this time ask for the current time, the LLM is unable to provide me with an answer:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/LackOfTime.png\" alt=\"LLM doesn&#039;t know the current time\" \/>\nHowever, if we expect such questions, we can proactively provide the LLM with the information it needs as part of the prompt. Here I&#8217;ve registered with the kernel a function that returns the current date and time. Then I&#8217;ve created a prompt function, with a prompt template that will invoke this function as part of rendering the prompt, and that will also include the value of the <code>$input<\/code> variable (any number of arbitrarily-named arguments can be supplied via a <code>KernelArguments<\/code> dictionary, and I&#8217;ve simply chosen to name one &#8220;input&#8221;). Functions are grouped into &#8220;plugins&#8221;:<\/p>\n<pre><code class=\"language-C#\">using Microsoft.SemanticKernel;\r\n\r\nstring apikey = Environment.GetEnvironmentVariable(\"AI:OpenAI:APIKey\")!;\r\n\r\n\/\/ Initialize the kernel\r\nKernel kernel = Kernel.CreateBuilder()\r\n    .AddOpenAIChatCompletion(\"gpt-3.5-turbo-0125\", apikey)\r\n    .Build();\r\n\r\n\/\/ Create the prompt function as part of a plugin and add it to the kernel.\r\n\/\/ These operations can be done separately, but helpers also enable doing\r\n\/\/ them in one step.\r\nkernel.ImportPluginFromFunctions(\"DateTimeHelpers\",\r\n[\r\n    kernel.CreateFunctionFromMethod(() => $\"{DateTime.UtcNow:r}\", \"Now\", \"Gets the current date and time\")\r\n]);\r\n\r\nKernelFunction qa = kernel.CreateFunctionFromPrompt(\"\"\"\r\n    The current date and time is {{ datetimehelpers.now }}.\r\n    {{ $input }}\r\n    \"\"\");\r\n\r\n\/\/ Q&A loop\r\nvar arguments = new KernelArguments();\r\nwhile (true)\r\n{\r\n    Console.Write(\"Question: \");\r\n    arguments[\"input\"] = Console.ReadLine();\r\n    Console.WriteLine(await qa.InvokeAsync(kernel, arguments));\r\n    Console.WriteLine();\r\n}<\/code><\/pre>\n<p>When that function is invoked, it will render the prompt, filling in those placeholders by invoking the registered <code>Now<\/code> function and substituting its result into the prompt. Now when I ask the same question, the answer is more satisfying:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/SuccessfulTime.png\" alt=\"LLM now knows the current time\" \/><\/p>\n<h2>A Trip Down Memory Lane<\/h2>\n<p>We&#8217;re making good progress: in just a few lines of code, we&#8217;ve been able to create a simple chat agent to which we can repeatedly pose questions and get answers, and we&#8217;ve been able to provide additional information in the prompt to help it answer questions it would have otherwise been unable to answer. However, we&#8217;ve also created a chat agent with no memory, such that it has no concept of things previously discussed:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/NoMemory.png\" alt=\"LLM has no memory of the previous messages in the chat\" \/><\/p>\n<p>These LLMs are stateless. To address the lack of memory, we need to keep track of our chat history and send it back as part of the prompt on each request. We could do so manually, rendering it into the prompt ourselves, or we could rely on SK to do it for us (and it can rely on the underlying clients for Azure OpenAI, OpenAI, or whatever other chat service is plugged in). This does the latter, getting the registered <code>IChatCompletionService<\/code>, creating a new chat (which is essentially just a list of all messages added to it), and then not only issuing requests and printing out responses, but also storing both into the chat history.<\/p>\n<pre><code class=\"language-C#\">using Microsoft.SemanticKernel;\r\nusing Microsoft.SemanticKernel.ChatCompletion;\r\n\r\nstring apikey = Environment.GetEnvironmentVariable(\"AI:OpenAI:APIKey\")!;\r\n\r\n\/\/ Initialize the kernel\r\nKernel kernel = Kernel.CreateBuilder()\r\n    .AddOpenAIChatCompletion(\"gpt-3.5-turbo-0125\", apikey)\r\n    .Build();\r\n\r\n\/\/ Create a new chat\r\nIChatCompletionService ai = kernel.GetRequiredService&lt;IChatCompletionService&gt;();\r\nChatHistory chat = new(\"You are an AI assistant that helps people find information.\");\r\n\r\n\/\/ Q&A loop\r\nwhile (true)\r\n{\r\n    Console.Write(\"Question: \");\r\n    chat.AddUserMessage(Console.ReadLine()!);\r\n\r\n    var answer = await ai.GetChatMessageContentAsync(chat);\r\n    chat.AddAssistantMessage(answer.Content!);\r\n    Console.WriteLine(answer);\r\n\r\n    Console.WriteLine();\r\n}<\/code><\/pre>\n<p>With that chat history rendered into an appropriate prompt, we then get back much more satisfying results:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/WithMemory.png\" alt=\"LLM now sees all messages in the chat\" \/><\/p>\n<p>In a real implementation, you&#8217;d need to pay attention to many other details, like the fact that all of these language models today have limits on the amount of data they&#8217;re able to process (the &#8220;context window&#8221;). The model I&#8217;m using has a context window of ~16,000 tokens (a &#8220;token&#8221; is the unit at which an LLM operates, sometimes a whole word, sometimes a portion of a word, sometimes a single character), plus there&#8217;s a per-token cost associated with every request\/response, so once this graduates from experimentation to &#8220;let&#8217;s put this into production,&#8221; we&#8217;d need to start paying a lot more attention to things like how much data is actually in the chat history, clearing out portions of it, etc.<\/p>\n<p>We can also add a tiny bit of code to help make the interaction feel snappier. These LLMs work based on generating the next token in the response, so although up until now we&#8217;ve only been printing out the response when the whole thing has arrived, we can actually stream the results so that we print out portions of the response as it&#8217;s available. This is exposed in SK via <code>IAsyncEnumerable&lt;T&gt;<\/code>, making it conventient to work with via <code>await foreach<\/code> loops.<\/p>\n<pre><code class=\"language-C#\">using Microsoft.SemanticKernel;\r\nusing Microsoft.SemanticKernel.ChatCompletion;\r\nusing System.Text;\r\n\r\nstring apikey = Environment.GetEnvironmentVariable(\"AI:OpenAI:APIKey\")!;\r\n\r\n\/\/ Initialize the kernel\r\nKernel kernel = Kernel.CreateBuilder()\r\n    .AddOpenAIChatCompletion(\"gpt-3.5-turbo-0125\", apikey)\r\n    .Build();\r\n\r\n\/\/ Create a new chat\r\nIChatCompletionService ai = kernel.GetRequiredService&lt;IChatCompletionService&gt;();\r\nChatHistory chat = new(\"You are an AI assistant that helps people find information.\");\r\nStringBuilder builder = new();\r\n\r\n\/\/ Q&A loop\r\nwhile (true)\r\n{\r\n    Console.Write(\"Question: \");\r\n    chat.AddUserMessage(Console.ReadLine()!);\r\n\r\n    builder.Clear();\r\n    await foreach (StreamingChatMessageContent message in ai.GetStreamingChatMessageContentsAsync(chat))\r\n    {\r\n        Console.Write(message);\r\n        builder.Append(message.Content);\r\n    }\r\n    Console.WriteLine();\r\n    chat.AddAssistantMessage(builder.ToString());\r\n\r\n    Console.WriteLine();\r\n}<\/code><\/pre>\n<p>Now when we run this, we can see the response streaming in:\n<div style=\"width: 640px;\" class=\"wp-video\"><video class=\"wp-video-shortcode\" id=\"video-47283-1\" width=\"640\" height=\"360\" preload=\"metadata\" controls=\"controls\"><source type=\"video\/mp4\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/PoemAboutDogs-1.mp4?_=1\" \/><a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/PoemAboutDogs-1.mp4\">https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/PoemAboutDogs-1.mp4<\/a><\/video><\/div><\/p>\n<h2>Mind the Gap<\/h2>\n<p>So, we&#8217;re now able to submit questions and get back answers. We&#8217;re able to keep a history of these interactions and use them to influence the answers. And we&#8217;re able to stream our results. Are we done? Not exactly.<\/p>\n<p>Thus far, the only information the LLM has to provide answers is the data on which it was trained, plus anything we proactively put into the prompt (e.g. the current time in a previous example). That means if we ask questions about things the LLM wasn&#8217;t trained on or for which it has significant gaps in its knowledgebase, the answers we get back are likely to be unhelpful, misleading, or blatantly wrong (aka &#8220;hallucinations&#8221;). For example, consider this question and answer:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/IncorrectAnswerAboutNonBacktracking.png\" alt=\"Incorrect answer about functionality released after LLM model was trained\" \/>\nThe questions are asking about functionality introduced in .NET 7, which was released after this version of the GPT 3.5 Turbo model was trained (a newer version of GPT 3.5 Turbo is out in preview as of the time of this writing). The model has no information about the functionality, so for the first question, it gives an outdated answer, and for the second question, it starts hallucinating and just making up stuff. We need to find a way to teach it about the things the user is asking about.<\/p>\n<p>We&#8217;ve already seen a way of teaching the LLM things: put it in the prompt. So let&#8217;s try that. The blog post <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance_improvements_in_net_7\/\">Performance Improvements in .NET 7<\/a>, which was also posted after this version of GPT 3.5 Turbo model was trained, contains a lengthy section on <code>Regex<\/code> improvements in .NET 7, including about that new <code>RegexOptions.NonBacktracking<\/code> option, so if we put it all into the prompt, that should provide the LLM with what it needs. Here I&#8217;ve just augmented the previous example with an additional section of code that downloads the contents of the web page, does a hack job of cleaning up the contents a bit, and then adding that all into a user message.<\/p>\n<pre><code class=\"language-C#\">using Microsoft.SemanticKernel;\r\nusing Microsoft.SemanticKernel.ChatCompletion;\r\nusing System.Net;\r\nusing System.Text;\r\nusing System.Text.RegularExpressions;\r\n\r\nstring apikey = Environment.GetEnvironmentVariable(\"AI:OpenAI:APIKey\")!;\r\n\r\n\/\/ Initialize the kernel\r\nKernel kernel = Kernel.CreateBuilder()\r\n    .AddOpenAIChatCompletion(\"gpt-3.5-turbo-0125\", apikey)\r\n    .Build();\r\n\r\n\/\/ Create a new chat\r\nIChatCompletionService ai = kernel.GetRequiredService&lt;IChatCompletionService&gt;();\r\nChatHistory chat = new(\"You are an AI assistant that helps people find information.\");\r\nStringBuilder builder = new();\r\n\r\n\/\/ Download a document and add all of its contents to our chat\r\nusing (HttpClient client = new())\r\n{\r\n    string s = await client.GetStringAsync(\"https:\/\/devblogs.microsoft.com\/dotnet\/performance_improvements_in_net_7\");\r\n    s = WebUtility.HtmlDecode(Regex.Replace(s, @\"&lt;[^&gt;]+&gt;|&amp;nbsp;\", \"\"));\r\n    chat.AddUserMessage(\"Here's some additional information: \" + s); \/\/ uh oh!\r\n}\r\n\r\n\/\/ Q&A loop\r\nwhile (true)\r\n{\r\n    Console.Write(\"Question: \");\r\n    chat.AddUserMessage(Console.ReadLine()!);\r\n\r\n    builder.Clear();\r\n    await foreach (var message in ai.GetStreamingChatMessageContentsAsync(chat))\r\n    {\r\n        Console.Write(message);\r\n        builder.Append(message.Content);\r\n    }\r\n    Console.WriteLine();\r\n    chat.AddAssistantMessage(builder.ToString());\r\n\r\n    Console.WriteLine();\r\n}<\/code><\/pre>\n<p>And the result?<\/p>\n<pre><code class=\"language-text\">Unhandled exception. Microsoft.SemanticKernel.AI.AIException: Invalid request: The request is not valid, HTTP status: 400\r\n ---&gt; Azure.RequestFailedException: This model's maximum context length is 16384 tokens. However, your messages resulted in 155751 tokens. Please reduce the length of the messages.\r\nStatus: 400 (model_error)\r\nErrorCode: context_length_exceeded<\/code><\/pre>\n<p>Oops! Even without any additional history, we exceeded the context window by almost 10 times. We obviously need to include less information, but we still need to ensure it&#8217;s relevant information. RAG to the rescue.<\/p>\n<p>&#8220;RAG,&#8221; or Retrieval Augmented Generation, is just a fancy way of saying &#8220;look up some stuff and put it into the prompt.&#8221; Rather than putting all possible information into the prompt, we&#8217;ll instead index all of the additional information we care about, and then when a question is asked, we&#8217;ll use that question to find the most relevant indexed content and put just that additional content into the prompt. And to help with that, we need embeddings.<\/p>\n<p>Think of an &#8220;embedding&#8221; as a vector (array) of floating-point values that represents some content and its semantic meaning. We can ask a model specifically focused on embeddings to create such a vector for a particular input, and then we can store both the vector and the text that seeded it into a database. Later on, when a question is asked, we can similarly run that question through the same model, and we can use the resulting vector to look up the most relevant embeddings in our database. We&#8217;re not necessarily looking for exact matches, just ones that are close enough. And you can take &#8220;close&#8221; here literally; the lookups are typically performed using functions that use a distance measure, such as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cosine_similarity\">cosine similarity<\/a>. For example, consider this program (to run this, you&#8217;ll need to add the <code>System.Numerics.Tensors<\/code> nuget package in order to have access to the <code>TensorPrimitives<\/code> type):<\/p>\n<pre><code class=\"language-C#\">using Microsoft.SemanticKernel.Connectors.OpenAI;\r\nusing System.Numerics.Tensors;\r\n\r\n#pragma warning disable SKEXP0011\r\n\r\nstring apikey = Environment.GetEnvironmentVariable(\"AI:OpenAI:APIKey\")!;\r\n\r\nvar embeddingGen = new OpenAITextEmbeddingGenerationService(\"text-embedding-3-small\", apikey);\r\n\r\nstring input = \"What is an amphibian?\";\r\nstring[] examples =\r\n{\r\n    \"What is an amphibian?\",\r\n    \"Cos'\u00e8 un anfibio?\",\r\n    \"A frog is an amphibian.\",\r\n    \"Frogs, toads, and salamanders are all examples.\",\r\n    \"Amphibians are four-limbed and ectothermic vertebrates of the class Amphibia.\",\r\n    \"They are four-limbed and ectothermic vertebrates.\",\r\n    \"A frog is green.\",\r\n    \"A tree is green.\",\r\n    \"It's not easy bein' green.\",\r\n    \"A dog is a mammal.\",\r\n    \"A dog is a man's best friend.\",\r\n    \"You ain't never had a friend like me.\",\r\n    \"Rachel, Monica, Phoebe, Joey, Chandler, Ross\",\r\n};\r\n\r\n\/\/ Generate embeddings for each piece of text\r\nReadOnlyMemory<float> inputEmbedding = (await embeddingGen.GenerateEmbeddingsAsync([input]))[0];\r\nIList<ReadOnlyMemory<float>> exampleEmbeddings = await embeddingGen.GenerateEmbeddingsAsync(examples);\r\n\r\n\/\/ Print the cosine similarity between the input and each example\r\nfloat[] similarity = exampleEmbeddings.Select(e => TensorPrimitives.CosineSimilarity(e.Span, inputEmbedding.Span)).ToArray();\r\nsimilarity.AsSpan().Sort(examples.AsSpan(), (f1, f2) => f2.CompareTo(f1));\r\nConsole.WriteLine(\"Similarity Example\");\r\nfor (int i = 0; i < similarity.Length; i++)\r\n    Console.WriteLine($\"{similarity[i]:F6}   {examples[i]}\");<\/code><\/pre>\n<p>This uses the OpenAI embedding generation service to get an embedding vector (using the <code>text-embedding-3-small<\/code> model I mentioned at the beginning of the post) for both an input and a bunch of other pieces of text. It then compares the resulting embedding for the input against the resulting embedding for each of those other texts, sorts the results by similarity, and prints them out:<\/p>\n<pre><code>Similarity Example\r\n1.000000   What is an amphibian?\r\n0.937651   A frog is an amphibian.\r\n0.902491   Amphibians are four-limbed and ectothermic vertebrates of the class Amphibia.\r\n0.873569   Cos'\u00e8 un anfibio?\r\n0.866632   Frogs, toads, and salamanders are all examples.\r\n0.857454   A frog is green.\r\n0.842596   They are four-limbed and ectothermic vertebrates.\r\n0.802171   A dog is a mammal.\r\n0.784479   It's not easy bein' green.\r\n0.778341   A tree is green.\r\n0.756669   A dog is a man's best friend.\r\n0.734219   You ain't never had a friend like me.\r\n0.721176   Rachel, Monica, Phoebe, Joey, Chandler, Ross<\/code><\/pre>\n<p>Let's incorporate this concept into our chat app. In this iteration, I've augmented the previous chat example with a few things:<\/p>\n<ul>\n<li>To better help us see what's going on under the covers, I've enable logging in the kernel. The kernel is actually a lightweight wrappers for a few pieces of information, including an <code>IServiceProvider<\/code>, which means you can use all of the services you're familiar with from elsewhere in .NET, including <code>ILoggerFactory<\/code>. The <code>IKernelBuilder<\/code> has a <code>Services<\/code> property that's an <code>IServiceContainer<\/code>, which means we can use all of the support from <code>Microsoft.Extensions.Logging<\/code> and friends to enable logging with all the support we're familiar with from ASP.NET. That also means we need to add a couple of additional packages:\n<pre><code class=\"language-text\">dotnet add package Microsoft.Extensions.Logging\r\ndotnet add package Microsoft.Extensions.Logging.Console<\/code><\/pre>\n<\/li>\n<li>Since we can bring in arbitrary services, we can also use the great support available in the ecosystem for resiliency. We're making many requests here over HTTP, which can fail for various infrastructure-related reasons (e.g. a server that's temporarily unreachable). And LLMs themselves often introduce their own failure modes, e.g. the caller's account only permits a certain number of interactions per second. Thus, we want to enable smart retries. To do that, we'll use <code class=\"language-text\">dotnet add package Microsoft.Extensions.Http.Resilience<\/code> to bring in automated resilience support that can be imported into the kernel. Then any HTTP requests created by any of the other components via the kernel will get retries applied automatically.<\/li>\n<li>To enable SK to do the embedding generation on our behalf via its abstractions, we'll also need to add its Memory package (note the \"--prerelease\"... this is an evolving space, so while some of the SK components are considered stable, others are still evolving and are thus still marked as \"prerelease\"):\n<pre><code class=\"language-text\">dotnet add package Microsoft.SemanticKernel.Plugins.Memory --prerelease<\/code><\/pre>\n<\/li>\n<li>I then need to create an <code>ISemanticTextMemory<\/code> to use for querying, which I do by using <code>MemoryBuilder<\/code> to combine an embeddings generator with a database. I've used the <code>WithAzureTextEmbeddingGenerationService<\/code> method to specify I want to use the Azure OpenAI service as my embeddings generator, and I've used <code>WithMemoryStore<\/code> to register a <code>VolatileMemoryStore<\/code> instance as the store (we'll change that later, but this will suffice for now). <code>VolatileMemoryStore<\/code> is simply an implementation of SK's <code>IMemoryStore<\/code> abstraction wrapping an in-memory dictionary.<\/li>\n<li>I've taken the downloaded text, used SK's <code>TextChunker<\/code> to break it fairly arbitrarily into pieces, and then I've used <code>SaveInformationAsync<\/code> to save each of those pieces to the memory store. That call will use the embedding service to generate an embedding for the text and then store the resulting vector and the input text into the aforementioned dictionary.<\/li>\n<li>Then, when it's time to ask a question, rather than just adding the question to the chat history and submitting that, we first use the question to <code>SearchAsync<\/code> on the memory store. That will again use the embedding service to get an embedding vector for the question, and then search the store for the closest vectors to that input. I've arbitrarily had it return the three closest matches, for which it then appends together the associated text, adds the results into the chat history, and submits that. After submitting the request, I've also then removed this additional context from my chat history, so that it's not sent again on subsequent requests; this additional information can consume much of the allowed context window.<\/li>\n<\/ul>\n<p>Here's our resulting program:<\/p>\n<pre><code class=\"language-C#\">using Microsoft.Extensions.DependencyInjection;\r\nusing Microsoft.Extensions.Logging;\r\nusing Microsoft.SemanticKernel;\r\nusing Microsoft.SemanticKernel.ChatCompletion;\r\nusing Microsoft.SemanticKernel.Connectors.OpenAI;\r\nusing Microsoft.SemanticKernel.Memory;\r\nusing Microsoft.SemanticKernel.Text;\r\nusing System.Net;\r\nusing System.Text;\r\nusing System.Text.RegularExpressions;\r\n\r\n#pragma warning disable SKEXP0003, SKEXP0011, SKEXP0052, SKEXP0055 \/\/ Experimental\r\n\r\nstring apikey = Environment.GetEnvironmentVariable(\"AI:OpenAI:APIKey\")!;\r\n\r\n\/\/ Initialize the kernel\r\nIKernelBuilder kb = Kernel.CreateBuilder();\r\nkb.AddOpenAIChatCompletion(\"gpt-3.5-turbo-0125\", apikey);\r\nkb.Services.AddLogging(c => c.AddConsole().SetMinimumLevel(LogLevel.Trace));\r\nkb.Services.ConfigureHttpClientDefaults(c => c.AddStandardResilienceHandler());\r\nKernel kernel = kb.Build();\r\n\r\n\/\/ Download a document and create embeddings for it\r\nISemanticTextMemory memory = new MemoryBuilder()\r\n    .WithLoggerFactory(kernel.LoggerFactory)\r\n    .WithMemoryStore(new VolatileMemoryStore())\r\n    .WithOpenAITextEmbeddingGeneration(\"text-embedding-3-small\", apikey)\r\n    .Build();\r\nstring collectionName = \"net7perf\";\r\nusing (HttpClient client = new())\r\n{\r\n    string s = await client.GetStringAsync(\"https:\/\/devblogs.microsoft.com\/dotnet\/performance_improvements_in_net_7\");\r\n    List<string> paragraphs =\r\n        TextChunker.SplitPlainTextParagraphs(\r\n            TextChunker.SplitPlainTextLines(\r\n                WebUtility.HtmlDecode(Regex.Replace(s, @\"&lt;[^&gt;]+&gt;|&amp;nbsp;\", \"\")),\r\n                128),\r\n            1024);\r\n    for (int i = 0; i < paragraphs.Count; i++)\r\n        await memory.SaveInformationAsync(collectionName, paragraphs[i], $\"paragraph{i}\");\r\n}\r\n\r\n\/\/ Create a new chat\r\nvar ai = kernel.GetRequiredService&lt;IChatCompletionService&gt;();\r\nChatHistory chat = new(\"You are an AI assistant that helps people find information.\");\r\nStringBuilder builder = new();\r\n\r\n\/\/ Q&amp;A loop\r\nwhile (true)\r\n{\r\n    Console.Write(\"Question: \");\r\n    string question = Console.ReadLine()!;\r\n\r\n    builder.Clear();\r\n    await foreach (var result in memory.SearchAsync(collectionName, question, limit: 3))\r\n        builder.AppendLine(result.Metadata.Text);\r\n    int contextToRemove = -1;\r\n    if (builder.Length != 0)\r\n    {\r\n        builder.Insert(0, \"Here's some additional information: \");\r\n        contextToRemove = chat.Count;\r\n        chat.AddUserMessage(builder.ToString());\r\n    }\r\n\r\n    chat.AddUserMessage(question);\r\n\r\n    builder.Clear();\r\n    await foreach (var message in ai.GetStreamingChatMessageContentsAsync(chat))\r\n    {\r\n        Console.Write(message);\r\n        builder.Append(message.Content);\r\n    }\r\n    Console.WriteLine();\r\n    chat.AddAssistantMessage(builder.ToString());\r\n\r\n    if (contextToRemove >= 0) chat.RemoveAt(contextToRemove);\r\n    Console.WriteLine();\r\n}<\/code><\/pre>\n<p>When I run this, I now see lots of logging happening:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2024\/02\/LoggingFromEmbeddingGeneration.png\" alt=\"Embedding creation logging from Semantic Kernel\" \/>\nThe text chunking code split the document into 163 \"paragraphs,\" leading to 163 embeddings being generated and stored in our database. For one or two of the resulting requests, they were also throttled, with the service sending back an error saying that too many requests were being issued in too short a period of time, and the <code>HttpClient<\/code> used by SK automatically retried after a few seconds, at which point it was able to successfully continue. The cool thing is now with all of those embeddings, when we ask our question, it results in pulling from the database the most relevant material, that additional text is added to the prompt, and now when we ask the same questions we did earlier, we get a much more helpful and accurate response:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2024\/02\/CorrectAnswerAboutNonBacktracking.png\" alt=\"LLM now correctly answers questions about functionality created after the model was trained\" \/>\nSweet.<\/p>\n<h2>Persistence of Memory<\/h2>\n<p>Of course, we don't want to have to index the material every time the app restarts. Imagine this was a site that was enabling chatting with thousands of documents; reindexing all of that content every time a process was restarted would not only be time consuming, it would be unnecessarily expensive (the <a href=\"https:\/\/azure.microsoft.com\/pricing\/details\/cognitive-services\/openai-service\/\">pricing details<\/a> for the Azure OpenAI embedding model I'm using here highlight that at the time of this writing it costs $0.0001 per 1,000 tokens, which means just this one document costs a few cents to index). So, we want to switch to using a database. SK provides a multitude of <code>IMemoryStore<\/code> implementations, and we can easily switch to one that actually persists the results. For example, let's switch to one based on Sqlite. For this, we need another NuGet package:<\/p>\n<pre><code class=\"language-sh\">dotnet add package Microsoft.SemanticKernel.Connectors.Sqlite --prerelease<\/code><\/pre>\n<p>and with that, we can change just one line of code to switch from the <code>VolatileMemoryStore<\/code>:<\/p>\n<pre><code class=\"language-C#\">.WithMemoryStore(new VolatileMemoryStore())<\/code><\/pre>\n<p>to the <code>SqliteMemoryStore<\/code>:<\/p>\n<pre><code class=\"language-C#\">.WithMemoryStore(await SqliteMemoryStore.ConnectAsync(\"mydata.db\"))<\/code><\/pre>\n<p>Sqlite is an embedded SQL database engine that runs in the same process and stores its data in regular disk files. Here, it'll connect to a <code>mydata.db<\/code> file, creating it if it doesn't already exist. Now, if we were to run that, we'd still end up creating the embeddings again, as in our previous example there wasn't any guard checking to see whether the data already existed. Thus, our final change is simply to guard that work:<\/p>\n<pre><code class=\"language-C#\">IList&lt;string&gt; collections = await memory.GetCollectionsAsync();\r\nif (!collections.Contains(\"net7perf\"))\r\n{\r\n    ... \/\/ same code as before to download and process the document\r\n}<\/code><\/pre>\n<p>You get the idea. Here's the full version using Sqlite:<\/p>\n<pre><code class=\"language-C#\">using Microsoft.Extensions.DependencyInjection;\r\nusing Microsoft.Extensions.Logging;\r\nusing Microsoft.SemanticKernel;\r\nusing Microsoft.SemanticKernel.ChatCompletion;\r\nusing Microsoft.SemanticKernel.Connectors.OpenAI;\r\nusing Microsoft.SemanticKernel.Connectors.Sqlite;\r\nusing Microsoft.SemanticKernel.Memory;\r\nusing Microsoft.SemanticKernel.Text;\r\nusing System.Net;\r\nusing System.Text;\r\nusing System.Text.RegularExpressions;\r\n\r\n#pragma warning disable SKEXP0003, SKEXP0011, SKEXP0028, SKEXP0052, SKEXP0055 \/\/ Experimental\r\n\r\nstring apikey = Environment.GetEnvironmentVariable(\"AI:OpenAI:APIKey\")!;\r\n\r\n\/\/ Initialize the kernel\r\nIKernelBuilder kb = Kernel.CreateBuilder();\r\nkb.AddOpenAIChatCompletion(\"gpt-3.5-turbo-0125\", apikey);\r\nkb.Services.AddLogging(c => c.AddConsole());\r\nkb.Services.ConfigureHttpClientDefaults(c => c.AddStandardResilienceHandler());\r\nKernel kernel = kb.Build();\r\n\r\n\/\/ Download a document and create embeddings for it\r\nISemanticTextMemory memory = new MemoryBuilder()\r\n    .WithLoggerFactory(kernel.LoggerFactory)\r\n    .WithMemoryStore(await SqliteMemoryStore.ConnectAsync(\"mydata.db\"))\r\n    .WithOpenAITextEmbeddingGeneration(\"text-embedding-3-small\", apikey)\r\n    .Build();\r\n\r\nIList&lt;string&gt; collections = await memory.GetCollectionsAsync();\r\nstring collectionName = \"net7perf\";\r\nif (collections.Contains(collectionName))\r\n{\r\n    Console.WriteLine(\"Found database\");\r\n}\r\nelse\r\n{\r\n    using HttpClient client = new();\r\n    string s = await client.GetStringAsync(\"https:\/\/devblogs.microsoft.com\/dotnet\/performance_improvements_in_net_7\");\r\n    List<string> paragraphs =\r\n        TextChunker.SplitPlainTextParagraphs(\r\n            TextChunker.SplitPlainTextLines(\r\n                WebUtility.HtmlDecode(Regex.Replace(s, @\"&lt;[^&gt;]+&gt;|&amp;nbsp;\", \"\")),\r\n                128),\r\n            1024);\r\n    for (int i = 0; i < paragraphs.Count; i++)\r\n        await memory.SaveInformationAsync(collectionName, paragraphs[i], $\"paragraph{i}\");\r\n    Console.WriteLine(\"Generated database\");\r\n}\r\n\r\n\/\/ Create a new chat\r\nvar ai = kernel.GetRequiredService&lt;IChatCompletionService&gt;();\r\nChatHistory chat = new(\"You are an AI assistant that helps people find information.\");\r\nStringBuilder builder = new();\r\n\r\n\/\/ Q&amp;A loop\r\nwhile (true)\r\n{\r\n    Console.Write(\"Question: \");\r\n    string question = Console.ReadLine()!;\r\n\r\n    builder.Clear();\r\n    await foreach (var result in memory.SearchAsync(collectionName, question, limit: 3))\r\n        builder.AppendLine(result.Metadata.Text);\r\n\r\n    int contextToRemove = -1;\r\n    if (builder.Length != 0)\r\n    {\r\n        builder.Insert(0, \"Here's some additional information: \");\r\n        contextToRemove = chat.Count;\r\n        chat.AddUserMessage(builder.ToString());\r\n    }\r\n\r\n    chat.AddUserMessage(question);\r\n\r\n    builder.Clear();\r\n    await foreach (var message in ai.GetStreamingChatMessageContentsAsync(chat))\r\n    {\r\n        Console.Write(message);\r\n        builder.Append(message.Content);\r\n    }\r\n    Console.WriteLine();\r\n    chat.AddAssistantMessage(builder.ToString());\r\n\r\n    if (contextToRemove >= 0) chat.RemoveAt(contextToRemove);\r\n    Console.WriteLine();\r\n}<\/code><\/pre>\n<p>Now when we run that, on first invocation we still end up indexing everything, but after that, the data has all been indexed:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/SqliteDb.png\" alt=\"Sqlite database on disk\" \/>\nand subsequent invocations are able to simply use it.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/FoundDatabase.png\" alt=\"Embeddings found in Sqlite database\" \/><\/p>\n<p>Of course, while Sqlite is an awesome tool, it's not optimized for doing these kinds of searches. In fact, the code for this <code>SqliteMemoryStore<\/code> in SK is simply enumerating the full contents of the database and doing a <code>CosineSimilarity<\/code> check on each:<\/p>\n<pre><code class=\"language-C#\">\/\/ from https:\/\/github.com\/microsoft\/semantic-kernel\/blob\/52e317a79651898a6c135124241c9e7dcb0c02ae\/dotnet\/src\/Connectors\/Connectors.Memory.Sqlite\/SqliteMemoryStore.cs#L136\r\nawait foreach (var record in this.GetAllAsync(collectionName, cancellationToken))\r\n{\r\n    if (record != null)\r\n    {\r\n        double similarity = TensorPrimitives.CosineSimilarity(embedding.Span, record.Embedding.Span);\r\n        ...<\/code><\/pre>\n<p>For real scale, and to be able to share the data between multiple frontends, we'd want a real \"vector database,\" one that's been designed for storing and searching embeddings. There are a multitude of such vector databases now available, including Azure AI Search, Chroma, Milvus, Pinecone, Qdrant, Weaviate, and more, all of which have memory store implementations for SK. We can simply stand up one of those (most of which have docker images readily available), change our <code>WithMemoryStore<\/code> call to use the appropriate connector, and we're cooking with gas.<\/p>\n<p>So let's do that. You can choose whichever of these databases works well for your needs; for the purposes of this post, I've arbitrarily chosen Qdrant. Ensure you have docker up and running, and then issue the following command to pull down the Qdrant image:<\/p>\n<pre><code class=\"language-sh\">docker pull qdrant\/qdrant<\/code><\/pre>\n<p>Once you have that, you can start a container with it:<\/p>\n<pre><code class=\"language-sh\">docker run -p 6333:6333 -v \/qdrant_storage:\/qdrant\/storage qdrant\/qdrant<\/code><\/pre>\n<p>And that's it; we now have a vector database up and running locally. Now we just need to use it instead. I add the relevant SK \"connector\" to my project:<\/p>\n<pre><code class=\"language-sh\">dotnet add package Microsoft.SemanticKernel.Connectors.Qdrant --prerelease<\/code><\/pre>\n<p>and then change two lines of code, from:<\/p>\n<pre><code class=\"language-C#\">using Microsoft.SemanticKernel.Connectors.Memory.Sqlite;\r\n...\r\n.WithMemoryStore(await SqliteMemoryStore.ConnectAsync(\"mydata.db\"))<\/code><\/pre>\n<p>to:<\/p>\n<pre><code class=\"language-C#\">using Microsoft.SemanticKernel.Connectors.Qdrant;\r\n...\r\n.WithMemoryStore(new QdrantMemoryStore(\"http:\/\/localhost:6333\/\", 1536))<\/code><\/pre>\n<p>And that's it! Now when I run it, I see a flurry of logging activity coming from Qdrant as the app stores all the embeddings:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/QdrantLogging.png\" alt=\"Console logging from Qdrant\" \/>\nWe can use its dashboard to inspect the data that was stored.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/QdrantDashboard.png\" alt=\"Qdrant web dashboard\" \/>\nAnd of course the app continues working happily:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/UsingQdrantFromChatApp.png\" alt=\"Using Qdrant from the chat console app\" \/><\/p>\n<h2>Hearing the Call<\/h2>\n<p>We've just implemented an end-to-end use of embeddings with a vector database, examining the input query, getting additional content based on that query, and augmenting the prompt submitted to the LLM in order to give it more context. That's the essence of RAG. However, there are other ways content can be retrieved, and you as the developer don't always need to be the one doing it. In fact, models themselves may be trained to ask for more information; the OpenAI models, for example, have been trained to support tools \/ function calls, where as part of the prompt they can be told about a set of functions they could invoke if they deem it valuable. See the \"Function Calling\" section of https:\/\/openai.com\/blog\/function-calling-and-other-api-updates. Essentially, as part of the chat message, you include a schema for any available function the LLM might want to use, and then if the LLM detects an opportunity to use it, rather than sending back a textual response to your message, it sends back a request for you to invoke the function, replete with the argument that should be provided. You then invoke the function and reissue your request, this time with both its function request and your function's response in the chat history.<\/p>\n<p>As we saw earlier, SK supports creating strongly-typed function objects (<code>KernelFunction<\/code>), and collections of these functions (referred to as a \"plugin\") can be added to a <code>Kernel<\/code>. SK is then able to automatically handle all aspects of that function calling lifecycle for you: it can describe the shape of the functions, include the schema for the parameters in the chat message, parse function call request responses, invoke the relevant function, and send back the results, all without the developer needing to be in the loop (though the developer can be if desired).<\/p>\n<p>Let's look at an example. Here I'm creating a <code>Kernel<\/code> that contains a single plugin, which in turn contains a single function (there are multiple ways these can be expressed and brought into a Kernel; here I'm just using one based on lambda functions in order to keep the post concise). When invoked, that function will look at the name of the person specified and return that person's age; I've hardcoded those ages here, but obviously this function could do anything and look anywhere in order to retrieve that information. Notice that the function is just returning an integer: SK handles marshaling of data in and out of the function, so arbitrary data types can be used and it handles the conversion of those types. I've also added some metadata to the function, so that SK can appropriately describe this function and its parameters to the LLM.<\/p>\n<pre><code class=\"language-C#\">kernel.ImportPluginFromFunctions(\"Demographics\",\r\n[\r\n    kernel.CreateFunctionFromMethod(\r\n        [Description(\"Gets the age of the named person\")]\r\n        ([Description(\"The name of a person\")] string name) => name switch\r\n        {\r\n            \"Elsa\" => 21,\r\n            \"Anna\" => 18,\r\n            _ => -1,\r\n        }, \"get_person_age\")\r\n]);<\/code><\/pre>\n<p>The only thing we then need to do is tell the <code>IChatCompletionService<\/code> that we want it to opt-in to automatic function calling, which we do by providing it with a <code>PromptExecutionSettings<\/code> object that's been configured appropriately. The end result of our whole program looks like this:<\/p>\n<pre><code class=\"language-C#\">using Microsoft.Extensions.DependencyInjection;\r\nusing Microsoft.Extensions.Logging;\r\nusing Microsoft.SemanticKernel;\r\nusing Microsoft.SemanticKernel.ChatCompletion;\r\nusing Microsoft.SemanticKernel.Connectors.OpenAI;\r\nusing System.ComponentModel;\r\nusing System.Text;\r\n\r\nstring apikey = Environment.GetEnvironmentVariable(\"AI:OpenAI:APIKey\")!;\r\n\r\n\/\/ Initialize the kernel\r\nIKernelBuilder kb = Kernel.CreateBuilder();\r\nkb.AddOpenAIChatCompletion(\"gpt-3.5-turbo-0125\", apikey);\r\nkb.Services.AddLogging(c => c.AddConsole().SetMinimumLevel(LogLevel.Trace));\r\nKernel kernel = kb.Build();\r\n\r\nkernel.ImportPluginFromFunctions(\"Demographics\",\r\n[\r\n    kernel.CreateFunctionFromMethod(\r\n        [Description(\"Gets the age of the named person\")]\r\n        ([Description(\"The name of a person\")] string name) => name switch\r\n        {\r\n            \"Elsa\" => 21,\r\n            \"Anna\" => 18,\r\n            _ => -1,\r\n        }, \"get_person_age\")\r\n]);\r\n\r\n\/\/ Create a new chat\r\nvar ai = kernel.GetRequiredService&lt;IChatCompletionService&gt;();\r\nChatHistory chat = new(\"You are an AI assistant that helps people find information.\");\r\nStringBuilder builder = new();\r\nOpenAIPromptExecutionSettings settings = new() { ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions };\r\n\r\n\/\/ Q&A loop\r\nwhile (true)\r\n{\r\n    Console.Write(\"Question: \");\r\n    chat.AddUserMessage(Console.ReadLine()!);\r\n\r\n    builder.Clear();\r\n    await foreach (var message in ai.GetStreamingChatMessageContentsAsync(chat, settings, kernel))\r\n    {\r\n        Console.Write(message);\r\n        builder.Append(message.Content);\r\n    }\r\n    Console.WriteLine();\r\n    chat.AddAssistantMessage(builder.ToString());\r\n}<\/code><\/pre>\n<p>And with that, we can see the LLM not only has access to what was included in the prompt, but it also able to effectively invoke this function to get the additional information it needs:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2024\/02\/FunctionsResults.png\" alt=\"Showing the results of function invocations\" \/><\/p>\n<h2>What's Next?<\/h2>\n<p>Whew! I've obviously omitted a lot of important details that any real application will need to consider. How should the data being indexed be cleaned and normalized and chunked? How should errors be handled? How should we restrict how much data is sent as part of each request (e.g. limiting chat history, limiting the size of the found embeddings)? In a service, where should all of this information be persisted? And a multitude of others, including making the UI much prettier than my amazing <code>Console.WriteLine<\/code> calls. But even with all of those details missing, it should hopefully be obvious now that you can get started incorporating this kind of functionality into your applications immediately. As mentioned as well, the space is evolving very quickly, and your feedback about what works well and doesn't work well for you would be invaluable for the teams working on these solutions. I encourage you to join the discussions in repos like for <a href=\"https:\/\/github.com\/microsoft\/semantic-kernel\">Semantic Kernel<\/a> and <a href=\"https:\/\/github.com\/Azure\/azure-sdk-for-net\/tree\/main\/sdk\/openai\/Azure.AI.OpenAI\">Azure OpenAI client library<\/a>.<\/p>\n<p>Happy coding!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Build a chat-based console app with Retrieval Augmented Generation (RAG) from scratch.<\/p>\n","protected":false},"author":360,"featured_media":47284,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685,7781],"tags":[7701],"class_list":["post-47283","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","category-ai","tag-dotnet-8"],"acf":[],"blog_post_summary":"<p>Build a chat-based console app with Retrieval Augmented Generation (RAG) from scratch.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/47283","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/360"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=47283"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/47283\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/47284"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=47283"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=47283"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=47283"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}