September 6th, 2023

Demystifying Retrieval Augmented Generation with .NET

Stephen Toub - MSFT
Partner Software Engineer

This post was edited on 2/1/2024 to update it for Semantic Kernel 1.3.0.

Generative AI, or using AI to create text, images, audio, or basically anything else, has taken the world by storm over the last year. Developers for all manner of applications are now exploring how these systems can be incorporated to the benefit of their users. Yet while the technology advances at a breakneck pace, new models are released every day, and new SDKs constantly pop out of the woodwork, it can be challenging for developers to figure out how to actually get started. There are a variety of polished end-to-end sample applications that .NET developers can use as a reference (example). However, I personally do better when I can build something up incrementally, learning the minimal concepts first, and then expanding on it and making it robust and beautiful later.

To that end, this post focuses on building a simple console-based .NET chat application from the ground up, with minimal dependencies and minimal fuss. The end goal is to be able to ask questions and get answers not only based on the data on which our model was trained, but also on additional data supplied dynamically. Along the way, every code sample shown is a complete application, so you can just copy-and-paste it into a Program.cs file, run it, play with it, and then copy-and-paste into your real application, where you can refine and augment it to your heart’s content.

Let’s Get Chattin’

To begin, make sure you have .NET 8 installed, and create a simple console app (.NET 6, .NET 7, and .NET Framework will also work as well, just with a few tweaks to the project file):

dotnet new console -o chatapp
cd chatapp

This creates a new directory chatapp and populates it with two files: chatapp.csproj and Program.cs. We then need to bring in one NuGet package: Microsoft.SemanticKernel.

dotnet add package Microsoft.SemanticKernel

We could choose to reference specific AI-related packages, like Azure.AI.OpenAI, but I’ve instead turned to Semantic Kernel (SK) as a way to simplify various interactions and more easily swap in and out different implementations in order to more quickly experiment. SK provides a set of libraries that makes it easier to work with Large Language Models (LLMs). It provides abstractions for various AI concepts so that you can code to the abstraction and more easily substitute different implementations. It provides many concrete implementations of those abstractions, wrapping a multitude of other SDKs. It provides support for planning and orchestration, such that you can ask AI to create a plan for how to achieve a certain goal. It provides support for plug-ins. And much more. We’ll touch on a variety of those aspects throughout this post, but I’m primarily using SK for its abstractions.

While for the purposes of this post I’ve tried to keep dependencies to a minimum, there’s one more I can’t avoid: you need access to an LLM. The easiest way to get access is via either OpenAI or Azure OpenAI. For this post, I’m using OpenAI, but switching to use Azure OpenAI instead requires changing just one line in each of the samples. You’ll need three pieces of information for the remainder of the post:

  • Your API key, provided to you in the portal for your service. (If you’re using Azure OpenAI instead of OpenAI, you’ll also need your endpoint, which is provided to you in your Azure portal.)
  • A chat model. I’m using gpt-3.5-turbo-0125, which as of this writing has a context window of ~16K tokens. We’ll talk more about what this is later. Note that if you’re using Azure OpenAI instead of OpenAI, you won’t refer to the model by name; instead, you’ll create a deployment of that model and refer to the deployment name.
  • An embedding model. I’m using text-embedding-3-small.

With that out of the way, we can dive in. Believe it or not, we can create a simple chat app in just a few lines code. Copy-and-paste this into your Program.cs:

using Microsoft.SemanticKernel;

string apikey = Environment.GetEnvironmentVariable("AI:OpenAI:APIKey")!;

// Initialize the kernel
Kernel kernel = Kernel.CreateBuilder()
    .AddOpenAIChatCompletion("gpt-3.5-turbo-0125", apikey)
    .Build();

// Q&A loop
while (true)
{
    Console.Write("Question: ");
    Console.WriteLine(await kernel.InvokePromptAsync(Console.ReadLine()!));
    Console.WriteLine();
}

To avoid accidentally leaking my API key (which needs to be protected like you would a password) to the world in this post, I’ve stored it in an environment variable. Thus, I read in the API key and endpoint via GetEnvironmentVariable. I then create a new “kernel” with the SK APIs, asking it to add into the kernel an OpenAI chat completion service. The Microsoft.SemanticKernel package we pulled in earlier includes references to client support for both OpenAI and Azure OpenAI, so we don’t need anything additional to be able to talk to these services. And with that configured, we can now run our chat app (dotnet run), typing out questions and getting answers back from the service: Simple question and answer with a chat agent

The await kernel.InvokePromptAsync(Console.ReadLine()!) expression in there is the entirety of the interaction with the LLM. This reads in the user’s question and sends it off to the LLM, getting back a string response. SK supports multiple kinds of functions that can be invoked, including prompt functions (text-based interactions with AI) and normal .NET methods that can do anything C# code can do. The invocation of these functions can be issued directly by the consumer, as we’re doing here, but they can also be invoked as part of a “plan”: you supply a set of functions, each of which is capable of accomplishing something, and you ask the LLM for a plan for how to use one or more of those functions to achieve some described goal… SK can then invoke the functions according to the plan (we’ll see an example of that later). Some models also suppport “function calling”, which we’ll see in action later; this is also something SK simplifies.

The “function” in this example is just whatever the user typed, e.g. if the user typed “What color is the sky?”, that’s the function, asking of the LLM “What color is the sky?”, since that’s what we passed to InvokePromptAsync. We can make this function nature a bit clearer by separating the function out into its own entity, via the CreateFunctionFromPrompt method, and then reusing that one function repeatedly. In doing so, we’re no longer creating a new function per user input, and thus need some way to parameterize the created function with the user’s input. For that, SK includes support for prompt templates, where you supply the prompt but with placeholders that SK will fill in based on the variables and functions available to it. For example, if I run the previous sample again, and this time ask for the current time, the LLM is unable to provide me with an answer: LLM doesn't know the current time However, if we expect such questions, we can proactively provide the LLM with the information it needs as part of the prompt. Here I’ve registered with the kernel a function that returns the current date and time. Then I’ve created a prompt function, with a prompt template that will invoke this function as part of rendering the prompt, and that will also include the value of the $input variable (any number of arbitrarily-named arguments can be supplied via a KernelArguments dictionary, and I’ve simply chosen to name one “input”). Functions are grouped into “plugins”:

using Microsoft.SemanticKernel;

string apikey = Environment.GetEnvironmentVariable("AI:OpenAI:APIKey")!;

// Initialize the kernel
Kernel kernel = Kernel.CreateBuilder()
    .AddOpenAIChatCompletion("gpt-3.5-turbo-0125", apikey)
    .Build();

// Create the prompt function as part of a plugin and add it to the kernel.
// These operations can be done separately, but helpers also enable doing
// them in one step.
kernel.ImportPluginFromFunctions("DateTimeHelpers",
[
    kernel.CreateFunctionFromMethod(() => $"{DateTime.UtcNow:r}", "Now", "Gets the current date and time")
]);

KernelFunction qa = kernel.CreateFunctionFromPrompt("""
    The current date and time is {{ datetimehelpers.now }}.
    {{ $input }}
    """);

// Q&A loop
var arguments = new KernelArguments();
while (true)
{
    Console.Write("Question: ");
    arguments["input"] = Console.ReadLine();
    Console.WriteLine(await qa.InvokeAsync(kernel, arguments));
    Console.WriteLine();
}

When that function is invoked, it will render the prompt, filling in those placeholders by invoking the registered Now function and substituting its result into the prompt. Now when I ask the same question, the answer is more satisfying: LLM now knows the current time

A Trip Down Memory Lane

We’re making good progress: in just a few lines of code, we’ve been able to create a simple chat agent to which we can repeatedly pose questions and get answers, and we’ve been able to provide additional information in the prompt to help it answer questions it would have otherwise been unable to answer. However, we’ve also created a chat agent with no memory, such that it has no concept of things previously discussed: LLM has no memory of the previous messages in the chat

These LLMs are stateless. To address the lack of memory, we need to keep track of our chat history and send it back as part of the prompt on each request. We could do so manually, rendering it into the prompt ourselves, or we could rely on SK to do it for us (and it can rely on the underlying clients for Azure OpenAI, OpenAI, or whatever other chat service is plugged in). This does the latter, getting the registered IChatCompletionService, creating a new chat (which is essentially just a list of all messages added to it), and then not only issuing requests and printing out responses, but also storing both into the chat history.

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

string apikey = Environment.GetEnvironmentVariable("AI:OpenAI:APIKey")!;

// Initialize the kernel
Kernel kernel = Kernel.CreateBuilder()
    .AddOpenAIChatCompletion("gpt-3.5-turbo-0125", apikey)
    .Build();

// Create a new chat
IChatCompletionService ai = kernel.GetRequiredService<IChatCompletionService>();
ChatHistory chat = new("You are an AI assistant that helps people find information.");

// Q&A loop
while (true)
{
    Console.Write("Question: ");
    chat.AddUserMessage(Console.ReadLine()!);

    var answer = await ai.GetChatMessageContentAsync(chat);
    chat.AddAssistantMessage(answer.Content!);
    Console.WriteLine(answer);

    Console.WriteLine();
}

With that chat history rendered into an appropriate prompt, we then get back much more satisfying results: LLM now sees all messages in the chat

In a real implementation, you’d need to pay attention to many other details, like the fact that all of these language models today have limits on the amount of data they’re able to process (the “context window”). The model I’m using has a context window of ~16,000 tokens (a “token” is the unit at which an LLM operates, sometimes a whole word, sometimes a portion of a word, sometimes a single character), plus there’s a per-token cost associated with every request/response, so once this graduates from experimentation to “let’s put this into production,” we’d need to start paying a lot more attention to things like how much data is actually in the chat history, clearing out portions of it, etc.

We can also add a tiny bit of code to help make the interaction feel snappier. These LLMs work based on generating the next token in the response, so although up until now we’ve only been printing out the response when the whole thing has arrived, we can actually stream the results so that we print out portions of the response as it’s available. This is exposed in SK via IAsyncEnumerable<T>, making it conventient to work with via await foreach loops.

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Text;

string apikey = Environment.GetEnvironmentVariable("AI:OpenAI:APIKey")!;

// Initialize the kernel
Kernel kernel = Kernel.CreateBuilder()
    .AddOpenAIChatCompletion("gpt-3.5-turbo-0125", apikey)
    .Build();

// Create a new chat
IChatCompletionService ai = kernel.GetRequiredService<IChatCompletionService>();
ChatHistory chat = new("You are an AI assistant that helps people find information.");
StringBuilder builder = new();

// Q&A loop
while (true)
{
    Console.Write("Question: ");
    chat.AddUserMessage(Console.ReadLine()!);

    builder.Clear();
    await foreach (StreamingChatMessageContent message in ai.GetStreamingChatMessageContentsAsync(chat))
    {
        Console.Write(message);
        builder.Append(message.Content);
    }
    Console.WriteLine();
    chat.AddAssistantMessage(builder.ToString());

    Console.WriteLine();
}

Now when we run this, we can see the response streaming in:

Mind the Gap

So, we’re now able to submit questions and get back answers. We’re able to keep a history of these interactions and use them to influence the answers. And we’re able to stream our results. Are we done? Not exactly.

Thus far, the only information the LLM has to provide answers is the data on which it was trained, plus anything we proactively put into the prompt (e.g. the current time in a previous example). That means if we ask questions about things the LLM wasn’t trained on or for which it has significant gaps in its knowledgebase, the answers we get back are likely to be unhelpful, misleading, or blatantly wrong (aka “hallucinations”). For example, consider this question and answer: Incorrect answer about functionality released after LLM model was trained The questions are asking about functionality introduced in .NET 7, which was released after this version of the GPT 3.5 Turbo model was trained (a newer version of GPT 3.5 Turbo is out in preview as of the time of this writing). The model has no information about the functionality, so for the first question, it gives an outdated answer, and for the second question, it starts hallucinating and just making up stuff. We need to find a way to teach it about the things the user is asking about.

We’ve already seen a way of teaching the LLM things: put it in the prompt. So let’s try that. The blog post Performance Improvements in .NET 7, which was also posted after this version of GPT 3.5 Turbo model was trained, contains a lengthy section on Regex improvements in .NET 7, including about that new RegexOptions.NonBacktracking option, so if we put it all into the prompt, that should provide the LLM with what it needs. Here I’ve just augmented the previous example with an additional section of code that downloads the contents of the web page, does a hack job of cleaning up the contents a bit, and then adding that all into a user message.

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;

string apikey = Environment.GetEnvironmentVariable("AI:OpenAI:APIKey")!;

// Initialize the kernel
Kernel kernel = Kernel.CreateBuilder()
    .AddOpenAIChatCompletion("gpt-3.5-turbo-0125", apikey)
    .Build();

// Create a new chat
IChatCompletionService ai = kernel.GetRequiredService<IChatCompletionService>();
ChatHistory chat = new("You are an AI assistant that helps people find information.");
StringBuilder builder = new();

// Download a document and add all of its contents to our chat
using (HttpClient client = new())
{
    string s = await client.GetStringAsync("https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7");
    s = WebUtility.HtmlDecode(Regex.Replace(s, @"<[^>]+>|&nbsp;", ""));
    chat.AddUserMessage("Here's some additional information: " + s); // uh oh!
}

// Q&A loop
while (true)
{
    Console.Write("Question: ");
    chat.AddUserMessage(Console.ReadLine()!);

    builder.Clear();
    await foreach (var message in ai.GetStreamingChatMessageContentsAsync(chat))
    {
        Console.Write(message);
        builder.Append(message.Content);
    }
    Console.WriteLine();
    chat.AddAssistantMessage(builder.ToString());

    Console.WriteLine();
}

And the result?

Unhandled exception. Microsoft.SemanticKernel.AI.AIException: Invalid request: The request is not valid, HTTP status: 400
 ---> Azure.RequestFailedException: This model's maximum context length is 16384 tokens. However, your messages resulted in 155751 tokens. Please reduce the length of the messages.
Status: 400 (model_error)
ErrorCode: context_length_exceeded

Oops! Even without any additional history, we exceeded the context window by almost 10 times. We obviously need to include less information, but we still need to ensure it’s relevant information. RAG to the rescue.

“RAG,” or Retrieval Augmented Generation, is just a fancy way of saying “look up some stuff and put it into the prompt.” Rather than putting all possible information into the prompt, we’ll instead index all of the additional information we care about, and then when a question is asked, we’ll use that question to find the most relevant indexed content and put just that additional content into the prompt. And to help with that, we need embeddings.

Think of an “embedding” as a vector (array) of floating-point values that represents some content and its semantic meaning. We can ask a model specifically focused on embeddings to create such a vector for a particular input, and then we can store both the vector and the text that seeded it into a database. Later on, when a question is asked, we can similarly run that question through the same model, and we can use the resulting vector to look up the most relevant embeddings in our database. We’re not necessarily looking for exact matches, just ones that are close enough. And you can take “close” here literally; the lookups are typically performed using functions that use a distance measure, such as cosine similarity. For example, consider this program (to run this, you’ll need to add the System.Numerics.Tensors nuget package in order to have access to the TensorPrimitives type):

using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Numerics.Tensors;

#pragma warning disable SKEXP0011

string apikey = Environment.GetEnvironmentVariable("AI:OpenAI:APIKey")!;

var embeddingGen = new OpenAITextEmbeddingGenerationService("text-embedding-3-small", apikey);

string input = "What is an amphibian?";
string[] examples =
{
    "What is an amphibian?",
    "Cos'è un anfibio?",
    "A frog is an amphibian.",
    "Frogs, toads, and salamanders are all examples.",
    "Amphibians are four-limbed and ectothermic vertebrates of the class Amphibia.",
    "They are four-limbed and ectothermic vertebrates.",
    "A frog is green.",
    "A tree is green.",
    "It's not easy bein' green.",
    "A dog is a mammal.",
    "A dog is a man's best friend.",
    "You ain't never had a friend like me.",
    "Rachel, Monica, Phoebe, Joey, Chandler, Ross",
};

// Generate embeddings for each piece of text
ReadOnlyMemory inputEmbedding = (await embeddingGen.GenerateEmbeddingsAsync([input]))[0];
IList> exampleEmbeddings = await embeddingGen.GenerateEmbeddingsAsync(examples);

// Print the cosine similarity between the input and each example
float[] similarity = exampleEmbeddings.Select(e => TensorPrimitives.CosineSimilarity(e.Span, inputEmbedding.Span)).ToArray();
similarity.AsSpan().Sort(examples.AsSpan(), (f1, f2) => f2.CompareTo(f1));
Console.WriteLine("Similarity Example");
for (int i = 0; i < similarity.Length; i++)
    Console.WriteLine($"{similarity[i]:F6}   {examples[i]}");

This uses the OpenAI embedding generation service to get an embedding vector (using the text-embedding-3-small model I mentioned at the beginning of the post) for both an input and a bunch of other pieces of text. It then compares the resulting embedding for the input against the resulting embedding for each of those other texts, sorts the results by similarity, and prints them out:

Similarity Example
1.000000   What is an amphibian?
0.937651   A frog is an amphibian.
0.902491   Amphibians are four-limbed and ectothermic vertebrates of the class Amphibia.
0.873569   Cos'è un anfibio?
0.866632   Frogs, toads, and salamanders are all examples.
0.857454   A frog is green.
0.842596   They are four-limbed and ectothermic vertebrates.
0.802171   A dog is a mammal.
0.784479   It's not easy bein' green.
0.778341   A tree is green.
0.756669   A dog is a man's best friend.
0.734219   You ain't never had a friend like me.
0.721176   Rachel, Monica, Phoebe, Joey, Chandler, Ross

Let's incorporate this concept into our chat app. In this iteration, I've augmented the previous chat example with a few things:

  • To better help us see what's going on under the covers, I've enable logging in the kernel. The kernel is actually a lightweight wrappers for a few pieces of information, including an IServiceProvider, which means you can use all of the services you're familiar with from elsewhere in .NET, including ILoggerFactory. The IKernelBuilder has a Services property that's an IServiceContainer, which means we can use all of the support from Microsoft.Extensions.Logging and friends to enable logging with all the support we're familiar with from ASP.NET. That also means we need to add a couple of additional packages:
    dotnet add package Microsoft.Extensions.Logging
    dotnet add package Microsoft.Extensions.Logging.Console
  • Since we can bring in arbitrary services, we can also use the great support available in the ecosystem for resiliency. We're making many requests here over HTTP, which can fail for various infrastructure-related reasons (e.g. a server that's temporarily unreachable). And LLMs themselves often introduce their own failure modes, e.g. the caller's account only permits a certain number of interactions per second. Thus, we want to enable smart retries. To do that, we'll use dotnet add package Microsoft.Extensions.Http.Resilience to bring in automated resilience support that can be imported into the kernel. Then any HTTP requests created by any of the other components via the kernel will get retries applied automatically.
  • To enable SK to do the embedding generation on our behalf via its abstractions, we'll also need to add its Memory package (note the "--prerelease"... this is an evolving space, so while some of the SK components are considered stable, others are still evolving and are thus still marked as "prerelease"):
    dotnet add package Microsoft.SemanticKernel.Plugins.Memory --prerelease
  • I then need to create an ISemanticTextMemory to use for querying, which I do by using MemoryBuilder to combine an embeddings generator with a database. I've used the WithAzureTextEmbeddingGenerationService method to specify I want to use the Azure OpenAI service as my embeddings generator, and I've used WithMemoryStore to register a VolatileMemoryStore instance as the store (we'll change that later, but this will suffice for now). VolatileMemoryStore is simply an implementation of SK's IMemoryStore abstraction wrapping an in-memory dictionary.
  • I've taken the downloaded text, used SK's TextChunker to break it fairly arbitrarily into pieces, and then I've used SaveInformationAsync to save each of those pieces to the memory store. That call will use the embedding service to generate an embedding for the text and then store the resulting vector and the input text into the aforementioned dictionary.
  • Then, when it's time to ask a question, rather than just adding the question to the chat history and submitting that, we first use the question to SearchAsync on the memory store. That will again use the embedding service to get an embedding vector for the question, and then search the store for the closest vectors to that input. I've arbitrarily had it return the three closest matches, for which it then appends together the associated text, adds the results into the chat history, and submits that. After submitting the request, I've also then removed this additional context from my chat history, so that it's not sent again on subsequent requests; this additional information can consume much of the allowed context window.

Here's our resulting program:

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Text;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;

#pragma warning disable SKEXP0003, SKEXP0011, SKEXP0052, SKEXP0055 // Experimental

string apikey = Environment.GetEnvironmentVariable("AI:OpenAI:APIKey")!;

// Initialize the kernel
IKernelBuilder kb = Kernel.CreateBuilder();
kb.AddOpenAIChatCompletion("gpt-3.5-turbo-0125", apikey);
kb.Services.AddLogging(c => c.AddConsole().SetMinimumLevel(LogLevel.Trace));
kb.Services.ConfigureHttpClientDefaults(c => c.AddStandardResilienceHandler());
Kernel kernel = kb.Build();

// Download a document and create embeddings for it
ISemanticTextMemory memory = new MemoryBuilder()
    .WithLoggerFactory(kernel.LoggerFactory)
    .WithMemoryStore(new VolatileMemoryStore())
    .WithOpenAITextEmbeddingGeneration("text-embedding-3-small", apikey)
    .Build();
string collectionName = "net7perf";
using (HttpClient client = new())
{
    string s = await client.GetStringAsync("https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7");
    List paragraphs =
        TextChunker.SplitPlainTextParagraphs(
            TextChunker.SplitPlainTextLines(
                WebUtility.HtmlDecode(Regex.Replace(s, @"<[^>]+>|&nbsp;", "")),
                128),
            1024);
    for (int i = 0; i < paragraphs.Count; i++)
        await memory.SaveInformationAsync(collectionName, paragraphs[i], $"paragraph{i}");
}

// Create a new chat
var ai = kernel.GetRequiredService<IChatCompletionService>();
ChatHistory chat = new("You are an AI assistant that helps people find information.");
StringBuilder builder = new();

// Q&A loop
while (true)
{
    Console.Write("Question: ");
    string question = Console.ReadLine()!;

    builder.Clear();
    await foreach (var result in memory.SearchAsync(collectionName, question, limit: 3))
        builder.AppendLine(result.Metadata.Text);
    int contextToRemove = -1;
    if (builder.Length != 0)
    {
        builder.Insert(0, "Here's some additional information: ");
        contextToRemove = chat.Count;
        chat.AddUserMessage(builder.ToString());
    }

    chat.AddUserMessage(question);

    builder.Clear();
    await foreach (var message in ai.GetStreamingChatMessageContentsAsync(chat))
    {
        Console.Write(message);
        builder.Append(message.Content);
    }
    Console.WriteLine();
    chat.AddAssistantMessage(builder.ToString());

    if (contextToRemove >= 0) chat.RemoveAt(contextToRemove);
    Console.WriteLine();
}

When I run this, I now see lots of logging happening: Embedding creation logging from Semantic Kernel The text chunking code split the document into 163 "paragraphs," leading to 163 embeddings being generated and stored in our database. For one or two of the resulting requests, they were also throttled, with the service sending back an error saying that too many requests were being issued in too short a period of time, and the HttpClient used by SK automatically retried after a few seconds, at which point it was able to successfully continue. The cool thing is now with all of those embeddings, when we ask our question, it results in pulling from the database the most relevant material, that additional text is added to the prompt, and now when we ask the same questions we did earlier, we get a much more helpful and accurate response: LLM now correctly answers questions about functionality created after the model was trained Sweet.

Persistence of Memory

Of course, we don't want to have to index the material every time the app restarts. Imagine this was a site that was enabling chatting with thousands of documents; reindexing all of that content every time a process was restarted would not only be time consuming, it would be unnecessarily expensive (the pricing details for the Azure OpenAI embedding model I'm using here highlight that at the time of this writing it costs $0.0001 per 1,000 tokens, which means just this one document costs a few cents to index). So, we want to switch to using a database. SK provides a multitude of IMemoryStore implementations, and we can easily switch to one that actually persists the results. For example, let's switch to one based on Sqlite. For this, we need another NuGet package:

dotnet add package Microsoft.SemanticKernel.Connectors.Sqlite --prerelease

and with that, we can change just one line of code to switch from the VolatileMemoryStore:

.WithMemoryStore(new VolatileMemoryStore())

to the SqliteMemoryStore:

.WithMemoryStore(await SqliteMemoryStore.ConnectAsync("mydata.db"))

Sqlite is an embedded SQL database engine that runs in the same process and stores its data in regular disk files. Here, it'll connect to a mydata.db file, creating it if it doesn't already exist. Now, if we were to run that, we'd still end up creating the embeddings again, as in our previous example there wasn't any guard checking to see whether the data already existed. Thus, our final change is simply to guard that work:

IList<string> collections = await memory.GetCollectionsAsync();
if (!collections.Contains("net7perf"))
{
    ... // same code as before to download and process the document
}

You get the idea. Here's the full version using Sqlite:

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using Microsoft.SemanticKernel.Connectors.Sqlite;
using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Text;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;

#pragma warning disable SKEXP0003, SKEXP0011, SKEXP0028, SKEXP0052, SKEXP0055 // Experimental

string apikey = Environment.GetEnvironmentVariable("AI:OpenAI:APIKey")!;

// Initialize the kernel
IKernelBuilder kb = Kernel.CreateBuilder();
kb.AddOpenAIChatCompletion("gpt-3.5-turbo-0125", apikey);
kb.Services.AddLogging(c => c.AddConsole());
kb.Services.ConfigureHttpClientDefaults(c => c.AddStandardResilienceHandler());
Kernel kernel = kb.Build();

// Download a document and create embeddings for it
ISemanticTextMemory memory = new MemoryBuilder()
    .WithLoggerFactory(kernel.LoggerFactory)
    .WithMemoryStore(await SqliteMemoryStore.ConnectAsync("mydata.db"))
    .WithOpenAITextEmbeddingGeneration("text-embedding-3-small", apikey)
    .Build();

IList<string> collections = await memory.GetCollectionsAsync();
string collectionName = "net7perf";
if (collections.Contains(collectionName))
{
    Console.WriteLine("Found database");
}
else
{
    using HttpClient client = new();
    string s = await client.GetStringAsync("https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7");
    List paragraphs =
        TextChunker.SplitPlainTextParagraphs(
            TextChunker.SplitPlainTextLines(
                WebUtility.HtmlDecode(Regex.Replace(s, @"<[^>]+>|&nbsp;", "")),
                128),
            1024);
    for (int i = 0; i < paragraphs.Count; i++)
        await memory.SaveInformationAsync(collectionName, paragraphs[i], $"paragraph{i}");
    Console.WriteLine("Generated database");
}

// Create a new chat
var ai = kernel.GetRequiredService<IChatCompletionService>();
ChatHistory chat = new("You are an AI assistant that helps people find information.");
StringBuilder builder = new();

// Q&A loop
while (true)
{
    Console.Write("Question: ");
    string question = Console.ReadLine()!;

    builder.Clear();
    await foreach (var result in memory.SearchAsync(collectionName, question, limit: 3))
        builder.AppendLine(result.Metadata.Text);

    int contextToRemove = -1;
    if (builder.Length != 0)
    {
        builder.Insert(0, "Here's some additional information: ");
        contextToRemove = chat.Count;
        chat.AddUserMessage(builder.ToString());
    }

    chat.AddUserMessage(question);

    builder.Clear();
    await foreach (var message in ai.GetStreamingChatMessageContentsAsync(chat))
    {
        Console.Write(message);
        builder.Append(message.Content);
    }
    Console.WriteLine();
    chat.AddAssistantMessage(builder.ToString());

    if (contextToRemove >= 0) chat.RemoveAt(contextToRemove);
    Console.WriteLine();
}

Now when we run that, on first invocation we still end up indexing everything, but after that, the data has all been indexed: Sqlite database on disk and subsequent invocations are able to simply use it. Embeddings found in Sqlite database

Of course, while Sqlite is an awesome tool, it's not optimized for doing these kinds of searches. In fact, the code for this SqliteMemoryStore in SK is simply enumerating the full contents of the database and doing a CosineSimilarity check on each:

// from https://github.com/microsoft/semantic-kernel/blob/52e317a79651898a6c135124241c9e7dcb0c02ae/dotnet/src/Connectors/Connectors.Memory.Sqlite/SqliteMemoryStore.cs#L136
await foreach (var record in this.GetAllAsync(collectionName, cancellationToken))
{
    if (record != null)
    {
        double similarity = TensorPrimitives.CosineSimilarity(embedding.Span, record.Embedding.Span);
        ...

For real scale, and to be able to share the data between multiple frontends, we'd want a real "vector database," one that's been designed for storing and searching embeddings. There are a multitude of such vector databases now available, including Azure AI Search, Chroma, Milvus, Pinecone, Qdrant, Weaviate, and more, all of which have memory store implementations for SK. We can simply stand up one of those (most of which have docker images readily available), change our WithMemoryStore call to use the appropriate connector, and we're cooking with gas.

So let's do that. You can choose whichever of these databases works well for your needs; for the purposes of this post, I've arbitrarily chosen Qdrant. Ensure you have docker up and running, and then issue the following command to pull down the Qdrant image:

docker pull qdrant/qdrant

Once you have that, you can start a container with it:

docker run -p 6333:6333 -v /qdrant_storage:/qdrant/storage qdrant/qdrant

And that's it; we now have a vector database up and running locally. Now we just need to use it instead. I add the relevant SK "connector" to my project:

dotnet add package Microsoft.SemanticKernel.Connectors.Qdrant --prerelease

and then change two lines of code, from:

using Microsoft.SemanticKernel.Connectors.Memory.Sqlite;
...
.WithMemoryStore(await SqliteMemoryStore.ConnectAsync("mydata.db"))

to:

using Microsoft.SemanticKernel.Connectors.Qdrant;
...
.WithMemoryStore(new QdrantMemoryStore("http://localhost:6333/", 1536))

And that's it! Now when I run it, I see a flurry of logging activity coming from Qdrant as the app stores all the embeddings: Console logging from Qdrant We can use its dashboard to inspect the data that was stored. Qdrant web dashboard And of course the app continues working happily: Using Qdrant from the chat console app

Hearing the Call

We've just implemented an end-to-end use of embeddings with a vector database, examining the input query, getting additional content based on that query, and augmenting the prompt submitted to the LLM in order to give it more context. That's the essence of RAG. However, there are other ways content can be retrieved, and you as the developer don't always need to be the one doing it. In fact, models themselves may be trained to ask for more information; the OpenAI models, for example, have been trained to support tools / function calls, where as part of the prompt they can be told about a set of functions they could invoke if they deem it valuable. See the "Function Calling" section of https://openai.com/blog/function-calling-and-other-api-updates. Essentially, as part of the chat message, you include a schema for any available function the LLM might want to use, and then if the LLM detects an opportunity to use it, rather than sending back a textual response to your message, it sends back a request for you to invoke the function, replete with the argument that should be provided. You then invoke the function and reissue your request, this time with both its function request and your function's response in the chat history.

As we saw earlier, SK supports creating strongly-typed function objects (KernelFunction), and collections of these functions (referred to as a "plugin") can be added to a Kernel. SK is then able to automatically handle all aspects of that function calling lifecycle for you: it can describe the shape of the functions, include the schema for the parameters in the chat message, parse function call request responses, invoke the relevant function, and send back the results, all without the developer needing to be in the loop (though the developer can be if desired).

Let's look at an example. Here I'm creating a Kernel that contains a single plugin, which in turn contains a single function (there are multiple ways these can be expressed and brought into a Kernel; here I'm just using one based on lambda functions in order to keep the post concise). When invoked, that function will look at the name of the person specified and return that person's age; I've hardcoded those ages here, but obviously this function could do anything and look anywhere in order to retrieve that information. Notice that the function is just returning an integer: SK handles marshaling of data in and out of the function, so arbitrary data types can be used and it handles the conversion of those types. I've also added some metadata to the function, so that SK can appropriately describe this function and its parameters to the LLM.

kernel.ImportPluginFromFunctions("Demographics",
[
    kernel.CreateFunctionFromMethod(
        [Description("Gets the age of the named person")]
        ([Description("The name of a person")] string name) => name switch
        {
            "Elsa" => 21,
            "Anna" => 18,
            _ => -1,
        }, "get_person_age")
]);

The only thing we then need to do is tell the IChatCompletionService that we want it to opt-in to automatic function calling, which we do by providing it with a PromptExecutionSettings object that's been configured appropriately. The end result of our whole program looks like this:

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.ComponentModel;
using System.Text;

string apikey = Environment.GetEnvironmentVariable("AI:OpenAI:APIKey")!;

// Initialize the kernel
IKernelBuilder kb = Kernel.CreateBuilder();
kb.AddOpenAIChatCompletion("gpt-3.5-turbo-0125", apikey);
kb.Services.AddLogging(c => c.AddConsole().SetMinimumLevel(LogLevel.Trace));
Kernel kernel = kb.Build();

kernel.ImportPluginFromFunctions("Demographics",
[
    kernel.CreateFunctionFromMethod(
        [Description("Gets the age of the named person")]
        ([Description("The name of a person")] string name) => name switch
        {
            "Elsa" => 21,
            "Anna" => 18,
            _ => -1,
        }, "get_person_age")
]);

// Create a new chat
var ai = kernel.GetRequiredService<IChatCompletionService>();
ChatHistory chat = new("You are an AI assistant that helps people find information.");
StringBuilder builder = new();
OpenAIPromptExecutionSettings settings = new() { ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions };

// Q&A loop
while (true)
{
    Console.Write("Question: ");
    chat.AddUserMessage(Console.ReadLine()!);

    builder.Clear();
    await foreach (var message in ai.GetStreamingChatMessageContentsAsync(chat, settings, kernel))
    {
        Console.Write(message);
        builder.Append(message.Content);
    }
    Console.WriteLine();
    chat.AddAssistantMessage(builder.ToString());
}

And with that, we can see the LLM not only has access to what was included in the prompt, but it also able to effectively invoke this function to get the additional information it needs: Showing the results of function invocations

What's Next?

Whew! I've obviously omitted a lot of important details that any real application will need to consider. How should the data being indexed be cleaned and normalized and chunked? How should errors be handled? How should we restrict how much data is sent as part of each request (e.g. limiting chat history, limiting the size of the found embeddings)? In a service, where should all of this information be persisted? And a multitude of others, including making the UI much prettier than my amazing Console.WriteLine calls. But even with all of those details missing, it should hopefully be obvious now that you can get started incorporating this kind of functionality into your applications immediately. As mentioned as well, the space is evolving very quickly, and your feedback about what works well and doesn't work well for you would be invaluable for the teams working on these solutions. I encourage you to join the discussions in repos like for Semantic Kernel and Azure OpenAI client library.

Happy coding!

Topics
.net 8

Author

Stephen Toub - MSFT
Partner Software Engineer

Stephen Toub is a developer on the .NET team at Microsoft.

23 comments

Discussion is closed. Login to edit/delete existing comments.

  • Coz

    Very interesting post. I wanted to start building the chat app myself but unfortunately I cannot create an Open AI Azure resource. Apparently "Your subscription is not enabled to deploy Azure OpenAI Service. The service is currently in Limited Access Public Preview. Please click the banner to apply for access.". When I checked what's needed to apply for access I see that this is only for enterprises.
    I wonder how everyone else was able to...

    Read more
    • King David Consulting LLC

      You can use open ai platform for this as well. https://platform.openai.com/overview

              if (_options.IsAzure)
              {
                  _builder.WithAzureChatCompletionService(
                      _options.ModelId,
                      _options.Endpoint.ToString(),
                      _options.Key);
              }
              else
              {
                  _builder.WithOpenAIChatCompletionService(
                      _options.Endpoint.ToString(),
                      _options.Key);
              }
  • Wouter Van Ranst · Edited

    Great article and excellent walkthrough!!

    As of 8/oct triggered the content management policy :) I m yet to find a workaround (same happened on gpt4, when changing dog to cat, when changing the name from Jane to Steve)

    <code>

    Read more
    • Stephen Toub - MSFTMicrosoft employee Author

      Thanks for pointing that out! I’ve updated the example to not trip the policy, and I’ve also followed up with the relevant teams about it.

    • Wouter Van Ranst

      Changing the age from 8 to 24 does the trick

      • Wouter Van Ranst

        I got a subsequent error where it cannot do basic math

        <code>

        I added this custom function (and needed to put the age otherwise the dog's age triggers another content management policy :))

        <code>

        this is the full listing

        <code>

        Read more
  • Dhananjaya Kuppu

    Thanks great article !! you explain so well that is easier to understand. Looking forward for more articles on dotnet Open AI

    • Stephen Toub - MSFTMicrosoft employee Author

      Thanks great article !! you explain so well that is easier to understand.

      Excellent, thanks!

  • Dalibor Čarapić

    @Stephen Toub
    Can you perhaps recommend some additional blog posts/articles which deal with using LLMs as a programmer?

  • Dalibor Čarapić

    Thank you very much for this. Very nice post and it helped me a lot to understand why all of that stuff even exists.

  • Midnight

    Great article, thank you! The explanations are easy and clear for getting started. Will definetly try the topic.

    But still a lot of black boxes around (which isn’t a problem at all, for now). Probably in future we’ll need some deep dives like “How async/await really works” but about the parts that makes it fly.

    P. S. Waiting for the you-know-what-posts-are-the-longest-you’ve-ever-seen series. 🙂

  • Frenchy.André

    I often print out those long articles and read them offline. Though, the screenshots from the Terminal makes it really wasteful on the ink. Could you possibly use a light-theme for the terminal in future posts? Would really appreciate it 🙂

    • Stephen Toub - MSFTMicrosoft employee Author

      Thanks for the feedback. Hadn’t occurred to me to change the terminal theme. I’ll certainly consider that for the future.

  • enrico sabbadin

    As far as I know 0301 has 4k tokens .. 0613 has 16k

    • Stephen Toub - MSFTMicrosoft employee Author

      If I submit a gigantic prompt, I get the following errors:
      For gpt-35-turbo 0301: “This model’s maximum context length is 16384 tokens”
      For gpt-35-turbo 0613: “This model’s maximum context length is 4096 tokens”
      For gpt-35-turbo-16k 0613: “This model’s maximum context length is 16384 tokens”

  • MgSam

    Thanks for the walkthrough. As always, this is really well done.

    I hope in the future there will be better ways of passing data to these models besides just putting everything into a giant text prompt, which seems to be the only method right now. This current generation of the LLMs being these magical black boxes that we throw text into seems really clumsy and awkward and leads to a lot of severe limitations.

    Read more
    • Stephen Toub - MSFTMicrosoft employee Author

      Thanks for the walkthrough. As always, this is really well done.

      Thanks.

  • 王宏亮 · Edited

    Looking forward to your brilliant insights into .net 8 performance improvements!

      • Daniel Smith

        Hey, now I’m wondering if you use generative AI to write those epic performance improvement blog posts! 😂

      • Stephen Toub - MSFTMicrosoft employee Author · Edited

        Funny you should ask that: I actually talk about that in the intro to the post, but you’ll need to wait a week to see it 🙂

        I do appreciate all the interest. Hopefully it lives up to the hype 🙂