Build Intelligent Applications using ChatGPT & Azure Cosmos DB
“Cosmos DB is the go-to database powering the world’s most demanding workloads at any scale.
OpenAI relies on Cosmos DB to dynamically scale their ChatGPT service – one of the fastest-growing consumer apps ever – enabling high reliability and low maintenance.” – Satya Nadella, Microsoft chairman and chief executive officer
As the need for intelligent applications expands, developers are increasingly relying on artificial intelligence (AI) and machine learning (ML) to enrich their application’s capabilities. Amongst the numerous conversational AI delivery methods, chatbots have become highly sought after. OpenAI’s ChatGPT, a robust large language model (LLM), is an excellent solution for creating chatbots that can comprehend natural language and provide intelligent replies. ChatGPT has gained popularity since its launch in November 2022.
In this blog post, myself and a colleague, Sandeep Nair, walk through our experience in learning the Large Language Models that power OpenAI’s ChatGPT service and API by creating a Cosmos DB + ChatGPT sample application that mimics its functionality, albeit to a lesser extent. Our sample combines ChatGPT and Azure Cosmos DB which is used to persist all of the data for this sample application. Throughout this blog post, as we go through the sample we built, we’ll also explore other ways combining a database like Azure Cosmos DB can enhance the experience for users when building intelligent applications. As you’ll come to see, combining Azure Cosmos DB with ChatGPT provided benefits in ways that turned out to be more valuable than we anticipated.
To run this sample application, you need to have access to Azure OpenAI Service. To get access within your Azure Subscription, apply here for Azure OpenAI Service.
Our Sample ChatGPT App
Let’s get to the application. Our application looks to show some of the functionality for the ChatGPT service that people are familiar with. Down the left-hand side are a list of conversations or “chat sessions”. You click on each of these for a different chat session. You can also rename or delete them. Within each chat session are “chat messages”. Each is identified by a “sender” as User or Assistant. The messages are listed in ascending chronological order and displayed using the Humanizer Nuget package. A text box at the bottom is used to type in a new prompt to add to the session. Finally the number of tokens used for requests to the Azure OpenAI service are displayed both for each chat message and the entire session. We’ll talk more about tokens later.
Before getting too far here, some definitions. When interacting with a Large Language Model (LLM) service like ChatGPT, you would typically ask a question and the service gives you an answer. To put these into the vernacular, the user supplies a “prompt”, and the service provides a “completion”. This sample was initially created using the text-davinci-003 model. The current ChatGPT application is based upon a newer model, gpt-3.5-turbo-0301.
Below is the architecture for our sample. The front-end is a Blazor web application hosted in Azure App Service. This connects to Azure Cosmos DB as the database and the Azure OpenAI service which hosts the ChatGPT model. To make it as easy as possible to deploy our sample application, look for the “Deploy to Azure” button in the readme file for our sample on GitHub. The ARM template will handle all the connection information, so you don’t have to copy and paste keys. It’s completely zero-touch deployment.
Here is the data model for our application. Very simple with only two classes, chat session and chat message. Everything is stored in a single container.
Giving ChatGPT Memories
If you’ve used ChatGPT at chat.openai.com you may have noticed that in addition to just answering single prompts, you can have a conversation with it as well. ChatGPT gives you an answer, you ask a follow-up without any additional context and ChatGPT responds in a contextually correct way as if you were having a conversation with it.
When designing our sample application using the underlying LLM model, my initial assumption was that LLM’s has some clever way of preserving user context between completions and this was in some way part of the LLM available on Azure. I wanted to see this in action, so I tested the service’s API with prompts I had typed in. However, the service couldn’t remember anything. The completions were completely disjointed.
Let me show you. In this chat below I asked what the seating capacity is for Lumen Field in Seattle. However, in my follow-up, “Is Dodger Stadium bigger?”, it gives a response that is incorrect. You can’t answer yes to the question when the stadium in the follow up question is smaller, not larger than the one asked previously. It looks as though our app is either a huge Dodger fan and in denial about the stadium size or hallucinating that Dodger Stadium is larger than Lumen Field.
What exactly is happening here? Well, it turns out my assumption was not correct. The underlying large language model, whether gpt-3.5-turbo-0301 or any LLM model, do not have some clever mechanism for maintaining context to enable a conversation. That is something YOU need to provide so that the LLM can respond appropriately.
One way to do this is to send the previous prompts and completions back to the service with the latest prompt appended for it to respond to. With the full history of the conversation, it now has the information necessary to respond in context and correctly. It can now infer what we mean when we ask, “Is that bigger than Dodger Stadium?”.
Let’s look at the same chat when I send the previous prompts and completions in my follow-up.
Clearly the response is now contextually aligned with the rest of the conversation. The sample application is now able to mimic ChatGPT’s conversational experience for users.
Some Practical Considerations
While providing a history of a chat is a simple solution, there are some limitations to this. These Large Language Models are limited in how much text you can send in a request. This is gated by “tokens”. Tokens are a form of compute currency and can vary in value from one character to one word in length. They are allocated by the service on a per-request basis for a deployed model. The maximum amounts allowed vary across models most of the “GPT” LLM’s all have a maximum number of 4096 tokens. In the sample we built, we tested various values for tokens. Here are some things to consider.
First, tokens are consumed in both the request and the response. If you use 4096 tokens sending a complete history in the request, you won’t get anything back in the response because your tokens were all consumed processing the text you sent in the request.
Second, setting the token limit aside, sending large payloads of text on every request is just not something you really want to do. It’s expensive on the client, consumes lots of bandwidth, and increases overall latency.
How We Implemented Memory
To deal with these practical considerations, we limited the amount of memory by sending only the most recent prompts and completions for that conversation. This would allow it to mostly respond with the context it needed to carry on a conversation. However, if a conversation drifted significantly over time, context increasingly would be lost and follow-up questions to older prompts would result in contextually incorrect responses again. Still, this works good enough.
How we implemented this initially was quite simple by just setting a maxConversationLength of 2000 bytes. A token can represent a single character all the way up to a word. To be cautious we figured that 2000 bytes would give us enough history to provide some conversational memory without sacrificing complete responses.
Our next iteration however took a more precise approach. OpenAI’s API can tell you how many tokens your requests (prompts) and responses (completions) cost. With that data it was simple to add a tokens property to our object model and save the cost for each prompt and completion as well as a total for the entire session. We could then set an arbitrary amount for tokens to maintain our conversation. This approach made it much easier to raise it up and down during testing until we were satisfied.
Here you can see in our implementation. We loop through messages, adding them to an array and counting the tokens which were consumed, then breaks out of the loop once it goes over the _maxConversationTokens amount. Because we build the conversation from most recent to oldest, we build the array in reverse. To put the conversation in the correct order, we call reverse on the List object in a Join function and pass it back to the function calling it.
Make ChatGPT Do It
There is another approach for maintaining conversational context which I’ve seen others suggest and you might consider. You can maintain context for a conversation by asking ChatGPT to summarize or categorize the conversation, then use that instead of sending the full or partial history of the conversation in each request. The benefit of course is drastically reducing the amount of text needed to maintain that context.
There are however some limitations to this approach. For instance, how often should you refresh the summary? How do subsequent refreshes maintain fidelity? Can one maintain fidelity over a long conversation with limited tokens? After pondering these and other questions, we decided this wasn’t the correct approach for us. Our sample application’s implementation was good enough for most use cases.
However, we did end up using this approach, but for something different. If you’ve used ChatGPT, you may notice it renames the chat with a summary of what you asked. We decided to do the same thing and use ChatGPT to summarize the chat session on what the first prompt is. Here is how we did it.
We tried a bunch of different prompts here as well. It turns out that just telling it what you want and why works best. It’s worth pointing out that this was much easier with the older text-davinci-003 model. When we migrated to gpt-3.5-turbo-0301 what we had just would not work anymore. We spent a few weeks testing different prompts as well as tweaking Temperature and NucleusSamplingFactor to try to find the right combination. Turns out that gpt-3.5-turbo-0301 does not do a very good job at following system prompts. This is likely why we struggled. And it still sometimes returns a summary of the chat that isn’t quite right but that’s ok.
Instructing Chat Models
It turns out that what I’ve described in giving an LLM a memory is part of a larger concept around prompts that provide the model with greater context needed to interact with users. There are different types of prompts as well. In addition to user prompts that are used to provide context for a conversation, you can use starting, or system prompts to instruct a chat model to behave a certain way. This can include lots of things from whether to be, “friendly and helpful” or “factual and concise” or even, “playful and snarky”.
One thing that most people don’t realize is Large Language Models like GPT are probabilistic, not deterministic. Each time you send a prompt for a completion there are “probabilities” at work in generating that completion. It’s a bit strange to wrap your head around. Most developers expect things to work the same every time they run the same code. That’s not what happens in these systems. One thing that can help in this regard is Temperature and NucleausSamplingFactor. In our testing, we found that this combination we used, reduced the changes of ChatGPT returning the best summarization for a given user prompt.
Exploring the possibilities
This process of writing a sample application gave me lots to think about. Our intention was to create a simple intelligent chat application that had some ChatGPT features using Cosmos DB to store chat messages. But it got me starting to think about how else I could combine a database like Azure Cosmos DB with a large language model like gpt-3.5-turbo-0301. It didn’t take me long to think of various scenarios.
For instance, if I was building a chatbot retail experience, I could use starting prompts to load user profile information for a user’s web session. This could include information on product or other recommendations. “Hi Mark, your mom’s birthday is next month, would you like to see some gift ideas?” Or simply being able to ask natural language questions without having to scroll through products. “Do you have red socks in size 10 available?” There are limitless possibilities.
Prompts are great at providing information and arming a chat bot with some simple information. However, there’s a limit to how much data you can feed into a prompt. What if you wanted to build an intelligent chat that had access to data for millions of documents and domain data? You can’t cram GB of data into a prompt. For this scenario you need to do what is called, model fine-tuning. Not only does this result in better responses, but it saves in tokens and reduces latency from smaller request payloads.
This is however, not an approach that makes sense for most users. There is another way which is probably a better approach for most people called, Retrieval Augmented Generation (RAG). There is a good blog post from Pablo Castro which describes what it is and how it works using Azure Cognitive Services. It also links to a GitHub where you can try it for yourself.
I’m already working on implementing another sample that combines Azure Cosmos DB with other services to build this RAG capability. All this starts with the data in your database!
We’ve shown here what you can do with just a simple example, we’re going to look further. Our team is very busy working to enable users to create intelligent applications and services using Azure OpenAI and Azure Cosmos DB. Stay tuned here as you can expect more from us in this space with product announcements, additional samples and blog posts. What an exciting time for all of us.
Start your journey here!
- Download and explore our Azure Cosmos DB + ChatGPT Sample App on GitHub
- Get your Free Azure Cosmos DB Trial
- Open AI Platform documentation