{"id":3499,"date":"2023-09-21T11:46:07","date_gmt":"2023-09-21T18:46:07","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/surface-duo\/?p=3499"},"modified":"2024-01-03T16:19:46","modified_gmt":"2024-01-04T00:19:46","slug":"android-openai-chatgpt-19","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-19\/","title":{"rendered":"Infinite chat with history embeddings"},"content":{"rendered":"<p>\n  Hello prompt engineers,\n<\/p>\n<p>\n  The last few posts have been about the different ways to create an \u2018infinite chat\u2019, where the conversation between the user and an LLM model is not limited by the token size limit and as much historical context as possible can be used to answer future queries. We previously covered:\n<\/p>\n<ul>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-16\/\">Sliding window<\/a>\n  <\/li>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-18\/\">Summarization<\/a>\n  <\/li>\n<\/ul>\n<p>\n  These are techniques to help better manage the message history, but they don\u2019t really provide for \u201cinfinite\u201d memory. This week, we will investigate storing the entire chat history with embeddings, which should get us closer to the idea of \u201cinfinite chat\u201d.\n<\/p>\n<p>\n  One of the first features added to the <a href=\"https:\/\/github.com\/conceptdev\/droidcon-sf-23\/\">Jetchat droidcon demo<\/a> was <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-7\/\">using embeddings to answer questions<\/a> about conference sessions. In this post, we\u2019ll use the same technique to (hopefully) create a chat which \u201cremembers\u201d the conversation history. Spoiler alert \u2013 it\u2019s relatively easy to build the <em>mechanism<\/em>, but it also exposes the challenges of saving the right data, in the right format, and retrieving it at the right time. You can play around with all of these aspects in your own implementations \ud83d\ude0a\n<\/p>\n<p>\n  For more information on the underlying theory of chat memory using a vector store, see the <a href=\"https:\/\/python.langchain.com\/docs\/modules\/memory\/types\/vectorstore_retriever_memory\">LangChain docs<\/a>, which discuss a similar approach.\n<\/p>\n<h2>Background and testing<\/h2>\n<p>\n  The JetchatAI sample has multiple chat \u201cchannels\u201d:\n<\/p>\n<ul>\n<li><strong>#jetchat-ai<\/strong> \u2013 Basic open-ended chat with the OpenAI endpoint, including the <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-9-functions\/\">live US weather function<\/a> and using the <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-5\/#generate-an-image\">image generation endpoint<\/a>.\n  <\/li>\n<li><strong>#droidcon-chat <\/strong>\u2013 Directed chat aimed at answering questions about the droidcon SF conference, includes embeddings of the conference sessions and functions to list conference info as well as save and view favorites.\n  <\/li>\n<\/ul>\n<p>\n  The <strong>#droidcon-chat<\/strong> channel already includes the sliding window and history summary discussed in the past few weeks, and it\u2019s already backed by a static embedding data store. In this post, we\u2019re going back to the basics and adding the chat history embeddings to the <strong>#jetchat-ai<\/strong> channel. \n<\/p>\n<p>\n  <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-4.png\" class=\"wp-image-3500\" alt=\"Screenshot of channel chooser in Jetchat sample\" width=\"600\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-4.png 922w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-4-300x118.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-4-768x302.png 768w\" sizes=\"(max-width: 922px) 100vw, 922px\" \/><br\/><em>Figure 1: switching \u201cchannels\u201d in JetchatAI<\/em>\n<\/p>\n<p>\n  Because the <strong>#jetchat-ai<\/strong> channel is just an open discussion, testing will be based on questions and answers about travel, food, and any of the kinds of things that people like asking ChatGPT!\n<\/p>\n<h3>First implement the sliding window<\/h3>\n<p>\n  In the <a href=\"https:\/\/github.com\/conceptdev\/droidcon-sf-23\/pull\/14\/files\">pull request for this feature<\/a>, you will see that before building the <em>embeddings history<\/em>, we first added the sliding window feature that was previously implemented in the #droidcon-chat. The code changes in <strong>OpenAIWrapper.kt<\/strong>, <strong>SlidingWindow.kt<\/strong> and <strong>SummarizeHistory.kt<\/strong> are mostly to port this already built-and-tested feature. Refer to the earlier <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-16\/\">blog post<\/a> for details on how this works.\n<\/p>\n<p>\n  With the sliding window implemented, the chat will not hit the API token limit and can be conducted over a long period of time. \n<\/p>\n<blockquote><p>Note that the demo app doesn\u2019t persist chats beyond the app restarting, so while we\u2019re building the functionality for an \u201cinfinite chat\u201d, it can\u2019t be literally infinite just yet! For now, we\u2019re focusing on when the context would otherwise be outside the API token limit.<\/p><\/blockquote>\n<h2>Embedding the chat history<\/h2>\n<p>\n  The mechanism for creating and retrieving embeddings of messages is basically the same approach used for the droidcon conference schedule:\n<\/p>\n<ol>\n<li>\n  Call the embedding API with text from each chat message (we\u2019ll combine the user query and the model response into one).\n<\/li>\n<li>\n  Store the resulting vector along with the text (preferably in a database, but we\u2019ll start with in-memory).\n<\/li>\n<li>\n  Create an embedding vector for each new user query.\n<\/li>\n<li>\n  Match the new vector against everything in history, looking for similar context (high embedding similarity).\n<\/li>\n<li>\n  Add the text from similar embeddings to the prompt along with the new user query. If there are no matches, no additional context is added to the prompt.\n<\/li>\n<li>\n  The model\u2019s response will take the included context into account, and hopefully answer with the expected historical knowledge.\n<\/li>\n<\/ol>\n<h3>Make history<\/h3>\n<p>\n  The code for steps 1 and 2 is shown in Figure 2. Every time an interaction occurs (a user query, and a response from the model), this will be concatenated together and an embedding generated. These will be added to a datastore that will be referenced in subsequent user queries.\n<\/p>\n<p>\n  The query and response are concatenated together so that the embedding created will have the best chance of matching future user queries that might be related. Choosing what to embed is a huge part of the success (or not) of this approach (and similar approaches like <a href=\"https:\/\/www.pinecone.io\/learn\/chunking-strategies\/\">chunking<\/a> and embedding documents), and the right design choice for your application could be different.\n<\/p>\n<pre>suspend fun storeInHistory (openAI: OpenAI, dbHelper: HistoryDbHelper, user: CustomChatMessage, bot: CustomChatMessage) {\r\n    val contentToEmbed = user.userContent + \" \" + bot.userContent  \/\/ concatenate query and response\r\n    val embeddingRequest = EmbeddingRequest(\r\n                model = ModelId(Constants.OPENAI_EMBED_MODEL),\r\n                input = listOf(contentToEmbed)\r\n    )\r\n    val embedding = openAI.embeddings(embeddingRequest)\r\n    val vector = embedding.embeddings[0].embedding.toDoubleArray()\r\n    historyCache[vector] = contentToEmbed  \/\/ add to in-memory history cache\r\n}<\/pre>\n<p><em>Figure 2: This function creates an embedding for each user interaction with the model<\/em>\n<\/p>\n<h3>RAG-time<\/h3>\n<p>\n  Steps 3, 4 and 5 are addressed in the code in Figure 3 below. Just like the <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-7\/\">droidcon conference demo<\/a>, we\u2019ve created a \u201cretrieval-augmented generation\u201d where the query will include the historical context with close vector similarity, wrapped in prompt text that guides the model to answer:\n<\/p>\n<pre>suspend fun groundInHistory (openAI: OpenAI, dbHelper: HistoryDbHelper, message: String): String {\r\n    var messagePreamble = \"\"\r\n    var messageVector: DoubleArray? = null\r\n    val embeddingRequest = EmbeddingRequest(\r\n        model = ModelId(Constants.OPENAI_EMBED_MODEL),\r\n        input = listOf(message)\r\n    )\r\n    val embedding = openAI.embeddings(embeddingRequest)\r\n    val messsageVector = embedding.embeddings[0].embedding.toDoubleArray()\r\n    \/\/ find the best matches history items\r\n    var sortedVectors: SortedMap&lt;Double, String&gt; = sortedMapOf()\r\n    for (pastMessagePair in historyCache) {\r\n        val v = messageVector!! dot pastMessagePair.key\r\n        sortedVectors[v] = pastMessagePair.value\r\n    }\r\n    if (sortedVectors.size &lt;= 0) return \"\" \/\/ nothing found\r\n    if (sortedVectors.lastKey() &gt; 0.8) { \/\/ arbitrary match threshold\r\n        messagePreamble =\r\n            \"Following are some older interactions from this chat:\\n\\n\"\r\n        for (pastMessagePair in sortedVectors.tailMap(0.8)) {\r\n            messagePreamble += pastMessagePair.value + \"\\n\\n\"\r\n        }\r\n        messagePreamble += \"\\n\\nUse the above information to answer the following question:\\n\\n\"\r\n    }\r\n    return messagePreamble\r\n}<\/pre>\n<p><em>Figure 3: Code that creates an embedding for the user query, finds similar entries in the history, and pulls that content into the prompt as grounding<\/em>\n<\/p>\n<p>\n  There are two important implementation choices to call out in this function:\n<\/p>\n<ul>\n<li>\n    The arbitrary match threshold of <code>0.8<\/code> determines how much historical context is pulled in. The value of <code>0.8<\/code> was selected purely on how successful it was in the droidcon demo. Each application will need to adapt the criteria for matching historical context according to your user scenarios. \n  <\/li>\n<li>\n    The historical context itself is wrapped in a prompt to guide the model. The wording of this prompt should try to ensure the model only refers to the context when needed, and that it doesn\u2019t rely on it too heavily. The basic format is like this:<br\/><\/p>\n<pre>Following are some older interactions from this chat:\r\n\r\n&lt;MATCHING EMBEDDINGS&gt;\r\n\r\nUse the above information to answer the following question:\r\n\r\n&lt;USER QUERY GOES HERE&gt;<\/pre>\n<\/li>\n<\/ul>\n<p>\n  Once the grounding prompt has been created, we just need to pass it back to the model along with the new user query:\n<\/p>\n<pre>val relevantHistory = EmbeddingHistory.groundInHistory(openAI, dbHelper, message)\r\nval userMessage = CustomChatMessage(\r\n    role = ChatRole.User,\r\n    grounding = relevantHistory,\r\n    userContent = message\r\n)\r\nconversation.add(userMessage)<\/pre>\n<p><em>Figure 4: The history returned from the embeddings is added as grounding context before the user query is sent to the model<\/em>\n<\/p>\n<p>\n  The model will respond using the retrieved historical context to generate the correct answer, which itself will be added to the history datastore for future reference!\n<\/p>\n<h3>Example retrieval<\/h3>\n<p>\n  The screenshots below show a conversation where facts discussed early in the conversation are correctly recalled after the sliding window limit has been exceeded. In this case, I commented that I liked the Eiffel Tower, and this context was used to answer the query \u201cwhat\u2019s my favorite paris landmark\u201d:\n<\/p>\n<p>\n  <img decoding=\"async\" width=\"1363\" height=\"829\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/screens-screenshot-of-a-phone-description-automat.png\" class=\"wp-image-3501\" alt=\"Screens screenshot of a phone\n\nDescription automatically generated\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/screens-screenshot-of-a-phone-description-automat.png 1363w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/screens-screenshot-of-a-phone-description-automat-300x182.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/screens-screenshot-of-a-phone-description-automat-1024x623.png 1024w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/screens-screenshot-of-a-phone-description-automat-768x467.png 768w\" sizes=\"(max-width: 1363px) 100vw, 1363px\" \/>\n<\/p>\n<p><em>Figure 5: chat history where the initial interactions are \u2018remembered\u2019 even after they\u2019ve slipped out of the sliding window<\/em>\n<\/p>\n<p>\n  If you try these queries and watch the logcat, you\u2019d see that both the query+response pairs on the left screen are vector similarity matches for the query \u201cwhat\u2019s my favorite paris landmark\u201d (0.844 and 0.855 respectively) even though the first query won\u2019t help answer the question. The prompt looked like this:\n<\/p>\n<pre>Following are some older interactions from this chat:\r\n\r\ni like the eiffel tower - tell me more about it The Eiffel Tower is an iconic landmark in Paris, France. It is a wrought-iron lattice tower that stands at a height of 330 meters (1,083 feet). Built in 1889, it was initially criticized but soon became a symbol of Paris and a major tourist attraction. Visitors can enjoy breathtaking views of the city from its observation decks at different levels. The Eiffel Tower is also illuminated at night, creating a stunning spectacle.\r\n\r\nwhat are some popular landmarks in paris Some popular landmarks in Paris include the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, Champs-\u00c9lys\u00e9es, and Montmartre.\r\n\r\nUse the above information to answer the following question:\r\n\r\nWhat\u2019s my favorite paris landmark<\/pre>\n<p>\n  Compare this result with a different question about \u201cwhat\u2019s my favorite sydney landmark\u201d. Sydney has not been mentioned in this chat before, however the previous query \u201cwhat\u2019s my favorite paris landmark\u201d still has a strong vector similarity and that is included in the prompt:\n<\/p>\n<pre>Following are some older interactions from this chat:\r\n\r\nWhat\u2019s my favorite paris landmark Your favorite Paris landmark is the Eiffel Tower.\r\n\r\nUse the above information to answer the following question\r\n\r\nWhat\u2019s my favorite sydney landmark<\/pre>\n<p>\n  The model figures it out \u2013 ignoring the irrelevant context we retrieved \u2013 and answers with a figurative shrug:\n<\/p>\n<p>\n  <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-5.png\" class=\"wp-image-3502\" alt=\"\" width=\"500\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-5.png 1434w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-5-300x105.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-5-1024x358.png 1024w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-5-768x269.png 768w\" sizes=\"(max-width: 1434px) 100vw, 1434px\" \/><br\/><em>Figure 6: When grounding doesn\u2019t answer the question<\/em>\n<\/p>\n<h2>Easier said than done<\/h2>\n<p>\n  As I hinted at the start of this blog, writing the <em>mechanism<\/em> is easy, but tuning it for best results can be a challenge. There are at least three important design choices made in just the few lines of code shared above:\n<\/p>\n<ul>\n<li>\n    Format and content of the embeddings text\n  <\/li>\n<li>\n    Threshold for matching historical embeddings with the current user query\n  <\/li>\n<li>\n    Prompt text used to introduce the context as grounding\n  <\/li>\n<\/ul>\n<p>\n  You\u2019ll notice I\u2019ve also conveniently neglected to discuss the other context that is <em>still in the sliding window<\/em>. In a more sophisticated embodiment, we\u2019d probably keep track of which messages are still \u201cin the window\u201d and omit those from the vector comparison\u2026 or would we?!\n<\/p>\n<p>This channel also includes the live weather function. The implementation above would add those interactions to the history along with everything else, but it might make sense to <i>exclude<\/i> function results, since it makes more sense to get the live data each time?<\/p>\n<p>\n  And then there\u2019s the question of what to do as the history datastore grows larger and larger. We will need to de-duplicate and\/or summarize context that we use for grounding, possibly keeping track of dates and times so we can make inferences about stale data. Some of these problems might even be solved with additional LLM queries \u2013 but with the added cost and latency before providing an answer.\n<\/p>\n<h2>Resources and feedback<\/h2>\n<p>\n  The <a href=\"https:\/\/community.openai.com\/\">OpenAI developer community<\/a> forum has lots of discussion about API usage and other developer questions, and the <a href=\"https:\/\/python.langchain.com\/docs\/\">LangChain docs<\/a> have lots of interesting examples and discussion of LLM theory.\n<\/p>\n<p>\n  We\u2019d love your feedback on this post, including any tips or tricks you\u2019ve learned from playing around with ChatGPT prompts.\n<\/p>\n<p>\n  If you have any thoughts or questions, use the <a href=\"http:\/\/aka.ms\/SurfaceDuoSDK-Feedback\">feedback forum<\/a> or message us on <a href=\"https:\/\/twitter.com\/surfaceduodev\">Twitter @surfaceduodev<\/a>.\n<\/p>\n<p>\n  There will be no livestream this week, but you can check out the <a href=\"https:\/\/youtube.com\/c\/surfaceduodev\">archives on YouTube<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello prompt engineers, The last few posts have been about the different ways to create an \u2018infinite chat\u2019, where the conversation between the user and an LLM model is not limited by the token size limit and as much historical context as possible can be used to answer future queries. We previously covered: Sliding window [&hellip;]<\/p>\n","protected":false},"author":570,"featured_media":3503,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[741],"tags":[734,733],"class_list":["post-3499","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-chatgpt","tag-openai"],"acf":[],"blog_post_summary":"<p>Hello prompt engineers, The last few posts have been about the different ways to create an \u2018infinite chat\u2019, where the conversation between the user and an LLM model is not limited by the token size limit and as much historical context as possible can be used to answer future queries. We previously covered: Sliding window [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts\/3499","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/users\/570"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/comments?post=3499"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts\/3499\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/media\/3503"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/media?parent=3499"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/categories?post=3499"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/tags?post=3499"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}