{"id":3485,"date":"2023-09-14T09:53:53","date_gmt":"2023-09-14T16:53:53","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/surface-duo\/?p=3485"},"modified":"2024-01-03T16:20:13","modified_gmt":"2024-01-04T00:20:13","slug":"android-openai-chatgpt-18","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-18\/","title":{"rendered":"\u201cInfinite\u201d chat with history summarization"},"content":{"rendered":"<p>\n  Hello prompt engineers,\n<\/p>\n<p>\n  A few weeks ago we talked about <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-15\/\">token limits<\/a> on LLM chat APIs and how this prevents an infinite amount of history being remembered as context. A <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-16\/\">sliding window<\/a> can limit the overall context size, and making the sliding window <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-16\/\">more efficient<\/a> can help maximize the amount of context sent with each new chat query. \n<\/p>\n<p>\n  However, to include MORE relevant context from a chat history, different approaches are required, such as history summarization or using embeddings of past context.\n<\/p>\n<p>\n  In this post, we\u2019ll consider how summarizing the conversation history that\u2019s beyond the sliding window can help preserve context available to answer future queries while still keeping within the model\u2019s token limit.\n<\/p>\n<h2>The test case<\/h2>\n<p>\n  In the droidcon sample, one possible query that requires a lot of historical context could be <strong>&#8220;list all the sessions discussed in this chat&#8221;<\/strong>. Ideally, this would be able to list ALL the sessions in the conversation, not just the ones still mentioned in the sliding window. To test this, we\u2019ll need to first chat back-and-forth with the model, asking questions and getting responses until the sliding window is full.\n<\/p>\n<blockquote><p>An interesting side-effect of all the features we already added is that queries like this trigger the <code>AskDatabaseFunction<\/code> from <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-13\/\">Dynamic Sqlite queries with OpenAI chat functions<\/a>. The model generates a <code>SELECT * FROM sessions<\/code> query and attempts to answer with ALL the sessions in the database. For simplicity, I have commented-out that function in <strong>DroidconEmbeddingsWrapper.kt<\/strong> for the purposes of testing the history summarization feature discussed in this post. Tweaking the prompt for that function will be an exercise for another time&#8230;<\/p><\/blockquote>\n<p>\n  To make testing easier, I wrote a test conversation to <strong>logcat<\/strong> and then hardcoded it to simulate a longer conversation (the initial generation of the test data proved that the \u2018history accumulation\u2019 code worked). The test conversation can be seen in <a href=\"https:\/\/github.com\/conceptdev\/droidcon-sf-23\/blob\/655c022a1a325ec3bf03d06aca2a08f625c88b1b\/Jetchat\/app\/src\/main\/java\/com\/example\/compose\/jetchat\/data\/SlidingWindow.kt#L84-L144\">this commit<\/a>.\n<\/p>\n<h2>Summarizing the message history<\/h2>\n<p>\n  Older messages that exist outside of the sliding window need to be \u2018compressed\u2019 somehow so that they can still be included in API calls but under the token limit. To perform this \u2018compression\u2019, we will do an additional LLM completion to generate a summary of as many messages as we can that fit inside a separate completion request (which will be limited by its own maximum token size).\n<\/p>\n<p>\n  The message history will be plain text formatted, so the responses will be prefixed with \u201cUSER:\u201d or \u201cASSISTANT:\u201d. Here\u2019s an example of the <a href=\"https:\/\/github.com\/conceptdev\/droidcon-sf-23\/blob\/655c022a1a325ec3bf03d06aca2a08f625c88b1b\/Jetchat\/app\/src\/main\/java\/com\/example\/compose\/jetchat\/data\/SlidingWindow.kt#L84-L144\">format<\/a>:\n<\/p>\n<pre>USER: are there any sessions about AI?\r\nASSISTANT: Yes, there is a session about AI titled \u201cAI for Android on- and off- device\u201d presented by Craig Dunn, a software engineer at Microsoft. The session will take place on June 9th at 16:30 in Robertson 1.\r\nUSER:...<\/pre>\n<p><em>Figure 1: Example text representation of chat history<\/em>\n<\/p>\n<p>\n  Conversation history like this, that extends beyond the sliding window, is what we need to succinctly summarize to keep \u2018context\u2019 while still using our token limit efficiently.\n<\/p>\n<blockquote><p>NOTE: the prompt used to generate the summary will have a big impact on the success of the feature. With this in mind, I tested a number of different prompts directly in the <a href=\"https:\/\/platform.openai.com\/playground\">OpenAI Playground<\/a> before working on the Android sample in Kotlin.<\/p><\/blockquote>\n<p>\n  <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-computer-description-automatica.png\" class=\"wp-image-3486\" alt=\"Screenshot of the OpenAI playground showing a test prompt along with the sample hardcoded conversation\" width=\"450\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-computer-description-automatica.png 2032w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-computer-description-automatica-300x223.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-computer-description-automatica-1024x762.png 1024w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-computer-description-automatica-768x571.png 768w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-computer-description-automatica-1536x1143.png 1536w\" sizes=\"(max-width: 2032px) 100vw, 2032px\" \/><br\/><em>Figure 2: OpenAI playground showing the summary prompt testing<\/em>\n<\/p>\n<p>\n  The following table shows some of the different prompts that were tested on the hardcoded <a href=\"https:\/\/github.com\/conceptdev\/droidcon-sf-23\/blob\/655c022a1a325ec3bf03d06aca2a08f625c88b1b\/Jetchat\/app\/src\/main\/java\/com\/example\/compose\/jetchat\/data\/SlidingWindow.kt#L84-L144\">sample conversation<\/a>:\n<\/p>\n<table>\n<tr>\n<td>\n<p><strong>Prompt<\/strong>\n<\/p>\n<\/td>\n<td>\n<p><strong>Summarized text<\/strong>\n<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>\n  Summarize the following conversation and extract key points:\n<\/p>\n<p>\n  &lt;CONVERSATION&gt;\n<\/p>\n<\/td>\n<td>\n<p>\n  The droidcon SF conference will feature several Jetpack Compose and Gradle sessions. Jetpack Compose sessions will be held on June 8th and 9th, with speakers VINAY GABA, NICK DIPATRI, ALEJANDRO SANCHEZ and TASHA RAMESH, while Gradle sessions will be held on June 8th and 9th, with speakers NELSON OSACKY, IURY SOUZA and AURIMAS LIUTIKAS.\n<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>\n  Summarize the following conversation and extract key points, especially from USER: &lt;CONVERSATION&gt;\n<\/p>\n<\/td>\n<td>\n<p>\n  The user asked if there were any Jetpack Compose and Gradle sessions at the droidcon SF conference. The assistant provided five sessions for Jetpack Compose and three sessions for Gradle. The user was provided with information regarding the speakers, topics, dates, times, and locations of the sessions.\n<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>\n  Extract all the session names from this discussion:\n<\/p>\n<p>\n  &lt;CONVERSATION&gt;\n<\/p>\n<\/td>\n<td>\n<p>\n  Jetpack Compose sessions:<br\/>\n  1. Panel Discussion: Adopting Jetpack Compose @ Scale<br\/>\n  2. Practical Compose Navigation with a Red Siren<br\/>\n  3. Creative Coding with Compose: The Next Chapter<br\/>\n  4. Opening the shutter on snapshots<br\/>\n  Gradle sessions:<br\/>\n  1. Improving Developer Experience with Gradle Build Scans<br\/>\n  2. Crash Course in building your First Gradle Plugin<br\/>\n  3. Gradle under a microscope: profiling and optimizing builds\n<\/p>\n<\/td>\n<\/tr>\n<\/table>\n<p><em>Figure 3: testing different prompts for the best summarization of message history<\/em>\n<\/p>\n<p>\n  In this test scenario the message history consisted of 502 tokens, and the summaries range from 60-100 tokens. This suggests we can get at least five times as much \u2018information\u2019 into our chat context by summarizing it!\n<\/p>\n<p>\n  Choose a prompt that gives you the summary that seems most useful for the types of questions you expect the user to ask. For the droidcon conference chat, the last prompt seems to produce the most useful output \u2013 the chat will retain a record of all the sessions that were previously discussed. A more open-ended chat application might prefer one of the other prompt styles.\n<\/p>\n<p>\n  To actually generate the summary, the code will call the <a href=\"https:\/\/platform.openai.com\/docs\/guides\/gpt\/completions-api\">OpenAI completions API<\/a>, totally independently of the ongoing chat API calls. This is done using the <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-2\/\">completion endpoint to summarize text<\/a> that was covered in a blog post in April. The new <code>SummarizeHistory.summarize<\/code> function is shown in Figure 4:\n<\/p>\n<pre>  suspend fun summarize(history: String): String {\r\n      val summarizePrompt = \"\"\"Extract all the session names from this discussion:\r\n          #####\r\n          $history\r\n          #####\"\"\".trimMargin()\r\n      \/\/ send the function request\/response back to the model\r\n      val completionRequest = completionRequest {\r\n          model = ModelId(Constants.OPENAI_COMPLETION_MODEL)\r\n          prompt = summarizePrompt\r\n          maxTokens = 500\r\n      }\r\n      val textCompletion: TextCompletion =  openAI.completion(completionRequest)\r\n      return textCompletion.choices.first().text\r\n  }<\/pre>\n<p><em>Figure 4: Code to summarize the message history using OpenAI<\/em>\n<\/p>\n<p>\n  This <code>summarize<\/code> is called from the <code>SlidingWindow.chatHistoryToWindow<\/code> function, where the loop through the message history has been updated. Instead of <code>break<\/code> when the sliding window is full, the loop now concatenates older messages into a string format which is sent to the summarize function.\n<\/p>\n<p>\n  Once the older messages have been concatenated into the <code>messagesForSummarization<\/code> variable, they are summarized using the method shown above and then wrapped in further instructions so that the model knows the \u2018meaning\u2019 of the summarized data:\n<\/p>\n<pre>  val history = SummarizeHistory.summarize(messagesForSummarization)\r\n  val historyContext = \"\"\"These sessions have been discussed previously:\r\n       $history\r\n       Only use this information if requested.\"\"\".trimIndent()<\/pre>\n<p><em>Figure 5: the summarized history is wrapped in a prompt message to help the model understand the context<\/em>\n<\/p>\n<blockquote><p>Without the additional prompt instructions, the model is prone to using the data in the <code>$history<\/code> to answer <em>every<\/em> question the user asks\u2026 i.e. a form of \u201challucination\u201d! As with all prompt instructions, small changes in wording (or changing the model you use) can have a big effect on the output.<\/p><\/blockquote>\n<h2>Where to reference the summary<\/h2>\n<p>\n  Now that we have a historical context summary, we need to pass it back to the model in subsequent API calls. There are at least three options for where in the chat API we could insert the summary, each of which could behave differently depending on your prompts and model choice:\n<\/p>\n<ul>\n<li>\n    System prompt\n  <\/li>\n<li>\n    Insert as the oldest user message\n  <\/li>\n<li>\n    As grounding for the current user query\n  <\/li>\n<\/ul>\n<p>\n  It\u2019s not clear that any of these options are superior to the others; it\u2019s more likely to be dependent on the model you are using, your built-in prompts, and the user queries! The options are discussed below:\n<\/p>\n<h3>System prompt<\/h3>\n<p>\n  The system prompt seems like an ideal place to include additional context. However, some models (e.g., GPT 3.5) do not give as much weight to this context as others. Testing will be required to see if context added here is \u2018observed\u2019 in completions or whether it is ignored (for your application\u2019s specific use case).\n<\/p>\n<h3>\u201cFirst\/oldest\u201d user query<\/h3>\n<p>\n  Sending the history summary as the first\/oldest message reflects the fact that the data <em>was<\/em> earlier in the conversation (prior to being summarized). In theory, this suggests the model should give the same \u201clevel of consideration\u201d as the original (longer) message history when generating new completions.\n<\/p>\n<h3>Current user query<\/h3>\n<p>\n  If either of the first two methods fail to cause the summary to be referenced effectively, adding to the <em>current <\/em>user query as additional grounding should cause it to be considered when the model formulates its answer. The downside may be that it\u2019s considered more important than other context within the sliding window, resulting in confusing responses.\n<\/p>\n<h3>Further research<\/h3>\n<p>\n  In addition to conducting testing with your scenarios, there is also more general research available on how models use the context, such as the paper <a href=\"https:\/\/arxiv.org\/pdf\/2307.03172.pdf\">Lost in the Middle: How Language Models Use Long Contexts<\/a>, which discusses a bias in responses towards information at the beginning or end of the context.\n<\/p>\n<h2>Add the summary to the chat<\/h2>\n<p>\n  For this demonstration, I\u2019ve chosen the first option &#8211; to add the history summary to the system prompt. The updated code is shown in figure 6:\n<\/p>\n<pre>  if (history.isNullOrEmpty()) { \/\/ no history summary available\r\n       messagesInWindow.add(systemMessage)\r\n  } else { \/\/ combine system message with history summary\r\n       messagesInWindow.add(\r\n           ChatMessage(\r\n               role = ChatRole.System,\r\n               content = (systemMessage.content + \"\\n\\n\" + historyContext)\r\n          )\r\n      )\r\n  }<\/pre>\n<p><em>Figure 6: if a history summary exists, add it to the system prompt<\/em>\n<\/p>\n<h2>It works!<\/h2>\n<p>\n  The screenshot below shows a conversation where the most recent chat response (about an AI session), as well as Gradle and Compose sessions from the summarized history, are all included in response to the test query <strong>\u201clist all the sessions discussed in this chat\u201d<\/strong>:\n<\/p>\n<p>\n  <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-3.png\" class=\"wp-image-3487\" alt=\"Screenshot of the Jetchat application showing a conversation that has been answered with context from the chat history\" width=\"450\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-3.png 948w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-3-243x300.png 243w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-3-831x1024.png 831w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/09\/a-screenshot-of-a-chat-description-automatically-3-768x946.png 768w\" sizes=\"(max-width: 948px) 100vw, 948px\" \/><br\/><em>Figure 7: a question about all sessions in the chat includes both the most recent response AND sessions from the summarized history<\/em>\n<\/p>\n<p>\n  If you look closely at the response, you\u2019ll notice that only the 8<sup>th<\/sup> session includes the presenter name. This is a good indication that the summarized history in combination with recent messages is being used to generate this response \u2013 in the history summary example above, you can see that only the session names are included and not the speakers.\n<\/p>\n<p>Test different types of query to verify that the historical context is being used, for example <strong>&#8220;what was the first session mentioned?&#8221;<\/strong> correctly responds with the &#8220;Panel Discussion: Adopting Jetpack Compose @ Scale&#8221; from the summarized history.<\/p>\n<h2>One last thing&#8230;<\/h2>\n<p>\n  Scaling the summarization process as the history grows really long will require repeated API calls which could end up potentially summarizing summaries. Although the concept of summarizing history to keep <em>some <\/em>context can probably work for a given amount of history, ultimately the token limit will mean that eventually historical information will be lost\/forgotten.\n<\/p>\n<p>\n  While summarization might be a good solution to extend the amount of content remembered for some chat conversations, it can\u2019t be relied upon to support a true \u201cinfinite chat\u201d. For another possible solution, the next post will discuss embeddings as a way to recall past chat interactions.\n<\/p>\n<h2>Resources and feedback<\/h2>\n<p>\n  The <a href=\"https:\/\/community.openai.com\/\">OpenAI developer community<\/a> forum has lots of discussion about API usage and other developer questions. In particular, it discusses how the <a href=\"https:\/\/platform.openai.com\/docs\/guides\/gpt\/chat-completions-vs-completions\">OpenAI completions API<\/a> is now considered legacy, and for performance and cost reasons the chat model is preferred (either <code>gpt-3.5-turbo<\/code> or <code>gpt-4<\/code>). \n<\/p>\n<p>\n  The code added specifically for this blog post can be viewed in <a href=\"https:\/\/github.com\/conceptdev\/droidcon-sf-23\/pull\/13\/\">pull request #13<\/a> on the sample\u2019s GitHub repo.\n<\/p>\n<p>\n  We\u2019d love your feedback on this post, including any tips or tricks you\u2019ve learned from playing around with OpenAI.\n<\/p>\n<p>\n  If you have any thoughts or questions, use the <a href=\"http:\/\/aka.ms\/SurfaceDuoSDK-Feedback\">feedback forum<\/a> or message us on <a href=\"https:\/\/twitter.com\/surfaceduodev\">Twitter @surfaceduodev<\/a>.\n<\/p>\n<p>\n  There will be no livestream this week, but you can check out the <a href=\"https:\/\/youtube.com\/c\/surfaceduodev\">archives on YouTube<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello prompt engineers, A few weeks ago we talked about token limits on LLM chat APIs and how this prevents an infinite amount of history being remembered as context. A sliding window can limit the overall context size, and making the sliding window more efficient can help maximize the amount of context sent with each [&hellip;]<\/p>\n","protected":false},"author":570,"featured_media":3488,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[741],"tags":[734,733],"class_list":["post-3485","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-chatgpt","tag-openai"],"acf":[],"blog_post_summary":"<p>Hello prompt engineers, A few weeks ago we talked about token limits on LLM chat APIs and how this prevents an infinite amount of history being remembered as context. A sliding window can limit the overall context size, and making the sliding window more efficient can help maximize the amount of context sent with each [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts\/3485","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/users\/570"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/comments?post=3485"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts\/3485\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/media\/3488"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/media?parent=3485"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/categories?post=3485"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/tags?post=3485"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}