{"id":3449,"date":"2023-08-31T11:19:46","date_gmt":"2023-08-31T18:19:46","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/surface-duo\/?p=3449"},"modified":"2024-01-03T16:21:51","modified_gmt":"2024-01-04T00:21:51","slug":"android-openai-chatgpt-16","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-16\/","title":{"rendered":"Infinite chat using a sliding window"},"content":{"rendered":"<p>\n  Hello prompt engineers,\n<\/p>\n<p>\n  There are a number of different strategies to support an \u2018infinite chat\u2019 using an LLM, required because large language models do not store \u2018state\u2019 across API requests and there is a limit to how large a single request can be.\n<\/p>\n<p>\n  In this OpenAI community question on <a href=\"https:\/\/community.openai.com\/t\/question-about-token-limit-differences-in-api-vs-chat\/227601\">token limit differences in API vs Chat<\/a>, user <a href=\"https:\/\/community.openai.com\/u\/damc4\/summary\">damc4<\/a> outlines three well-known methods to implement infinite chat:\n<\/p>\n<ul>\n<li>\n    Sliding context window\n  <\/li>\n<li>\n    Summarization\n  <\/li>\n<li>\n    Embeddings\n  <\/li>\n<\/ul>\n<p>\n  The thread also suggests tools like <a href=\"https:\/\/docs.langchain.com\/docs\/\">Langchain<\/a> can help to implement these approaches, but for learning purposes, we\u2019ll examine them from first principles within the context of the Jetchat sample. This week\u2019s blog discusses the first method \u2013 a sliding context window &#8211; in our Android OpenAI chat sample.\n<\/p>\n<h2>Before we begin\u2026<\/h2>\n<p>\n  We\u2019re going to add a sliding window feature to the droidcon conference chat code in the <a href=\"https:\/\/github.com\/conceptdev\/droidcon-sf-23\/tree\/main\/Jetchat\">Jetchat sample on GitHub<\/a>.\n<\/p>\n<h3>Configuration changes<\/h3>\n<p>\n  The sliding window is required because without it, eventually a long conversation will hit the token limit. In order to make this easier to test, set the <code>OPENAI_CHAT_MODEL<\/code> in <strong>Constants.kt<\/strong> to <code>gpt-3.5-turbo<\/code> which (currently) has a 4,096 token limit. Using the model with the lowest limit will make it easier to hit the boundary condition because we\u2019ll need fewer interactions to test hitting the limit. \n<\/p>\n<h3>Calculating tokens<\/h3>\n<p>\n  Last week\u2019s post discussed <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-15\/\">OpenAI tokens and limits<\/a>, including a link to the <a href=\"https:\/\/platform.openai.com\/tokenizer\">interactive tokenizer<\/a>. The page includes links to <a href=\"https:\/\/github.com\/openai\/tiktoken\">Python<\/a> and <a href=\"https:\/\/www.npmjs.com\/package\/gpt-3-encoder\">Javascript<\/a> code for tokenizing inputs. Unfortunately, there doesn\u2019t seem to be anything \u2018off the shelf\u2019 for Kotlin right now, so for now we\u2019re going to use a rough approximation, which you can find in the <code>Tokenizer<\/code> class in the <a href=\"https:\/\/github.com\/conceptdev\/droidcon-sf-23\/blob\/main\/Jetchat\/app\/src\/main\/java\/com\/example\/compose\/jetchat\/data\/Tokenizer.kt\">Jetchat example on GitHub<\/a>.\n<\/p>\n<h3>Test data<\/h3>\n<p>\n  The other thing that will help with testing is user inputs that have a \u2018roughly known\u2019 token size, so that we can reproduce the failure case easily and repeatably. From the examples <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-15\/\">last week<\/a>, queries for sessions about \u201cgradle\u201d and \u201cjetpack compose\u201d result in a large number of matching embeddings that use up one- to two-thousand tokens. With just those two queries, it\u2019s possible to trigger the error state:\n<\/p>\n<p>\n  <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-message-description-automatical.png\" class=\"wp-image-3450\" alt=\"Jetchat conversation screenshot where the model returns an error that the maximum content length of 4096 tokens has been exceeded\" width=\"450\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-message-description-automatical.png 844w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-message-description-automatical-300x116.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-message-description-automatical-768x298.png 768w\" sizes=\"(max-width: 844px) 100vw, 844px\" \/>\n<\/p>\n<p><em>Figure 1: error when token limit is exceeded<\/em>\n<\/p>\n<p>\n  By choosing a model with a low token limit and by testing with queries that require a large number of tokens to process, we can easily reproduce the issue, then build and test a fix.\n<\/p>\n<h3>Inexact science<\/h3>\n<p>\n  As mentioned above, the Kotlin tokenizer is making a very rough guess based on averages, rather than parsing according to the well-known source. Similarly, because the Kotlin OpenAI client library abstracts away the serialization of arguments to and from JSON, we also don\u2019t really have access to the <em>exact<\/em> payload that is being sent and received from the API. We also can\u2019t really predict with accuracy how many tokens to expect in our response.\n<\/p>\n<p>\n  To get around this issue, the token calculations err on the size of overestimating, such that we\u2019re less likely to hit edge cases where the API might be send a message that exceeds the limit. With more research it would be possible to better understand the underlying message sizes and write a better tokenizer\u2026 these are left as an exercise for the reader.\n<\/p>\n<h2>Building a sliding window (by discarding older results)<\/h2>\n<p>\n  \u201cSliding window\u201d is just a fancy name for a FIFO (first in first out) queue. For an AI chat feature, as the number of tokens in the conversation history approach the maximum allowed by the model, we will discard the OLDEST interactions (both user messages and model responses). Those older interactions will be \u2018forgotten\u2019 by the model when it constructs responses to new queries (although there\u2019s nothing to stop the UI from continuing to show the older messages as a long conversation thread).\n<\/p>\n<h3>Implementation in Jetchat<\/h3>\n<p>\n  There are three new classes in the <a href=\"https:\/\/github.com\/conceptdev\/droidcon-sf-23\/tree\/main\/Jetchat\">Jetchat demo project<\/a> to support the sliding window:\n<\/p>\n<ul>\n<li>\n    <code>Tokenizer<\/code> \u2013 a basic averages-based token counter (number of characters divided by four).\n  <\/li>\n<li>\n    <code>CustomChatMessage<\/code> \u2013 a wrapper for the <code>ChatMessage<\/code> class that\u2019s used to calculate the token usage for the message. As noted above, this also can only approximate the token size of the underlying JSON payload.\n  <\/li>\n<li>\n    <code>SlidingWindow<\/code> \u2013 helper class that contains the algorithm to loop through the conversation history, checking the token-size of each message and building a new message collection that is \u201cguaranteed\u201d to be smaller than the model\u2019s token limit.\n  <\/li>\n<\/ul>\n<p>\n  The <code>SlidingWindow<\/code> class is used in the <code>DroidconEmbeddingsWrapper<\/code> class right before the <code>chatCompletionRequest<\/code> is sent, to trim down the data used as the completion request parameter. The <code>SlidingWindow.chatHistoryToWindow()<\/code> function performs the following steps:\n<\/p>\n<ol>\n<li>\n  Set up variables and values for the algorithm.\n<\/li>\n<li>\n  Checks for a system message and saves it for later (otherwise it\u2019s the \u2018oldest\u2019 message and could be ignored).\n<\/li>\n<li>\n  Loops through the entire message history from newest to oldest, because we want to preserve the newest messages.\n<\/li>\n<li>\n  Adds the system message back, reorders, and returns the message \u201cwindow\u201d\n<\/li>\n<\/ol>\n<p>\n  The code is shown here:\n<\/p>\n<pre>  fun chatHistoryToWindow (conversation: MutableList&lt;ChatMessage&gt;): MutableList&lt;ChatMessage&gt; {\r\n      val tokenLimit = Constants.OPENAI_MAX_TOKENS\r\n      val expectedResponseSizeTokens = 500 \/\/ hardcoded estimate of max response size we expect (in tokens)\r\n      var tokensUsed = 0\r\n      var systemMessage: ChatMessage? = null\r\n      val tokenMax = tokenLimit - expectedResponseSizeTokens\r\n      var messagesInWindow = mutableListOf&lt;ChatMessage&gt;()\r\n      \/\/ check for system message (to preserve it even if others are removed)\r\n      if (conversation[0].role == ChatRole.System) {\r\n          systemMessage = conversation[0]\r\n          var systemMessageTokenCount = Tokenizer.countTokensIn(systemMessage.content)\r\n          tokensUsed += systemMessageTokenCount\r\n      }\r\n      \/\/ loop through messages until one takes us over the token limit\r\n      for (message in conversation.reversed()) {\r\n          if (message.role != ChatRole.System) {\r\n              var m = CustomChatMessage(message.role, \"\", message.content, message.name, message.functionCall)\r\n              if ((tokensUsed + m.getTokenCount()) &lt; tokenMax) {\r\n                  messagesInWindow.add(message)\r\n                  tokensUsed += m.getTokenCount()\r\n              } else {\r\n                  break \/\/ could optionally keep adding subsequent, smaller messages to context up until token limit\r\n              }\r\n          }\r\n      }\r\n      \/\/ add system message back if it existed\r\n      if (systemMessage != null) {\r\n          messagesInWindow.add(systemMessage)\r\n      }\r\n      \/\/ re-order so that system message is [0]\r\n      var orderedMessageWindow = messagesInWindow.reversed().toMutableList()\r\n      return orderedMessageWindow\r\n  }<\/pre>\n<p><em>Figure 2: sliding window algorithm<\/em>\n<\/p>\n<p>\n  The usage in the <code>DroidconEmbeddingsWrapper.chat()<\/code> function is <em>right before<\/em> the creation of the <code>chatCompletionRequest<\/code>. The updated code passes the full <code>conversation<\/code> list of messages to the <code>chatHistoryToWindow<\/code> function, and then uses the subset of messages (the \u201csliding window\u201d) that\u2019s returned as the parameter for the <code>chatCompletionRequest<\/code>:\n<\/p>\n<pre>  \/\/ implement sliding window\r\n  val chatWindowMessages = SlidingWindow.chatHistoryToWindow(conversation)\r\n  \/\/ build the OpenAI network request\r\n  val chatCompletionRequest = chatCompletionRequest {\r\n          model = ModelId(Constants.OPENAI_CHAT_MODEL)\r\n          messages = chatWindowMessages \/\/ previously sent the entire conversation\r\n  \/\/...<\/pre>\n<p><em>Figure 3: snippet of code where the sliding window function is added into the existing chat feature<\/em>\n<\/p>\n<p>\n  The remainder of the code is unchanged \u2013 subsequent model responses are added back to the main <code>conversation<\/code> list and the sliding window is re-calculated again for the next user query.\n<\/p>\n<h3>Fixed it!<\/h3>\n<p>\n  To \u201cprove\u201d that the new algorithm works, the conversation in Figure 4 mimics the input used to trigger the error in Figure 1. The first two user queries \u2013 about \u201cgradle\u201d and \u201cjetpack compose\u201d \u2013 each result in a large number of embedding matches being included in the prompt (exemplified by the number of results to each \u2013 five sessions listed after the first query and ten sessions listed for the second). Prior to implementing the sliding window, the third query would have been expected to trigger an error that the maximum number of tokens was exceeded, but instead the third query (about \u201cAI\u201d) succeeds:\n<\/p>\n<p>\n  <img decoding=\"async\" width=\"1600\" height=\"650\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-phone-description-automatically.png\" class=\"wp-image-3451\" alt=\"Three screenshots of Jetchat showing a conversation about the droidcon schedule. The user queries require embeddings with a token count that exceeds the maximum model limit, proving the sliding window is working\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-phone-description-automatically.png 1600w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-phone-description-automatically-300x122.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-phone-description-automatically-1024x416.png 1024w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-phone-description-automatically-768x312.png 768w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-phone-description-automatically-1536x624.png 1536w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/>\n<\/p>\n<p><em>Figure 4: a three-interaction conversation where the first two questions have large embedding matches<br \/><\/em>\n  The \u201cproof\u201d that the algorithm is working can be see in the <strong>logcat<\/strong> output (filtered by <code>LLM-SW<\/code>) shown in figure 5. The log shows each message from newest to oldest being compared against the \u201cavailable tokens\u201d, until one of the messages has so many tokens that including it would exceed the limit (114 tokens \u2018available\u2019 under the limit, and the \u201cGradle\u201d query with embeddings requires 2033 tokens). That message is not included in the window, so the algorithm stops adding messages and returns the subset to be sent to the model.\n<\/p>\n<p>\n  <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-computer-program-description-au.png\" class=\"wp-image-3452\" alt=\"Screenshot of logcat output showing the steps the sliding window algorithm took to exclude the embeddings that would have exceeded the model limit\" width=\"450\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-computer-program-description-au.png 785w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-computer-program-description-au-300x228.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/08\/a-screenshot-of-a-computer-program-description-au-768x584.png 768w\" sizes=\"(max-width: 785px) 100vw, 785px\" \/>\n<\/p>\n<p><em>Figure 5: Logcat shows the sliding window calculation (in reverse chronological order) where the initial Gradle question and embeddings are removed from the context sent to the model<\/em>\n<\/p>\n<h2>Sliding window for functions<\/h2>\n<p>\n  The above examples and code snippets show queries that use embeddings to augment the model\u2019s response. However, as discussed in the post on <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-10\/\">combining OpenAI functions with embeddings<\/a> and <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-15\/\">OpenAI tokens and limits<\/a>, the addition of function calls also has an impact on token usage. The local result of function calling requires another chat completion request be sent to the model to get the model\u2019s final response.\n<\/p>\n<p>\n  To prevent the subsequent function call completion request from exceeding the token limit, we re-apply the sliding window algorithm to the updated <code>conversation<\/code> list, after the function-calling-messages have been added, as shown in figure 6:\n<\/p>\n<pre>  \/\/ sliding window - with the function call messages we might need to remove more from the conversation history\r\n  val functionChatWindowMessages = SlidingWindow.chatHistoryToWindow(conversation)\r\n  \/\/ send the function request\/response back to the model\r\n  val functionCompletionRequest = chatCompletionRequest {\r\n       model = ModelId(Constants.OPENAI_CHAT_MODEL)\r\n        messages = functionChatWindowMessages\r\n  }<\/pre>\n<p><em>Figure 6: re-apply the sliding window algorithm after including the function call request and local response<\/em>\n<\/p>\n<p>\n  Recalculating the sliding window at this point may exclude even more older context from the start of the conversation to make room for the local function response to be added to the chat completion request.\n<\/p>\n<h2>Resources and feedback<\/h2>\n<p>\n  Upcoming posts will continue the discussion about different ways to implement a long-term chat beyond a simple sliding window.\n<\/p>\n<p>\n  The <a href=\"https:\/\/community.openai.com\/\">OpenAI developer community<\/a> forum has lots of discussion about API usage and other developer questions.\n<\/p>\n<p>\n  We\u2019d love your feedback on this post, including any tips or tricks you\u2019ve learning from playing around with ChatGPT prompts.\n<\/p>\n<p>\n  If you have any thoughts or questions, use the <a href=\"http:\/\/aka.ms\/SurfaceDuoSDK-Feedback\">feedback forum<\/a> or message us on <a href=\"https:\/\/twitter.com\/surfaceduodev\">Twitter @surfaceduodev<\/a>.\n<\/p>\n<p>\n  There will be no livestream this week, but you can check out the <a href=\"https:\/\/youtube.com\/c\/surfaceduodev\">archives on YouTube<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello prompt engineers, There are a number of different strategies to support an \u2018infinite chat\u2019 using an LLM, required because large language models do not store \u2018state\u2019 across API requests and there is a limit to how large a single request can be. In this OpenAI community question on token limit differences in API vs [&hellip;]<\/p>\n","protected":false},"author":570,"featured_media":3450,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[741],"tags":[734,733],"class_list":["post-3449","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-chatgpt","tag-openai"],"acf":[],"blog_post_summary":"<p>Hello prompt engineers, There are a number of different strategies to support an \u2018infinite chat\u2019 using an LLM, required because large language models do not store \u2018state\u2019 across API requests and there is a limit to how large a single request can be. In this OpenAI community question on token limit differences in API vs [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts\/3449","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/users\/570"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/comments?post=3449"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts\/3449\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/media\/3450"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/media?parent=3449"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/categories?post=3449"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/tags?post=3449"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}