{"id":3593,"date":"2023-11-12T16:42:30","date_gmt":"2023-11-13T00:42:30","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/surface-duo\/?p=3593"},"modified":"2024-01-03T16:05:29","modified_gmt":"2024-01-04T00:05:29","slug":"android-openai-chatgpt-25","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-25\/","title":{"rendered":"Chunking for citations in a document chat"},"content":{"rendered":"<p>\n  Hello prompt engineers,\n<\/p>\n<p>\n  Last week\u2019s blog introduced a simple \u201c<a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-24\/\">chat over documents<\/a>\u201d Android implementation, using some example content from this <a href=\"https:\/\/github.com\/azure-samples\/azure-search-openai-demo\">Azure demo<\/a>. However, if you take a look at the Azure sample, the output is not only summarized from the input PDFs, but it\u2019s also able to cite which document the answer is drawn from (showing in Figure 1). In this blog, we\u2019ll investigate how to add citations to the responses in JetchatAI.\n<\/p>\n<p>\n  <img decoding=\"async\" width=\"1219\" height=\"392\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-computer-description-automatica.png\" class=\"wp-image-3594\" alt=\"\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-computer-description-automatica.png 1219w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-computer-description-automatica-300x96.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-computer-description-automatica-1024x329.png 1024w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-computer-description-automatica-768x247.png 768w\" sizes=\"(max-width: 1219px) 100vw, 1219px\" \/><br\/><em>Figure 1: Azure OpenAI demo result shows citations for the information presented in the response<\/em>\n<\/p>\n<p>\n  In order to provide similar information in the <em>JetchatAI<\/em> document chat on Android, we\u2019ll need to update the document parsing (chunking) so that we have enough context to answer questions <em>and<\/em> identify the source.\n<\/p>\n<h2>Prompt engineering playground<\/h2>\n<p>\n  Before spending a lot of time on the parsing algorithm, it makes sense to confirm that we can get the model to understand what we want to achieve. To quickly iterate prototypes for this feature, I simulated a request\/response in the <a href=\"https:\/\/platform.openai.com\/playground\">OpenAI Playground<\/a>, using the existing prompts from the app and some test embeddings from testing the document chat feature: \n<\/p>\n<p>\n  <img decoding=\"async\" width=\"2401\" height=\"1293\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-computer-description-automatica-1.png\" class=\"wp-image-3595\" alt=\"\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-computer-description-automatica-1.png 2401w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-computer-description-automatica-1-300x162.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-computer-description-automatica-1-1024x551.png 1024w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-computer-description-automatica-1-768x414.png 768w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-computer-description-automatica-1-1536x827.png 1536w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-computer-description-automatica-1-2048x1103.png 2048w\" sizes=\"(max-width: 2401px) 100vw, 2401px\" \/><br\/><em>Figure 2: OpenAI playground for testing prompt ideas, with a prototype for answering a document chat with cited sources<\/em>\n<\/p>\n<p>\n  Figure 2 shows an example chat interaction based on the documents we added to the app in the previous blog. The facts listed in the USER prompt (#4) are examples of the embeddings from testing the existing feature. Each element of the \u201cprompt prototype\u201d is explained below:\n<\/p>\n<ol>\n<li>\n  Existing system prompt and grounding introduction (unchanged).\n<\/li>\n<li>\n  Specify which plan the user has, to help answer questions more specifically.\n<\/li>\n<li>\n  Updates to the system prompt and the grounding prompt to teach the model how to cite sources. The system prompt explains what citations should \u201clook like\u201d, with <code>[1]<\/code> numbered square brackets, and the grounding reinforces that citations should be used and added to the end of the response.\n<\/li>\n<li>\n  The similar embeddings are now grouped by the document they were extracted from, and the <code>#<\/code> markdown-style emphases on the filename helps the model to group the data that follows. This test data is actual embeddings from testing the document chat feature previously.\n<\/li>\n<li>\n  The user\u2019s query, which is added to the end of the grounding data (from embeddings) and prompt.\n<\/li>\n<li>\n  The model\u2019s response attempts to refer to \u201cYour plan\u201d and hopefully distinguishes the plan mentioned in the system prompt (#2) from other plan features.\n<\/li>\n<li>\n  Two citations are provided in the response, because the vision and immunication chunks are from different source documents.\n<\/li>\n<li>\n  The model correctly adds the cited documents at the end of the response.\n<\/li>\n<\/ol>\n<p>\n  Slightly changing the user-prompt to \u201cdoes my plan cover contact lenses\u201d (without mentioning immunizations), we can confirm that the answer and cited documents changes:\n<\/p>\n<p>\n  <img decoding=\"async\" width=\"1540\" height=\"338\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-close-up-of-a-message-description-automatically.png\" class=\"wp-image-3596\" alt=\"\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-close-up-of-a-message-description-automatically.png 1540w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-close-up-of-a-message-description-automatically-300x66.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-close-up-of-a-message-description-automatically-1024x225.png 1024w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-close-up-of-a-message-description-automatically-768x169.png 768w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-close-up-of-a-message-description-automatically-1536x337.png 1536w\" sizes=\"(max-width: 1540px) 100vw, 1540px\" \/>\n<br\/><em>Figure 3: OpenAI playground example where only one source document is cited<\/em>\n<\/p>\n<p>\n  Note that in Figure 3 the citation numbering seems to reflect the position of the \u201cdocument\u201d in the grounding prompt. Although this should be numbered from one, I\u2019m going to ignore it for now (another exercise for the reader). The updated prompt and grounding format works well enough to be added to the app for further testing.\n<\/p>\n<h2>Updated chunking and embeddings <\/h2>\n<p>\n  Now that we\u2019ve established a prompt that works in the OpenAI playground, we need to update the app to parse the documents differently so that we can re-create the grounding format in code. \n<\/p>\n<p><strong>Currently<\/strong>, the sentence embeddings are all added without keeping track of the source document. When they\u2019re added to the grounding data, they are ordered by similarity score (highest first).\n<\/p>\n<p><strong>To implement<\/strong> the prompt and grounding prototyped above, we need to:\n<\/p>\n<ol>\n<li>\n  Alter the document parsing so that we keep track of which document each embedding comes from,\n<\/li>\n<li>\n  After we\u2019ve identified the most similar embeddings, group them by document name, and \n<\/li>\n<li>\n  Update the system and grounding prompts to train the model to create citations.\n<\/li>\n<\/ol>\n<p>\n  The code for these two changes is shown below (and is in this <a href=\"https:\/\/github.com\/conceptdev\/droidcon-sf-23\/pull\/21\/files\">pull request<\/a>), followed by final app testing.\n<\/p>\n<h2>Chunking changes<\/h2>\n<p>\n  Because the code from last week was already keepting track of \u2018document id\u2019 as it parsed the resource files, minimal changes were needed to keep track of the actual filenames. \n<\/p>\n<p>\n  Firstly, a new array <code>rawFilenames<\/code> contains the user-friendly filename representation for each resource:\n<\/p>\n<pre>val rawResources = listOf(R.raw.benefit_options, R.raw.northwind_standard_benefits_details)\r\nval rawFilenames = listOf&lt;String&gt;(\"Benefit-options.pdf\", \"Northwind-Standard-benefits-details.pdf\")<\/pre>\n<p><em>Figure 4: adding the user-friendly filename strings (must match the resources order)<\/em>\n<\/p>\n<p>\n  Then as the code is looping through the resources, we add the user-friendly filename to a cache, keyed by the \u2018document id\u2019 we already have stored as part of the embeddings key:\n<\/p>\n<pre>for (resId in rawResources) {\r\n    documentId++\r\n    documentCache[\"$documentId\"] = rawFilenames[documentId]  \/\/ filename will be shown to user<\/pre>\n<p><em>Figure 5: storing the filename to match the <code>documentId<\/code> for later retrieval<\/em>\n<\/p>\n<p>\n  It\u2019s now possible to determine which document a given sentence was found in.\n<\/p>\n<h2>Grounding tweaks<\/h2>\n<p>\n  When the document filename is stored for each embedding, the code building the grounding prompt can group the embeddings under document \u201cheadings\u201d so that the model can better understand the context for the embedding strings.\n<\/p>\n<p>\n  For the document filenames to be useful, the system prompt must be updated to match the prototype in Figure 2. Figure 6 below shows the updated system prompt from the <strong>DocumentChatWrapper.kt<\/strong> <code>init<\/code> function:\n<\/p>\n<pre>grounding = \"\"\"\r\n   You are a personal assistant for Contoso employees. \r\n   You will answer questions about Contoso employee benefits from various employee manuals.\r\n   Your answers will be short and concise. \r\n   Only use the functions you have been provided with.\r\n   The user has Northwind Standard health plan.\r\n   For each piece of information you provide, cite the source in brackets like so: [1].\r\n   At the end of the answer, always list each source with its corresponding number and provide the document name, like so [1] Filename.doc\"\"\".trimMargin()<\/pre>\n<p><em>Figure 6: updated system prompt (including a personalization statement about the user\u2019s current plan)<\/em>\n<\/p>\n<p>\n  The code in Figure 7 shows the <code>grounding<\/code> function changes to support citations, producing output similar to the prototype grounding in Figure 2. After ranking the embeddings by similarity (and ignoring results with less than 0.8 similarity score), it loops through and groups sentences by document filename:\n<\/p>\n<pre>var matches = sortedVectors.tailMap(0.8)\r\n\/\/ re-sort based on key, to group by filename\r\nvar sortedMatches: SortedMap&lt;String, String&gt; = sortedMapOf()\r\nfor (dpKey in sortedVectors.tailMap(0.8)) {\r\n    val fileId = dpKey.value.split('-')[0] \/\/ the document id is the first part of the embedding key\r\n    val filename = documentNameCache[fileId]!!\r\n    val content = documentCache[dpKey.value]!!\r\n    if (sortedMatches.contains(filename))\r\n    { \/\/ add to current \u2018file\u2019 matching sentences\r\n        sortedMatches[filename] += \"\\n\\n$content\"\r\n    } else { \/\/ first match for this filename\r\n        sortedMatches[filename] = content\r\n    }\r\n}\r\n\/\/ loop through filenames and output the matching sentences for each file\r\nmessagePreamble = \"The following information is extracted from Contoso employee handbooks and health plan documents:\"\r\nfor (file in sortedMatches) {\r\n    messagePreamble += \"\\n\\n# ${file.key}\\n\\n${file.value}\\n\\n#####\\n\\n\" \/\/ use the # pound markdown-like heading syntax for the filename,\r\n}\r\nmessagePreamble += \"\\n\\nUse the above information to answer the following question, providing numbered citations for document sources used (mention the cited documents at the end by number). Synthesize the information into a summary paragraph:\\n\\n\"\r\n<\/pre>\n<p><em>Figure 7: updated <code>grounding<\/code> function<\/em>\n<\/p>\n<p>\n  Now that the code has been updated to:\n<\/p>\n<ol>\n<li>\n  Keep track of which document each embedding sentence was found in, \n<\/li>\n<li>\n  Group high-similarity embedding results by document filename, and \n<\/li>\n<li>\n  Add instructions in the system and grounding prompts to cite the source of facts in the model\u2019s response.\n<\/li>\n<\/ol>\n<p>\n  The responses in the JetchatAI document chat should now include numbered citations.\n<\/p>\n<h2>Citations in the chat<\/h2>\n<p>\n  With these relatively small changes in the code, the #document-chat conversation in JetchatAI will now add citations when asked questions about the fictitious Contoso employee benefits documents that are referenced via RAG principles:\n<\/p>\n<p>\n  <img decoding=\"async\" width=\"1928\" height=\"1168\" src=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-phone-description-automatically.png\" class=\"wp-image-3597\" alt=\"Two screenshots of the JetchatAI app running on Android, with user queestions and model answers containing numbered citations.\" srcset=\"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-phone-description-automatically.png 1928w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-phone-description-automatically-300x182.png 300w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-phone-description-automatically-1024x620.png 1024w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-phone-description-automatically-768x465.png 768w, https:\/\/devblogs.microsoft.com\/surface-duo\/wp-content\/uploads\/sites\/53\/2023\/11\/a-screenshot-of-a-phone-description-automatically-1536x931.png 1536w\" sizes=\"(max-width: 1928px) 100vw, 1928px\" \/>\n<\/p>\n<p><em>Figure 8: JetchatAI showing citations when referencing source documents<\/em>\n<\/p>\n<h2>Feedback and resources<\/h2>\n<p>\n  This post is closely related to the <a href=\"https:\/\/devblogs.microsoft.com\/surface-duo\/android-openai-chatgpt-24\/\">document chat implementation<\/a> post.\n<\/p>\n<p>\n  We\u2019d love your feedback on this post, including any tips or tricks you\u2019ve learned from playing around with ChatGPT prompts.\n<\/p>\n<p>\n  If you have any thoughts or questions, use the <a href=\"http:\/\/aka.ms\/SurfaceDuoSDK-Feedback\">feedback forum<\/a> or message us on <a href=\"https:\/\/twitter.com\/surfaceduodev\">Twitter @surfaceduodev<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello prompt engineers, Last week\u2019s blog introduced a simple \u201cchat over documents\u201d Android implementation, using some example content from this Azure demo. However, if you take a look at the Azure sample, the output is not only summarized from the input PDFs, but it\u2019s also able to cite which document the answer is drawn from [&hellip;]<\/p>\n","protected":false},"author":570,"featured_media":3598,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[741],"tags":[734,733],"class_list":["post-3593","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-chatgpt","tag-openai"],"acf":[],"blog_post_summary":"<p>Hello prompt engineers, Last week\u2019s blog introduced a simple \u201cchat over documents\u201d Android implementation, using some example content from this Azure demo. However, if you take a look at the Azure sample, the output is not only summarized from the input PDFs, but it\u2019s also able to cite which document the answer is drawn from [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts\/3593","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/users\/570"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/comments?post=3593"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/posts\/3593\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/media\/3598"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/media?parent=3593"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/categories?post=3593"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/surface-duo\/wp-json\/wp\/v2\/tags?post=3593"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}