Chunking for citations in a document chat

Craig Dunn

November 12th, 20230 0

Hello prompt engineers,

Last week’s blog introduced a simple “chat over documents” Android implementation, using some example content from this Azure demo. However, if you take a look at the Azure sample, the output is not only summarized from the input PDFs, but it’s also able to cite which document the answer is drawn from (showing in Figure 1). In this blog, we’ll investigate how to add citations to the responses in JetchatAI.

Figure 1: Azure OpenAI demo result shows citations for the information presented in the response

In order to provide similar information in the JetchatAI document chat on Android, we’ll need to update the document parsing (chunking) so that we have enough context to answer questions and identify the source.

Prompt engineering playground

Before spending a lot of time on the parsing algorithm, it makes sense to confirm that we can get the model to understand what we want to achieve. To quickly iterate prototypes for this feature, I simulated a request/response in the OpenAI Playground, using the existing prompts from the app and some test embeddings from testing the document chat feature:

Figure 2: OpenAI playground for testing prompt ideas, with a prototype for answering a document chat with cited sources

Figure 2 shows an example chat interaction based on the documents we added to the app in the previous blog. The facts listed in the USER prompt (#4) are examples of the embeddings from testing the existing feature. Each element of the “prompt prototype” is explained below:

Existing system prompt and grounding introduction (unchanged).
Specify which plan the user has, to help answer questions more specifically.
Updates to the system prompt and the grounding prompt to teach the model how to cite sources. The system prompt explains what citations should “look like”, with [1] numbered square brackets, and the grounding reinforces that citations should be used and added to the end of the response.
The similar embeddings are now grouped by the document they were extracted from, and the # markdown-style emphases on the filename helps the model to group the data that follows. This test data is actual embeddings from testing the document chat feature previously.
The user’s query, which is added to the end of the grounding data (from embeddings) and prompt.
The model’s response attempts to refer to “Your plan” and hopefully distinguishes the plan mentioned in the system prompt (#2) from other plan features.
Two citations are provided in the response, because the vision and immunication chunks are from different source documents.
The model correctly adds the cited documents at the end of the response.

Slightly changing the user-prompt to “does my plan cover contact lenses” (without mentioning immunizations), we can confirm that the answer and cited documents changes:

Figure 3: OpenAI playground example where only one source document is cited

Note that in Figure 3 the citation numbering seems to reflect the position of the “document” in the grounding prompt. Although this should be numbered from one, I’m going to ignore it for now (another exercise for the reader). The updated prompt and grounding format works well enough to be added to the app for further testing.

Updated chunking and embeddings

Now that we’ve established a prompt that works in the OpenAI playground, we need to update the app to parse the documents differently so that we can re-create the grounding format in code.

Currently, the sentence embeddings are all added without keeping track of the source document. When they’re added to the grounding data, they are ordered by similarity score (highest first).

To implement the prompt and grounding prototyped above, we need to:

Alter the document parsing so that we keep track of which document each embedding comes from,
After we’ve identified the most similar embeddings, group them by document name, and
Update the system and grounding prompts to train the model to create citations.

The code for these two changes is shown below (and is in this pull request), followed by final app testing.

Chunking changes

Because the code from last week was already keepting track of ‘document id’ as it parsed the resource files, minimal changes were needed to keep track of the actual filenames.

Firstly, a new array rawFilenames contains the user-friendly filename representation for each resource:

val rawResources = listOf(R.raw.benefit_options, R.raw.northwind_standard_benefits_details)
val rawFilenames = listOf<String>("Benefit-options.pdf", "Northwind-Standard-benefits-details.pdf")

Figure 4: adding the user-friendly filename strings (must match the resources order)

Then as the code is looping through the resources, we add the user-friendly filename to a cache, keyed by the ‘document id’ we already have stored as part of the embeddings key:

for (resId in rawResources) {
    documentId++
    documentCache["$documentId"] = rawFilenames[documentId]  // filename will be shown to user

Figure 5: storing the filename to match the documentId for later retrieval

It’s now possible to determine which document a given sentence was found in.

Grounding tweaks

When the document filename is stored for each embedding, the code building the grounding prompt can group the embeddings under document “headings” so that the model can better understand the context for the embedding strings.

For the document filenames to be useful, the system prompt must be updated to match the prototype in Figure 2. Figure 6 below shows the updated system prompt from the DocumentChatWrapper.kt init function:

grounding = """
   You are a personal assistant for Contoso employees. 
   You will answer questions about Contoso employee benefits from various employee manuals.
   Your answers will be short and concise. 
   Only use the functions you have been provided with.
   The user has Northwind Standard health plan.
   For each piece of information you provide, cite the source in brackets like so: [1].
   At the end of the answer, always list each source with its corresponding number and provide the document name, like so [1] Filename.doc""".trimMargin()

Figure 6: updated system prompt (including a personalization statement about the user’s current plan)

The code in Figure 7 shows the grounding function changes to support citations, producing output similar to the prototype grounding in Figure 2. After ranking the embeddings by similarity (and ignoring results with less than 0.8 similarity score), it loops through and groups sentences by document filename:

var matches = sortedVectors.tailMap(0.8)
// re-sort based on key, to group by filename
var sortedMatches: SortedMap<String, String> = sortedMapOf()
for (dpKey in sortedVectors.tailMap(0.8)) {
    val fileId = dpKey.value.split('-')[0] // the document id is the first part of the embedding key
    val filename = documentNameCache[fileId]!!
    val content = documentCache[dpKey.value]!!
    if (sortedMatches.contains(filename))
    { // add to current ‘file’ matching sentences
        sortedMatches[filename] += "\n\n$content"
    } else { // first match for this filename
        sortedMatches[filename] = content
    }
}
// loop through filenames and output the matching sentences for each file
messagePreamble = "The following information is extracted from Contoso employee handbooks and health plan documents:"
for (file in sortedMatches) {
    messagePreamble += "\n\n# ${file.key}\n\n${file.value}\n\n#####\n\n" // use the # pound markdown-like heading syntax for the filename,
}
messagePreamble += "\n\nUse the above information to answer the following question, providing numbered citations for document sources used (mention the cited documents at the end by number). Synthesize the information into a summary paragraph:\n\n"

Figure 7: updated grounding function

Now that the code has been updated to:

Keep track of which document each embedding sentence was found in,
Group high-similarity embedding results by document filename, and
Add instructions in the system and grounding prompts to cite the source of facts in the model’s response.

The responses in the JetchatAI document chat should now include numbered citations.

Citations in the chat

With these relatively small changes in the code, the #document-chat conversation in JetchatAI will now add citations when asked questions about the fictitious Contoso employee benefits documents that are referenced via RAG principles:

Two screenshots of the JetchatAI app running on Android, with user queestions and model answers containing numbered citations.

Figure 8: JetchatAI showing citations when referencing source documents

Feedback and resources

This post is closely related to the document chat implementation post.

We’d love your feedback on this post, including any tips or tricks you’ve learned from playing around with ChatGPT prompts.

If you have any thoughts or questions, use the feedback forum or message us on Twitter @surfaceduodev.

Chunking for citations in a document chat

Prompt engineering playground

Updated chunking and embeddings

Chunking changes

Grounding tweaks

Citations in the chat

Feedback and resources

Craig Dunn Principal SW Engineer, Surface Duo Developer Experience

0 comments

Chunking for citations in a document chat

Prompt engineering playground

Updated chunking and embeddings

Chunking changes

Grounding tweaks

Citations in the chat

Feedback and resources

Craig Dunn Principal SW Engineer, Surface Duo Developer Experience

Read next

0 comments