November 4th, 2024

Managing Chat History for Large Language Models (LLMs)

Large Language Models (LLMs) operate with a defined limit on the number of tokens they can process at once, referred to as the context window. Exceeding this limit can have significant cost and performance implications. Therefore, it is essential to manage the size of the input sent to the LLM, particularly when using chat completion models. This involves effectively managing chat history and implementing strategies to truncate it when it becomes too large.

Key Considerations for Truncating Chat History

When truncating chat history, consider the following:

  1. System Message: This is typically the first message in the chat history and guides the model’s responses. It’s crucial to retain this message to avoid unpredictable behaviour from the LLM.
  2. Function Calling Messages: These messages consist of pairs of requests and responses that facilitate interaction with external functions. Omitting a request without its corresponding response can lead to an invalid sequence of messages.

Example Scenario

Imagine you are developing a co-pilot that provides information about books related to different cities. The system message used might be: “You are a librarian and expert on books about cities.”

If a user asks the following questions:

  1. Recommend a list of books about Seattle.
  2. Recommend a list of books about Dublin.
  3. Recommend a list of books about Amsterdam.
  4. Recommend a list of books about Paris.
  5. Recommend a list of books about London.

Sending the entire chat history for each query might consume approximately 9000 tokens (assuming you’re using a model like gpt-4o-mini). This approach is inefficient, as the LLM only requires the System Message and the last user message.

Strategies for Truncating Chat History

Several strategies can be employed to truncate chat history effectively:

  1. Sending Only the Last N Messages: Retain the last few messages, including the System Message.
  2. Limiting Based on Maximum Token Count: Ensure the total token count remains within a specified limit.
  3. Summarizing Older Messages: Create a summary of previous messages to maintain context while reducing token usage.

In the following sections, we will explore each of these strategies in detail, along with sample implementations.

Defining a Chat History Reducer Abstraction

First, we define an IChatHistoryReducer interface that supports various strategies for reducing chat history. This interface allows for asynchronous processing, which is beneficial for strategies that involve LLM calls for summarization.

The code provided in this Blog post are samples and not part of the officially support Semantic Kernel API. Our goal is to collaborate with the .NET team to define a set of abstractions and add them to Microsoft.Extensions.AI. The code provided in this Blog post is not complete. Please refer to the MultipleProvider_ChatHistoryReducer.cs for the full source code.
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

/// <summary>
/// Interface for reducing the chat history before sending it to the chat completion provider.
/// </summary>
public interface IChatHistoryReducer
{
    /// <summary>
    /// Reduce the <see cref="ChatHistory"/> before sending it to the <see cref="IChatCompletionService"/>.
    /// </summary>
    /// <param name="chatHistory">Instance of <see cref="ChatHistory"/>to be reduced.</param>
    /// <param name="cancellationToken">Cancellation token.</param>
    Task<IEnumerable<ChatMessageContent>?> ReduceAsync(ChatHistory chatHistory, CancellationToken cancellationToken);
}
To integrate a chat history reducer the samples will use the Decorator pattern to add this behaviour to an existing chat completion service. The decorator works by optionally reducing the chat history to a subset of the messages and only passing this subset to the underlying chat completion service.
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

/// <summary>
/// Extensions methods for <see cref="IChatCompletionService"/>
/// </summary>
internal static class ChatCompletionServiceExtensions
{
    /// <summary>
    /// Adds an wrapper to an instance of <see cref="IChatCompletionService"/> which will use
    /// the provided instance of <see cref="IChatHistoryReducer"/> to reduce the size of
    /// the <see cref="ChatHistory"/> before sending it to the model.
    /// </summary>
    /// <param name="service">Instance of <see cref="IChatCompletionService"/></param>
    /// <param name="reducer">Instance of <see cref="IChatHistoryReducer"/></param>
    public static IChatCompletionService UsingChatHistoryReducer(this IChatCompletionService service, IChatHistoryReducer reducer)
    {
        return new ChatCompletionServiceWithReducer(service, reducer);
    }
}

/// <summary>
/// Instance of <see cref="IChatCompletionService"/> which will invoke a delegate
/// to reduce the size of the <see cref="ChatHistory"/> before sending it to the model.
/// </summary>
public sealed class ChatCompletionServiceWithReducer(IChatCompletionService service, IChatHistoryReducer reducer) : IChatCompletionService
{
    public IReadOnlyDictionary<string, object?> Attributes => throw new NotImplementedException();

    /// <inheritdoc/>
    public async Task<IReadOnlyList<ChatMessageContent>> GetChatMessageContentsAsync(
        ChatHistory chatHistory,
        PromptExecutionSettings? executionSettings = null,
        Kernel? kernel = null,
        CancellationToken cancellationToken = default)
    {
        var reducedMessages = await reducer.ReduceAsync(chatHistory, cancellationToken).ConfigureAwait(false);
        var history = reducedMessages is null ? chatHistory : new ChatHistory(reducedMessages);

        return await service.GetChatMessageContentsAsync(history, executionSettings, kernel, cancellationToken).ConfigureAwait(false);
    }

    /// <inheritdoc/>
    public async IAsyncEnumerable<StreamingChatMessageContent> GetStreamingChatMessageContentsAsync(
        ChatHistory chatHistory,
        PromptExecutionSettings? executionSettings = null,
        Kernel? kernel = null,
        [EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        var reducedMessages = await reducer.ReduceAsync(chatHistory, cancellationToken).ConfigureAwait(false);
        var history = reducedMessages is null ? chatHistory : new ChatHistory(reducedMessages);

        var messages = service.GetStreamingChatMessageContentsAsync(history, executionSettings, kernel, cancellationToken);
        await foreach (var message in messages)
        {
            yield return message;
        }
    }
}
The next sections contains samples which show the effect of different strategies for reducing the size of chat history which is sent in each request to the LLM.

Truncating Based on Message Count

The simplest option to reduce chat history is to simply sent the last N messages to the LLM. This strategy assumes the most relevant context is available in the most recent messages.
Care must be taken to avoid sending incomplete sequences of function calling related messages to the LLM. In the sample code provided any function calling related messages at the start of the chat messages are skipped and only the LLM response (which will contain it’s interpretation of the function calling results) will be included.
The sample below truncates the chat history such that only two messages are sent to the LLM with each request i.e., the system message and the last user message. In this case this will result in no degradation in the responses from the LLM and will also save approximately 6000 tokens.
OpenAIChatCompletionService openAiChatService = new(
    modelId: TestConfiguration.OpenAI.ChatModelId,
    apiKey: TestConfiguration.OpenAI.ApiKey);

var truncatedSize = 2; // keep system message and last user message only
IChatCompletionService chatService = openAiChatService.UsingChatHistoryReducer(new TruncatingChatHistoryReducer(truncatedSize));

var chatHistory = new ChatHistory("You are a librarian and expert on books about cities");

string[] userMessages = [
    "Recommend a list of books about Seattle",
    "Recommend a list of books about Dublin",
    "Recommend a list of books about Amsterdam",
    "Recommend a list of books about Paris",
    "Recommend a list of books about London"
];

int totalTokenCount = 0;
foreach (var userMessage in userMessages)
{
    chatHistory.AddUserMessage(userMessage);
    Console.WriteLine($"\n>>> User:\n{userMessage}");

    var response = await chatService.GetChatMessageContentAsync(chatHistory);
    chatHistory.AddAssistantMessage(response.Content!);
    Console.WriteLine($"\n>>> Assistant:\n{response.Content!}");

    if (response.InnerContent is OpenAI.Chat.ChatCompletion chatCompletion)
    {
        totalTokenCount += chatCompletion.Usage?.TotalTokenCount ?? 0;
    }
}

// Example total token usage is approximately: 3000
Console.WriteLine($"Total Token Count: {totalTokenCount}");

Truncating Based on Maximum Token Count

The next strategy involves truncating based on a specified maximum token count. The reducer implementation will maintain the system message, this is very important to ensure the LLM responds in an appropriate manner.
This approach does a good job in restricting the overall token usage but highlights one of the downsides of reducing the chat history. You will notice in this sample that the user messages are related i.e., user wants to know about restaurants in Seattle and then more specifically Italian and Korean restaurants. However when you run this sample you will see that because the history is truncating the context of the current city is lost.
Assert.NotNull(TestConfiguration.OpenAI.ChatModelId);
Assert.NotNull(TestConfiguration.OpenAI.ApiKey);

OpenAIChatCompletionService openAiChatService = new(
    modelId: TestConfiguration.OpenAI.ChatModelId,
    apiKey: TestConfiguration.OpenAI.ApiKey);
IChatCompletionService chatService = openAiChatService.UsingChatHistoryReducer(new MaxTokensChatHistoryReducer(100));

var chatHistory = new ChatHistory();
chatHistory.AddSystemMessageWithTokenCount("You are an expert on the best restaurants in the world. Keep responses short.");

string[] userMessages = [
    "Recommend restaurants in Seattle",
    "What is the best Italian restaurant?",
    "What is the best Korean restaurant?",
    "Recommend restaurants in Dublin",
    "What is the best Indian restaurant?",
    "What is the best Japanese restaurant?",
];

int totalTokenCount = 0;
foreach (var userMessage in userMessages)
{
    chatHistory.AddUserMessageWithTokenCount(userMessage);
    Console.WriteLine($"\n>>> User:\n{userMessage}");

    var response = await chatService.GetChatMessageContentAsync(chatHistory);
    chatHistory.AddAssistantMessageWithTokenCount(response.Content!);
    Console.WriteLine($"\n>>> Assistant:\n{response.Content!}");

    if (response.InnerContent is OpenAI.Chat.ChatCompletion chatCompletion)
    {
        totalTokenCount += chatCompletion.Usage?.TotalTokenCount ?? 0;
    }
}

// Example total token usage is approximately: 3000
Console.WriteLine($"Total Token Count: {totalTokenCount}");

Summarizing Older Messages

The final strategy summarizes the existing chat history and sends the system message, chat history summary and most recent messages to the LLM. This approach helps with the problem of maintaining context and for the sample set of messages used maintains the correct context and get’s the expected responses. The sample updates the main chat history with the summary messages and sets a special entry in the chat message metadata so these summary messages can be identified. If you used this approach in your application you may want to hide the summary messages in the user interface.
To summarize older chat messages you can try different models e.g., a local Small Language Model (SLM) might provide good enough performance.
OpenAIChatCompletionService openAiChatService = new(
        modelId: TestConfiguration.OpenAI.ChatModelId,
        apiKey: TestConfiguration.OpenAI.ApiKey);
IChatCompletionService chatService = openAiChatService.UsingChatHistoryReducer(new SummarizingChatHistoryReducer(openAiChatService, 2, 4));

var chatHistory = new ChatHistory("You are an expert on the best restaurants in every city. Answer for the city the user has asked about.");

string[] userMessages = [
    "Recommend restaurants in Seattle",
    "What is the best Italian restaurant?",
    "What is the best Korean restaurant?",
    "Recommend restaurants in Dublin",
    "What is the best Indian restaurant?",
    "What is the best Japanese restaurant?",
];

int totalTokenCount = 0;
foreach (var userMessage in userMessages)
{
    chatHistory.AddUserMessage(userMessage);
    Console.WriteLine($"\n>>> User:\n{userMessage}");

    var response = await chatService.GetChatMessageContentAsync(chatHistory);
    chatHistory.AddAssistantMessage(response.Content!);
    Console.WriteLine($"\n>>> Assistant:\n{response.Content!}");

    if (response.InnerContent is OpenAI.Chat.ChatCompletion chatCompletion)
    {
        totalTokenCount += chatCompletion.Usage?.TotalTokenCount ?? 0;
    }
}

// Example total token usage is approximately: 3000
Console.WriteLine($"Total Token Count: {totalTokenCount}");

Please reach out if you have any questions or feedback through our Semantic Kernel GitHub Discussion Channel. We look forward to hearing from you! We would also love your support — if you’ve enjoyed using Semantic Kernel, give us a star on GitHub.

Author

Mark Wallace
Principal Engineering Manager

Semantic Kernel, Integrate cutting-edge LLM technology quickly and easily into your apps

0 comments