Large Language Models (LLMs) operate with a defined limit on the number of tokens they can process at once, referred to as the context window. Exceeding this limit can have significant cost and performance implications. Therefore, it is essential to manage the size of the input sent to the LLM, particularly when using chat completion models. This involves effectively managing chat history and implementing strategies to truncate it when it becomes too large.
Key Considerations for Truncating Chat History
When truncating chat history, consider the following:
- System Message: This is typically the first message in the chat history and guides the model’s responses. It’s crucial to retain this message to avoid unpredictable behaviour from the LLM.
- Function Calling Messages: These messages consist of pairs of requests and responses that facilitate interaction with external functions. Omitting a request without its corresponding response can lead to an invalid sequence of messages.
Example Scenario
Imagine you are developing a co-pilot that provides information about books related to different cities. The system message used might be: “You are a librarian and expert on books about cities.”
If a user asks the following questions:
- Recommend a list of books about Seattle.
- Recommend a list of books about Dublin.
- Recommend a list of books about Amsterdam.
- Recommend a list of books about Paris.
- Recommend a list of books about London.
Sending the entire chat history for each query might consume approximately 9000 tokens (assuming you’re using a model like gpt-4o-mini). This approach is inefficient, as the LLM only requires the System Message and the last user message.
Strategies for Truncating Chat History
Several strategies can be employed to truncate chat history effectively:
- Sending Only the Last N Messages: Retain the last few messages, including the System Message.
- Limiting Based on Maximum Token Count: Ensure the total token count remains within a specified limit.
- Summarizing Older Messages: Create a summary of previous messages to maintain context while reducing token usage.
In the following sections, we will explore each of these strategies in detail, along with sample implementations.
Defining a Chat History Reducer Abstraction
First, we define an IChatHistoryReducer interface that supports various strategies for reducing chat history. This interface allows for asynchronous processing, which is beneficial for strategies that involve LLM calls for summarization.
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
/// <summary>
/// Interface for reducing the chat history before sending it to the chat completion provider.
/// </summary>
public interface IChatHistoryReducer
{
/// <summary>
/// Reduce the <see cref="ChatHistory"/> before sending it to the <see cref="IChatCompletionService"/>.
/// </summary>
/// <param name="chatHistory">Instance of <see cref="ChatHistory"/>to be reduced.</param>
/// <param name="cancellationToken">Cancellation token.</param>
Task<IEnumerable<ChatMessageContent>?> ReduceAsync(ChatHistory chatHistory, CancellationToken cancellationToken);
}
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
/// <summary>
/// Extensions methods for <see cref="IChatCompletionService"/>
/// </summary>
internal static class ChatCompletionServiceExtensions
{
/// <summary>
/// Adds an wrapper to an instance of <see cref="IChatCompletionService"/> which will use
/// the provided instance of <see cref="IChatHistoryReducer"/> to reduce the size of
/// the <see cref="ChatHistory"/> before sending it to the model.
/// </summary>
/// <param name="service">Instance of <see cref="IChatCompletionService"/></param>
/// <param name="reducer">Instance of <see cref="IChatHistoryReducer"/></param>
public static IChatCompletionService UsingChatHistoryReducer(this IChatCompletionService service, IChatHistoryReducer reducer)
{
return new ChatCompletionServiceWithReducer(service, reducer);
}
}
/// <summary>
/// Instance of <see cref="IChatCompletionService"/> which will invoke a delegate
/// to reduce the size of the <see cref="ChatHistory"/> before sending it to the model.
/// </summary>
public sealed class ChatCompletionServiceWithReducer(IChatCompletionService service, IChatHistoryReducer reducer) : IChatCompletionService
{
public IReadOnlyDictionary<string, object?> Attributes => throw new NotImplementedException();
/// <inheritdoc/>
public async Task<IReadOnlyList<ChatMessageContent>> GetChatMessageContentsAsync(
ChatHistory chatHistory,
PromptExecutionSettings? executionSettings = null,
Kernel? kernel = null,
CancellationToken cancellationToken = default)
{
var reducedMessages = await reducer.ReduceAsync(chatHistory, cancellationToken).ConfigureAwait(false);
var history = reducedMessages is null ? chatHistory : new ChatHistory(reducedMessages);
return await service.GetChatMessageContentsAsync(history, executionSettings, kernel, cancellationToken).ConfigureAwait(false);
}
/// <inheritdoc/>
public async IAsyncEnumerable<StreamingChatMessageContent> GetStreamingChatMessageContentsAsync(
ChatHistory chatHistory,
PromptExecutionSettings? executionSettings = null,
Kernel? kernel = null,
[EnumeratorCancellation] CancellationToken cancellationToken = default)
{
var reducedMessages = await reducer.ReduceAsync(chatHistory, cancellationToken).ConfigureAwait(false);
var history = reducedMessages is null ? chatHistory : new ChatHistory(reducedMessages);
var messages = service.GetStreamingChatMessageContentsAsync(history, executionSettings, kernel, cancellationToken);
await foreach (var message in messages)
{
yield return message;
}
}
}
Truncating Based on Message Count
OpenAIChatCompletionService openAiChatService = new(
modelId: TestConfiguration.OpenAI.ChatModelId,
apiKey: TestConfiguration.OpenAI.ApiKey);
var truncatedSize = 2; // keep system message and last user message only
IChatCompletionService chatService = openAiChatService.UsingChatHistoryReducer(new TruncatingChatHistoryReducer(truncatedSize));
var chatHistory = new ChatHistory("You are a librarian and expert on books about cities");
string[] userMessages = [
"Recommend a list of books about Seattle",
"Recommend a list of books about Dublin",
"Recommend a list of books about Amsterdam",
"Recommend a list of books about Paris",
"Recommend a list of books about London"
];
int totalTokenCount = 0;
foreach (var userMessage in userMessages)
{
chatHistory.AddUserMessage(userMessage);
Console.WriteLine($"\n>>> User:\n{userMessage}");
var response = await chatService.GetChatMessageContentAsync(chatHistory);
chatHistory.AddAssistantMessage(response.Content!);
Console.WriteLine($"\n>>> Assistant:\n{response.Content!}");
if (response.InnerContent is OpenAI.Chat.ChatCompletion chatCompletion)
{
totalTokenCount += chatCompletion.Usage?.TotalTokenCount ?? 0;
}
}
// Example total token usage is approximately: 3000
Console.WriteLine($"Total Token Count: {totalTokenCount}");
Truncating Based on Maximum Token Count
Assert.NotNull(TestConfiguration.OpenAI.ChatModelId);
Assert.NotNull(TestConfiguration.OpenAI.ApiKey);
OpenAIChatCompletionService openAiChatService = new(
modelId: TestConfiguration.OpenAI.ChatModelId,
apiKey: TestConfiguration.OpenAI.ApiKey);
IChatCompletionService chatService = openAiChatService.UsingChatHistoryReducer(new MaxTokensChatHistoryReducer(100));
var chatHistory = new ChatHistory();
chatHistory.AddSystemMessageWithTokenCount("You are an expert on the best restaurants in the world. Keep responses short.");
string[] userMessages = [
"Recommend restaurants in Seattle",
"What is the best Italian restaurant?",
"What is the best Korean restaurant?",
"Recommend restaurants in Dublin",
"What is the best Indian restaurant?",
"What is the best Japanese restaurant?",
];
int totalTokenCount = 0;
foreach (var userMessage in userMessages)
{
chatHistory.AddUserMessageWithTokenCount(userMessage);
Console.WriteLine($"\n>>> User:\n{userMessage}");
var response = await chatService.GetChatMessageContentAsync(chatHistory);
chatHistory.AddAssistantMessageWithTokenCount(response.Content!);
Console.WriteLine($"\n>>> Assistant:\n{response.Content!}");
if (response.InnerContent is OpenAI.Chat.ChatCompletion chatCompletion)
{
totalTokenCount += chatCompletion.Usage?.TotalTokenCount ?? 0;
}
}
// Example total token usage is approximately: 3000
Console.WriteLine($"Total Token Count: {totalTokenCount}");
Summarizing Older Messages
OpenAIChatCompletionService openAiChatService = new(
modelId: TestConfiguration.OpenAI.ChatModelId,
apiKey: TestConfiguration.OpenAI.ApiKey);
IChatCompletionService chatService = openAiChatService.UsingChatHistoryReducer(new SummarizingChatHistoryReducer(openAiChatService, 2, 4));
var chatHistory = new ChatHistory("You are an expert on the best restaurants in every city. Answer for the city the user has asked about.");
string[] userMessages = [
"Recommend restaurants in Seattle",
"What is the best Italian restaurant?",
"What is the best Korean restaurant?",
"Recommend restaurants in Dublin",
"What is the best Indian restaurant?",
"What is the best Japanese restaurant?",
];
int totalTokenCount = 0;
foreach (var userMessage in userMessages)
{
chatHistory.AddUserMessage(userMessage);
Console.WriteLine($"\n>>> User:\n{userMessage}");
var response = await chatService.GetChatMessageContentAsync(chatHistory);
chatHistory.AddAssistantMessage(response.Content!);
Console.WriteLine($"\n>>> Assistant:\n{response.Content!}");
if (response.InnerContent is OpenAI.Chat.ChatCompletion chatCompletion)
{
totalTokenCount += chatCompletion.Usage?.TotalTokenCount ?? 0;
}
}
// Example total token usage is approximately: 3000
Console.WriteLine($"Total Token Count: {totalTokenCount}");
Please reach out if you have any questions or feedback through our Semantic Kernel GitHub Discussion Channel. We look forward to hearing from you! We would also love your support — if you’ve enjoyed using Semantic Kernel, give us a star on GitHub.
0 comments
Be the first to start the discussion.