Hybrid Model Orchestration

Hybrid model orchestration is a powerful technique that AI applications can use to intelligently select and switch between multiple models based on various criteria, all while being transparent to the calling code. This technique not only allows for model selection based on factors such as the prompt’s input token size and each model’s min/max token capacity, or data sensitivity – where sensitive inference is done against local models and the others against cloud models – returning either the fastest response, the most relevant response, or the first available model’s response, but also provides a robust fallback mechanism by ensuring that if one model fails, another can seamlessly take over. In this blog post, we will explore the fallback mechanism, which is just one implementation of the technique, and demonstrate its application through a practical example.

Benefits of Hybrid Model Orchestration

Enhanced Flexibility: The application can dynamically choose the best model based on the current context or requirements.
Seamless Integration: The consumer code does not need to be aware that it is interacting with an orchestrator. This transparency simplifies integration and reduces complexity for developers.
Enhanced Reliability: By having multiple models available, the application can continue to function smoothly even if one model fails, ensuring continuous operation.

Example Implementation

Let’s look at an example implementation of hybrid model orchestration. The following code demonstrates how to use a FallbackChatClient to perform chat completion, falling back to an available model when the primary model is unavailable.

public async Task FallbackToAvailableModelAsync()
{
    IKernelBuilder kernelBuilder = Kernel.CreateBuilder();

    // Create and register an unavailable chat client that fails with 503 Service Unavailable HTTP status code
    kernelBuilder.Services.AddSingleton<IChatClient>(CreateUnavailableOpenAIChatClient());

    // Create and register a cloud available chat client
    kernelBuilder.Services.AddSingleton<IChatClient>(CreateAzureOpenAIChatClient());

    // Create and register fallback chat client that will fallback to the available chat client when unavailable chat client fails
    kernelBuilder.Services.AddSingleton((sp) =>
    {
        IEnumerable<IChatClient> chatClients = sp.GetServices<IChatClient>();

        return new FallbackChatClient(chatClients.ToList()).AsChatCompletionService();
    });

    Kernel kernel = kernelBuilder.Build();
    kernel.ImportPluginFromFunctions("Weather", [KernelFunctionFactory.CreateFromMethod(() => "It's sunny", "GetWeather")]);

    AzureOpenAIPromptExecutionSettings settings = new()
    {
        FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
    };

    FunctionResult result = await kernel.InvokePromptAsync("Do I need an umbrella?", new(settings));

    Output.WriteLine(result);
}

internal sealed class FallbackChatClient : IChatClient  
{  
    private readonly IList<IChatClient> _chatClients;  
    private static readonly List<HttpStatusCode> s_defaultFallbackStatusCodes = new()  
    {  
        HttpStatusCode.InternalServerError,  
        HttpStatusCode.NotImplemented,  
        HttpStatusCode.BadGateway,  
        HttpStatusCode.ServiceUnavailable,  
        HttpStatusCode.GatewayTimeout  
    };  
  
    public FallbackChatClient(IList<IChatClient> chatClients)  
    {  
        this._chatClients = chatClients?.Any() == true ? chatClients : throw new ArgumentException("At least one chat client must be provided.", nameof(chatClients));  
    }  
  
    public List<HttpStatusCode>? FallbackStatusCodes { get; set; }  
  
    public async Task<ChatResponse> GetResponseAsync(IList<ChatMessage> chatMessages, ChatOptions? options = null, CancellationToken cancellationToken = default)  
    {  
        for (int i = 0; i < this._chatClients.Count; i++)  
        {  
            var chatClient = this._chatClients.ElementAt(i);  
            try  
            {  
                return await chatClient.GetResponseAsync(chatMessages, options, cancellationToken).ConfigureAwait(false);  
            }  
            catch (Exception ex)  
            {  
                if (this.ShouldFallbackToNextClient(ex, i, this._chatClients.Count))  
                {  
                    continue;  
                }  
  
                throw;  
            }  
        }  
  
        throw new InvalidOperationException("Neither of the chat clients could complete the inference.");  
    }

    public async IAsyncEnumerable<ChatResponseUpdate> GetStreamingResponseAsync(IList<ChatMessage> chatMessages, ChatOptions? options = null, [EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        // Similar funcitonality as GetResponseAsync but for streaming
    }
  
    private bool ShouldFallbackToNextClient(Exception ex, int clientIndex, int numberOfClients)  
    {  
        if (clientIndex == numberOfClients - 1)  
        {  
            return false;  
        }  
  
        HttpStatusCode? statusCode = ex switch  
        {  
            HttpOperationException operationException => operationException.StatusCode,  
            HttpRequestException httpRequestException => httpRequestException.StatusCode,  
            ClientResultException clientResultException => (HttpStatusCode?)clientResultException.Status,  
            _ => throw new InvalidOperationException($"Unsupported exception type: {ex.GetType()}."),  
        };  
  
        if (statusCode is null)  
        {  
            throw new InvalidOperationException("The exception does not contain an HTTP status code.");  
        }  
  
        return (this.FallbackStatusCodes ?? s_defaultFallbackStatusCodes).Contains(statusCode!.Value);  
    }  
}

For a full implementation, refer to the sample provided by Microsoft Semantic Kernel on GitHub

Explanation

In this example, the FallbackChatClient class is designed to handle multiple chat clients and switch between them based on their availability. The FallbackToAvailableModelAsync method demonstrates how to use this class to perform chat completion with fallback support.

Creating and Registering Chat Clients:
- The code initializes an unavailable chat client that simulates a failure by returning a 503 Service Unavailable HTTP status code.
- It also creates a cloud-based available chat client that represents a functional service.
- Both the unavailable and available chat clients are registered with the service container using dependency injection.
Setting Up Fallback Mechanism:
- The FallbackChatClient is set up with the list of registered chat clients.
- This client will first attempt to use the unavailable client, and if it fails, it will automatically switch to the available client.
- The FallbackChatClient is registered as an IChatCompletion.
Building the Kernel:
- The Kernel is built using the registered services and is configured with a plugin that returns weather information.
Invoking the Prompt:
- The InvokePromptAsync method is called on the kernel with the prompt “Do I need an umbrella?” and the specified execution settings.
- The InvokePromptAsync eventually invokes the chat completion service, which uses the fallback mechanism to handle the chat completion.
Handling Exceptions:
- Within the FallbackChatClient, the GetResponseAsync method iterates through the list of chat clients and handles exceptions.
- If a request fails with a certain set of HTTP status codes (like 503 Service Unavailable), it falls back to the next client in the list.
- If all clients fail, an exception is thrown indicating that none of the clients could complete the request.
Customization and Streaming:
- The FallbackChatClient also supports customization of fallback status codes via a property.
- Additionally, it provides a method for handling streaming responses, similar to the primary completion method.
Decorator Pattern:
- The FallbackChatClient implements the same interface as the chat clients it wraps, making it a decorator.
- This design pattern allows it to add additional functionality – in this case, fallback logic – without changing either the caller code or the underlying chat client implementations.

Potential Scenarios

Hybrid model orchestration can be effectively utilized in various situations. For instance, when selecting models based on token size, the system determines the appropriate model by considering the prompt’s input token size and each model’s minimum and maximum token capacity. It can then return the fastest model’s response, the most relevant response, or the first available model’s response. Similarly, in scenarios involving data sensitivity, the system selects models based on the sensitivity of the data and can return the fastest model’s response, the most relevant response, or the first available model’s response.

Conclusion

Hybrid model orchestration is a powerful technique for enhancing the flexibility, integration, and reliability of AI applications. By dynamically selecting the best model based on context, seamlessly integrating with consumer code, and providing a robust fallback mechanism, applications can ensure continuous and efficient operation. The example provided demonstrates how to implement hybrid model orchestration in a C# application, highlighting its practical benefits and ease of use. This approach not only improves the resilience of AI systems but also simplifies the development process, making it an invaluable tool for developers.