{"id":973,"date":"2025-05-13T21:18:06","date_gmt":"2025-05-13T21:18:06","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/all-things-azure\/?p=973"},"modified":"2025-05-13T22:25:14","modified_gmt":"2025-05-13T22:25:14","slug":"accelerate-ai-applications-with-semantic-caching-on-azure-managed-redis","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/all-things-azure\/accelerate-ai-applications-with-semantic-caching-on-azure-managed-redis\/","title":{"rendered":"Accelerate AI Applications with Semantic Caching on Azure Managed Redis"},"content":{"rendered":"<p>In this blog, we will look at implementing the Semantic caching use case using <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/azure-cache-for-redis\/managed-redis\/managed-redis-overview\">Azure Managed Redis<\/a>. Azure has been at the forefront of providing caching solutions for over a decade with the Azure Cache for Redis enterprise. This service empowered developers with a high-performance, scalable cache that significantly enhanced the responsiveness of their applications. Now, Azure has taken another significant step forward in this area with its latest offering, AMR, which is currently in public preview. AMR runs on the Redis enterprise stack which offers significant advantages over the community edition of Redis. AMR provides advanced features such as Active Geo-Replication, Vector Storage &amp; Search, Semantic Caching, Automatic Zone Redundancy, and Entra ID auth support across all its SKUs and Tiers.<\/p>\n<h2>Why semantic caching?<\/h2>\n<p>Redis use cases have been expanding over the years from the traditional Data cache, API Response cache, Session Store, etc. to now include AI Apps use cases like <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/redis\/overview-vector-similarity\">Vector Store, Vector Search, and Semantic cache<\/a>. Using LLMs often introduces a high amount of latency (due to generation time) and cost (due to per token pricing) to an application. Semantic Caching can help solve these problems by storing the past output of an LLM along with the text as well as vector representation of the query associated with the LLM Output. The subsequent call to LLM can be preceded by cache lookup using the vector representation of the query; this pattern fits very well with the AI Apps implementations.<\/p>\n<p>Some common scenarios to use semantic caching could be in Faster FAQ retrieval in a chat bot by doing vector search of user query in the cache before making a call to the LLM Completion API and Storing previous interactions of users and their context to provide more relevant and personalized responses in a much faster time compared to relying on LLM for generating completion response all the time.<\/p>\n<p>Let\u2019s look at the architecture of an AI App implementing the Semantic caching pattern:<\/p>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/redis-semantic-caching-scaled.jpg\"><img decoding=\"async\" class=\"alignnone size-full wp-image-981\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/redis-semantic-caching-scaled.jpg\" alt=\"redis semantic caching image\" width=\"2500\" height=\"1388\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/redis-semantic-caching-scaled.jpg 2500w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/redis-semantic-caching-300x167.jpg 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/redis-semantic-caching-1024x568.jpg 1024w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/redis-semantic-caching-768x426.jpg 768w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/redis-semantic-caching-1536x853.jpg 1536w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/redis-semantic-caching-2048x1137.jpg 2048w\" sizes=\"(max-width: 2500px) 100vw, 2500px\" \/><\/a><\/p>\n<p>As depicted above, the AI App looks up the semantic cache (AMR) before invoking the LLM chat completion API. The cache lookup is not text based alone, the AI App gets a vector representation of the query by invoking the LLM embedding API, this enables semantic search of the user queries. Apart from the AI App, Semantic caching logic can also be implemented in API Management (APIM) when using the AI Gateway pattern. APIM provides built in <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/api-management\/azure-openai-semantic-cache-lookup-policy\">semantic caching policies<\/a> that makes implementing this use-case very straightforward.<\/p>\n<h2>Example: Python App with Semantic Cache<\/h2>\n<p>Let\u2019s now look at a Python application sample that uses semantic caching pattern and the improvement in response times achieved with this.<\/p>\n<p>In this section of the app, we import the required libraries and configure API credentials to work with Azure OpenAI and Azure Redis:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-1.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-974\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-1.png\" alt=\"REDIS blog image 1 image\" width=\"936\" height=\"577\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-1.png 936w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-1-300x185.png 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-1-768x473.png 768w\" sizes=\"(max-width: 936px) 100vw, 936px\" \/><\/a><\/p>\n<p>Next, we will create a Semantic Cache Index in Azure Redis. Note that <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/how-to\/embeddings?tabs=console\">Azure OpenAI Embedding API<\/a> is being set as the embedding provider for this Semantic Cache instance. Another important config here is the Semantic Cache Distance Threshold, the lower the value, the more similar the cached and new queries must be to be considered a match in the cache, valid values for this parameter are between 0 and 1.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-2.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-980\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-2.png\" alt=\"REDIS blog image 2 image\" width=\"936\" height=\"222\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-2.png 936w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-2-300x71.png 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-2-768x182.png 768w\" sizes=\"(max-width: 936px) 100vw, 936px\" \/><\/a><\/p>\n<p>The last section of the app includes code to get a prompt from the user, checking for the presence of similar prompt\/query from the user in the semantic cache. If there is a match, the response is returned from the Cache and the call to Chat Completion operation in Azure OpenAI is skipped. If there is no match in the cache, then the Chat Completion operation is invoked, and the response is stored in the semantic cache index of Azure Redis.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-3.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-979\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-3.png\" alt=\"REDIS blog image 3 image\" width=\"936\" height=\"640\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-3.png 936w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-3-300x205.png 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-3-768x525.png 768w\" sizes=\"(max-width: 936px) 100vw, 936px\" \/><\/a><\/p>\n<p>To see this in action, we run the application and pass the following query to the application:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-4.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-978\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-4.png\" alt=\"REDIS blog image 4 image\" width=\"786\" height=\"91\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-4.png 786w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-4-300x35.png 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-4-768x89.png 768w\" sizes=\"(max-width: 786px) 100vw, 786px\" \/><\/a><\/p>\n<p>As expected, the cache lookup resulted in a miss and Azure OpenAI was called to get the response, and it took more than 3 seconds for the execution to complete:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-5.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-977\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-5.png\" alt=\"REDIS blog image 5 image\" width=\"607\" height=\"181\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-5.png 607w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-5-300x89.png 300w\" sizes=\"(max-width: 607px) 100vw, 607px\" \/><\/a><\/p>\n<p>We now run the application again by passing a different query:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-6.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-976\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-6.png\" alt=\"REDIS blog image 6 image\" width=\"766\" height=\"97\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-6.png 766w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-6-300x38.png 300w\" sizes=\"(max-width: 766px) 100vw, 766px\" \/><\/a><\/p>\n<p>Although the words of this query are different, the cache search led to a hit as the query was semantically like the first query. The response time was less than 200 milliseconds, which is more than 10 times faster than calling LLM.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-7.png\"><img decoding=\"async\" class=\"alignnone wp-image-975 size-full\" src=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-7.png\" alt=\"REDIS blog image 7 image\" width=\"936\" height=\"113\" srcset=\"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-7.png 936w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-7-300x36.png 300w, https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-content\/uploads\/sites\/83\/2025\/05\/REDIS-blog-image-7-768x93.png 768w\" sizes=\"(max-width: 936px) 100vw, 936px\" \/><\/a><\/p>\n<p>As depicted above, using Semantic Search capabilities of Azure Managed Redis where applicable, will lead to better latency and reduced costs for your intelligent applications.<\/p>\n<h2>Resources<\/h2>\n<ul>\n<li>You can learn more about the Vector Embeddings and Vector Search capabilities of Azure Redis <a href=\"https:\/\/aka.ms\/RedisVectorSimilarity\">here<\/a>.<\/li>\n<li>To run a sample application with Semantic Caching implementation using Azure Redis, check out the samples <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/redis\/tutorial-semantic-cache\">here<\/a><\/li>\n<li>The complete code repo associated with the screenshots in the article can be found <a href=\"https:\/\/github.com\/sdadha\/azureredis-semanticcache\/tree\/main\">here<\/a>.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In this blog, we will look at implementing the Semantic caching use case using Azure Managed Redis. Azure has been at the forefront of providing caching solutions for over a decade with the Azure Cache for Redis enterprise. This service empowered developers with a high-performance, scalable cache that significantly enhanced the responsiveness of their applications. [&hellip;]<\/p>\n","protected":false},"author":172722,"featured_media":981,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[71,74,73,70,72,76,75],"class_list":["post-973","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-azure","tag-azure-managed-redis","tag-azure-openai","tag-azure-redis","tag-redis","tag-semantic-caching","tag-vector-embeddings","tag-vector-search"],"acf":[],"blog_post_summary":"<p>In this blog, we will look at implementing the Semantic caching use case using Azure Managed Redis. Azure has been at the forefront of providing caching solutions for over a decade with the Azure Cache for Redis enterprise. This service empowered developers with a high-performance, scalable cache that significantly enhanced the responsiveness of their applications. [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/posts\/973","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/users\/172722"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/comments?post=973"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/posts\/973\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/media\/981"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/media?parent=973"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/categories?post=973"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/all-things-azure\/wp-json\/wp\/v2\/tags?post=973"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}