Reducing latency by 50% and scaling intelligent CX for SMBs
This article was co-authored by Sergey Galchenko, Chief Technology Officer, IntelePeer, and Subhash Ramamoorthi, Director, IntelePeer AI Hub.
Don’t miss: Discover how IntelePeer enhances agent intelligence and powers their multi-agent applications at Ignite! Join us for “From DEV to PROD: How to build agentic memory with Azure Cosmos DB” Thursday, Nov. 20, 11 AM PST.
You don’t need to be an AI expert, software engineer, or data scientist to understand the importance of system reliability and performance in digital customer service platforms. If you’ve ever tried to schedule a dental appointment, contact your bank about your account, or check on the status of your takeout order, then you’ve experienced it firsthand.
While large companies have long used sophisticated AI-powered call center solutions, we work with many small and medium-sized businesses (SMBs) that are now adopting next-generation customer experience (CX) platforms like IntelePeer. Our solutions offer high reliability, conversational AI accuracy, low latency for voice, LLM and API processing, and—most importantly—scalability to maintain system performance during spikes in customer contact volumes. These surges in demand don’t discriminate by industry, for example:
- Retail: During the holiday season—especially Black Friday and Cyber Monday— retail call centers see call volumes increase by up to 41 percent as customers inquire about order status, returns, and promotions.
- Healthcare: Open enrollment periods for insurance or the start of a new year for dental/medical plans trigger a rush of calls from patients seeking appointments and coverage details.
- Travel: Airlines and travel agencies face spikes during summer vacations, winter holidays, or when flight disruptions create a wave of rebooking inquiries that overwhelm agents.
- Utilities: Severe weather like hurricanes or snowstorms can damage infrastructure, causing a flood of calls from customers seeking outage updates.
- Finance: Tax season and year-end financial reporting lead to increased inquiries about filings, refunds, and account management.
- Technology: Major updates or service outages can result in sudden spikes in support calls as customers seek troubleshooting help.
To meet these challenges, we designed IntelePeer with scalable infrastructure and an AI framework that allows SMB to deliver quality call center services even during high-traffic surges. Elasticity and scalability are crucial for maintaining resilience, cost-effectiveness, and CX performance in dynamic environments. They’re also part of the reason why we migrated our AI framework to Azure and adopted Azure Cosmos DB.
Moving the AI framework to Azure to minimize latency
When it comes to conversational AI, managing latency is essential for ensuring natural-sounding and efficient interactions for customers. The IntelePeer platform handles this complex task by continuously monitoring and adjusting for multiple factors that affect latency: inference speed for the large language model (LLM), time to first byte for the text-to-speech engine, processing delay for the APIs, network latency, and others.
We deployed our first-generation AI framework in IntelePeer data centers. But after observing telemetry data and analyzing latency factors in the value delivery chain, we decided to implement our next-generation agentic AI framework in Azure to minimize network-induced latency. We were already heavy users of Azure OpenAI in Foundry Models and this move brought our AI framework closer to the LLMs. At the same time, we profiled the database performance and chose Azure Cosmos DB as our persistence and data layer.
Since migrating to Azure Cosmos DB, we’ve seen at least a 50 percent decrease in network and data access-related latency—going from roughly 35 milliseconds (ms) to 15 ms for each transaction roundtrip. We’ve also started using the small language model (SLM) Phi-4 for some use cases as we consolidated all our database workloads (configuration persistence, session storage/agent short-term memory and vector search) to Azure Cosmos DB.
With Azure Cosmos DB it’s day and night compared to self-managed clusters
Before moving to Azure, our platform ran on self-managed MongoDB clusters hosted in our data centers. While this setup worked initially, we encountered scaling issues as we grew. We had to size our clusters for seasonal peak performance, even though 90 percent of the time we didn’t need that capacity. That meant wasted resources and higher costs.
Even with overprovisioning, we had physical limitations on the amount of I/O we could give to database clusters. Because datasets with vastly different usage patterns shared the same physical infrastructure and the same I/O pools, when one workload started pulling harder, it could affect the performance of other applications.
We don’t have that problem in the Azure Cosmos DB environment because of its auto-scaling and logical isolation capabilities. It gives us fine-grained performance controls so we can configure our I/O profile per container. That way we can make decisions between which long queries are fine to run and which need to have 15 ms latency because of other SLAs upstream. The separation gives us these controls. It’s night and day compared to self-managed clusters.
Azure Cosmos DB’s MongoDB compatibility was another major advantage. It allowed us to migrate our existing codebase with minimal changes, while its built-in change data capture (CDC) capability made it easy to stream data into our analytics pipeline. We also plan to use native replication to maintain a live copy of our production dataset outside of Azure for disaster recovery purposes.
For our agentic AI framework, we use Azure Cosmos DB for both short-term and long-term memory. A very useful feature in the short-term memory subsystem has been TTL indexing. We can now set automatic expiration policies at the container or item level, letting Azure Cosmos DB automatically purge old data without impacting performance. This has been crucial for managing log data, session information, and temporary caches that previously cluttered our databases and degraded performance over time. Additionally, we use CDC to integrate a data processing pipeline that ingests short-term session data and propagates it into the long-term memory data subsystem. The long-term memory layer persists agent configuration states, user personalization parameters, organizational metrics, and other contextual data, enabling continuous improvement of AI agent behavior and enhancing overall system performance.
Detecting voicemail boxes at 97% accuracy with Phi-4
This architecture has served us well across the board, but for voicemail recognition specifically, we’ve made huge gains using the SLM Phi-4. In outbound campaigns—whether it’s appointment reminders, payment collections, or follow-up notifications—we need to know right away if our platform is connecting to a live person or a voicemail. Our old detection system relied on hard-coded rules and call control signals. The accuracy was hovering under 40 percent, which is basically the same as flipping a coin.
When we decided to use AI for voicemail detection, we ran the first 15 seconds of conversations through an LLM and accuracy immediately jumped to 90 percent. That was a big improvement, but it came at a cost. Running every call through an LLM was too expensive for a simple repeatable task.
Then we turned to our agentic long-term memory subsystem for the training data. Using Azure Cosmos DB, it was easy to identify the conversations that successfully detected voicemails. We used that data as a self-labeled training dataset for an SLM and fine-tuned smaller models with a LoRA approach. After experimenting with various open-weights SMLs, we ultimately chose Phi-4. Fine-tuned in Azure Open AI in Foundry Models, Phi-4 delivered 97 percent accuracy. That’s better than a human agent performing detection live, and it’s dramatically cheaper than pushing everything through a larger, but more generic model.
Converting FAQ docs into embeddings for smarter AI agents
With our agentic AI workflows up and running on Azure, we wanted to make sure our conversational AI was providing the best possible answers to customers. Many of the questions our AI agents handle are about store hours, policies, or post-appointment instructions. That information exists in PDFs, websites, and knowledge bases, but the trick is making it accessible in real time during a voice conversation. To that end, we implemented a Retrieval Augmented Generation (RAG) pattern using vector search as a data store.
Our vector search uses text-embedding-ada-002 with 1,536 dimensions for generating high-quality embeddings that capture semantic meaning and nuance across our knowledge base and conversation data while using manageable amounts of processing power.
When evaluating Azure Cosmos DB’s flat, quantized flat, and DiskANN vector indexing options, we settled on DiskANN. It provides the low-latency performance critical for real-time applications while maintaining the accuracy needed for production workloads. During live user sessions, agents execute vector-based queries against the knowledge base. For complex user inputs, the agent computes an embedding representation of the query and performs a similarity search over pre-indexed knowledge documents to retrieve the most relevant context.
This happens in real time during voice calls, where every millisecond counts. The 15ms vector query latency we achieve with DiskANN ensures natural conversation flow without awkward pauses. The retrieved data is combined with LLM reasoning capabilities, delivering accurate answers with extremely low latency.
Investing in Azure to deliver agentic AI for SMBs
Our investment in Azure has more than paid off. We cut latency in half, simplified our infrastructure, and focused our resources on innovating new capabilities rather than managing hardware.
At IntelePeer, we’re focused on equipping SMBs with the same level of intelligent CX performance that was historically reserved for enterprise-scale organizations. By using Azure services like Azure Cosmos DB and Azure OpenAI in Foundry Models, we’ve developed scalable, AI-driven systems that enable SMBs to measurably improve efficiency, costs, and customer and patient satisfaction.
More importantly, we’re demonstrating that world-class CX solutions aren’t just practical and affordable for SMBs—they’re vital for solving the pressing operational and financial challenges facing the business, healthcare, and service organizations we all rely on every day.
About the authors
About Azure Cosmos DB
Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.
To stay in the loop on Azure Cosmos DB updates, follow us on X, YouTube, and LinkedIn.


0 comments
Be the first to start the discussion.