November 18th, 2025
0 reactions

How Kraft Heinz achieved 1,400x faster data lineage with Azure DocumentDB’s DiskANN vector search and hybrid search

Azure Cosmos DB Team
Azure Cosmos DB Team

This article is authored by Gurvinder Singh, Principal Cloud Engineer at The Kraft Heinz Company.


ICYMI: Hear how Kraft Heinz built an AI-powered data lineage solution leveraging Azure DocumentDB hybrid search and graph querying capabilities – watch the Ignite session here: Move fast, save more with MongoDB-compatible workloads on DocumentDB


The enterprise-scale challenge: Data lineage across 50+ interconnected systems

At Kraft Heinz, we operate one of the world’s largest CPG data ecosystems with more than 200 brands and 50+ interconnected enterprise systems—including SAP, Oracle, Snowflake, and Power BI—connected through 1,000+ data transformation pipelines. Our data footprint stretches from ingredient sourcing to retail analytics. Every decision, whether it’s optimizing a production line or forecasting demand, depends on data moving reliably through dozens of interconnected systems.

With so much data in motion, even small changes can have big effects. A missed pipeline job or schema update can ripple through analytics, dashboards, and reports. That’s where data lineage becomes essential. Data lineage is the GPS for enterprise data, giving teams visibility into how information flows, transforms, and impacts downstream systems.

Our lineage platform applies graph principles, modeling data as nodes and relationships so dependencies can be traced at a glance. This helps data engineers pinpoint issues quickly, assess downstream impact, and make confident changes without slowing innovation. For a company operating at global scale, lineage isn’t optional—it’s foundational to maintaining data quality and trust.

The first iteration: data lineage with a third-party graph database

Our first iteration used a third-party graph database platform to build our data lineage solution. It proved powerful and delivered measurable results. Root cause analysis dropped from hours to minutes, data quality decisions accelerated, and data engineers gained full visibility into data flows.

But there were some limitations that emerged over time:

  1. Cost structure: Significant annual recurring licensing fees
  2. No consumption-based pricing model available
  3. Performance: Complex lineage graph queries with multi-level graph traversal were slow or timing out
  4. Operational overhead: Separate database island outside of Azure ecosystem

In this blog, I’ll share why we chose Azure DocumentDB for data lineage, how we rebuilt the platform on Azure, where we’re heading next as we bring AI into lineage.

Solving two challenges: reducing costs and operational overhead

We set two clear objectives and key results (OKRs): reduce annual licensing costs and eliminate operational overhead from managing a separate third-party graph database island outside of the Azure ecosystem. Working with our Microsoft account team, we selected Azure DocumentDB—built natively on Azure alongside our data infrastructure.

But the real win? – The CosmosAIGraph Solution we built that took our lineage solution at enterprise scale. We will talk about it in the next section.

Graph reimagined using CosmosAIGraph

Data lineage at Enterprise scale requires the following two fundamentally different capabilities:

  1. Deterministic, Authoritative Lineage (high precision): Ability to answer the relationship question with absolute certainty
  2. Semantic similarity: Ability to find datasets when keywords do not match

The CosmosAIGraph solution, as shown in the diagram below, combines both capabilities using a dual-pattern RAG architecture that integrates GraphRAG and OmniRAG. Let’s explore why dual-pattern RAG architecture is essential and how this works.

The solution has three main elements:

  1. Azure DocumentDB: Persistent store for data-lineage data, edges, nodes, high dimensional vector embeddings, and session data
  2. Two Microservices: Deployed in AKS cluster
    • Graph Microservice: Handles constructing an in-memory RDF knowledge graph with full ontology reasoning using OWL schema definitions.
    • Web Microservice: Handles user queries and response completion
  3. Azure OpenAI: Responsible for Intelligent intent detection in real time

Kraft Heinz Azure DocumentDB architecture image

The GraphRAG pattern leverages the RDF graph microservice. This microservice loads entity and relationship data from Azure DocumentDB at startup and constructs an in-memory RDF knowledge graph using triples and ontology. Using SPARQL queries, it provides deterministic, multi-hop lineage traversal—answering questions like:

  • “Which downstream reports depend on this table?”
  • “What happens if I change this field in SAP?”

The RDF microservice maintains the complete ontology schema using Web Ontology Language (OWL) and triple relationships in memory, enabling fast logical inference and semantic reasoning.

However, the problem is that the RDF works well for relationship traversal or deterministic lineage, but it’s not enough for enterprise data lineage. It lacks semantic discovery and keyword search at scale. This is where Azure DocumentDB and OmniRAG pattern come in

Example: Imagine you have a table or attribute with a long textual description that’s important and defines the business context. Someone searches for that information by keywords or in different terms; the deterministic GraphRAG would fail, and we would need a completely different approach.

OmniRAG: Intelligent Orchestration Across Multiple Data Sources

The OmniRAG pattern utilizes Azure OpenAI and Semantic Kernel to detect the intent of user queries and route them to the most suitable data source. Here is an example:

  • “What downstream reports depend on this table?” →GraphRAG, Route to RDF Graph (SPARQL)
  • “Find datasets similar to customer master” → Route to DocumentDB Vector Search
  • “Find datasets containing ‘PII'” → Route to DocumentDB Full-Text Search
  • “Show me customer data updated recently” → Route to Multiple Sources + Merge

When OmniRAG routes queries to Azure DocumentDB, it leverages both vector and full-text search through Hybrid Search:

  • SSD-backed DiskANN vector search: Finds datasets by meaning using high-dimensional vector embeddings
  • Full-text search (Keyword intelligence): Find datasets by exact keywords

These two methods produce separate result sets. Azure DocumentDB’s Hybrid Search uses Reciprocal Rank Fusion (RRF) to intelligently merge them, making sure documents that match both semantic AND keyword criteria rank highest.

This dual-pattern RAG architecture—GraphRAG for deterministic accuracy and OmniRAG for smart orchestration—enables enterprise-scale data lineage. By directing each query to its best source, the solution avoids unnecessary, costly knowledge graph traversals for simple lookups and overcomes semantic limits on relationship queries.

Proven impact: Upfront licensing costs saved annually and 1,400x performance gain

Modernizing our data lineage platform delivered measurable gains in costs, speed, and operational efficiency. Moving from a fixed-license based, third-party graph database to a consumption-based vCore-model of Azure DocumentDB eliminated annual licensing burden in annual licensing fees, reducing overhead and freeing engineers from maintaining a separate database island.

Performance gains have been equally striking. Query response time dropped from 18,634 milliseconds to as low as 3 to 13 milliseconds—a 1,400x to 6,200x improvement. Graph traversals that once stalled analytics now execute almost instantly, keeping engineers focused on insight rather than waiting for data to load. The shift from third-party graph database infrastructure to Azure-native, cloud-optimized services redefined what speed and responsiveness look like at enterprise scale.

The new lineage platform has also reshaped how our data engineers work. Managing it through the same Azure pipelines, monitoring, and security controls that govern the rest of our environment has simplified daily operations and reduced the risk of errors. Teams no longer move between systems or manage separate credentials, which means faster responses and fewer points of failure.

 

gurvinder singh headshot image Gurvinder Singh is Principal Cloud Engineer at Kraft Heinz, with 15+ years empowering product teams and elevating excellence in developer enablement and enterprise transformation on Microsoft Azure. He leads federated product development initiatives and delivers sustainable, secure, and cost-optimized cloud solutions that enhance developer agility. By pioneering reusable architecture templates and automation frameworks, Gurvinder empowers teams to deliver future-ready, cloud-native solutions.

As co-author of three Microsoft Press books published by Pearson, Gurvinder contributes to the broader developer community through thought leadership and technical expertise.

About Azure Cosmos DB

Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.

To stay in the loop on Azure Cosmos DB updates, follow us on XYouTube, and LinkedIn.

Author

Azure Cosmos DB Team
Azure Cosmos DB Team

Azure Cosmos DB is a fully managed NoSQL, relational, and vector database. It offers single-digit millisecond response times, automatic and instant scalability, along with guaranteed speed at any scale. Business continuity is assured with SLA-backed availability and enterprise-grade security.

0 comments