This article is authored by Gurvinder Singh, Principal Cloud Engineer at The Kraft Heinz Company.
ICYMI: Hear how Kraft Heinz built an AI-powered data lineage solution leveraging Azure DocumentDB hybrid search and graph querying capabilities – watch the Ignite session here: Move fast, save more with MongoDB-compatible workloads on DocumentDB
The enterprise-scale challenge: Data lineage across 50+ interconnected systems
At Kraft Heinz, we operate one of the world’s largest CPG data ecosystems with more than 200 brands and 50+ interconnected enterprise systems—including SAP, Oracle, Snowflake, and Power BI—connected through 1,000+ data transformation pipelines. Our data footprint stretches from ingredient sourcing to retail analytics. Every decision, whether it’s optimizing a production line or forecasting demand, depends on data moving reliably through dozens of interconnected systems.
With so much data in motion, even small changes can have big effects. A missed pipeline job or schema update can ripple through analytics, dashboards, and reports. That’s where data lineage becomes essential. Data lineage is the GPS for enterprise data, giving teams visibility into how information flows, transforms, and impacts downstream systems.
Our lineage platform applies graph principles, modeling data as nodes and relationships so dependencies can be traced at a glance. This helps data engineers pinpoint issues quickly, assess downstream impact, and make confident changes without slowing innovation. For a company operating at global scale, lineage isn’t optional—it’s foundational to maintaining data quality and trust.
The first iteration: data lineage with a third-party graph database
Our first iteration used a third-party graph database platform to build our data lineage solution. It proved powerful and delivered measurable results. Root cause analysis dropped from hours to minutes, data quality decisions accelerated, and data engineers gained full visibility into data flows.
But there were some limitations that emerged over time:
- Cost structure: Significant annual recurring licensing fees
- No consumption-based pricing model available
- Performance: Complex lineage graph queries with multi-level graph traversal were slow or timing out
- Operational overhead: Separate database island outside of Azure ecosystem
In this blog, I’ll share why we chose Azure DocumentDB for data lineage, how we rebuilt the platform on Azure, where we’re heading next as we bring AI into lineage.
Solving two challenges: reducing costs and operational overhead
We set two clear objectives and key results (OKRs): reduce annual licensing costs and eliminate operational overhead from managing a separate third-party graph database island outside of the Azure ecosystem. Working with our Microsoft account team, we selected Azure DocumentDB—built natively on Azure alongside our data infrastructure.
But the real win? – The CosmosAIGraph Solution we built that took our lineage solution at enterprise scale. We will talk about it in the next section.
Graph reimagined using CosmosAIGraph
Data lineage at Enterprise scale requires the following two fundamentally different capabilities:
- Deterministic, Authoritative Lineage (high precision): Ability to answer the relationship question with absolute certainty
- Semantic similarity: Ability to find datasets when keywords do not match
The CosmosAIGraph solution, as shown in the diagram below, combines both capabilities using a dual-pattern RAG architecture that integrates GraphRAG and OmniRAG. Let’s explore why dual-pattern RAG architecture is essential and how this works.
The solution has three main elements:
- Azure DocumentDB: Persistent store for data-lineage data, edges, nodes, high dimensional vector embeddings, and session data
- Two Microservices: Deployed in AKS cluster
- Graph Microservice: Handles constructing an in-memory RDF knowledge graph with full ontology reasoning using OWL schema definitions.
- Web Microservice: Handles user queries and response completion
- Azure OpenAI: Responsible for Intelligent intent detection in real time
The GraphRAG pattern leverages the RDF graph microservice. This microservice loads entity and relationship data from Azure DocumentDB at startup and constructs an in-memory RDF knowledge graph using triples and ontology. Using SPARQL queries, it provides deterministic, multi-hop lineage traversal—answering questions like:
- “Which downstream reports depend on this table?”
- “What happens if I change this field in SAP?”
The RDF microservice maintains the complete ontology schema using Web Ontology Language (OWL) and triple relationships in memory, enabling fast logical inference and semantic reasoning.
However, the problem is that the RDF works well for relationship traversal or deterministic lineage, but it’s not enough for enterprise data lineage. It lacks semantic discovery and keyword search at scale. This is where Azure DocumentDB and OmniRAG pattern come in
Example: Imagine you have a table or attribute with a long textual description that’s important and defines the business context. Someone searches for that information by keywords or in different terms; the deterministic GraphRAG would fail, and we would need a completely different approach.
OmniRAG: Intelligent Orchestration Across Multiple Data Sources
The OmniRAG pattern utilizes Azure OpenAI and Semantic Kernel to detect the intent of user queries and route them to the most suitable data source. Here is an example:
- “What downstream reports depend on this table?” →GraphRAG, Route to RDF Graph (SPARQL)
- “Find datasets similar to customer master” → Route to DocumentDB Vector Search
- “Find datasets containing ‘PII'” → Route to DocumentDB Full-Text Search
- “Show me customer data updated recently” → Route to Multiple Sources + Merge
When OmniRAG routes queries to Azure DocumentDB, it leverages both vector and full-text search through Hybrid Search:
- SSD-backed DiskANN vector search: Finds datasets by meaning using high-dimensional vector embeddings
- Full-text search (Keyword intelligence): Find datasets by exact keywords
These two methods produce separate result sets. Azure DocumentDB’s Hybrid Search uses Reciprocal Rank Fusion (RRF) to intelligently merge them, making sure documents that match both semantic AND keyword criteria rank highest.
This dual-pattern RAG architecture—GraphRAG for deterministic accuracy and OmniRAG for smart orchestration—enables enterprise-scale data lineage. By directing each query to its best source, the solution avoids unnecessary, costly knowledge graph traversals for simple lookups and overcomes semantic limits on relationship queries.
Proven impact: Upfront licensing costs saved annually and 1,400x performance gain
Modernizing our data lineage platform delivered measurable gains in costs, speed, and operational efficiency. Moving from a fixed-license based, third-party graph database to a consumption-based vCore-model of Azure DocumentDB eliminated annual licensing burden in annual licensing fees, reducing overhead and freeing engineers from maintaining a separate database island.
Performance gains have been equally striking. Query response time dropped from 18,634 milliseconds to as low as 3 to 13 milliseconds—a 1,400x to 6,200x improvement. Graph traversals that once stalled analytics now execute almost instantly, keeping engineers focused on insight rather than waiting for data to load. The shift from third-party graph database infrastructure to Azure-native, cloud-optimized services redefined what speed and responsiveness look like at enterprise scale.
The new lineage platform has also reshaped how our data engineers work. Managing it through the same Azure pipelines, monitoring, and security controls that govern the rest of our environment has simplified daily operations and reduced the risk of errors. Teams no longer move between systems or manage separate credentials, which means faster responses and fewer points of failure.
About Azure Cosmos DB
Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.
To stay in the loop on Azure Cosmos DB updates, follow us on X, YouTube, and LinkedIn.


0 comments
Be the first to start the discussion.