Best Practices for Re-modeling Graph database to Azure Cosmos DB for NoSQL

Subhasish Ghosh

Faiz Chachiya

Sandeep S. Nair

Introduction

Azure Cosmos DB is Microsoft’s premier fully managed NoSQL database for modern app development. It is ideal for solutions including artificial intelligence, digital commerce, Internet of Things, booking management and other types of use cases. It offers single-digit millisecond response times, automatic and instant scalability along with guaranteed speed at any scale. Azure Cosmos DB’s NoSQL offering stands out with its exceptional performance and robust feature set that can be an alternative to using graph databases.

This blog post outlines the approach and best practices, we, the authors learned while working with a customer who wanted to transition a graph workload to Azure Cosmos DB for NoSQL. Our insights are based on our collaboration with various teams, including software engineering and Line-Of-Business (LOB) application teams, at Walmart.

When should I do this?

Azure Cosmos DB for NoSQL is recommended for building green-field applications. However, if users are committed to refactoring their code, it can also be used in scenarios where users are attempting to squeeze the most performance for their applications. The scenario highlighted here is an application currently using a graph as the back-end database for storing, querying and analyzing data relationships. In this blog post we will illustrate using the Gremlin graph database and provide guidance to re-model the same into Azure Cosmos DB for NoSQL.

The three important qualifying factors wherein this approach is valid include:

  1. Your use-case leverages point-reads most of the time (> 80%) for querying data.
  2. Your use-case leverages 1-x traversals most of the time (> 80%) for querying data.
  3. Your use-case leverages a combination of point-reads and 1-x traversals most of the time (>80%) for querying data.

This approach is not recommended if you are mostly doing a 2x- and 3x- or more level of traversals with complex edge-relationships and/or data size of database is > 5 TBs.

NOTE: This post is not a recommendation for all customers using Gremlin API or any graph database, to migrate to Azure Cosmos DB for NoSQL. This blog post is intended to highlight how one customer approached this given their needs and the specific factors mentioned above.

Approach

Ideally, the recommended approach is to simulate the Gremlin API data model and vertex (aka node) to vertex traversals via edges using Azure Cosmos DB for NoSQL. The table below documents the source and destination:  

Currently in Azure Cosmos DB for Gremlin Proposed in Azure Cosmos DB for NoSQL
Vertex (Node) Item
Edge (Relationship) Item
A forward edge (i.e., a forward traversal). E.g., A –> B Item
A reverse edge (i.e., a backward traversal). E.g., B –> A Item

In essence the approach is to create 4 items. 

Re-modeling – an example

A graph is just a collection of vertices and edges – or, in other words, a set of nodes and the relationships that connect them. Graphs represent entities as nodes and the ways in which those entities relate to the world as relationships. This general-purpose, expressive structure allows us to model all kinds of scenarios. In the example below, for a retail online sales scenario, we are building a graphical representation of a customer data point. Our vertex has id: Cust001, label: customer and properties including email address, address, first seen, last seen, device1, device2 and so on. Device identified by id Device_1 is in turn shared by customer identified by id: Cust006, and device identified by id Device_2 is shared by two other customers, identified by id: Cust81 and Cust431.

Graph model representation of an Ecommerce data mode.

This Gremlin API data model vertex could be represented in an Azure Cosmos DB for NoSQL data model in the following JSON representation: 

{

     "doc_type": "node",
     "id": "<<generate guid>>",                 // generate a GUID
     "type": "[Type]",
     "key": "[Key]",
     "pk": "[Type + Key]",                      // generate a partition key
     "first_seen": "[first_seen]",
     "last_seen": "[last_seen]",
     "details": {
              "a1": "v1",
              "a2": "v2",
              ".com": {
                    "x": {
                       "p": "abc",
                       "g": "def"
                    },
                    "com_fs": "",
                    "com_ls": "",
                    "mp_fs": "",
                    "mp_ls": "",
                    "last_"
     },
}

The edge (forward direction) traversal item could be represented as: 

{ 
     "doc_type": "edge",
     "id": "<<generate guid>>",                 // [Edge.Node1.key].[Edge.Node2.key].[Edge.Type]
     "type": "[Type],                           // alternatively, "[Edge.Node1.key]" 
     "key": "[Key]",
     "pk": "[Type + Key]", 
     "first_seen": "[first_seen]",
     "last_seen": "[last_seen]",
     "details": {
              "a1": "v1",
              "a2": "v2",
              ".com": { 
                    "x": {
                       "p": "abc",
                       "g": "def" 
                    },
                    "com_fs": "",
                    "com_ls": "",
                    "mp_fs": "",
                    "mp_ls": ""
                    "last_"
     },
} 

A practical example

Gremlin API data mode and queries

We use the air-routes graph model that allows one to plan their air routes in terms of finding the shortest path or multiple routes to reach from a source to a destination. We have also migrated the air-routes data to Azure Cosmos DB for Gremlin which is available here for you to experiment with.
Image air routes

Click here for Image Reference

The GitHub sample contains 4 vertex types. The vertices airport, country, and continent are important. The version vertex exists for the purpose of tracking the version of the graph. The following shows the number of vertices for each type.

Image air routes1

Airport vertex has a few properties associated with it which provide airport related information like number of runways, elevation, geographical location, etc. 

Image air routes2

Routes are edges that connect the countries and the airports. They have a distance property to calculate the distance between the two connected airport vertices. The following shows the number of edges for each type.

Image air routes3

The following shows a sample edge connection between two airports with properties label and distance between the airports.

Image air routes4

The image below represents the sample model with all the vertices and edges.

Image air routes5

Let us execute a few queries on the model and then simulate the same in NoSQL.

Single Traversal

// List all the airports for the country India 
g.V().has("country","desc","India").out("contains").values("city")

The output is as exhibited below:

Image air routes6 Approximate Request Unit (RU) consumption = 25 RU/s  

Double Traversal  

// Bangalore to San Francisco with 1 hop/layover
g.V().has("code","BOM").out().out().has("code","SFO").path().by("code") 

The output is as exhibited below:

Image air routes7 Approximate RU consumption = 770 RU/s

// Find the top ten overall in terms of incoming routes
g.V().hasLabel('airport').order().by(__.in('route').count(),decr).limit(10).project('ap','routes').by('code').by(__.in('route').count()) 

Image air routes8 Approximate RU consumption = 11,900 RU/s

Azure Cosmos DB for NoSQL model and queries 

Understanding query access patterns is critical before designing a data model for your use-case in Azure Cosmos DB for NoSQL. Based on your application specific query patterns, it is advisable to create a schema and choose the partition key for your container. This results in a highly optimized and cost-efficient data model in Azure Cosmos DB for NoSQL. 

In this case, the schema of the model will include forward and reverse traversal for the vertices and edges. Additionally, the query access pattern will help us in defining the partition key and additional attributes.  

A basic Gremlin API data model when translated into a NoSQL data model looks similar to:

Image air routes9

How do we map vertex and edge to NoSQL documents effectively?  

  • The partition key is determined based on the query to be performed and different vertices and edge combinations.  
  • For an edge, a generic example for a combination of “vertex-type”+”vertex-key”+”edge-type” would help perform traversal from one vertex to another using the edge relationship. You can actually traverse to any depth with this combination.
  • For a vertex, a generic example for a combination of “vertex-type”+”vertex-key” is good enough to grab the relevant information.

Note: The partition key combination will vary depending on your use case.

Let us take the air-routes graph data and model it into the Azure Cosmos DB for NoSQL model that would help us execute similar queries.

Image Gremlin to NoSQL QueryPatterns(Click to enlarge image)

Single Traversal 

// List all the airports for the country India 
select c.destination, c.destinationType from c where c.partitionKey='country_India_contains'and c.doc_type='edge' 

Approximate RU: 3-5 RU/s  

This query would return all the vertices i.e. the forward linkages from country to airport. Once the vertices are available then the node related details can easily be extracted using the point-read operation. 

Double Traversal 

// List all the flights between Mumbai to San Francisco with one layover  

Since it is double traversal (i.e. like performing multiple out in Gremlin), we would be executing 2 queries, first query to get all the routes from Mumbai and second query to get all the routes flowing into San Francisco (like an in operation). 

First Query

select c.destination, c.destinationType from c where c.partitionKey='airport_Mumbai_route'and c.doc_type='edge' 

Second Query

select c.destination, c.destinationType from c where c.partitionKey='airport_San Francisco_route'and c.doc_type='edge' 

The union of destinations from both the queries would result into all the routes between Mumbai to and San Francisco.

Approximate RU: 11-15 RU/s for both the queries.   

// Find the top ten overall in terms of incoming routes 
selectCOUNT(1) as inroutes,c.destination from c where ENDSWITH(c.partitionKey, '_route') and c.doc_type='edge'GROUPBY c.destination  

The query would list the destination with the number of incoming routes to the destination airport. The results would not be ordered but this can be done easily within the code. 

Approximate RU: 2,935 RU/s.

Seven best practices to follow 

The following best practices are recommended:

  1. Store the vertex properties as a separate item rather than storing it inline. This reduces storage costs and ensures data consistency irrespective of read from either forward or reverse direction.
  2. For high-volume low-latency scenarios, create all 3 items (edge-details, and 2 traversal items) as part of a single transaction.
  3. Query on these items using the source partition key (pk) and src which represents source.
  4. Use the Patch API for updating specific properties within an item. The following operations are supported: Add, Set, Replace, Remove, Increment and Move.
  5. If you have a high-volume low-latency scenario wherein you need to update specific properties within an item, use Patch API for updating all 3 documents (e.g., in our use case scenario, the edge-details and 2 traversal items) as part of a single transaction.
  6. RU/s utilization is better in Azure Cosmos DB for NoSQL than the Gremlin API.
  7. Finally, you can leverage Change Feed to create/update the reverse traversal item. This keeps client code simple. Please note the downside for this approach: new items/updates to items will not be available instantly.

Limitations 

  1. 2x and more levels of traversals will be expensive in terms of cost (RU/s).
  2. 2x and more levels of traversals will have higher latency.
  3. Analytical queries and aggregations are not recommended.
  4. Complex queries which touch multiple edges out and in operation are not recommended. 

Conclusion 

Using Azure Cosmos DB for NoSQL is recommended for building green-field applications. Use the information in this blog post as guidance to remodel an existing Azure Cosmos DB for Gremlin API into Azure Cosmos DB for NoSQL. It is advisable for you to test your application-specific scenarios, query access patterns, and choose the one which provides you with the best price to performance ratio. Reach out to Microsoft for any specific use case guidance. Let us know your thoughts or drop and specific questions in the comments section below. 

Where to learn more. 

Explore the following links for further information:

  • Azure Cosmos DB Data Modeling 
  • Sample air-routes Data Model here for you to experiment.

About Azure Cosmos DB

Azure Cosmos DB is a fully managed and serverless distributed database for modern app development, with SLA-backed speed and availability, automatic and instant scalability, and support for open-source PostgreSQL, MongoDB, and Apache Cassandra. Try Azure Cosmos DB for free here. To stay in the loop on Azure Cosmos DB updates, follow us on Twitter, YouTube, and LinkedIn.

0 comments

Leave a comment

Feedback usabilla icon