October 4th, 2022

Azure Synapse Link support for Azure Cosmos DB Gremlin API now in preview

Rodrigo Souza
SR Program Manager

Azure Cosmos DB’s Gremlin API combines the power of graph database algorithms with highly scalable, managed infrastructure to provide a unique, flexible solution to most common data problems associated with lack of flexibility and relational approaches. For more information, click here.

Image SL for Gremlin

Use Cases

The objective of this new capability is to unlock Graph Analytics workloads, so that customers can analyze the relationships between their graph entities. Typical use cases are social networks, recommendation engines, Customer 365, telecommunications networks, supply-chain , and IoT. For more information, click here.

As an example, customers now can use Azure Synapse Link for network analysis like centrality, connectivity, shortest path, and community detection. This can be achieved by:

  • Batch GRAPH analytics, and then write results elsewhere, potentially bulk update some Graph properties on existing vertexes/edges or create new edges for the discovered relationships as an outcome of this.
  • Large scan and aggregation reporting, to avoid expensive group() and sort() cross-partition Gremlin queries, which likely will have large RUs cost and slow performance. The objective is to produce tabular reporting on graph data, populating reports and dashboards.

How to enable Synapse Link for Gremlin API

Currently customers can use Azure CLI to enable Synapse Link for Gremlin API.  PowerShell will be supported soon. The required steps are:

First, enable Synapse Link in your Gremlin Database account:

az cosmosdb update --capabilities EnableGremlin --name MyCosmosDBGremlinDatabaseAccount --resource-group MyResourceGroup --enable-analytical-storage true

 

Then, enable Synapse Link in your graph:

az cosmosdb gremlin graph update --g MyResourceGroup --a MyCosmosDBGremlinDatabaseAccount --d MyGremlinDB --n MyGraph --analytical-storage-ttl -1

 

Do you need to create Gremlin database account, database, or Graph?

Check these Gremlin CLI scripts. Please note that you can also enable Synapse Link when creating your Gremlin database account and your graph. Just use –enable-analytical-storage true with az cosmosdb create to create your Synapse Link enabled database account and –analytical-storage-ttl –1 with az cosmosdb gremlin graph update to create your Synapse Link enabled graph.

For more information about Synapse Link time to live (ttl) and analytical store data retention, click here. After you enable analytical store in your graph, you can view the analytical ttl in the Azure portal Data Explorer.

Another important detail is that well defined schema is the default option for Gremlin API. For more information about schema representation, click here.

 

How to analyze your data with Synapse Workspaces

You need to use Azure Synapse Workspaces to analyze Cosmos DB data through Synapse Link. And the Azure Synapse Studio is the tool that within the workspace that is used to create SQL queries or Spark notebooks. This is true for all Cosmos DB APIs that support Synapse Link: SQL, MongoDB, and Gremlin.

Since Gremlin is in preview, there are some limitations on Azure Synapse Studio:

Linked Service

A linked service isn’t required to use Azure Synapse Link for Cosmos DB, but it’s a great option to reduce coding and better visualize your Cosmos DB data. The steps to create a linked service for your Gremlin API are:

  • Create a Linked Service and use the Azure Cosmos DB (SQL API)
  • Instead of using “From Azure subscription” default option, use “Enter manually”.
  • Copy/paste the .Net SDK URI and the account key for your Gremlin API account. You can use any key, primary or secondary.
  • Enter your Graph database.

Now you will be able to see your linked service in the data explorer tree view. Currently you won’t be able to see your graphs listed in the tree view, but you can still query your data.

Querying data with Azure Synapse SQL Serverless

To query your graphs using Azure Synapse SQL serverless. Let’s assume these names for a hypothetical scenario:

  • The database account name is MyGremlinAccount
  • The database name is MyGremlinDB
  • The graph name is MyGraph

You can query your data using 2 different syntaxes:

OPENROWSET with your Gremlin API account key

SELECT TOP 10 * FROM OPENROWSET(
       'CosmosDB',
       'Account=MyGremlinAccount;Database=MyGremlinDB;Key=<your-account-key>',
       MyGraph) as MyGraph
GO

 

OPENROWSET with credential

CREATE CREDENTIAL MyGremlinCredential WITH IDENTITY = 'SHARED ACCESS SIGNATURE', SECRET = '<your-account-key>'
GO

SELECT TOP 10 *
FROM OPENROWSET(
      PROVIDER = 'CosmosDB',
      CONNECTION = 'Account=MyGremlinAccount;Database=MyGremlinDB,
      OBJECT = 'MyGraph',
      SERVER_CREDENTIAL = ' MyGremlinCredential'
    ) as MyGraph
GO

Please note that the credential is created once and prevents you from pasting your database account key for every single query. For more information about Synapse Link and SQL Serverless, click here.

Querying data with Azure Synapse Spark

The example below uses GraphFrames, a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. For more information, click here.

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._
import org.graphframes.GraphFrame

val df_olap = spark.read.format("cosmos.olap").option("spark.synapse.linkedService", "<Your-Linked-Service-Name>").option("spark.cosmos.container", "MyGraph").load()

//display first 10 entries
//display(df_olap)

var vertices = df_olap.filter($"_sink".isNull).select($"id", $"name",$"age".getItem(0).getItem("_value").as("age"))

val df_edges = (df_olap.filter($"_sink".isNotNull).drop("_isEdge"))

var edges = df_edges.select("_vertexId", "_sink", "label")

edges = edges.withColumnRenamed("_vertexId", "src")

edges = edges.withColumnRenamed("_sink", "dst")

edges = edges.withColumnRenamed("label", "relationship")

 //Optional
display(vertices) 
//Optional
display(edges)

val graph = GraphFrame(vertices, edges)

// Label Propagation Algo to detect comunities of Vertices based on thier connections. 

import org.apache.spark.sql.DataFrame

val result = graph.labelPropagation.maxIter(5).run()result.select("id", "name", "label").orderBy("label").show()

 

Please note that unlike Synapse SQL serverless, Synapse Spark can take advantage of linked service.

For more information about Synapse Link and Spark, click here.

Conclusion

Now customers can create powerful graph analytics workloads to unlock BI, insights, and advanced analytics on top of Azure Cosmos DB Gremlin API data. Stay tuned to our blog for more updates about Azure Synapse Link for Cosmos DB. And please contact our team for any questions that you may have.

Author

Rodrigo Souza
SR Program Manager

Rodrigo is a Program Manager on Azure Cosmos DB, focusing on Analytics.

0 comments

Discussion are closed.