When JSON Met Graph
Part 2 of 4 – The Multimodal Database Series
A note on naming. Throughout this series, when we say “SQL Server 2025” we also mean Azure SQL. The multimodal capabilities we discuss – native JSON, graph, vector, and columnstore – are available across both the on-premises engine and the Azure SQL family.
In Part 1 we described the moment every fast-moving project hits: the agent built your prototype in a morning, and now the product team wants features that do not fit neatly into relational tables. Device fingerprints arrive as nested JSON. The anti-fraud team needs to trace connections between accounts – a graph problem. Your options are to spin up new databases and wire sync pipelines between them… or to ask whether your existing engine already handles these natively.
Today we answer that question for two data models that complement the relational core in different ways – JSON (schema-flexible) and graph (relationship-native).
The First Question: Why Not Just NVARCHAR(MAX)?
If you have worked with JSON in SQL Server before 2025, you probably stored it as NVARCHAR(MAX). It works. So why does a native JSON type matter?
Here is what the engine actually does when your JSON is stored as NVARCHAR(MAX).
Every time you call JSON_VALUE(column, '$.some.path'), the engine has to:
- Read the entire string from the LOB page tree in the worst case
- Parse it from the first character to locate the path
- Extract the value as a string
For a single row, this is negligible. For a table scan across a million rows, you are parsing a million JSON documents from scratch on every query. The engine does not cache intermediate results between calls.
The native JSON type introduced in SQL Server 2025 changes the storage format. Instead of storing the raw text, the engine stores a pre-parsed binary representation. Path lookups become offset calculations rather than string parsing. The practical difference is 30-50% less storage (no repeated key strings) and significantly faster path extraction.
-- SQL Server 2025: Native JSON type
CREATE TABLE Events (
EventID INT IDENTITY PRIMARY KEY,
PersonID INT NULL, -- FK to Person (used in Section 6)
Data JSON NOT NULL, -- Pre-parsed binary storage, validated on insert
CreatedAt DATETIME2 DEFAULT SYSUTCDATETIME()
);
Notice what happens on INSERT: the engine validates the JSON and converts it to the binary format once. Every subsequent read benefits from that upfront cost. If the JSON is malformed, the insert fails – you get format validation (well-formed JSON) at the storage layer, without having to define a relational schema.
This is a genuine architectural shift, not syntactic sugar. It changes the cost model for the query optimizer.
OPENJSON and Window Functions: Thinking in Sets
The next query combines document access with analytical computation in a single execution plan.
INSERT INTO Events (PersonID, Data) VALUES
(1, '{"deviceId":"d1","fingerprint":{"browser":"Chrome","os":"Windows","plugins":["AdBlock","LastPass"]}}'),
(2, '{"deviceId":"d2","fingerprint":{"browser":"Firefox","os":"macOS","plugins":["uBlock","Grammarly","DarkReader"]}}'),
(3, '{"deviceId":"d3","fingerprint":{"browser":"Chrome","os":"Linux","plugins":["AdBlock"]}}');
Now suppose we want to know, for each event, which browser was used and how many other events share that same browser. In a document database, you would run an aggregation pipeline, or pull the data client-side and group it. In SQL Server, you write:
SELECT
e.EventID,
JSON_VALUE(e.Data, '$.deviceId') AS deviceId,
JSON_VALUE(e.Data, '$.fingerprint.browser') AS browser,
JSON_VALUE(e.Data, '$.fingerprint.os') AS os,
COUNT(*) OVER (
PARTITION BY JSON_VALUE(e.Data, '$.fingerprint.browser')
) AS BrowserCount
FROM Events e;
EventID | deviceId | browser | os | BrowserCount
--------|----------|---------|---------|-------------
1 | d1 | Chrome | Windows | 2
3 | d3 | Chrome | Linux | 2
2 | d2 | Firefox | macOS | 1
For scalar values – strings, numbers, booleans at a known path – JSON_VALUE is the right tool. It extracts a single value from the binary JSON via an offset lookup (with the native JSON type) and returns it as a typed scalar. The window function COUNT(*) OVER (PARTITION BY ...) operates directly on the extracted value. This is document access and analytical computation fused into a single execution plan. The optimizer does not hand off to a different engine partway through.
But what about the plugins array? JSON_VALUE cannot help here – it extracts one scalar, and an array is not a scalar. This is where OPENJSON earns its keep.
JSON_VALUE vs. OPENJSON: When to Use Which
The distinction is simple and worth internalizing:
| What you need | Use | Why |
|---|---|---|
| A single scalar at a known path | JSON_VALUE |
Returns one value, benefits from binary JSON offset lookup |
| A nested object as a fragment | JSON_QUERY |
Preserves the JSON structure, does not attempt scalar extraction |
| Rows from an array, or multiple columns projected at once | OPENJSON |
Explodes arrays into rowsets, projects documents into relational columns |
JSON_VALUE and OPENJSON are complementary, not competing. Use JSON_VALUE when you know the path and want a scalar. Use OPENJSON when you need to iterate – over array elements, or over multiple paths projected simultaneously.
Exploding Arrays with OPENJSON
To work with the plugins array – say, to count how many events include each plugin – you use OPENJSON to explode the array into rows, while still using JSON_VALUE for the scalar browser path:
SELECT
JSON_VALUE(e.Data, '$.fingerprint.browser') AS browser,
plugin.value AS PluginName,
COUNT(*) AS EventCount
FROM Events e
CROSS APPLY OPENJSON(e.Data, '$.fingerprint.plugins') plugin
GROUP BY JSON_VALUE(e.Data, '$.fingerprint.browser'), plugin.value;
browser | PluginName | EventCount
--------|------------|-----------
Chrome | AdBlock | 2
Chrome | LastPass | 1
Firefox | uBlock | 1
Firefox | Grammarly | 1
Firefox | DarkReader | 1
JSON_VALUE extracts the scalar browser. OPENJSON with the path '$.fingerprint.plugins' explodes the array directly – each element becomes a row with a value column. The GROUP BY then aggregates across all events. No client-side iteration, no cursor.
Notice the pattern: JSON_VALUE for the scalar property, OPENJSON for the array. This combination is idiomatic. When you have a document with both scalar fields and arrays, use each function for what it does best rather than forcing everything through OPENJSON ... WITH.
If you were doing this across separate databases – a document store for events, a relational database for analytics – you would extract the data from the document store, transform it in your application, load it into the analytical store, and then run the aggregation. That is ETL for a single query.
JSON Indexes: Teaching the Optimizer About Paths
If you query JSON_VALUE(Data, '$.fingerprint.browser') frequently, does the optimizer have to scan every row?
Before SQL Server 2025, the answer was: you create a computed column and index that.
-- The old approach (SQL Server 2022 and earlier)
ALTER TABLE Events
ADD BrowserComputed AS JSON_VALUE(Data, '$.fingerprint.browser') PERSISTED;
CREATE INDEX IX_Events_Browser ON Events(BrowserComputed);
This works, but it has an awkward proposition: you are manually telling the engine which JSON paths to materialize. Each new path needs a new computed column – and if you want seeks, a new index on top. You are effectively denormalizing by hand.
SQL Server 2025 introduces CREATE JSON INDEX, which takes a different approach:
CREATE JSON INDEX IX_EventData
ON Events(Data);
When you omit the FOR clause, the default path is $ – the document root. Because paths are indexed recursively, this single declaration indexes every key and value in the document: $.deviceId, $.fingerprint.browser, $.fingerprint.os, $.fingerprint.plugins[*], and any path added by future schema versions. The engine builds an internal structure (think of it as a B-tree keyed by path + value) that the optimizer can use for equality and range predicates on any indexed path.
If your documents are large but your queries only touch a narrow subtree, you can scope the index with an explicit FOR clause:
-- Only index the fingerprint subtree and deviceId
CREATE JSON INDEX IX_EventData_Scoped
ON Events(Data)
FOR ('$.fingerprint', '$.deviceId');
FOR ('$.fingerprint') recursively indexes everything under that path – $.fingerprint.browser, $.fingerprint.os, $.fingerprint.plugins[*] – without you listing each leaf. We add '$.deviceId' separately because it sits outside the fingerprint subtree. One constraint to know: paths cannot overlap. FOR ('$.fingerprint', '$.fingerprint.browser') raises an error because $.fingerprint already includes $.fingerprint.browser.
For our Events table the documents are small and we query paths at multiple levels, so the default $ (index everything) is the right choice. For tables with large documents where you only query a few paths, scoped paths keep the index smaller.
There is also a variant for array searches:
CREATE TABLE Customers_WithPhones (
CustomerID INT IDENTITY PRIMARY KEY,
CustomerInfo JSON NOT NULL
);
CREATE JSON INDEX IX_CustomerJson
ON Customers_WithPhones (CustomerInfo)
WITH (OPTIMIZE_FOR_ARRAY_SEARCH = ON);
-- Now array containment becomes index-seekable
SELECT * FROM Customers_WithPhones
WHERE JSON_CONTAINS(CustomerInfo, '123-456-7890', '$.phone[*]') = 1;
Without the index, JSON_CONTAINS has to scan every row and search every array element. With the index, the engine can seek directly to the index nodes containing the target value. The difference matters when you have millions of rows.
Schema Evolution Without Downtime
JSON’s schema flexibility is a powerful tool – and one that calls for deliberate guardrails.
Consider an events table where the document shape evolves over time:
-- Week 1 deployment
INSERT INTO Events (PersonID, Data) VALUES
(NULL, '{"version":1,"action":"click","target":"button"}');
-- Week 2 deployment: added analytics block
-- OUTPUT clause returns the inserted row immediately
INSERT INTO Events (PersonID, Data)
OUTPUT
inserted.EventID,
JSON_VALUE(inserted.Data, '$.version' RETURNING INT) AS Version,
JSON_VALUE(inserted.Data, '$.action') AS Action
VALUES
(NULL, '{"version":2,"action":"click","target":"button","analytics":{"duration":1.5}}');
The OUTPUT clause returns the newly inserted row without a separate SELECT. Here it also demonstrates the RETURNING clause inside JSON_VALUE: '$.version' RETURNING INT tells the engine to return the value typed as INT rather than the default NVARCHAR(4000). No CAST wrapper needed – the type conversion happens inside the JSON function itself.
Both documents coexist in the same table. A query that accesses $.analytics.duration returns NULL for version 1 rows – the missing path produces NULL, not an error.
SELECT
JSON_VALUE(Data, '$.action') AS Action,
JSON_VALUE(Data, '$.target') AS Target,
JSON_VALUE(Data, '$.analytics.duration' RETURNING FLOAT) AS Duration
FROM Events
WHERE JSON_VALUE(Data, '$.version') IS NOT NULL;
Action | Target | Duration
-------|--------|----------
click | button | NULL -- v1 row: path does not exist, returns NULL
click | button | 1.5 -- v2 row: path exists, returned as FLOAT
The WHERE clause itself demonstrates the point: we are filtering by a path ($.version) that only exists in some rows, and the engine handles it gracefully. Rows without that path simply return NULL for JSON_VALUE, which the IS NOT NULL predicate filters out.
No ALTER TABLE. No migration script. No downtime.
The tradeoff is real, and it is not small: you lose compile-time guarantees about which paths exist, and more importantly, you lose the data integrity enforcement that a relational model provides. A misspelled path, a renamed field, or a deployment that silently drops a property will not raise an error – the query will simply return NULL, and the bug may go undetected until someone notices wrong results downstream. The JSON type validates syntax (well-formed JSON), not shape (expected fields, types, or constraints).
If your data requires enforced schemas and referential integrity, use a proper relational model – that is what it is built for, and no combination of CHECK constraints on a JSON column will replicate foreign keys, unique constraints, and column-level types. JSON columns are the right choice for the parts of your schema that are genuinely semi-structured and change faster than your release cycle: event logs, telemetry, feature flags, device fingerprints, configuration data. For core entities like Customers or Orders, stick with relational columns.
Graph Tables: The MATCH Syntax
Now let us shift to a different data model: graph queries.
SQL Server’s graph implementation uses node and edge tables. If you have never seen this before, here is the mental model: a node table is a regular table with an additional system-generated $node_id column. An edge table has $from_id and $to_id columns pointing at nodes. Underneath, they are still tables – the engine stores them in the same B-trees, on the same pages, with the same transaction log. The graph semantics are a query-language feature, not a separate storage engine.
CREATE TABLE Person (
PersonID INT PRIMARY KEY,
Name NVARCHAR(100),
RiskScore FLOAT
) AS NODE;
CREATE TABLE Account (
AccountID INT PRIMARY KEY,
Bank NVARCHAR(50)
) AS NODE;
CREATE TABLE Owns AS EDGE; -- Person → Account
CREATE TABLE SentMoney AS EDGE; -- Account → Account
CREATE TABLE Knows AS EDGE; -- Person → Person
Now let us populate the graph with sample data – a small fraud-detection scenario:
-- Insert people
INSERT INTO Person (PersonID, Name, RiskScore) VALUES
(1, 'Alice', 0.3),
(2, 'Bob', 0.6),
(3, 'Carlos', 0.9),
(4, 'Diana', 0.85);
-- Insert accounts
INSERT INTO Account (AccountID, Bank) VALUES
(101, 'Contoso Bank'),
(102, 'Fabrikam Bank'),
(103, 'Contoso Bank'),
(104, 'Northwind Bank');
-- Create ownership edges: Person → Account
INSERT INTO Owns ($from_id, $to_id)
SELECT p.$node_id, a.$node_id
FROM Person p, Account a
WHERE (p.PersonID = 1 AND a.AccountID = 101)
OR (p.PersonID = 2 AND a.AccountID = 102)
OR (p.PersonID = 3 AND a.AccountID = 103)
OR (p.PersonID = 4 AND a.AccountID = 104);
-- Create money-transfer edges: Account → Account
INSERT INTO SentMoney ($from_id, $to_id)
SELECT a1.$node_id, a2.$node_id
FROM Account a1, Account a2
WHERE (a1.AccountID = 101 AND a2.AccountID = 102)
OR (a1.AccountID = 102 AND a2.AccountID = 103);
-- Create social edges: Person ↔ Person
INSERT INTO Knows ($from_id, $to_id)
SELECT p1.$node_id, p2.$node_id
FROM Person p1, Person p2
WHERE (p1.PersonID = 1 AND p2.PersonID = 2)
OR (p1.PersonID = 2 AND p2.PersonID = 3)
OR (p1.PersonID = 2 AND p2.PersonID = 4)
OR (p1.PersonID = 3 AND p2.PersonID = 4);
Why does this matter? Because edges are stored as rows in a table, they inherit the transactional guarantees and operational tooling that regular tables have: backup and recovery, point-in-time restore, and the same security model. In a dedicated graph database, you often trade these capabilities for graph-specific optimizations. Here, you keep them.
The MATCH Pattern
The MATCH clause lets you express graph traversals using an arrow syntax:
-- Find people connected through account transfers
SELECT
p1.Name AS Sender,
p1.RiskScore AS SenderRisk,
a1.Bank AS FromBank,
a2.Bank AS ToBank,
p2.Name AS Receiver,
p2.RiskScore AS ReceiverRisk
FROM
Person p1, Owns o1, Account a1,
SentMoney s,
Account a2, Owns o2, Person p2
WHERE MATCH(p1-(o1)->a1-(s)->a2<-(o2)-p2)
AND p1.PersonID <> p2.PersonID;
Here is the MATCH expression: p1-(o1)->a1-(s)->a2<-(o2)-p2.
Reading left to right: Person p1 owns Account a1, which sent money to Account a2, which is owned by Person p2. The arrow directions encode relationship semantics. The <-(o2)- reversal means we are traversing the “Owns” edge backward – from account to person.
Under the hood, the optimizer translates this into joins. p1-(o1)->a1 becomes a join between the Person node table, the Owns edge table (on $from_id), and the Account node table (on $to_id). The execution plan is a series of joins – the same join infrastructure used for relational queries.
This is worth understanding because it tells you when graph syntax helps and when it does not. If your traversal is a fixed number of hops (like the query above), graph syntax is primarily a readability improvement over manual self-joins. You could write the same query with explicit joins on $node_id, $from_id, and $to_id. It would produce the same execution plan. The MATCH syntax is cleaner, but the performance is identical.
Where graph syntax becomes genuinely useful is in variable-length path queries (using SHORTEST_PATH) and in cases where the traversal pattern is complex enough that hand-writing joins becomes tedious and error-prone.
Finding High-Risk Connections
-- People directly connected to high-risk individuals
SELECT
p1.Name AS Person,
p1.RiskScore,
p2.Name AS KnowsHighRisk,
p2.RiskScore AS TheirRisk
FROM Person p1, Knows k, Person p2
WHERE MATCH(p1-(k)->p2)
AND p2.RiskScore > 0.8;
The filter p2.RiskScore > 0.8 is a relational predicate applied after the graph traversal. In the execution plan, you will see an index seek on Person (if you have an index on RiskScore) feeding into the join. The optimizer treats the graph traversal and the relational filter as parts of the same plan.
The Combined Query: Where Multimodality Earns Its Keep
Here is the query that brings everything together. First, the supporting Transactions table:
-- Transactions table for the relational subquery
CREATE TABLE Transactions (
TxnID INT IDENTITY PRIMARY KEY,
PersonID INT NOT NULL,
Amount DECIMAL(12,2),
Timestamp DATETIME2 DEFAULT SYSUTCDATETIME()
);
CREATE INDEX IX_Txn_Person_Time
ON Transactions(PersonID, Timestamp) INCLUDE(Amount);
INSERT INTO Transactions (PersonID, Amount, Timestamp) VALUES
(2, 1500.00, DATEADD(DAY, -2, GETUTCDATE())),
(2, 750.00, DATEADD(DAY, -5, GETUTCDATE())),
(3, 9200.00, DATEADD(DAY, -1, GETUTCDATE())),
(4, 3100.00, DATEADD(DAY, -3, GETUTCDATE()));
Now the combined query. One SQL Server constraint to know: MATCH clauses cannot reference node aliases from an outer JOIN or APPLY. So we pre-compute the graph traversal in a CTE and join the result. The optimizer inlines the CTE, producing a single execution plan that spans all three data models:
-- Graph: pre-compute fraud connections per person
WITH FraudConnections AS (
SELECT
p1.PersonID,
COUNT(*) AS FraudConnectionCount
FROM Person p1, Knows k, Person suspect
WHERE MATCH(p1-(k)->suspect)
AND suspect.RiskScore > 0.8
GROUP BY p1.PersonID
)
SELECT
p.Name,
p.RiskScore,
-- JSON: Extract browser from device fingerprint (RETURNING specifies the type inline)
JSON_VALUE(e.Data, '$.fingerprint.browser' RETURNING NVARCHAR(50)) AS Browser,
-- Graph: Fraud connections (from CTE)
ISNULL(fc.FraudConnectionCount, 0) AS FraudConnections,
-- Relational: Recent transaction total
(
SELECT SUM(Amount)
FROM Transactions t
WHERE t.PersonID = p.PersonID
AND t.Timestamp > DATEADD(DAY, -7, GETUTCDATE())
) AS WeeklyVolume
FROM Person p
JOIN Events e ON p.PersonID = e.PersonID
LEFT JOIN FraudConnections fc ON fc.PersonID = p.PersonID
WHERE p.RiskScore > 0.5
ORDER BY FraudConnections DESC;
Name | RiskScore | Browser | FraudConnections | WeeklyVolume
-------|-----------|---------|------------------|-------------
Bob | 0.6 | Firefox | 2 | 2250.00
Carlos | 0.9 | Chrome | 1 | 9200.00
Bob has two fraud connections (Carlos at 0.9 and Diana at 0.85). Carlos has one (Diana at 0.85). Alice is excluded by the WHERE p.RiskScore > 0.5 filter. Diana has no matching Events row, so she does not appear.
Here is what the optimizer does with this.
-
CTE / graph traversal: The
FraudConnectionsCTE traverses theKnowsedges and filters targets byRiskScore > 0.8, then groups byPersonID. The optimizer inlines this as a derived table – the CTE does not materialize to a temp table. Under the hood, theMATCHcompiles to joins on the Knows edge table ($from_id,$to_id) and the Person node table ($node_id). -
Outer scan: The
FROM Person p WHERE p.RiskScore > 0.5becomes an index seek if there is an index onRiskScore, or a table scan with a filter. This produces the driving rowset. -
Join to Events (JSON): The
JOIN Events e ON p.PersonID = e.PersonIDbrings in the JSON data. TheJSON_VALUE(e.Data, '$.fingerprint.browser')call extracts the browser from each event’s pre-parsed binary JSON. With the native JSON type, this is an offset lookup, not a string parse. -
Left join to CTE (graph): The
LEFT JOIN FraudConnectionsattaches the pre-computed graph counts.ISNULL(..., 0)handles persons with no high-risk connections. -
Correlated subquery (relational): For each person, the subquery scans transactions for the last 7 days. The index on
(PersonID, Timestamp)withINCLUDE(Amount)makes this a covering seek.
The critical point: this is one execution plan. The relational joins, the JSON extraction, the graph traversal, and the aggregations all participate in the same cost-based optimization. The optimizer can reorder operations, push predicates, and choose access methods across data model boundaries.
In a polyglot architecture, you would execute these as four separate queries, aggregate the results in your application, and handle the case where one of them is slow or fails. The multimodal version has one failure mode, one latency budget, and one transaction boundary.
When to Use Which Model
Having all these models available creates a new design question: which one should I use for which data? The answer follows access patterns, not features.
| Your dominant access pattern | Use this model | Why |
|---|---|---|
| Fixed-schema CRUD with joins | Relational tables | Data integrity enforcement, strongest optimizer support |
| Schema varies by row, evolves weekly | JSON columns or JSON type | Flexibility without migrations |
| “Find connected entities N hops away” | Graph (node/edge + MATCH) | Readable traversal syntax, SHORTEST_PATH support |
| “Find things similar to X” | Vector (VECTOR type + DiskANN) | Purpose-built ANN indexes |
| Aggregations over millions of rows | Columnstore indexes | Batch-mode processing, 10x+ compression |
The boundaries are not rigid. You will frequently combine models: relational tables with a JSON column for metadata, graph edges between relational entities, vector columns on otherwise relational rows. The multimodal approach does not ask you to choose one model – it asks you to choose the right model per column.
So What?
Remember the moment from Part 1 – you had a working prototype, and then the requirements started stacking up. The product team wants flexible schemas for event data. The anti-fraud team needs to trace connections between accounts. In the old playbook, this is where the architecture fragments: a document database for events, a graph database for relationships, a sync layer to keep them consistent with your relational core.
That fragmentation is a choice, not a requirement.
SQL Server 2025’s native JSON type is not a wrapper around NVARCHAR(MAX) – it is a binary storage format with indexed path access. Graph tables are not a visualization layer – they compile to the same join infrastructure the optimizer has refined for three decades. When you combine relational, JSON, and graph in a single query, you get one execution plan, not three plans stitched together.
For the agentic workflow, this means your agent does not need three different connection strings, three different query languages, and three different error-handling paths. It queries one engine. The optimizer handles the rest.
That eliminates two of the five databases from Part 1’s polyglot stack – and we are not done yet.
What is Next
- Part 3: Vectors, Analytics, and the End of ETL – DiskANN vector indexes, hybrid search that pushes relational filters before vector computation, columnstore mechanics, and HTAP queries on live data. Two more databases eliminated from the polyglot stack.
0 comments
Be the first to start the discussion.