Vectors, Analytics, and the End of ETL
Part 3 of 4 – The Multimodal Database Series
In Part 1: The Polyglot Tax we laid out the fraud detection scenario: a transaction comes in, and before you approve or deny it you need five checks. Order history (relational). Device fingerprint (JSON). Connections to fraud rings (graph). Transactions that look like this one (vector similarity). Statistical baselines across millions of rows (analytics). Five data access patterns, five databases in the polyglot model, five sets of everything that can break.
In Part 2: When JSON Met Graph we handled the first three. We stored device fingerprints in native JSON columns on the Events table, traced fraud connections through Person, Account, Knows, and SentMoney graph tables, and joined them with relational Transactions data. One execution plan, three data models. Alice, Bob, Carlos, and Diana stayed in one database.
Two checks remain: finding transactions that look like the current one (vector similarity) and computing statistical baselines across the full transaction history (columnstore analytics). These are the two features that most commonly fracture an architecture. Somebody says “we need similarity search” and a vector store appears. Somebody else says “the anti-fraud team wants real-time baselines” and an ETL pipeline to a data warehouse materializes.
Today we bring both inside the same engine, using the same tables and the same fraud scenario.
Why Vector Search Matters for Fraud
The fourth check from our fraud scenario: find transactions that look like the incoming one. This is not an exact-match problem. A fraudulent transaction for $9,200 from a new device in a new city does not match any specific WHERE clause. But it might look a lot like a pattern of fraud you have seen before: similar amount, similar device profile, similar timing relative to account creation.
Traditional queries match on exact values or ranges. Amount BETWEEN 8000 AND 10000 catches some cases, but it is brittle. You end up with dozens of hand-tuned rules that drift out of date. Vector search takes a different approach: convert each transaction’s characteristics into a numerical embedding, where semantic similarity maps to geometric proximity. A new transaction that feels like known fraud lands close to known fraud vectors in that space, even if the exact amounts and merchants differ.
The question is where this search should run. If you put it in a separate vector database, you get the polyglot tax: an extra network hop, a separate auth system, no transactional guarantees with the relational data we already have in the Person, Account, and Transactions tables. If you put it inside the same engine, the fraud check stays in one query plan. But only if the engine can search vectors efficiently.
Brute-force comparison against every vector in the table is O(n). For a million historical transactions, that is a million distance calculations per incoming fraud check. Each distance calculation involves a multiplication across all vector dimensions, so for 1536-dimensional vectors that is 1,536 multiplications per candidate, times a million candidates. That is not something you want running in a transactional database. For vector search to work alongside OLTP, you need to do much, much better than that. That means an approximate nearest neighbor (ANN) index that gets to the right neighborhood without visiting every vector.
DiskANN: The Index That Makes This Practical
Microsoft SQL uses DiskANN for vector indexing. DiskANN is a published research project from Microsoft Research, designed for billion-point nearest neighbor search on a single node. Understanding why this index structure matters requires a brief look at the alternatives.
Most ANN indexes (HNSW, IVF) are designed to run entirely in memory. They build graph structures that assume random access to any vector at any time. This works for small-to-medium datasets. At a billion vectors with 1536 dimensions each, you need roughly 6 TB of RAM.
DiskANN was designed from the start for disk-resident data rather than memory-resident structures. It builds a graph where each node connects to a bounded number of neighbors, organized so that traversals require far fewer reads than a brute-force scan. The practical result: billion-scale similarity search with millisecond-range latency, running on commodity storage, not terabytes of RAM.
Let us add vector capabilities to our fraud detection tables. We defined Transactions earlier with TxnID, PersonID, Amount, and Timestamp. Now we extend it with an embedding column that captures the full transaction signature: amount, time-of-day, merchant category, device fingerprint hash, and account age, all encoded into a single vector by an embedding model.
— Extend our Transactions table with vector embeddings — Each embedding encodes the transaction’s behavioral signature: — amount, time-of-day, merchant category, device profile, account age CREATE TABLE TransactionVectors ( TxnID INT PRIMARY KEY, PersonID INT NOT NULL, Amount DECIMAL(12,2), Timestamp DATETIME2 DEFAULT SYSUTCDATETIME(), IsFraud BIT DEFAULT 0, TxnEmbedding VECTOR(1536) — behavioral signature from embedding model );
— DiskANN index for approximate nearest neighbor search CREATE VECTOR INDEX IX_Txn_Embedding ON TransactionVectors(TxnEmbedding) WITH (METRIC = ‘cosine’, TYPE = ‘DiskANN’);
VECTOR(1536) is a native type, not a VARBINARY blob storing serialized floats. The engine validates dimensionality on insert: put in a 768-dimensional vector and the insert fails. Vectors are stored in an optimized binary format but exposed as JSON arrays for convenience. Each element is stored as a single-precision (4-byte) float, so 1536 dimensions is 6,144 bytes per vector.
Creating a vector is straightforward. The first two examples show literal syntax; the third shows how you would generate an embedding from a registered model:
-- Implicit casting from JSON array format
DECLARE @v VECTOR(3) = '[1.0, -0.2, 30]';
-- Or from a JSON_ARRAY expression
DECLARE @v2 VECTOR(3) = JSON_ARRAY(1.0, -0.2, 30);
-- Using the built-in embedding function with a registered model
DECLARE @v3 VECTOR(1536) = AI_GENERATE_EMBEDDINGS(
N'high-value transfer new device international merchant' USE MODEL Ada2Embeddings
);
METRIC = 'cosine' tells the index which distance function to use when building the graph. Cosine distance measures the angle between vectors; two vectors pointing in the same direction have distance 0, regardless of magnitude. This is the standard choice for embeddings. The engine also supports 'dot' (negative dot product) and 'euclidean' distance.
One constraint worth designing around today: the table must have a clustered key for the vector index. Beyond that, the early preview limitation where tables became read-only after index creation is gone. Tables with DiskANN indexes now support full DML: inserts, updates, deletes, and merges all work without dropping the index. The index is maintained asynchronously in the background. Your DML commits immediately; the DiskANN graph incorporates the changes shortly after. Recently modified rows are queryable right away, even before the background maintenance fully integrates them into the graph.
This matters for fraud detection specifically. Consider what happens when the anti-fraud team confirms a case: the investigator marks the transaction as fraud, and the system re-embeds it into the vector table so future similarity searches can find it. In the old model, you would drop the DiskANN index, perform the writes, and rebuild it, blocking all vector queries in between. Now:
-- Fraud confirmed on TxnID 42. Update the flag - index stays online.
UPDATE TransactionVectors
SET IsFraud = 1
WHERE TxnID = 42;
-- A new suspicious transaction arrives. Insert it with its embedding - no index rebuild.
INSERT INTO TransactionVectors (TxnID, PersonID, Amount, IsFraud, TxnEmbedding)
VALUES (
1001, 3, 9200.00, 0,
AI_GENERATE_EMBEDDINGS(
N'$9200 international wire, new device, account age 3 days' USE MODEL Ada2Embeddings
)
);
-- Both changes are immediately visible to vector search.
-- The DiskANN graph integrates them asynchronously in the background.
The fraud knowledge base is always current. Every confirmed fraud instantly becomes a reference point for future similarity searches. Every new transaction is searchable the moment it lands. No maintenance windows, no index-drop-rebuild cycles, no stale embeddings gap where a known fraud pattern is invisible to the system.
Querying Vectors: Exact vs. Approximate
Microsoft SQL provides two ways to search vectors. The distinction is not syntactic preference. It changes the execution model: exact search computes every distance, approximate search walks the DiskANN graph. Both support relational filters.
Exact Search with VECTOR_DISTANCE
-- An incoming transaction arrives. Compute its embedding.
DECLARE @incomingTxn VECTOR(1536) = AI_GENERATE_EMBEDDINGS(
N'$9200 international wire, new device, account age 3 days' USE MODEL Ada2Embeddings
);
-- Find the 10 most similar historical transactions
SELECT TOP (10)
TxnID,
PersonID,
Amount,
IsFraud,
VECTOR_DISTANCE('cosine', TxnEmbedding, @incomingTxn) AS Distance
FROM TransactionVectors
ORDER BY Distance;
VECTOR_DISTANCE is a scalar function. It computes the distance between two vectors and returns a float. It can appear anywhere a scalar function can: SELECT, WHERE, ORDER BY, HAVING. The optimizer handles it like any other scalar expression in the query plan.
This is an exact nearest neighbor search (kNN). The engine computes the distance for every qualifying row and sorts the results. The Microsoft documentation recommends exact search when you have fewer than 50,000 candidate vectors. The table can contain many more vectors than that, as long as your WHERE predicates reduce the candidate set to 50,000 or fewer.
The critical advantage of VECTOR_DISTANCE is composability. It participates in the standard query plan, which means the optimizer can push relational predicates before computing distances. More on this in a moment.
Approximate Search with VECTOR_SEARCH
DECLARE @incomingTxn VECTOR(1536) = AI_GENERATE_EMBEDDINGS(
N'$9200 international wire, new device, account age 3 days' USE MODEL Ada2Embeddings
);
SELECT TOP (10) WITH APPROXIMATE
t.TxnID,
t.PersonID,
t.Amount,
t.IsFraud,
vs.distance
FROM VECTOR_SEARCH(
TABLE = TransactionVectors AS t,
COLUMN = TxnEmbedding,
SIMILAR_TO = @incomingTxn,
METRIC = 'cosine'
) AS vs
ORDER BY vs.distance;
VECTOR_SEARCH is a table-valued function that uses the DiskANN index. It traverses the graph to find the approximate nearest neighbors and returns them with their distances. The table alias (t) references columns from the source table. The result alias (vs) provides the distance column. The WITH APPROXIMATE clause on SELECT TOP tells the optimizer that approximate results are acceptable; the optimizer then decides whether to use the DiskANN index or fall back to exact kNN based on data size, predicates, and cost.
This is the fast path for large datasets. The engine walks the DiskANN graph, touching a small fraction of the total vectors, and returns results in millisecond-range latency.
VECTOR_SEARCH supports iterative filtering: any WHERE clause you add is evaluated during the graph traversal, not after. The engine checks predicates as it visits candidate nodes, so you get the number of results you asked for, all satisfying your filters. No over-fetching, no unpredictable result counts.
-- Iterative filtering: predicates applied during DiskANN traversal
SELECT TOP (10) WITH APPROXIMATE
t.TxnID,
t.PersonID,
t.Amount,
vs.distance
FROM VECTOR_SEARCH(
TABLE = TransactionVectors AS t,
COLUMN = TxnEmbedding,
SIMILAR_TO = @incomingTxn,
METRIC = 'cosine'
) AS vs
WHERE t.IsFraud = 1
ORDER BY vs.distance;
This means VECTOR_SEARCH now handles both broad similarity search and filtered search over large tables.
When to Use Which
| Scenario | Use | Why |
|---|---|---|
| Fewer than 50,000 candidate vectors | VECTOR_DISTANCE |
Exact results, fully composable with any T-SQL expression |
| Relational filters narrow candidates below 50,000 | VECTOR_DISTANCE with WHERE |
Pre-filter shrinks the search space before distance computation |
| Millions of vectors, broad or filtered search | VECTOR_SEARCH with WITH APPROXIMATE |
ANN via DiskANN, iterative filtering, sub-linear performance |
| Not sure which path is best | VECTOR_SEARCH with WITH APPROXIMATE |
The optimizer auto-chooses DiskANN or exact kNN based on cost |
The Pre-Filter Pattern: Where the Engine Earns Its Keep
Here is the pattern that makes the multimodal approach categorically different from a standalone vector database. Both VECTOR_DISTANCE and VECTOR_SEARCH now support filtered search, but the mechanism differs. VECTOR_DISTANCE pushes relational predicates first, then computes distances over the reduced set (exact). VECTOR_SEARCH evaluates predicates iteratively during graph traversal (approximate). For small candidate sets, VECTOR_DISTANCE gives exact results. For large tables, VECTOR_SEARCH with iterative filtering is faster. Either way, the filters and the vectors share the same engine.
Back to our fraud check. Carlos just initiated a $9,200 wire transfer. We want to find similar transactions, but only among transactions flagged as fraudulent in the last 90 days. There is no point comparing against millions of legitimate grocery purchases.
DECLARE @incomingTxn VECTOR(1536) = AI_GENERATE_EMBEDDINGS(
N'$9200 international wire, new device, account age 3 days' USE MODEL Ada2Embeddings
);
SELECT TOP (10)
tv.TxnID,
tv.PersonID,
tv.Amount,
VECTOR_DISTANCE('cosine', tv.TxnEmbedding, @incomingTxn) AS Distance
FROM TransactionVectors tv
WHERE
tv.IsFraud = 1 -- only known fraud
AND tv.Timestamp > DATEADD(DAY, -90, GETUTCDATE()) -- last 90 days
AND tv.Amount > 5000 -- high-value only
ORDER BY Distance;
Here is what happens in the execution plan. The relational filters (IsFraud = 1, recent timestamp, Amount > 5000) execute first. If there are indexes on these columns, the engine seeks and filters before any distance computation happens. This reduces the candidate set dramatically.
Suppose you have 10 million historical transactions but only 3,000 are high-value fraud in the last 90 days. The engine computes 3,000 distance calculations instead of 10 million. That is a 3,300x reduction in compute, achieved by pushing relational predicates before the vector math. And because the candidate set is well under the 50,000-vector threshold, exact search via VECTOR_DISTANCE is the recommended approach.
In a pure vector database, you search the full 10 million vectors for nearest neighbors, then filter for fraud status and date range in your application. Many nearest neighbors will be legitimate transactions, so you over-fetch and hope enough survive the filtering. This is wasteful and unpredictable.
The multimodal approach is not marginally better. It is categorically different. The fraud flag, the timestamp range, and the vector distance all live in the same query plan, with the optimizer choosing the most selective predicate first. For a small candidate set like this one, VECTOR_DISTANCE gives exact results. For a larger set, the same WHERE predicates work with VECTOR_SEARCH via iterative filtering. An external vector database cannot do either because it does not have your IsFraud column or your Timestamp index.
Columnstore: Statistical Baselines for detecting Fraud
The fifth check is statistical baselines: aggregate analytics across millions of rows. The anti-fraud team needs to answer a deceptively simple question: what does “normal” look like for this person, this merchant category, this time of day? Without that context, a $9,200 wire transfer is just a number. With it, the system can flag it as three standard deviations above Carlos’s average.
Row Store vs. Column Store
Our Transactions table works fine for point lookups: “show me Bob’s last 5 transactions.” But the anti-fraud team wants to aggregate across millions of rows: average transaction amount by person, standard deviation by merchant category, daily transaction volume trends. That is a different access pattern, and the performance difference between row-oriented and column-oriented storage is not incremental. “60x faster” sounds like marketing until you understand the physics.
Consider a Transactions table that has grown to hundreds of millions of rows over time:
-- TransactionHistory: the full ledger, growing daily
CREATE TABLE TransactionHistory (
TxnID INT IDENTITY PRIMARY KEY,
PersonID INT NOT NULL,
AccountID INT NOT NULL,
Amount DECIMAL(12,2),
MerchantCategory NVARCHAR(50),
Timestamp DATETIME2 DEFAULT SYSUTCDATETIME(),
Region NVARCHAR(50)
);
-- One statement. Analytical queries now run in batch mode.
CREATE CLUSTERED COLUMNSTORE INDEX CCI_TxnHistory ON TransactionHistory;
A typical baseline query aggregates a few columns across millions of rows:
-- Compute baseline statistics for fraud detection
SELECT
PersonID,
COUNT(*) AS TotalTransactions,
AVG(Amount) AS AvgAmount,
STDEV(Amount) AS StdDevAmount,
MAX(Amount) AS MaxAmount,
COUNT(DISTINCT MerchantCategory) AS UniqueCategories
FROM TransactionHistory
WHERE Timestamp >= DATEADD(DAY, -90, GETUTCDATE())
GROUP BY PersonID;
This query touches 4 of 7 columns. Here is where the physics matter.
In a row store, data is laid out as complete rows on disk. To read Amount, the engine reads the entire row, including AccountID, MerchantCategory, and Region, even when the query does not need all of them. For 500 million rows at roughly 80 bytes per row, that is about 40 GB of I/O.
In a column store, data is organized by column. The engine reads only the columns the query needs. But the advantage compounds from four independent effects:
1. Compression. A column of Region values might contain 10 distinct strings repeated across hundreds of millions of rows. Dictionary encoding replaces each with a small integer. Run-length encoding collapses consecutive identical values. Typical compression ratios: 10-15x. That 40 GB table becomes 3-4 GB on disk.
2. Column elimination. The query uses 4 of 7 columns. The engine never reads AccountID, Region, or the full TxnID sequence from disk.
3. Batch mode execution. Instead of evaluating one row at a time, the engine processes roughly 900 rows per CPU cycle through vectorized operators. The CPU’s SIMD instructions process multiple values in parallel. For SUM, COUNT, AVG, and STDEV, this is a 100x throughput improvement per core.
4. Segment elimination. Each column segment (typically 1 million rows) stores min/max metadata. If a segment’s Timestamp range is entirely before our 90-day window, the engine skips it entirely. No rows read, no decompression, no CPU work.
These four effects compound:
| Factor | Row Store | Columnstore |
|---|---|---|
| Data read from disk | ~40 GB | ~3 GB (compression + column elimination) |
| CPU processing model | 1 row/cycle | ~900 rows/cycle (batch mode) |
| Segments skipped | 0% | Up to 80% (segment elimination) |
| Typical query time (500M rows) | ~20 minutes | ~4 seconds |
The speedup is not a benchmark-tuned outlier. It is the expected outcome when compression, column elimination, batch mode, and segment elimination all apply. For aggregate queries like fraud baselines, they do.
HTAP: Baselines on Live Data
Traditional architectures separate OLTP from analytics. An ETL pipeline copies data to a warehouse overnight. Your fraud baselines are always at least a day old. A fraudster who created an account this morning and started moving money this afternoon will not show up in yesterday’s baselines.
Microsoft SQL’s columnstore works on live operational data. You can run the baseline query above while thousands of transactions are being inserted. New rows go into a “delta store” (row-based) and are periodically compressed into columnstore segments by a background process.
-- Real-time fraud baseline: runs against live data, not last night's ETL
SELECT
p.Name,
p.RiskScore,
baseline.AvgAmount,
baseline.StdDevAmount,
baseline.MaxAmount,
baseline.UniqueCategories,
-- Flag if Carlos's $9,200 is more than 3 standard deviations above his average
CASE WHEN 9200 > baseline.AvgAmount + (3 * ISNULL(baseline.StdDevAmount, 0))
THEN 'ANOMALOUS' ELSE 'NORMAL' END AS AmountAssessment
FROM Person p
INNER JOIN (
SELECT
th.PersonID,
AVG(th.Amount) AS AvgAmount,
STDEV(th.Amount) AS StdDevAmount,
MAX(th.Amount) AS MaxAmount,
COUNT(DISTINCT th.MerchantCategory) AS UniqueCategories
FROM TransactionHistory th
WHERE th.Timestamp >= DATEADD(DAY, -90, GETUTCDATE())
GROUP BY th.PersonID
) baseline ON baseline.PersonID = p.PersonID
WHERE p.PersonID IN (2, 3, 4); -- Bob, Carlos, Diana
Name | RiskScore | AvgAmount | StdDevAmount | MaxAmount | UniqueCategories | AmountAssessment
-------|-----------|-----------|--------------|-----------|------------------|------------------
Bob | 0.6 | 1125.00 | 530.33 | 1500.00 | 3 | ANOMALOUS
Carlos | 0.9 | 9200.00 | NULL | 9200.00 | 1 | NORMAL
Diana | 0.85 | 3100.00 | NULL | 3100.00 | 2 | ANOMALOUS
No ETL pipeline. No overnight sync. No “the warehouse data is 14 hours stale” conversation. The anti-fraud team gets baselines computed against data that includes transactions from the last few seconds.
The point lookup (Person p) runs in row mode against B-tree indexes. The aggregation over TransactionHistory runs in batch mode against the columnstore. The optimizer picks the right execution mode for each part of the plan.
The tradeoff: heavy analytical queries and OLTP do compete for CPU and memory on the same instance. Resource Governor can manage this for many workloads. For truly heavy analytics, read replicas or Microsoft Fabric mirroring (real-time sync, not batch ETL) provide separation without the pipeline maintenance.
The Five-Model Fraud Check
We described five checks at the start of this series and combined three of them in the previous post. Now we have all five. Here is what the complete fraud check looks like as a single stored procedure:
CREATE PROCEDURE dbo.FraudCheck
@PersonID INT,
@IncomingAmount DECIMAL(12,2),
@TxnEmbedding VECTOR(1536)
AS
BEGIN
-- 1. GRAPH: How many known high-risk people is this person connected to?
WITH FraudConnections AS (
SELECT
p1.PersonID,
COUNT(*) AS FraudConnectionCount
FROM Person p1, Knows k, Person suspect
WHERE MATCH(p1-(k)->suspect)
AND suspect.RiskScore > 0.8
GROUP BY p1.PersonID
),
-- 4. VECTOR: How similar is this transaction to known fraud?
SimilarFraud AS (
SELECT TOP (5)
tv.TxnID,
tv.Amount AS FraudAmount,
VECTOR_DISTANCE('cosine', tv.TxnEmbedding, @TxnEmbedding) AS Distance
FROM TransactionVectors tv
WHERE tv.IsFraud = 1
AND tv.Timestamp > DATEADD(DAY, -90, GETUTCDATE())
ORDER BY Distance
)
SELECT
p.Name,
p.RiskScore,
-- 2. JSON: Device fingerprint from the latest event
JSON_VALUE(e.Data, '$.fingerprint.browser') AS Browser,
JSON_VALUE(e.Data, '$.fingerprint.os') AS OS,
JSON_VALUE(e.Data, '$.deviceId') AS DeviceId,
-- 3. GRAPH: Fraud ring connections
ISNULL(fc.FraudConnectionCount, 0) AS FraudConnections,
-- 1. RELATIONAL: Recent transaction total
(
SELECT SUM(Amount)
FROM Transactions t
WHERE t.PersonID = p.PersonID
AND t.Timestamp > DATEADD(DAY, -7, GETUTCDATE())
) AS WeeklyVolume,
-- 4. VECTOR: Closest known fraud distance (lower = more similar)
(SELECT MIN(Distance) FROM SimilarFraud) AS ClosestFraudDistance,
(SELECT COUNT(*) FROM SimilarFraud WHERE Distance < 0.15) AS NearFraudMatches,
-- 5. COLUMNSTORE ANALYTICS: Is this amount anomalous for this person?
baseline.AvgAmount,
baseline.StdDevAmount,
CASE WHEN @IncomingAmount > baseline.AvgAmount + (3 * ISNULL(baseline.StdDevAmount, 0))
THEN 'ANOMALOUS' ELSE 'NORMAL' END AS AmountAssessment
FROM Person p
JOIN Events e ON p.PersonID = e.PersonID
LEFT JOIN FraudConnections fc ON fc.PersonID = p.PersonID
INNER JOIN (
SELECT
th.PersonID,
AVG(th.Amount) AS AvgAmount,
STDEV(th.Amount) AS StdDevAmount
FROM TransactionHistory th
WHERE th.Timestamp >= DATEADD(DAY, -90, GETUTCDATE())
GROUP BY th.PersonID
) baseline ON baseline.PersonID = p.PersonID
WHERE p.PersonID = @PersonID;
END;
Call it:
EXEC dbo.FraudCheck @PersonID = 3, @IncomingAmount = 9200.00, @TxnEmbedding = @carlosVector;
One stored procedure. Five data models. One execution plan. One transaction boundary. One auth check. An agent does not coordinate five services. It calls one procedure and gets a complete fraud assessment.
So What?
With vector similarity and columnstore analytics now running alongside relational, JSON, and graph queries, all five fraud checks live in dbo.FraudCheck: one stored procedure, one execution plan, one transaction boundary. The pre-filter pattern, where relational predicates narrow 10 million transactions to 3,000 candidates before any distance computation, is the kind of optimization that only works when the vector data and the relational metadata share the same engine. Columnstore baselines on live data close the HTAP gap without an ETL pipeline.
For an agent, this consolidation is not just architecturally cleaner. Five databases meant five tool calls, five auth contexts, five failure modes, and cumulative network latency. One stored procedure means one call, one auth check, and a single place where things can go wrong.
But consolidation raises the stakes for governance. All five data models behind one endpoint means one security surface to get right, and getting it wrong exposes everything at once. Who gets to call dbo.FraudCheck? What happens when the caller is an autonomous agent that retries on failure and reasons its way to actions? That is the subject of Part 4.
What is Next
- [Part 4: One API, One Security Model, Zero Excuses] – What makes a database agent-ready, row-level security across all five data models, Data API Builder for MCP/REST/GraphQL, and how to start for free.
0 comments
Be the first to start the discussion.