Regex support for LOB types in T-SQL—available in Azure SQL & SQL Server 2025

At a glance — Native regular expression (regex) functions in T-SQL now accept varchar(max) and nvarchar(max) inputs of up to 2 MB across all seven regex functions, including the two table-valued functions (REGEXP_MATCHES and REGEXP_SPLIT_TO_TABLE). This capability ships in SQL Server 2025 CU5 and is already available in Azure SQL Database, SQL Database in Fabric and Azure SQL Managed Instance configured with the Always-up-to-date update policy. It will reach Managed Instances on the SQL Server 2025 update policy as part of the CU5 rollout. You no longer need to split log files, HTML documents, or large JSON payloads into 8,000-byte chunks just to run a pattern match.

1. Introduction

Regular expressions have long been a cornerstone of modern data processing — used for validation, parsing, transformation, and extracting structured insights from unstructured text. With SQL Server 2025 and Azure SQL, regex is now a first-class T-SQL capability, removing the historical need to rely on SQLCLR functions or application-tier processing.

While the initial release made native regex broadly available, large-object (LOB) inputs were not yet supported on every function. CU5 closes that gap.

Under the hood, T-SQL regex implements POSIX Extended Regular Expression (ERE) semantics, augmented by a curated set of Perl-style features, and is powered by the RE2 engine. RE2 is a linear-time, non-backtracking implementation, which means it is not susceptible to catastrophic backtracking (a class of denial-of-service issue commonly known as ReDoS). That guarantee becomes far more important when the input is a 1.8 MB log blob than when it is an 8,000-byte string.

Release timeline

Milestone	What shipped
Ignite 2025 — General Availability	Regex went GA in SQL Server 2025 and Azure SQL. LOB inputs were initially supported only on `REGEXP_LIKE`, `REGEXP_COUNT`, and `REGEXP_INSTR`. LOB support on `REGEXP_REPLACE` and `REGEXP_SUBSTR` was deferred, and the two table-valued functions (TVFs) accepted only non-LOB string types.
Azure SQL (post-GA service updates)	LOB inputs enabled across all seven functions.
SQL Server 2025 CU5	LOB inputs up to 2 MB enabled on all seven functions in the SQL Server.

What’s new in CU5

varchar(max) and nvarchar(max) inputs are accepted on every regex function.
The input string is capped at 2 MB per function call. The pattern is still capped at 8,000 bytes, which is far larger than any maintainable regular expression should ever need.
Behavior is consistent between Azure SQL and SQL Server, so code you write today is fully portable.

Note — The 2 MB limit applies to the input passed to a single function call, not to the column or row. A single value in a varchar(max) column can still store up to 2 GB; the constraint is that no single regex evaluation can consume more than 2 MB of that value.

Prerequisites

SQL Server 2025 CU5 or later, or Azure SQL Database, or SQL Database in Fabric or Azure SQL Managed Instance configured with the SQL Server 2025 / Always-up-to-date update policy.
The two table-valued functions (REGEXP_MATCHES and REGEXP_SPLIT_TO_TABLE) require database compatibility level 170, unless the database-scoped configuration ALLOW_BUILTIN_TVF_IN_ALL_COMPAT_LEVELS (preview) is enabled.

Note — On Azure SQL Managed Instance (Always-up-to-date), this capability is rolling out region by region. It is already live in regions where the rollout has completed and will light up in the remaining regions as the deployment finishes. Instances on the SQL Server 2025 update policy will receive it as part of the CU5 rollout — coming soon.

Verify compatibility level (170 required for the TVFs) –

SELECT name, compatibility_level
FROM sys.databases
WHERE name = DB_NAME();

-- If necessary:
-- ALTER DATABASE [<your-database>] SET COMPATIBILITY_LEVEL = 170;

2. Working with LOB Data

This section demonstrates the CU5 capabilities against a realistic LOB data. We build a LogEntries table whose RawPayload column holds multi-KB to multi-MB chunks of web server and application output, plus an HtmlPages table for HTML cleansing examples.

2.1 Create the sample schema and data

IF OBJECT_ID('dbo.LogEntries', 'U') IS NOT NULL DROP TABLE dbo.LogEntries;
IF OBJECT_ID('dbo.HtmlPages',  'U') IS NOT NULL DROP TABLE dbo.HtmlPages;

CREATE TABLE dbo.LogEntries
(
    LogId       BIGINT IDENTITY(1,1) PRIMARY KEY,
    Source      SYSNAME       NOT NULL,
    IngestedAt  DATETIME2(3)  NOT NULL DEFAULT SYSUTCDATETIME(),
    RawPayload  VARCHAR(MAX)  NOT NULL   -- LOB column
);

CREATE TABLE dbo.HtmlPages
(
    PageId      INT IDENTITY(1,1) PRIMARY KEY,
    Url         NVARCHAR(2048) NOT NULL,
    Body        NVARCHAR(MAX)  NOT NULL  -- LOB column (Unicode)
);

Now generate realistically large rows. The REPLICATE(CAST(... AS varchar(max)), n) pattern is required because REPLICATE returns NULL when the result would exceed 8,000 bytes unless its first argument is a max type.

-- Synthetic web access-log payload (~252 KB in row 1, plus a separate ~586 KB row).
DECLARE @logLine VARCHAR(500) =
    '127.0.0.1 - alice [21/May/2026:10:15:32 +0000] "GET /api/orders/42 HTTP/1.1" 200 1532 ' +
    'user-agent="Mozilla/5.0" ip=10.0.0.7 email=alice@contoso.com card=4111-1111-1111-1234' + CHAR(10);

DECLARE @bigLog VARCHAR(MAX) =
    REPLICATE(CAST(@logLine AS VARCHAR(MAX)), 1500)                -- ~252 KB
    + '127.0.0.1 - mallory [21/May/2026:10:16:01 +0000] "POST /login HTTP/1.1" 500 0 ' +
      'ip=203.0.113.99 ssn=123-45-6789' + CHAR(10);

INSERT INTO dbo.LogEntries (Source, RawPayload) VALUES
    ('web-01', @bigLog),                                            -- ~252 KB
    ('web-02', REPLICATE(CAST('OK ' AS VARCHAR(MAX)), 200000));     -- ~586 KB

-- Synthetic HTML page (~775 KB / ~396,000 characters).
DECLARE @htmlChunk NVARCHAR(MAX) =
    N'<div class="row"><p>Hello <b>world</b>! Contact <a href="mailto:bob@contoso.com">bob</a>.</p></div>';

INSERT INTO dbo.HtmlPages (Url, Body) VALUES
    (N'https://contoso.example/page-1',
     N'<html><head><title>Big Page</title></head><body>'
     + REPLICATE(@htmlChunk, 4000)
     + N'</body></html>');

-- Confirm payload sizes in bytes.
SELECT LogId, Source, DATALENGTH(RawPayload) AS PayloadBytes FROM dbo.LogEntries;
SELECT PageId, DATALENGTH(Body) AS BodyBytes, LEN(Body) AS BodyChars FROM dbo.HtmlPages;

Results:

LogId	Source	PayloadBytes
1	web-01	258,110
2	web-02	600,000

PageId	BodyBytes	BodyChars
1	792,124	396,062

Before CU5, feeding any of these payloads into REGEXP_REPLACE, REGEXP_SUBSTR, REGEXP_MATCHES, or REGEXP_SPLIT_TO_TABLE would have failed with a type-mismatch error or required a LEFT(RawPayload, 8000)-style truncation. The same queries now run end-to-end.

2.2 REGEXP_LIKE — Filter rows by LOB content

-- Find logs that contain at least one HTTP 5xx response.
SELECT LogId, Source, DATALENGTH(RawPayload) AS PayloadBytes
FROM   dbo.LogEntries
WHERE  REGEXP_LIKE(RawPayload, '"[A-Z]+\s[^"]+\sHTTP/1\.[01]"\s5[0-9]{2}\s');

REGEXP_LIKE is a Boolean predicate: it evaluates to true when the pattern matches anywhere in the input and false otherwise. Because it returns a Boolean rather than a bit, use it directly in WHERE, CASE WHEN, IIF, or CHECK constraint contexts — do not compare it with = 1 or = 0 (the parser rejects that syntax).

Note — REGEXP_LIKE itself requires database compatibility level 170. The other scalar regex functions (REGEXP_COUNT, REGEXP_INSTR, REGEXP_REPLACE, REGEXP_SUBSTR) are available at all compatibility levels.

Results:

LogId	Source	PayloadBytes
1	web-01	258,110

2.3 REGEXP_COUNT — Counting at scale

-- Per-row tally of GET requests, POST requests, and 5xx responses
-- across the entire LOB payload.
SELECT LogId,
       Source,
       REGEXP_COUNT(RawPayload, '"GET\s')        AS Gets,
       REGEXP_COUNT(RawPayload, '"POST\s')       AS Posts,
       REGEXP_COUNT(RawPayload, '\s5[0-9]{2}\s') AS ServerErrors
FROM   dbo.LogEntries;

Results:

LogId	Source	Gets	Posts	ServerErrors
1	web-01	1,500	1	1
2	web-02	0	0	0

2.4 REGEXP_INSTR — Locate the first error

-- 1-based character position (or 0 if no match) of the FIRST 5xx response in each payload.
SELECT LogId,
       Source,
       REGEXP_INSTR(RawPayload, '\s5[0-9]{2}\s', 1, 1, 0) AS FirstErrorPos
FROM   dbo.LogEntries;

Parameter recap: REGEXP_INSTR(string, pattern, start, occurrence, return_option [, flags [, group ]]). A return_option of 0 returns the starting position of the match; 1 returns the position immediately after the last character of the match.

Results:

LogId	Source	FirstErrorPos
1	web-01	258,072
2	web-02	0

2.5 REGEXP_REPLACE — Redact sensitive data in place

PII redaction over LOB payloads was one of the most-requested CU5 scenarios. Before CU5, it required a custom chunked-replace routine; it is now a single expression.

-- Redact credit-card-shaped tokens, U.S. SSN-shaped tokens, and email addresses
-- across the entire payload.
SELECT LogId,
       REGEXP_REPLACE(
           REGEXP_REPLACE(
               REGEXP_REPLACE(
                   RawPayload,
                   '\b[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}\b',
                   '****-****-****-****'),
               '\b[0-9]{3}-[0-9]{2}-[0-9]{4}\b',
               '***-**-****'),
           '\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b',
           '[redacted-email]'
       ) AS RedactedPayload
FROM   dbo.LogEntries;

Or strip every HTML tag from an nvarchar(max) page in a single call:

SELECT PageId,
       LEN(Body)                                     AS OriginalLen,
       LEN(REGEXP_REPLACE(Body, N'<[^>]+>', N''))    AS TextOnlyLen
FROM   dbo.HtmlPages;

Results — the ~775 KB HTML document collapses from 396,062 to 100,008 characters of plain text in a single call:

PageId	OriginalLen	TextOnlyLen
1	396,062	100,008

2.6 REGEXP_SUBSTR — Extract a single value

-- Pull the first IPv4 address out of each log payload.
SELECT LogId,
       REGEXP_SUBSTR(RawPayload,
                     '\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b',
                     1,    -- start position
                     1,    -- occurrence
                     'c',  -- flags: case-sensitive
                     0     -- group: 0 returns the whole match
                    ) AS FirstIp
FROM   dbo.LogEntries;

To return the contents of a specific capture group instead of the entire match, pass its 1-based group number as the final argument.

Results:

LogId	FirstIp
1	127.0.0.1
2	`NULL`

2.7 REGEXP_MATCHES — Every match, set-based

This is where the combination of TVF and LOB delivers the largest productivity gain: extract every structured value from a megabyte of unstructured text in a single set-based query, with no client round-trips.

REGEXP_MATCHES returns one row per match with these columns:

Column	Type	Description
`match_id`	`bigint`	Sequence number of the match (1-based).
`start_position`	`int`	1-based start index of the match.
`end_position`	`int`	1-based end index of the match.
`match_value`	same type as `string_expression`	The entire matched substring.
`substring_matches`	`json`	JSON array describing each capture group, with the shape `[{"value":"…","start":N,"length":N}, …]`.

-- Every email address in every log payload, alongside its row of origin.
SELECT  l.LogId,
        m.match_id,
        m.match_value AS EmailFound
FROM    dbo.LogEntries AS l
CROSS APPLY REGEXP_MATCHES(
        l.RawPayload,
        '\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b'
) AS m
ORDER BY l.LogId, m.match_id;

Capture groups are even more useful — you can project the parts of every log line as columns by reading from the substring_matches JSON document:

-- Parse Common-Log-Format-ish entries into ip, user, status, and bytes columns.
-- The pattern has four capture groups, accessed below as $[0] through $[3].
SELECT  l.LogId,
        m.match_id,
        JSON_VALUE(m.substring_matches, '$[0].value') AS Ip,
        JSON_VALUE(m.substring_matches, '$[1].value') AS UserName,
        JSON_VALUE(m.substring_matches, '$[2].value') AS Status,
        JSON_VALUE(m.substring_matches, '$[3].value') AS Bytes
FROM    dbo.LogEntries AS l
CROSS APPLY REGEXP_MATCHES(
        l.RawPayload,
        '^([0-9.]+)\s-\s(\S+)\s\[[^\]]+\]\s"[^"]+"\s([0-9]{3})\s([0-9]+)',
        'm'    -- multi-line: ^ and $ anchor to each line, not just the whole input
) AS m
ORDER BY l.LogId, m.match_id;

Important — Without the 'm' flag, the ^ anchor matches only at the start of the entire 250 KB input, so you would receive exactly one match for the first line. The multi-line flag is what unlocks per-line extraction.

Results (first two parsed rows):

LogId	match_id	Ip	UserName	Status	Bytes
1	1	127.0.0.1	alice	200	1532
1	2	127.0.0.1	alice	200	1532

2.8 REGEXP_SPLIT_TO_TABLE — Shred a LOB into rows

-- Project the entire log payload as one row per non-empty line.
SELECT  l.LogId,
        s.ordinal AS [LineNo],
        s.value   AS LineText
FROM    dbo.LogEntries AS l
CROSS APPLY REGEXP_SPLIT_TO_TABLE(l.RawPayload, '\r?\n') AS s
WHERE   l.LogId = 1
  AND   s.value <> ''
ORDER BY s.ordinal;

You now have a tabular projection of a multi-megabyte text blob without leaving the engine. You can feed it into a CTE, aggregate it, join it to dimension tables, or materialize it into a staging table — all set-based.

Results (first three rows):

LogId	ordinal	LineText (first 80 chars)
1	1	`127.0.0.1 - alice [21/May/2026:10:15:32 +0000] "GET /api/orders/42 HTTP/1.1" 200`
1	2	`127.0.0.1 - alice [21/May/2026:10:15:32 +0000] "GET /api/orders/42 HTTP/1.1" 200`
1	3	`127.0.0.1 - alice [21/May/2026:10:15:32 +0000] "GET /api/orders/42 HTTP/1.1" 200`

Tip — composing LOB regex pipelines — CROSS APPLY(and OUTER APPLY when you need to preserve rows that produce no matches) is the primary composition primitive. You can stack REGEXP_SPLIT_TO_TABLE (lines) feeding REGEXP_MATCHES (fields per line) feeding ordinary aggregates, all within a single query plan.

2.9 The 2 MB ceiling — strategies for larger inputs

The 2 MB limit applies to the input string of a single regex call. If the value passed to a regex function exceeds 2 MB, the call raises an error (error number 19311, severity 16) rather than silently truncating. That is the intended behavior — silent truncation would hide correctness bugs.

In practice, 2 MB is a generous ceiling: a single log file or HTML document of that size is already unusual, and most real-world LOB data sit comfortably below it. When individual values do exceed the limit, the most reliable approach is to split them into smaller logical units before they land in the column you want to query — for example, by writing one log line, one document section, or one record per row at ingestion time. Because every regex function (including the two TVFs) shares the same 2 MB ceiling, sharding at query time is not generally feasible; doing it at the load path keeps every regex call well under the limit and avoids per-query workarounds.

Bytes vs. characters — The 2 MB limit is measured in bytes, not characters, and the byte count is based on the UTF-8 encoding of the input regardless of the column’s declared type. ASCII characters take 1 byte each, so plain ASCII text can run to roughly two million characters; non-ASCII characters take 2–4 bytes in UTF-8, so fewer characters fit. Keep in mind that DATALENGTH() reports storage size in the column’s own encoding, which may differ from the UTF-8 byte count used by the limit, and LEN() (which counts characters) is best avoided as a sizing check here.

To measure the UTF-8 byte length that the limit actually checks, cast the value to varchar(max) under a UTF-8 collation and take its DATALENGTH:

SELECT DATALENGTH(
           CONVERT(varchar(max),
                   Body COLLATE Latin1_General_100_CI_AS_SC_UTF8)
       ) AS Utf8Bytes
FROM   dbo.HtmlPages;

Anything above 2 * 1024 * 1024 (2,097,152) bytes will be rejected by a regex call on that value.

Have a scenario that genuinely needs more than 2 MB? If your workload requires regex evaluation on individual values larger than the current 2 MB ceiling, we would like to hear about it. Please share the details — data shape, payload size, pattern, and business need — on the Azure SQL feedback portal. Customer feedback directly informs how we prioritize future limit changes.

2.10 Cleanup

DROP TABLE IF EXISTS dbo.LogEntries;
DROP TABLE IF EXISTS dbo.HtmlPages;

3. Summary

What changed in CU5

Before CU5 — LOB inputs were accepted on REGEXP_LIKE, REGEXP_COUNT, and REGEXP_INSTR. The remaining functions — REGEXP_REPLACE, REGEXP_SUBSTR, and the two TVFs (REGEXP_MATCHES, REGEXP_SPLIT_TO_TABLE) — required non-LOB string inputs, which often meant truncating with LEFT(..., 8000) or chunking in the application tier.
After CU5 (and already in Azure SQL) — All seven functions accept varchar(max) and nvarchar(max) inputs of up to 2 MB. The pattern remains capped at 8,000 bytes.

Quick reference

Function	Returns	LOB input (CU5)	Common use case
`REGEXP_LIKE`	Boolean (predicate)	Yes	Filter rows in `WHERE` / `CASE` / `CHECK` predicates
`REGEXP_COUNT`	`int`	Yes	Count occurrences of a pattern
`REGEXP_INSTR`	`int`	Yes	Position of the nth match
`REGEXP_REPLACE`	string	Yes	Redact, cleanse, or normalize text
`REGEXP_SUBSTR`	string	Yes	Extract a single value
`REGEXP_MATCHES` (TVF)	`(match_id, start_position, end_position, match_value, substring_matches)`	Yes	Extract every match plus capture groups (via JSON), set-based
`REGEXP_SPLIT_TO_TABLE` (TVF)	`(value, ordinal)`	Yes	Split a LOB into rows by a regex delimiter

Regex support for LOB types in T-SQL—available in Azure SQL & SQL Server 2025

1. Introduction

Release timeline

What’s new in CU5

Prerequisites

2. Working with LOB Data

2.1 Create the sample schema and data

2.2 REGEXP_LIKE — Filter rows by LOB content

2.3 REGEXP_COUNT — Counting at scale

2.4 REGEXP_INSTR — Locate the first error

2.5 REGEXP_REPLACE — Redact sensitive data in place

2.6 REGEXP_SUBSTR — Extract a single value

2.7 REGEXP_MATCHES — Every match, set-based

2.8 REGEXP_SPLIT_TO_TABLE — Shred a LOB into rows

2.9 The 2 MB ceiling — strategies for larger inputs

2.10 Cleanup

3. Summary

What changed in CU5

Quick reference

Further reading

Category

Topics

Author

0 comments

Leave a commentCancel reply

Read next

Considering NL2SQL? Should your database really be the prompt? How can SQL MCP Server help?

Microsoft SQL Security Across the MAESTRO Stack: Building Secure Agentic AI with Defense-in-Depth

1. Introduction

Release timeline

What’s new in CU5

Prerequisites

2. Working with LOB Data

2.1 Create the sample schema and data

2.2 REGEXP_LIKE — Filter rows by LOB content

2.3 REGEXP_COUNT — Counting at scale

2.4 REGEXP_INSTR — Locate the first error

2.5 REGEXP_REPLACE — Redact sensitive data in place

2.6 REGEXP_SUBSTR — Extract a single value

2.7 REGEXP_MATCHES — Every match, set-based

2.8 REGEXP_SPLIT_TO_TABLE — Shred a LOB into rows

2.9 The 2 MB ceiling — strategies for larger inputs

2.10 Cleanup

3. Summary

What changed in CU5

Quick reference

Further reading

Category

Topics

Share

Author

0 comments

Leave a commentCancel reply

Read next

Considering NL2SQL? Should your database really be the prompt? How can SQL MCP Server help?

Microsoft SQL Security Across the MAESTRO Stack: Building Secure Agentic AI with Defense-in-Depth

Stay informed