At a glance — Native regular expression (regex) functions in T-SQL now accept varchar(max) and nvarchar(max) inputs of up to 2 MB across all seven regex functions, including the two table-valued functions (REGEXP_MATCHES and REGEXP_SPLIT_TO_TABLE). This capability ships in SQL Server 2025 CU5 and is already available in Azure SQL Database, SQL Database in Fabric and Azure SQL Managed Instance configured with the Always-up-to-date update policy. It will reach Managed Instances on the SQL Server 2025 update policy as part of the CU5 rollout. You no longer need to split log files, HTML documents, or large JSON payloads into 8,000-byte chunks just to run a pattern match.
1. Introduction
Regular expressions have long been a cornerstone of modern data processing — used for validation, parsing, transformation, and extracting structured insights from unstructured text. With SQL Server 2025 and Azure SQL, regex is now a first-class T-SQL capability, removing the historical need to rely on SQLCLR functions or application-tier processing.
While the initial release made native regex broadly available, large-object (LOB) inputs were not yet supported on every function. CU5 closes that gap.
Under the hood, T-SQL regex implements POSIX Extended Regular Expression (ERE) semantics, augmented by a curated set of Perl-style features, and is powered by the RE2 engine. RE2 is a linear-time, non-backtracking implementation, which means it is not susceptible to catastrophic backtracking (a class of denial-of-service issue commonly known as ReDoS). That guarantee becomes far more important when the input is a 1.8 MB log blob than when it is an 8,000-byte string.
Release timeline
| Milestone | What shipped |
| Ignite 2025 — General Availability | Regex went GA in SQL Server 2025 and Azure SQL. LOB inputs were initially supported only on REGEXP_LIKE, REGEXP_COUNT, and REGEXP_INSTR. LOB support on REGEXP_REPLACE and REGEXP_SUBSTR was deferred, and the two table-valued functions (TVFs) accepted only non-LOB string types. |
| Azure SQL (post-GA service updates) | LOB inputs enabled across all seven functions. |
| SQL Server 2025 CU5 | LOB inputs up to 2 MB enabled on all seven functions in the SQL Server. |
What’s new in CU5
varchar(max)andnvarchar(max)inputs are accepted on every regex function.- The input string is capped at 2 MB per function call. The pattern is still capped at 8,000 bytes, which is far larger than any maintainable regular expression should ever need.
- Behavior is consistent between Azure SQL and SQL Server, so code you write today is fully portable.
Note — The 2 MB limit applies to the input passed to a single function call, not to the column or row. A single value in a varchar(max) column can still store up to 2 GB; the constraint is that no single regex evaluation can consume more than 2 MB of that value.
Prerequisites
- SQL Server 2025 CU5 or later, or Azure SQL Database, or SQL Database in Fabric or Azure SQL Managed Instance configured with the SQL Server 2025 / Always-up-to-date update policy.
- The two table-valued functions (
REGEXP_MATCHESandREGEXP_SPLIT_TO_TABLE) require database compatibility level 170, unless the database-scoped configurationALLOW_BUILTIN_TVF_IN_ALL_COMPAT_LEVELS(preview) is enabled.
Note — On Azure SQL Managed Instance (Always-up-to-date), this capability is rolling out region by region. It is already live in regions where the rollout has completed and will light up in the remaining regions as the deployment finishes. Instances on the SQL Server 2025 update policy will receive it as part of the CU5 rollout — coming soon.
Verify compatibility level (170 required for the TVFs) –
SELECT name, compatibility_level
FROM sys.databases
WHERE name = DB_NAME();
-- If necessary:
-- ALTER DATABASE [<your-database>] SET COMPATIBILITY_LEVEL = 170;
2. Working with LOB Data
This section demonstrates the CU5 capabilities against a realistic LOB data. We build a LogEntries table whose RawPayload column holds multi-KB to multi-MB chunks of web server and application output, plus an HtmlPages table for HTML cleansing examples.
2.1 Create the sample schema and data
IF OBJECT_ID('dbo.LogEntries', 'U') IS NOT NULL DROP TABLE dbo.LogEntries;
IF OBJECT_ID('dbo.HtmlPages', 'U') IS NOT NULL DROP TABLE dbo.HtmlPages;
CREATE TABLE dbo.LogEntries
(
LogId BIGINT IDENTITY(1,1) PRIMARY KEY,
Source SYSNAME NOT NULL,
IngestedAt DATETIME2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
RawPayload VARCHAR(MAX) NOT NULL -- LOB column
);
CREATE TABLE dbo.HtmlPages
(
PageId INT IDENTITY(1,1) PRIMARY KEY,
Url NVARCHAR(2048) NOT NULL,
Body NVARCHAR(MAX) NOT NULL -- LOB column (Unicode)
);
Now generate realistically large rows. The REPLICATE(CAST(... AS varchar(max)), n) pattern is required because REPLICATE returns NULL when the result would exceed 8,000 bytes unless its first argument is a max type.
-- Synthetic web access-log payload (~252 KB in row 1, plus a separate ~586 KB row).
DECLARE @logLine VARCHAR(500) =
'127.0.0.1 - alice [21/May/2026:10:15:32 +0000] "GET /api/orders/42 HTTP/1.1" 200 1532 ' +
'user-agent="Mozilla/5.0" ip=10.0.0.7 email=alice@contoso.com card=4111-1111-1111-1234' + CHAR(10);
DECLARE @bigLog VARCHAR(MAX) =
REPLICATE(CAST(@logLine AS VARCHAR(MAX)), 1500) -- ~252 KB
+ '127.0.0.1 - mallory [21/May/2026:10:16:01 +0000] "POST /login HTTP/1.1" 500 0 ' +
'ip=203.0.113.99 ssn=123-45-6789' + CHAR(10);
INSERT INTO dbo.LogEntries (Source, RawPayload) VALUES
('web-01', @bigLog), -- ~252 KB
('web-02', REPLICATE(CAST('OK ' AS VARCHAR(MAX)), 200000)); -- ~586 KB
-- Synthetic HTML page (~775 KB / ~396,000 characters).
DECLARE @htmlChunk NVARCHAR(MAX) =
N'<div class="row"><p>Hello <b>world</b>! Contact <a href="mailto:bob@contoso.com">bob</a>.</p></div>';
INSERT INTO dbo.HtmlPages (Url, Body) VALUES
(N'https://contoso.example/page-1',
N'<html><head><title>Big Page</title></head><body>'
+ REPLICATE(@htmlChunk, 4000)
+ N'</body></html>');
-- Confirm payload sizes in bytes.
SELECT LogId, Source, DATALENGTH(RawPayload) AS PayloadBytes FROM dbo.LogEntries;
SELECT PageId, DATALENGTH(Body) AS BodyBytes, LEN(Body) AS BodyChars FROM dbo.HtmlPages;
Results:
| LogId | Source | PayloadBytes |
| 1 | web-01 | 258,110 |
| 2 | web-02 | 600,000 |
| PageId | BodyBytes | BodyChars |
| 1 | 792,124 | 396,062 |
Before CU5, feeding any of these payloads into REGEXP_REPLACE, REGEXP_SUBSTR, REGEXP_MATCHES, or REGEXP_SPLIT_TO_TABLE would have failed with a type-mismatch error or required a LEFT(RawPayload, 8000)-style truncation. The same queries now run end-to-end.
2.2 REGEXP_LIKE — Filter rows by LOB content
-- Find logs that contain at least one HTTP 5xx response.
SELECT LogId, Source, DATALENGTH(RawPayload) AS PayloadBytes
FROM dbo.LogEntries
WHERE REGEXP_LIKE(RawPayload, '"[A-Z]+\s[^"]+\sHTTP/1\.[01]"\s5[0-9]{2}\s');
REGEXP_LIKE is a Boolean predicate: it evaluates to true when the pattern matches anywhere in the input and false otherwise. Because it returns a Boolean rather than a bit, use it directly in WHERE, CASE WHEN, IIF, or CHECK constraint contexts — do not compare it with = 1 or = 0 (the parser rejects that syntax).
Note — REGEXP_LIKE itself requires database compatibility level 170. The other scalar regex functions (REGEXP_COUNT, REGEXP_INSTR, REGEXP_REPLACE, REGEXP_SUBSTR) are available at all compatibility levels.
Results:
| LogId | Source | PayloadBytes |
| 1 | web-01 | 258,110 |
2.3 REGEXP_COUNT — Counting at scale
-- Per-row tally of GET requests, POST requests, and 5xx responses
-- across the entire LOB payload.
SELECT LogId,
Source,
REGEXP_COUNT(RawPayload, '"GET\s') AS Gets,
REGEXP_COUNT(RawPayload, '"POST\s') AS Posts,
REGEXP_COUNT(RawPayload, '\s5[0-9]{2}\s') AS ServerErrors
FROM dbo.LogEntries;
Results:
| LogId | Source | Gets | Posts | ServerErrors |
| 1 | web-01 | 1,500 | 1 | 1 |
| 2 | web-02 | 0 | 0 | 0 |
2.4 REGEXP_INSTR — Locate the first error
-- 1-based character position (or 0 if no match) of the FIRST 5xx response in each payload.
SELECT LogId,
Source,
REGEXP_INSTR(RawPayload, '\s5[0-9]{2}\s', 1, 1, 0) AS FirstErrorPos
FROM dbo.LogEntries;
Parameter recap: REGEXP_INSTR(string, pattern, start, occurrence, return_option [, flags [, group ]]). A return_option of 0 returns the starting position of the match; 1 returns the position immediately after the last character of the match.
Results:
| LogId | Source | FirstErrorPos |
| 1 | web-01 | 258,072 |
| 2 | web-02 | 0 |
2.5 REGEXP_REPLACE — Redact sensitive data in place
PII redaction over LOB payloads was one of the most-requested CU5 scenarios. Before CU5, it required a custom chunked-replace routine; it is now a single expression.
-- Redact credit-card-shaped tokens, U.S. SSN-shaped tokens, and email addresses
-- across the entire payload.
SELECT LogId,
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
RawPayload,
'\b[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}\b',
'****-****-****-****'),
'\b[0-9]{3}-[0-9]{2}-[0-9]{4}\b',
'***-**-****'),
'\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b',
'[redacted-email]'
) AS RedactedPayload
FROM dbo.LogEntries;
Or strip every HTML tag from an nvarchar(max) page in a single call:
SELECT PageId,
LEN(Body) AS OriginalLen,
LEN(REGEXP_REPLACE(Body, N'<[^>]+>', N'')) AS TextOnlyLen
FROM dbo.HtmlPages;
Results — the ~775 KB HTML document collapses from 396,062 to 100,008 characters of plain text in a single call:
| PageId | OriginalLen | TextOnlyLen |
| 1 | 396,062 | 100,008 |
2.6 REGEXP_SUBSTR — Extract a single value
-- Pull the first IPv4 address out of each log payload.
SELECT LogId,
REGEXP_SUBSTR(RawPayload,
'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b',
1, -- start position
1, -- occurrence
'c', -- flags: case-sensitive
0 -- group: 0 returns the whole match
) AS FirstIp
FROM dbo.LogEntries;
To return the contents of a specific capture group instead of the entire match, pass its 1-based group number as the final argument.
Results:
| LogId | FirstIp |
| 1 | 127.0.0.1 |
| 2 | NULL |
2.7 REGEXP_MATCHES — Every match, set-based
This is where the combination of TVF and LOB delivers the largest productivity gain: extract every structured value from a megabyte of unstructured text in a single set-based query, with no client round-trips.
REGEXP_MATCHES returns one row per match with these columns:
| Column | Type | Description |
match_id |
bigint |
Sequence number of the match (1-based). |
start_position |
int |
1-based start index of the match. |
end_position |
int |
1-based end index of the match. |
match_value |
same type as string_expression |
The entire matched substring. |
substring_matches |
json |
JSON array describing each capture group, with the shape [{"value":"…","start":N,"length":N}, …]. |
-- Every email address in every log payload, alongside its row of origin.
SELECT l.LogId,
m.match_id,
m.match_value AS EmailFound
FROM dbo.LogEntries AS l
CROSS APPLY REGEXP_MATCHES(
l.RawPayload,
'\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b'
) AS m
ORDER BY l.LogId, m.match_id;
Capture groups are even more useful — you can project the parts of every log line as columns by reading from the substring_matches JSON document:
-- Parse Common-Log-Format-ish entries into ip, user, status, and bytes columns.
-- The pattern has four capture groups, accessed below as $[0] through $[3].
SELECT l.LogId,
m.match_id,
JSON_VALUE(m.substring_matches, '$[0].value') AS Ip,
JSON_VALUE(m.substring_matches, '$[1].value') AS UserName,
JSON_VALUE(m.substring_matches, '$[2].value') AS Status,
JSON_VALUE(m.substring_matches, '$[3].value') AS Bytes
FROM dbo.LogEntries AS l
CROSS APPLY REGEXP_MATCHES(
l.RawPayload,
'^([0-9.]+)\s-\s(\S+)\s\[[^\]]+\]\s"[^"]+"\s([0-9]{3})\s([0-9]+)',
'm' -- multi-line: ^ and $ anchor to each line, not just the whole input
) AS m
ORDER BY l.LogId, m.match_id;
Important — Without the 'm' flag, the ^ anchor matches only at the start of the entire 250 KB input, so you would receive exactly one match for the first line. The multi-line flag is what unlocks per-line extraction.
Results (first two parsed rows):
| LogId | match_id | Ip | UserName | Status | Bytes |
| 1 | 1 | 127.0.0.1 | alice | 200 | 1532 |
| 1 | 2 | 127.0.0.1 | alice | 200 | 1532 |
2.8 REGEXP_SPLIT_TO_TABLE — Shred a LOB into rows
-- Project the entire log payload as one row per non-empty line.
SELECT l.LogId,
s.ordinal AS [LineNo],
s.value AS LineText
FROM dbo.LogEntries AS l
CROSS APPLY REGEXP_SPLIT_TO_TABLE(l.RawPayload, '\r?\n') AS s
WHERE l.LogId = 1
AND s.value <> ''
ORDER BY s.ordinal;
You now have a tabular projection of a multi-megabyte text blob without leaving the engine. You can feed it into a CTE, aggregate it, join it to dimension tables, or materialize it into a staging table — all set-based.
Results (first three rows):
| LogId | ordinal | LineText (first 80 chars) |
| 1 | 1 | 127.0.0.1 - alice [21/May/2026:10:15:32 +0000] "GET /api/orders/42 HTTP/1.1" 200 |
| 1 | 2 | 127.0.0.1 - alice [21/May/2026:10:15:32 +0000] "GET /api/orders/42 HTTP/1.1" 200 |
| 1 | 3 | 127.0.0.1 - alice [21/May/2026:10:15:32 +0000] "GET /api/orders/42 HTTP/1.1" 200 |
Tip — composing LOB regex pipelines — CROSS APPLY(and OUTER APPLY when you need to preserve rows that produce no matches) is the primary composition primitive. You can stack REGEXP_SPLIT_TO_TABLE (lines) feeding REGEXP_MATCHES (fields per line) feeding ordinary aggregates, all within a single query plan.
2.9 The 2 MB ceiling — strategies for larger inputs
The 2 MB limit applies to the input string of a single regex call. If the value passed to a regex function exceeds 2 MB, the call raises an error (error number 19311, severity 16) rather than silently truncating. That is the intended behavior — silent truncation would hide correctness bugs.
In practice, 2 MB is a generous ceiling: a single log file or HTML document of that size is already unusual, and most real-world LOB data sit comfortably below it. When individual values do exceed the limit, the most reliable approach is to split them into smaller logical units before they land in the column you want to query — for example, by writing one log line, one document section, or one record per row at ingestion time. Because every regex function (including the two TVFs) shares the same 2 MB ceiling, sharding at query time is not generally feasible; doing it at the load path keeps every regex call well under the limit and avoids per-query workarounds.
Bytes vs. characters — The 2 MB limit is measured in bytes, not characters, and the byte count is based on the UTF-8 encoding of the input regardless of the column’s declared type. ASCII characters take 1 byte each, so plain ASCII text can run to roughly two million characters; non-ASCII characters take 2–4 bytes in UTF-8, so fewer characters fit. Keep in mind that DATALENGTH() reports storage size in the column’s own encoding, which may differ from the UTF-8 byte count used by the limit, and LEN() (which counts characters) is best avoided as a sizing check here.
varchar(max) under a UTF-8 collation and take its DATALENGTH:SELECT DATALENGTH(
CONVERT(varchar(max),
Body COLLATE Latin1_General_100_CI_AS_SC_UTF8)
) AS Utf8Bytes
FROM dbo.HtmlPages;
Anything above 2 * 1024 * 1024 (2,097,152) bytes will be rejected by a regex call on that value.
Have a scenario that genuinely needs more than 2 MB? If your workload requires regex evaluation on individual values larger than the current 2 MB ceiling, we would like to hear about it. Please share the details — data shape, payload size, pattern, and business need — on the Azure SQL feedback portal. Customer feedback directly informs how we prioritize future limit changes.
2.10 Cleanup
DROP TABLE IF EXISTS dbo.LogEntries;
DROP TABLE IF EXISTS dbo.HtmlPages;
3. Summary
What changed in CU5
- Before CU5 — LOB inputs were accepted on
REGEXP_LIKE,REGEXP_COUNT, andREGEXP_INSTR. The remaining functions —REGEXP_REPLACE,REGEXP_SUBSTR, and the two TVFs (REGEXP_MATCHES,REGEXP_SPLIT_TO_TABLE) — required non-LOB string inputs, which often meant truncating withLEFT(..., 8000)or chunking in the application tier. - After CU5 (and already in Azure SQL) — All seven functions accept
varchar(max)andnvarchar(max)inputs of up to 2 MB. The pattern remains capped at 8,000 bytes.
Quick reference
| Function | Returns | LOB input (CU5) | Common use case |
REGEXP_LIKE |
Boolean (predicate) | Yes | Filter rows in WHERE / CASE / CHECK predicates |
REGEXP_COUNT |
int |
Yes | Count occurrences of a pattern |
REGEXP_INSTR |
int |
Yes | Position of the nth match |
REGEXP_REPLACE |
string | Yes | Redact, cleanse, or normalize text |
REGEXP_SUBSTR |
string | Yes | Extract a single value |
REGEXP_MATCHES (TVF) |
(match_id, start_position, end_position, match_value, substring_matches) |
Yes | Extract every match plus capture groups (via JSON), set-based |
REGEXP_SPLIT_TO_TABLE (TVF) |
(value, ordinal) |
Yes | Split a LOB into rows by a regex delimiter |
Further reading
- Official documentation: REGEXP_LIKE, REGEXP_COUNT, REGEXP_INSTR, REGEXP_REPLACE, REGEXP_SUBSTR, REGEXP_MATCHES, REGEXP_SPLIT_TO_TABLE.
- Regular expressions overview.
- SQL Server 2025 CU5 release notes.
Closing thought. Native regex was already a significant quality-of-life improvement when it became generally available. CU5 completes the picture: every function, every input size up to 2 MB, every shape — scalar or table-valued. The next time you are tempted to export a column out of the database in order to grep it, try one of the seven regex functions first.
Happy matching. 🧠
0 comments
Be the first to start the discussion.