May 22nd, 2026
0 reactions

Regex support for LOB types in T-SQL—available in Azure SQL & SQL Server 2025

Senior Product Manager

At a glance — Native regular expression (regex) functions in T-SQL now accept varchar(max) and nvarchar(max) inputs of up to 2 MB across all seven regex functions, including the two table-valued functions (REGEXP_MATCHES and REGEXP_SPLIT_TO_TABLE). This capability ships in SQL Server 2025 CU5 and is already available in Azure SQL Database, SQL Database in Fabric and Azure SQL Managed Instance configured with the Always-up-to-date update policy. It will reach Managed Instances on the SQL Server 2025 update policy as part of the CU5 rollout. You no longer need to split log files, HTML documents, or large JSON payloads into 8,000-byte chunks just to run a pattern match.

1. Introduction

Regular expressions have long been a cornerstone of modern data processing — used for validation, parsing, transformation, and extracting structured insights from unstructured text. With SQL Server 2025 and Azure SQL, regex is now a first-class T-SQL capability, removing the historical need to rely on SQLCLR functions or application-tier processing.

While the initial release made native regex broadly available, large-object (LOB) inputs were not yet supported on every function. CU5 closes that gap.

Under the hood, T-SQL regex implements POSIX Extended Regular Expression (ERE) semantics, augmented by a curated set of Perl-style features, and is powered by the RE2 engine. RE2 is a linear-time, non-backtracking implementation, which means it is not susceptible to catastrophic backtracking (a class of denial-of-service issue commonly known as ReDoS). That guarantee becomes far more important when the input is a 1.8 MB log blob than when it is an 8,000-byte string.

Release timeline

Milestone What shipped
Ignite 2025 — General Availability Regex went GA in SQL Server 2025 and Azure SQL. LOB inputs were initially supported only on REGEXP_LIKE, REGEXP_COUNT, and REGEXP_INSTR. LOB support on REGEXP_REPLACE and REGEXP_SUBSTR was deferred, and the two table-valued functions (TVFs) accepted only non-LOB string types.
Azure SQL (post-GA service updates) LOB inputs enabled across all seven functions.
SQL Server 2025 CU5 LOB inputs up to 2 MB enabled on all seven functions in the SQL Server.

What’s new in CU5

  • varchar(max) and nvarchar(max) inputs are accepted on every regex function.
  • The input string is capped at 2 MB per function call. The pattern is still capped at 8,000 bytes, which is far larger than any maintainable regular expression should ever need.
  • Behavior is consistent between Azure SQL and SQL Server, so code you write today is fully portable.

Note — The 2 MB limit applies to the input passed to a single function call, not to the column or row. A single value in a varchar(max) column can still store up to 2 GB; the constraint is that no single regex evaluation can consume more than 2 MB of that value.

Prerequisites

  • SQL Server 2025 CU5 or later, or Azure SQL Database, or SQL Database in Fabric or Azure SQL Managed Instance configured with the SQL Server 2025 / Always-up-to-date update policy.
  • The two table-valued functions (REGEXP_MATCHES and REGEXP_SPLIT_TO_TABLE) require database compatibility level 170, unless the database-scoped configuration ALLOW_BUILTIN_TVF_IN_ALL_COMPAT_LEVELS (preview) is enabled.

Note — On Azure SQL Managed Instance (Always-up-to-date), this capability is rolling out region by region. It is already live in regions where the rollout has completed and will light up in the remaining regions as the deployment finishes. Instances on the SQL Server 2025 update policy will receive it as part of the CU5 rollout — coming soon.

Verify compatibility level (170 required for the TVFs) –

SELECT name, compatibility_level
FROM sys.databases
WHERE name = DB_NAME();

-- If necessary:
-- ALTER DATABASE [<your-database>] SET COMPATIBILITY_LEVEL = 170;

2. Working with LOB Data

This section demonstrates the CU5 capabilities against a realistic LOB data. We build a LogEntries table whose RawPayload column holds multi-KB to multi-MB chunks of web server and application output, plus an HtmlPages table for HTML cleansing examples.

2.1 Create the sample schema and data

IF OBJECT_ID('dbo.LogEntries', 'U') IS NOT NULL DROP TABLE dbo.LogEntries;
IF OBJECT_ID('dbo.HtmlPages',  'U') IS NOT NULL DROP TABLE dbo.HtmlPages;

CREATE TABLE dbo.LogEntries
(
    LogId       BIGINT IDENTITY(1,1) PRIMARY KEY,
    Source      SYSNAME       NOT NULL,
    IngestedAt  DATETIME2(3)  NOT NULL DEFAULT SYSUTCDATETIME(),
    RawPayload  VARCHAR(MAX)  NOT NULL   -- LOB column
);

CREATE TABLE dbo.HtmlPages
(
    PageId      INT IDENTITY(1,1) PRIMARY KEY,
    Url         NVARCHAR(2048) NOT NULL,
    Body        NVARCHAR(MAX)  NOT NULL  -- LOB column (Unicode)
);

Now generate realistically large rows. The REPLICATE(CAST(... AS varchar(max)), n) pattern is required because REPLICATE returns NULL when the result would exceed 8,000 bytes unless its first argument is a max type.

-- Synthetic web access-log payload (~252 KB in row 1, plus a separate ~586 KB row).
DECLARE @logLine VARCHAR(500) =
    '127.0.0.1 - alice [21/May/2026:10:15:32 +0000] "GET /api/orders/42 HTTP/1.1" 200 1532 ' +
    'user-agent="Mozilla/5.0" ip=10.0.0.7 email=alice@contoso.com card=4111-1111-1111-1234' + CHAR(10);

DECLARE @bigLog VARCHAR(MAX) =
    REPLICATE(CAST(@logLine AS VARCHAR(MAX)), 1500)                -- ~252 KB
    + '127.0.0.1 - mallory [21/May/2026:10:16:01 +0000] "POST /login HTTP/1.1" 500 0 ' +
      'ip=203.0.113.99 ssn=123-45-6789' + CHAR(10);

INSERT INTO dbo.LogEntries (Source, RawPayload) VALUES
    ('web-01', @bigLog),                                            -- ~252 KB
    ('web-02', REPLICATE(CAST('OK ' AS VARCHAR(MAX)), 200000));     -- ~586 KB

-- Synthetic HTML page (~775 KB / ~396,000 characters).
DECLARE @htmlChunk NVARCHAR(MAX) =
    N'<div class="row"><p>Hello <b>world</b>! Contact <a href="mailto:bob@contoso.com">bob</a>.</p></div>';

INSERT INTO dbo.HtmlPages (Url, Body) VALUES
    (N'https://contoso.example/page-1',
     N'<html><head><title>Big Page</title></head><body>'
     + REPLICATE(@htmlChunk, 4000)
     + N'</body></html>');

-- Confirm payload sizes in bytes.
SELECT LogId, Source, DATALENGTH(RawPayload) AS PayloadBytes FROM dbo.LogEntries;
SELECT PageId, DATALENGTH(Body) AS BodyBytes, LEN(Body) AS BodyChars FROM dbo.HtmlPages;

Results:

LogId Source PayloadBytes
1 web-01 258,110
2 web-02 600,000

 

PageId BodyBytes BodyChars
1 792,124 396,062

Before CU5, feeding any of these payloads into REGEXP_REPLACE, REGEXP_SUBSTR, REGEXP_MATCHES, or REGEXP_SPLIT_TO_TABLE would have failed with a type-mismatch error or required a LEFT(RawPayload, 8000)-style truncation. The same queries now run end-to-end.

2.2 REGEXP_LIKE — Filter rows by LOB content

-- Find logs that contain at least one HTTP 5xx response.
SELECT LogId, Source, DATALENGTH(RawPayload) AS PayloadBytes
FROM   dbo.LogEntries
WHERE  REGEXP_LIKE(RawPayload, '"[A-Z]+\s[^"]+\sHTTP/1\.[01]"\s5[0-9]{2}\s');

REGEXP_LIKE is a Boolean predicate: it evaluates to true when the pattern matches anywhere in the input and false otherwise. Because it returns a Boolean rather than a bit, use it directly in WHERE, CASE WHEN, IIF, or CHECK constraint contexts — do not compare it with = 1 or = 0 (the parser rejects that syntax).

NoteREGEXP_LIKE itself requires database compatibility level 170. The other scalar regex functions (REGEXP_COUNT, REGEXP_INSTR, REGEXP_REPLACE, REGEXP_SUBSTR) are available at all compatibility levels.

Results:

LogId Source PayloadBytes
1 web-01 258,110

2.3 REGEXP_COUNT — Counting at scale

-- Per-row tally of GET requests, POST requests, and 5xx responses
-- across the entire LOB payload.
SELECT LogId,
       Source,
       REGEXP_COUNT(RawPayload, '"GET\s')        AS Gets,
       REGEXP_COUNT(RawPayload, '"POST\s')       AS Posts,
       REGEXP_COUNT(RawPayload, '\s5[0-9]{2}\s') AS ServerErrors
FROM   dbo.LogEntries;

Results:

LogId Source Gets Posts ServerErrors
1 web-01 1,500 1 1
2 web-02 0 0 0

2.4 REGEXP_INSTR — Locate the first error

-- 1-based character position (or 0 if no match) of the FIRST 5xx response in each payload.
SELECT LogId,
       Source,
       REGEXP_INSTR(RawPayload, '\s5[0-9]{2}\s', 1, 1, 0) AS FirstErrorPos
FROM   dbo.LogEntries;

Parameter recap: REGEXP_INSTR(string, pattern, start, occurrence, return_option [, flags [, group ]]). A return_option of 0 returns the starting position of the match; 1 returns the position immediately after the last character of the match.

Results:

LogId Source FirstErrorPos
1 web-01 258,072
2 web-02 0

2.5 REGEXP_REPLACE — Redact sensitive data in place

PII redaction over LOB payloads was one of the most-requested CU5 scenarios. Before CU5, it required a custom chunked-replace routine; it is now a single expression.

-- Redact credit-card-shaped tokens, U.S. SSN-shaped tokens, and email addresses
-- across the entire payload.
SELECT LogId,
       REGEXP_REPLACE(
           REGEXP_REPLACE(
               REGEXP_REPLACE(
                   RawPayload,
                   '\b[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}\b',
                   '****-****-****-****'),
               '\b[0-9]{3}-[0-9]{2}-[0-9]{4}\b',
               '***-**-****'),
           '\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b',
           '[redacted-email]'
       ) AS RedactedPayload
FROM   dbo.LogEntries;

Or strip every HTML tag from an nvarchar(max) page in a single call:

SELECT PageId,
       LEN(Body)                                     AS OriginalLen,
       LEN(REGEXP_REPLACE(Body, N'<[^>]+>', N''))    AS TextOnlyLen
FROM   dbo.HtmlPages;

Results — the ~775 KB HTML document collapses from 396,062 to 100,008 characters of plain text in a single call:

PageId OriginalLen TextOnlyLen
1 396,062 100,008

 

2.6 REGEXP_SUBSTR — Extract a single value

-- Pull the first IPv4 address out of each log payload.
SELECT LogId,
       REGEXP_SUBSTR(RawPayload,
                     '\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b',
                     1,    -- start position
                     1,    -- occurrence
                     'c',  -- flags: case-sensitive
                     0     -- group: 0 returns the whole match
                    ) AS FirstIp
FROM   dbo.LogEntries;

To return the contents of a specific capture group instead of the entire match, pass its 1-based group number as the final argument.

Results:

LogId FirstIp
1 127.0.0.1
2 NULL

2.7 REGEXP_MATCHES — Every match, set-based

This is where the combination of TVF and LOB delivers the largest productivity gain: extract every structured value from a megabyte of unstructured text in a single set-based query, with no client round-trips.

REGEXP_MATCHES returns one row per match with these columns:

Column Type Description
match_id bigint Sequence number of the match (1-based).
start_position int 1-based start index of the match.
end_position int 1-based end index of the match.
match_value same type as string_expression The entire matched substring.
substring_matches json JSON array describing each capture group, with the shape [{"value":"…","start":N,"length":N}, …].
-- Every email address in every log payload, alongside its row of origin.
SELECT  l.LogId,
        m.match_id,
        m.match_value AS EmailFound
FROM    dbo.LogEntries AS l
CROSS APPLY REGEXP_MATCHES(
        l.RawPayload,
        '\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b'
) AS m
ORDER BY l.LogId, m.match_id;

Capture groups are even more useful — you can project the parts of every log line as columns by reading from the substring_matches JSON document:

-- Parse Common-Log-Format-ish entries into ip, user, status, and bytes columns.
-- The pattern has four capture groups, accessed below as $[0] through $[3].
SELECT  l.LogId,
        m.match_id,
        JSON_VALUE(m.substring_matches, '$[0].value') AS Ip,
        JSON_VALUE(m.substring_matches, '$[1].value') AS UserName,
        JSON_VALUE(m.substring_matches, '$[2].value') AS Status,
        JSON_VALUE(m.substring_matches, '$[3].value') AS Bytes
FROM    dbo.LogEntries AS l
CROSS APPLY REGEXP_MATCHES(
        l.RawPayload,
        '^([0-9.]+)\s-\s(\S+)\s\[[^\]]+\]\s"[^"]+"\s([0-9]{3})\s([0-9]+)',
        'm'    -- multi-line: ^ and $ anchor to each line, not just the whole input
) AS m
ORDER BY l.LogId, m.match_id;

Important — Without the 'm' flag, the ^ anchor matches only at the start of the entire 250 KB input, so you would receive exactly one match for the first line. The multi-line flag is what unlocks per-line extraction.

Results (first two parsed rows):

LogId match_id Ip UserName Status Bytes
1 1 127.0.0.1 alice 200 1532
1 2 127.0.0.1 alice 200 1532

2.8 REGEXP_SPLIT_TO_TABLE — Shred a LOB into rows

-- Project the entire log payload as one row per non-empty line.
SELECT  l.LogId,
        s.ordinal AS [LineNo],
        s.value   AS LineText
FROM    dbo.LogEntries AS l
CROSS APPLY REGEXP_SPLIT_TO_TABLE(l.RawPayload, '\r?\n') AS s
WHERE   l.LogId = 1
  AND   s.value <> ''
ORDER BY s.ordinal;

You now have a tabular projection of a multi-megabyte text blob without leaving the engine. You can feed it into a CTE, aggregate it, join it to dimension tables, or materialize it into a staging table — all set-based.

Results (first three rows):

LogId ordinal LineText (first 80 chars)
1 1 127.0.0.1 - alice [21/May/2026:10:15:32 +0000] "GET /api/orders/42 HTTP/1.1" 200
1 2 127.0.0.1 - alice [21/May/2026:10:15:32 +0000] "GET /api/orders/42 HTTP/1.1" 200
1 3 127.0.0.1 - alice [21/May/2026:10:15:32 +0000] "GET /api/orders/42 HTTP/1.1" 200

Tip — composing LOB regex pipelinesCROSS APPLY(and OUTER APPLY when you need to preserve rows that produce no matches) is the primary composition primitive. You can stack REGEXP_SPLIT_TO_TABLE (lines) feeding REGEXP_MATCHES (fields per line) feeding ordinary aggregates, all within a single query plan.

2.9 The 2 MB ceiling — strategies for larger inputs

The 2 MB limit applies to the input string of a single regex call. If the value passed to a regex function exceeds 2 MB, the call raises an error (error number 19311, severity 16) rather than silently truncating. That is the intended behavior — silent truncation would hide correctness bugs.

In practice, 2 MB is a generous ceiling: a single log file or HTML document of that size is already unusual, and most real-world LOB data sit comfortably below it. When individual values do exceed the limit, the most reliable approach is to split them into smaller logical units before they land in the column you want to query — for example, by writing one log line, one document section, or one record per row at ingestion time. Because every regex function (including the two TVFs) shares the same 2 MB ceiling, sharding at query time is not generally feasible; doing it at the load path keeps every regex call well under the limit and avoids per-query workarounds.

Bytes vs. characters — The 2 MB limit is measured in bytes, not characters, and the byte count is based on the UTF-8 encoding of the input regardless of the column’s declared type. ASCII characters take 1 byte each, so plain ASCII text can run to roughly two million characters; non-ASCII characters take 2–4 bytes in UTF-8, so fewer characters fit. Keep in mind that DATALENGTH() reports storage size in the column’s own encoding, which may differ from the UTF-8 byte count used by the limit, and LEN() (which counts characters) is best avoided as a sizing check here.

To measure the UTF-8 byte length that the limit actually checks, cast the value to varchar(max) under a UTF-8 collation and take its DATALENGTH:
SELECT DATALENGTH(
           CONVERT(varchar(max),
                   Body COLLATE Latin1_General_100_CI_AS_SC_UTF8)
       ) AS Utf8Bytes
FROM   dbo.HtmlPages;

Anything above 2 * 1024 * 1024 (2,097,152) bytes will be rejected by a regex call on that value.

Have a scenario that genuinely needs more than 2 MB? If your workload requires regex evaluation on individual values larger than the current 2 MB ceiling, we would like to hear about it. Please share the details — data shape, payload size, pattern, and business need — on the Azure SQL feedback portal. Customer feedback directly informs how we prioritize future limit changes.

2.10 Cleanup

DROP TABLE IF EXISTS dbo.LogEntries;
DROP TABLE IF EXISTS dbo.HtmlPages;

3. Summary

What changed in CU5

  • Before CU5 — LOB inputs were accepted on REGEXP_LIKE, REGEXP_COUNT, and REGEXP_INSTR. The remaining functions — REGEXP_REPLACE, REGEXP_SUBSTR, and the two TVFs (REGEXP_MATCHES, REGEXP_SPLIT_TO_TABLE) — required non-LOB string inputs, which often meant truncating with LEFT(..., 8000) or chunking in the application tier.
  • After CU5 (and already in Azure SQL) — All seven functions accept varchar(max) and nvarchar(max) inputs of up to 2 MB. The pattern remains capped at 8,000 bytes.

Quick reference

Function Returns LOB input (CU5) Common use case
REGEXP_LIKE Boolean (predicate) Yes Filter rows in WHERE / CASE / CHECK predicates
REGEXP_COUNT int Yes Count occurrences of a pattern
REGEXP_INSTR int Yes Position of the nth match
REGEXP_REPLACE string Yes Redact, cleanse, or normalize text
REGEXP_SUBSTR string Yes Extract a single value
REGEXP_MATCHES (TVF) (match_id, start_position, end_position, match_value, substring_matches) Yes Extract every match plus capture groups (via JSON), set-based
REGEXP_SPLIT_TO_TABLE (TVF) (value, ordinal) Yes Split a LOB into rows by a regex delimiter

Further reading

Closing thought. Native regex was already a significant quality-of-life improvement when it became generally available. CU5 completes the picture: every function, every input size up to 2 MB, every shape — scalar or table-valued. The next time you are tempted to export a column out of the database in order to grep it, try one of the seven regex functions first.

Happy matching. 🧠

Author

Abhiman Tiwari
Senior Product Manager

Abhiman Tiwari is a Senior Product Manager on the Azure SQL team at Microsoft, primarily focusing on developer experience with Azure SQL. In his previous roles, he has worked as a web developer/ tech-lead, developed several software applications/ products majorly in .NET and related technologies, and done a lot of .NET debugging, reverse engineering, and web application performance tuning.

0 comments