{"id":7135,"date":"2026-05-22T13:28:15","date_gmt":"2026-05-22T20:28:15","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/azure-sql\/?p=7135"},"modified":"2026-05-22T13:28:15","modified_gmt":"2026-05-22T20:28:15","slug":"regex-support-for-lob-types-in-t-sql-available-in-azure-sql-sql-server-2025","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/azure-sql\/regex-support-for-lob-types-in-t-sql-available-in-azure-sql-sql-server-2025\/","title":{"rendered":"Regex support for LOB types in T-SQL\u2014available in Azure SQL &amp; SQL Server 2025"},"content":{"rendered":"<p><strong>At a glance<\/strong> \u2014 Native regular expression (regex) functions in T-SQL now accept <code>varchar(max)<\/code> and <code>nvarchar(max)<\/code> inputs of up to <strong>2 MB<\/strong> across <strong>all seven<\/strong> regex functions, including the two table-valued functions (<code>REGEXP_MATCHES<\/code> and <code>REGEXP_SPLIT_TO_TABLE<\/code>). This capability ships in <a href=\"https:\/\/learn.microsoft.com\/troubleshoot\/sql\/releases\/sqlserver-2025\/cumulativeupdate5\"><strong>SQL Server 2025 CU5<\/strong><\/a> and is already available in <strong>Azure SQL Database, SQL Database in Fabric<\/strong>\u00a0and <strong>Azure SQL Managed Instance<\/strong> configured with the Always-up-to-date update policy. It will reach Managed Instances on the SQL Server 2025 update policy as part of the CU5 rollout. You no longer need to split log files, HTML documents, or large JSON payloads into 8,000-byte chunks just to run a pattern match.<\/p>\n<h2>1. Introduction<\/h2>\n<p>Regular expressions have long been a cornerstone of modern data processing \u2014 used for validation, parsing, transformation, and extracting structured insights from unstructured text. With <strong>SQL Server 2025<\/strong> and <strong>Azure SQL<\/strong>, regex is now a first-class T-SQL capability, removing the historical need to rely on SQLCLR functions or application-tier processing.<\/p>\n<p>While the initial release made native regex broadly available, large-object (LOB) inputs were not yet supported on every function. CU5 closes that gap.<\/p>\n<p>Under the hood, T-SQL regex implements <strong>POSIX <\/strong>Extended Regular Expression<strong> (ERE)<\/strong> semantics, augmented by a curated set of Perl-style features, and is powered by the RE2 engine. RE2 is a <em>linear-time<\/em>, non-backtracking implementation, which means it is not susceptible to catastrophic backtracking (a class of denial-of-service issue commonly known as ReDoS). That guarantee becomes far more important when the input is a 1.8 MB log blob than when it is an 8,000-byte string.<\/p>\n<h3>Release timeline<\/h3>\n<table>\n<thead>\n<tr>\n<td>Milestone<\/td>\n<td>What shipped<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Ignite 2025 \u2014 General Availability<\/strong><\/td>\n<td>Regex went GA in <strong>SQL Server 2025<\/strong> and <strong>Azure SQL<\/strong>. LOB inputs were initially supported only on <code>REGEXP_LIKE<\/code>, <code>REGEXP_COUNT<\/code>, and <code>REGEXP_INSTR<\/code>. LOB support on <code>REGEXP_REPLACE<\/code> and <code>REGEXP_SUBSTR<\/code> was deferred, and the two table-valued functions (TVFs) accepted only non-LOB string types.<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure SQL (post-GA service updates)<\/strong><\/td>\n<td>LOB inputs enabled across all seven functions.<\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/learn.microsoft.com\/troubleshoot\/sql\/releases\/sqlserver-2025\/cumulativeupdate5\"><strong>SQL Server 2025 CU5<\/strong><\/a><\/td>\n<td>LOB inputs up to <strong>2 MB<\/strong> enabled on <strong>all seven<\/strong> functions in the SQL Server.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>What&#8217;s new in CU5<\/h3>\n<ul>\n<li><code>varchar(max)<\/code> and <code>nvarchar(max)<\/code> inputs are accepted on <strong>every<\/strong> regex function.<\/li>\n<li>The input string is capped at <strong>2 MB per function call<\/strong>. The pattern is still capped at <strong>8,000 bytes<\/strong>, which is far larger than any maintainable regular expression should ever need.<\/li>\n<li>Behavior is consistent between Azure SQL and SQL Server, so code you write today is fully portable.<\/li>\n<\/ul>\n<p><strong>Note<\/strong> \u2014 The 2 MB limit applies to the <strong>input passed to a single function call<\/strong>, not to the column or row. A single value in a <code>varchar(max)<\/code> column can still store up to 2 GB; the constraint is that no single regex evaluation can consume more than 2 MB of that value.<\/p>\n<h3>Prerequisites<\/h3>\n<ul>\n<li><strong>SQL Server 2025 CU5<\/strong> or later, <strong>or<\/strong> Azure SQL Database, <strong>or<\/strong> SQL Database in Fabric <strong>or<\/strong> Azure SQL Managed Instance configured with the SQL Server 2025 \/ Always-up-to-date <a href=\"https:\/\/learn.microsoft.com\/azure\/azure-sql\/managed-instance\/update-policy\">update policy<\/a>.<\/li>\n<li>The two table-valued functions (<code>REGEXP_MATCHES<\/code> and <code>REGEXP_SPLIT_TO_TABLE<\/code>) require <strong>database compatibility level 170<\/strong>, unless the database-scoped configuration <code>ALLOW_BUILTIN_TVF_IN_ALL_COMPAT_LEVELS<\/code> (preview) is enabled.<\/li>\n<\/ul>\n<p><strong>Note<\/strong> \u2014 On Azure SQL Managed Instance (Always-up-to-date), this capability is rolling out region by region. It is already live in regions where the rollout has completed and will light up in the remaining regions as the deployment finishes. Instances on the <strong>SQL Server 2025<\/strong> update policy will receive it as part of the CU5 rollout \u2014 coming soon.<\/p>\n<p>Verify compatibility level (170 required for the TVFs) &#8211;<\/p>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">SELECT name, compatibility_level\r\nFROM sys.databases\r\nWHERE name = DB_NAME();\r\n\r\n-- If necessary:\r\n-- ALTER DATABASE [&lt;your-database&gt;] SET COMPATIBILITY_LEVEL = 170;<\/code><\/pre>\n<h2>2. Working with LOB Data<\/h2>\n<p>This section demonstrates the CU5 capabilities against a realistic LOB data. We build a <code>LogEntries<\/code> table whose <code>RawPayload<\/code> column holds multi-KB to multi-MB chunks of web server and application output, plus an <code>HtmlPages<\/code> table for HTML cleansing examples.<\/p>\n<h3>2.1 Create the sample schema and data<\/h3>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">IF OBJECT_ID('dbo.LogEntries', 'U') IS NOT NULL DROP TABLE dbo.LogEntries;\r\nIF OBJECT_ID('dbo.HtmlPages',  'U') IS NOT NULL DROP TABLE dbo.HtmlPages;\r\n\r\nCREATE TABLE dbo.LogEntries\r\n(\r\n    LogId       BIGINT IDENTITY(1,1) PRIMARY KEY,\r\n    Source      SYSNAME       NOT NULL,\r\n    IngestedAt  DATETIME2(3)  NOT NULL DEFAULT SYSUTCDATETIME(),\r\n    RawPayload  VARCHAR(MAX)  NOT NULL   -- LOB column\r\n);\r\n\r\nCREATE TABLE dbo.HtmlPages\r\n(\r\n    PageId      INT IDENTITY(1,1) PRIMARY KEY,\r\n    Url         NVARCHAR(2048) NOT NULL,\r\n    Body        NVARCHAR(MAX)  NOT NULL  -- LOB column (Unicode)\r\n);<\/code><\/pre>\n<p>Now generate realistically large rows. The <code>REPLICATE(CAST(... AS varchar(max)), n)<\/code> pattern is required because <code>REPLICATE<\/code> returns <code>NULL<\/code> when the result would exceed 8,000 bytes unless its first argument is a max type.<\/p>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">-- Synthetic web access-log payload (~252 KB in row 1, plus a separate ~586 KB row).\r\nDECLARE @logLine VARCHAR(500) =\r\n    '127.0.0.1 - alice [21\/May\/2026:10:15:32 +0000] \"GET \/api\/orders\/42 HTTP\/1.1\" 200 1532 ' +\r\n    'user-agent=\"Mozilla\/5.0\" ip=10.0.0.7 email=alice@contoso.com card=4111-1111-1111-1234' + CHAR(10);\r\n\r\nDECLARE @bigLog VARCHAR(MAX) =\r\n    REPLICATE(CAST(@logLine AS VARCHAR(MAX)), 1500)                -- ~252 KB\r\n    + '127.0.0.1 - mallory [21\/May\/2026:10:16:01 +0000] \"POST \/login HTTP\/1.1\" 500 0 ' +\r\n      'ip=203.0.113.99 ssn=123-45-6789' + CHAR(10);\r\n\r\nINSERT INTO dbo.LogEntries (Source, RawPayload) VALUES\r\n    ('web-01', @bigLog),                                            -- ~252 KB\r\n    ('web-02', REPLICATE(CAST('OK ' AS VARCHAR(MAX)), 200000));     -- ~586 KB\r\n\r\n-- Synthetic HTML page (~775 KB \/ ~396,000 characters).\r\nDECLARE @htmlChunk NVARCHAR(MAX) =\r\n    N'&lt;div class=\"row\"&gt;&lt;p&gt;Hello &lt;b&gt;world&lt;\/b&gt;! Contact &lt;a href=\"mailto:bob@contoso.com\"&gt;bob&lt;\/a&gt;.&lt;\/p&gt;&lt;\/div&gt;';\r\n\r\nINSERT INTO dbo.HtmlPages (Url, Body) VALUES\r\n    (N'https:\/\/contoso.example\/page-1',\r\n     N'&lt;html&gt;&lt;head&gt;&lt;title&gt;Big Page&lt;\/title&gt;&lt;\/head&gt;&lt;body&gt;'\r\n     + REPLICATE(@htmlChunk, 4000)\r\n     + N'&lt;\/body&gt;&lt;\/html&gt;');\r\n\r\n-- Confirm payload sizes in bytes.\r\nSELECT LogId, Source, DATALENGTH(RawPayload) AS PayloadBytes FROM dbo.LogEntries;\r\nSELECT PageId, DATALENGTH(Body) AS BodyBytes, LEN(Body) AS BodyChars FROM dbo.HtmlPages;<\/code><\/pre>\n<p><strong>Results:<\/strong><\/p>\n<table>\n<thead>\n<tr>\n<td>LogId<\/td>\n<td>Source<\/td>\n<td>PayloadBytes<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>web-01<\/td>\n<td>258,110<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>web-02<\/td>\n<td>600,000<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<table>\n<thead>\n<tr>\n<td>PageId<\/td>\n<td>BodyBytes<\/td>\n<td>BodyChars<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>792,124<\/td>\n<td>396,062<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Before CU5, feeding any of these payloads into <code>REGEXP_REPLACE<\/code>, <code>REGEXP_SUBSTR<\/code>, <code>REGEXP_MATCHES<\/code>, or <code>REGEXP_SPLIT_TO_TABLE<\/code> would have failed with a type-mismatch error or required a <code>LEFT(RawPayload, 8000)<\/code>-style truncation. The same queries now run end-to-end.<\/p>\n<h3>2.2 REGEXP_LIKE \u2014 Filter rows by LOB content<\/h3>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">-- Find logs that contain at least one HTTP 5xx response.\r\nSELECT LogId, Source, DATALENGTH(RawPayload) AS PayloadBytes\r\nFROM   dbo.LogEntries\r\nWHERE  REGEXP_LIKE(RawPayload, '\"[A-Z]+\\s[^\"]+\\sHTTP\/1\\.[01]\"\\s5[0-9]{2}\\s');<\/code><\/pre>\n<p><code>REGEXP_LIKE<\/code> is a Boolean predicate: it evaluates to <code>true<\/code> when the pattern matches anywhere in the input and <code>false<\/code> otherwise. Because it returns a Boolean rather than a <code>bit<\/code>, use it directly in <code>WHERE<\/code>, <code>CASE WHEN<\/code>, <code>IIF<\/code>, or <code>CHECK<\/code> constraint contexts \u2014 do <strong>not<\/strong> compare it with = 1 or = 0 (the parser rejects that syntax).<\/p>\n<p><strong>Note<\/strong> \u2014 <code>REGEXP_LIKE<\/code> itself requires database compatibility level <strong>170<\/strong>. The other scalar regex functions (<code>REGEXP_COUNT<\/code>, <code>REGEXP_INSTR<\/code>, <code>REGEXP_REPLACE<\/code>, <code>REGEXP_SUBSTR<\/code>) are available at all compatibility levels.<\/p>\n<p><strong>Results:<\/strong><\/p>\n<table>\n<thead>\n<tr>\n<td>LogId<\/td>\n<td>Source<\/td>\n<td>PayloadBytes<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>web-01<\/td>\n<td>258,110<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>2.3 REGEXP_COUNT \u2014 Counting at scale<\/h3>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">-- Per-row tally of GET requests, POST requests, and 5xx responses\r\n-- across the entire LOB payload.\r\nSELECT LogId,\r\n       Source,\r\n       REGEXP_COUNT(RawPayload, '\"GET\\s')        AS Gets,\r\n       REGEXP_COUNT(RawPayload, '\"POST\\s')       AS Posts,\r\n       REGEXP_COUNT(RawPayload, '\\s5[0-9]{2}\\s') AS ServerErrors\r\nFROM   dbo.LogEntries;<\/code><\/pre>\n<p><strong>Results:<\/strong><\/p>\n<table>\n<thead>\n<tr>\n<td>LogId<\/td>\n<td>Source<\/td>\n<td>Gets<\/td>\n<td>Posts<\/td>\n<td>ServerErrors<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>web-01<\/td>\n<td>1,500<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>web-02<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>2.4 REGEXP_INSTR \u2014 Locate the first error<\/h3>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">-- 1-based character position (or 0 if no match) of the FIRST 5xx response in each payload.\r\nSELECT LogId,\r\n       Source,\r\n       REGEXP_INSTR(RawPayload, '\\s5[0-9]{2}\\s', 1, 1, 0) AS FirstErrorPos\r\nFROM   dbo.LogEntries;<\/code><\/pre>\n<p>Parameter recap: <code>REGEXP_INSTR(string, pattern, start, occurrence, return_option [, flags [, group ]])<\/code>. A <code>return_option<\/code> of <code>0<\/code> returns the starting position of the match; <code>1<\/code> returns the position immediately after the last character of the match.<\/p>\n<p><strong>Results:<\/strong><\/p>\n<table>\n<thead>\n<tr>\n<td>LogId<\/td>\n<td>Source<\/td>\n<td>FirstErrorPos<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>web-01<\/td>\n<td>258,072<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>web-02<\/td>\n<td>0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>2.5 REGEXP_REPLACE \u2014 Redact sensitive data in place<\/h3>\n<p>PII redaction over LOB payloads was one of the most-requested CU5 scenarios. Before CU5, it required a custom chunked-replace routine; it is now a single expression.<\/p>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">-- Redact credit-card-shaped tokens, U.S. SSN-shaped tokens, and email addresses\r\n-- across the entire payload.\r\nSELECT LogId,\r\n       REGEXP_REPLACE(\r\n           REGEXP_REPLACE(\r\n               REGEXP_REPLACE(\r\n                   RawPayload,\r\n                   '\\b[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}\\b',\r\n                   '****-****-****-****'),\r\n               '\\b[0-9]{3}-[0-9]{2}-[0-9]{4}\\b',\r\n               '***-**-****'),\r\n           '\\b[A-Za-z0-9._%+\\-]+@[A-Za-z0-9.\\-]+\\.[A-Za-z]{2,}\\b',\r\n           '[redacted-email]'\r\n       ) AS RedactedPayload\r\nFROM   dbo.LogEntries;<\/code><\/pre>\n<p>Or strip every HTML tag from an <code>nvarchar(max)<\/code> page in a single call:<\/p>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">SELECT PageId,\r\n       LEN(Body)                                     AS OriginalLen,\r\n       LEN(REGEXP_REPLACE(Body, N'&lt;[^&gt;]+&gt;', N''))    AS TextOnlyLen\r\nFROM   dbo.HtmlPages;<\/code><\/pre>\n<p><strong>Results<\/strong> \u2014 the ~775 KB HTML document collapses from 396,062 to 100,008 characters of plain text in a single call:<\/p>\n<table>\n<thead>\n<tr>\n<td>PageId<\/td>\n<td>OriginalLen<\/td>\n<td>TextOnlyLen<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>396,062<\/td>\n<td>100,008<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h3>2.6 REGEXP_SUBSTR \u2014 Extract a single value<\/h3>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">-- Pull the first IPv4 address out of each log payload.\r\nSELECT LogId,\r\n       REGEXP_SUBSTR(RawPayload,\r\n                     '\\b(?:[0-9]{1,3}\\.){3}[0-9]{1,3}\\b',\r\n                     1,    -- start position\r\n                     1,    -- occurrence\r\n                     'c',  -- flags: case-sensitive\r\n                     0     -- group: 0 returns the whole match\r\n                    ) AS FirstIp\r\nFROM   dbo.LogEntries;<\/code><\/pre>\n<p>To return the contents of a specific capture group instead of the entire match, pass its 1-based group number as the final argument.<\/p>\n<p><strong>Results:<\/strong><\/p>\n<table>\n<thead>\n<tr>\n<td>LogId<\/td>\n<td>FirstIp<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>127.0.0.1<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td><code>NULL<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>2.7 REGEXP_MATCHES \u2014 Every match, set-based<\/h3>\n<p>This is where the combination of TVF and LOB delivers the largest productivity gain: extract every structured value from a megabyte of unstructured text in a single set-based query, with no client round-trips.<\/p>\n<p><code>REGEXP_MATCHES<\/code> returns one row per match with these columns:<\/p>\n<table>\n<thead>\n<tr>\n<td>Column<\/td>\n<td>Type<\/td>\n<td>Description<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>match_id<\/code><\/td>\n<td><code>bigint<\/code><\/td>\n<td>Sequence number of the match (1-based).<\/td>\n<\/tr>\n<tr>\n<td><code>start_position<\/code><\/td>\n<td><code>int<\/code><\/td>\n<td>1-based start index of the match.<\/td>\n<\/tr>\n<tr>\n<td><code>end_position<\/code><\/td>\n<td><code>int<\/code><\/td>\n<td>1-based end index of the match.<\/td>\n<\/tr>\n<tr>\n<td><code>match_value<\/code><\/td>\n<td>same type as <code>string_expression<\/code><\/td>\n<td>The entire matched substring.<\/td>\n<\/tr>\n<tr>\n<td><code>substring_matches<\/code><\/td>\n<td><code>json<\/code><\/td>\n<td>JSON array describing each capture group, with the shape <code>[{\"value\":\"\u2026\",\"start\":N,\"length\":N}, \u2026]<\/code>.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">-- Every email address in every log payload, alongside its row of origin.\r\nSELECT  l.LogId,\r\n        m.match_id,\r\n        m.match_value AS EmailFound\r\nFROM    dbo.LogEntries AS l\r\nCROSS APPLY REGEXP_MATCHES(\r\n        l.RawPayload,\r\n        '\\b[A-Za-z0-9._%+\\-]+@[A-Za-z0-9.\\-]+\\.[A-Za-z]{2,}\\b'\r\n) AS m\r\nORDER BY l.LogId, m.match_id;<\/code><\/pre>\n<p>Capture groups are even more useful \u2014 you can project the parts of every log line as columns by reading from the <code>substring_matches<\/code> JSON document:<\/p>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">-- Parse Common-Log-Format-ish entries into ip, user, status, and bytes columns.\r\n-- The pattern has four capture groups, accessed below as $[0] through $[3].\r\nSELECT  l.LogId,\r\n        m.match_id,\r\n        JSON_VALUE(m.substring_matches, '$[0].value') AS Ip,\r\n        JSON_VALUE(m.substring_matches, '$[1].value') AS UserName,\r\n        JSON_VALUE(m.substring_matches, '$[2].value') AS Status,\r\n        JSON_VALUE(m.substring_matches, '$[3].value') AS Bytes\r\nFROM    dbo.LogEntries AS l\r\nCROSS APPLY REGEXP_MATCHES(\r\n        l.RawPayload,\r\n        '^([0-9.]+)\\s-\\s(\\S+)\\s\\[[^\\]]+\\]\\s\"[^\"]+\"\\s([0-9]{3})\\s([0-9]+)',\r\n        'm'    -- multi-line: ^ and $ anchor to each line, not just the whole input\r\n) AS m\r\nORDER BY l.LogId, m.match_id;<\/code><\/pre>\n<p><strong>Important<\/strong> \u2014 Without the <code>'m'<\/code> flag, the <code>^<\/code> anchor matches only at the start of the entire 250 KB input, so you would receive exactly one match for the first line. The multi-line flag is what unlocks per-line extraction.<\/p>\n<p><strong>Results<\/strong> (first two parsed rows):<\/p>\n<table>\n<thead>\n<tr>\n<td>LogId<\/td>\n<td>match_id<\/td>\n<td>Ip<\/td>\n<td>UserName<\/td>\n<td>Status<\/td>\n<td>Bytes<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>1<\/td>\n<td>127.0.0.1<\/td>\n<td>alice<\/td>\n<td>200<\/td>\n<td>1532<\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>2<\/td>\n<td>127.0.0.1<\/td>\n<td>alice<\/td>\n<td>200<\/td>\n<td>1532<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>2.8 REGEXP_SPLIT_TO_TABLE \u2014 Shred a LOB into rows<\/h3>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">-- Project the entire log payload as one row per non-empty line.\r\nSELECT  l.LogId,\r\n        s.ordinal AS [LineNo],\r\n        s.value   AS LineText\r\nFROM    dbo.LogEntries AS l\r\nCROSS APPLY REGEXP_SPLIT_TO_TABLE(l.RawPayload, '\\r?\\n') AS s\r\nWHERE   l.LogId = 1\r\n  AND   s.value &lt;&gt; ''\r\nORDER BY s.ordinal;<\/code><\/pre>\n<p>You now have a tabular projection of a multi-megabyte text blob without leaving the engine. You can feed it into a CTE, aggregate it, join it to dimension tables, or materialize it into a staging table \u2014 all set-based.<\/p>\n<p><strong>Results<\/strong> (first three rows):<\/p>\n<table>\n<thead>\n<tr>\n<td>LogId<\/td>\n<td>ordinal<\/td>\n<td>LineText (first 80 chars)<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>1<\/td>\n<td><code>127.0.0.1 - alice [21\/May\/2026:10:15:32 +0000] \"GET \/api\/orders\/42 HTTP\/1.1\" 200<\/code><\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>2<\/td>\n<td><code>127.0.0.1 - alice [21\/May\/2026:10:15:32 +0000] \"GET \/api\/orders\/42 HTTP\/1.1\" 200<\/code><\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>3<\/td>\n<td><code>127.0.0.1 - alice [21\/May\/2026:10:15:32 +0000] \"GET \/api\/orders\/42 HTTP\/1.1\" 200<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong>Tip \u2014 composing LOB regex pipelines<\/strong> \u2014 <code>CROSS APPLY<\/code>(and <code>OUTER APPLY<\/code> when you need to preserve rows that produce no matches)\u00a0is the primary composition primitive. You can stack <code>REGEXP_SPLIT_TO_TABLE<\/code> (lines) feeding <code>REGEXP_MATCHES<\/code> (fields per line) feeding ordinary aggregates, all within a single query plan.<\/p>\n<h3>2.9 The 2 MB ceiling \u2014 strategies for larger inputs<\/h3>\n<p>The 2 MB limit applies to the <strong>input string of a single regex call<\/strong>. If the value passed to a regex function exceeds 2 MB, the call raises an error (<strong>error number <code>19311<\/code><\/strong>, severity 16) rather than silently truncating. That is the intended behavior \u2014 silent truncation would hide correctness bugs.<\/p>\n<p>In practice, 2 MB is a generous ceiling: a single log file or HTML document of that size is already unusual, and most real-world LOB data sit comfortably below it. When individual values do exceed the limit, the most reliable approach is to <strong>split them into smaller logical units before they land in the column you want to query<\/strong> \u2014 for example, by writing one log line, one document section, or one record per row at ingestion time. Because every regex function (including the two TVFs) shares the same 2 MB ceiling, sharding at query time is not generally feasible; doing it at the load path keeps every regex call well under the limit and avoids per-query workarounds.<\/p>\n<p><strong>Bytes vs. characters<\/strong> \u2014 The 2 MB limit is measured in <strong>bytes<\/strong>, not characters, and the byte count is based on the UTF-8 encoding of the input regardless of the column&#8217;s declared type. ASCII characters take 1 byte each, so plain ASCII text can run to roughly two million characters; non-ASCII characters take 2\u20134 bytes in UTF-8, so fewer characters fit. Keep in mind that <code>DATALENGTH()<\/code> reports storage size in the column&#8217;s own encoding, which may differ from the UTF-8 byte count used by the limit, and <code>LEN()<\/code> (which counts characters) is best avoided as a sizing check here.<\/p>\n<div>\n<div>To measure the UTF-8 byte length that the limit actually checks, cast the value to <code>varchar(max)<\/code> under a UTF-8 collation and take its <code>DATALENGTH<\/code>:<\/div>\n<\/div>\n<div>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">SELECT DATALENGTH(\r\n           CONVERT(varchar(max),\r\n                   Body COLLATE Latin1_General_100_CI_AS_SC_UTF8)\r\n       ) AS Utf8Bytes\r\nFROM   dbo.HtmlPages;<\/code><\/pre>\n<p>Anything above\u00a0<code class=\" prettyprinted\"><span class=\"lit\">2<\/span><span class=\"pln\">\u00a0<\/span><span class=\"pun\">*<\/span><span class=\"pln\">\u00a0<\/span><span class=\"lit\">1024<\/span><span class=\"pln\">\u00a0<\/span><span class=\"pun\">*<\/span><span class=\"pln\">\u00a0<\/span><span class=\"lit\">1024<\/span><\/code>\u00a0(2,097,152) bytes will be rejected by a regex call on that value.<\/p>\n<\/div>\n<div><\/div>\n<p><strong>Have a scenario that genuinely needs more than 2 MB?<\/strong> If your workload requires regex evaluation on individual values larger than the current 2 MB ceiling, we would like to hear about it. Please share the details \u2014 data shape, payload size, pattern, and business need \u2014 on the <a href=\"https:\/\/feedback.azure.com\/d365community\/forum\/04fe6ee0-3b25-ec11-b6e6-000d3a4f0da0\">Azure SQL feedback portal<\/a>. Customer feedback directly informs how we prioritize future limit changes.<\/p>\n<h3>2.10 Cleanup<\/h3>\n<pre class=\"prettyprint language-sql\"><code class=\"language-sql\">DROP TABLE IF EXISTS dbo.LogEntries;\r\nDROP TABLE IF EXISTS dbo.HtmlPages;<\/code><\/pre>\n<h2>3. Summary<\/h2>\n<h3>What changed in CU5<\/h3>\n<ul>\n<li><strong>Before CU5<\/strong> \u2014 LOB inputs were accepted on <code>REGEXP_LIKE<\/code>, <code>REGEXP_COUNT<\/code>, and <code>REGEXP_INSTR<\/code>. The remaining functions \u2014 <code>REGEXP_REPLACE<\/code>, <code>REGEXP_SUBSTR<\/code>, and the two TVFs (<code>REGEXP_MATCHES<\/code>, <code>REGEXP_SPLIT_TO_TABLE<\/code>) \u2014 required non-LOB string inputs, which often meant truncating with <code>LEFT(..., 8000)<\/code> or chunking in the application tier.<\/li>\n<li><strong>After CU5 (and already in Azure SQL)<\/strong> \u2014 All seven functions accept <code>varchar(max)<\/code> and <code>nvarchar(max)<\/code> inputs of up to <strong>2 MB<\/strong>. The pattern remains capped at 8,000 bytes.<\/li>\n<\/ul>\n<h3>Quick reference<\/h3>\n<table>\n<thead>\n<tr>\n<td>Function<\/td>\n<td>Returns<\/td>\n<td>LOB input (CU5)<\/td>\n<td>Common use case<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>REGEXP_LIKE<\/code><\/td>\n<td>Boolean (predicate)<\/td>\n<td>Yes<\/td>\n<td>Filter rows in <code>WHERE<\/code> \/ <code>CASE<\/code> \/ <code>CHECK<\/code> predicates<\/td>\n<\/tr>\n<tr>\n<td><code>REGEXP_COUNT<\/code><\/td>\n<td><code>int<\/code><\/td>\n<td>Yes<\/td>\n<td>Count occurrences of a pattern<\/td>\n<\/tr>\n<tr>\n<td><code>REGEXP_INSTR<\/code><\/td>\n<td><code>int<\/code><\/td>\n<td>Yes<\/td>\n<td>Position of the <em>n<\/em>th match<\/td>\n<\/tr>\n<tr>\n<td><code>REGEXP_REPLACE<\/code><\/td>\n<td>string<\/td>\n<td>Yes<\/td>\n<td>Redact, cleanse, or normalize text<\/td>\n<\/tr>\n<tr>\n<td><code>REGEXP_SUBSTR<\/code><\/td>\n<td>string<\/td>\n<td>Yes<\/td>\n<td>Extract a single value<\/td>\n<\/tr>\n<tr>\n<td><code>REGEXP_MATCHES<\/code> (TVF)<\/td>\n<td><code>(match_id, start_position, end_position, match_value, substring_matches)<\/code><\/td>\n<td>Yes<\/td>\n<td>Extract every match plus capture groups (via JSON), set-based<\/td>\n<\/tr>\n<tr>\n<td><code>REGEXP_SPLIT_TO_TABLE<\/code> (TVF)<\/td>\n<td><code>(value, ordinal)<\/code><\/td>\n<td>Yes<\/td>\n<td>Split a LOB into rows by a regex delimiter<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Further reading<\/h3>\n<ul>\n<li>Official documentation: <a href=\"https:\/\/learn.microsoft.com\/sql\/t-sql\/functions\/regexp-like-transact-sql\">REGEXP_LIKE<\/a>, <a href=\"https:\/\/learn.microsoft.com\/sql\/t-sql\/functions\/regexp-count-transact-sql\">REGEXP_COUNT<\/a>, <a href=\"https:\/\/learn.microsoft.com\/sql\/t-sql\/functions\/regexp-instr-transact-sql\">REGEXP_INSTR<\/a>, <a href=\"https:\/\/learn.microsoft.com\/sql\/t-sql\/functions\/regexp-replace-transact-sql\">REGEXP_REPLACE<\/a>, <a href=\"https:\/\/learn.microsoft.com\/sql\/t-sql\/functions\/regexp-substr-transact-sql\">REGEXP_SUBSTR<\/a>, <a href=\"https:\/\/learn.microsoft.com\/sql\/t-sql\/functions\/regexp-matches-transact-sql\">REGEXP_MATCHES<\/a>, <a href=\"https:\/\/learn.microsoft.com\/sql\/t-sql\/functions\/regexp-split-to-table-transact-sql\">REGEXP_SPLIT_TO_TABLE<\/a>.<\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/sql\/relational-databases\/regular-expressions\/overview\">Regular expressions overview<\/a>.<\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/troubleshoot\/sql\/releases\/sqlserver-2025\/cumulativeupdate5\">SQL Server 2025 CU5 release notes<\/a>.<\/li>\n<\/ul>\n<p><strong>Closing thought.<\/strong> Native regex was already a significant quality-of-life improvement when it became generally available. CU5 completes the picture: every function, every input size up to 2 MB, every shape \u2014 scalar or table-valued. The next time you are tempted to export a column out of the database in order to grep it, try one of the seven regex functions first.<\/p>\n<p><em>Happy matching.<\/em> \ud83e\udde0<\/p>\n","protected":false},"excerpt":{"rendered":"<p>At a glance \u2014 Native regular expression (regex) functions in T-SQL now accept varchar(max) and nvarchar(max) inputs of up to 2 MB across all seven regex functions, including the two table-valued functions (REGEXP_MATCHES and REGEXP_SPLIT_TO_TABLE). This capability ships in SQL Server 2025 CU5 and is already available in Azure SQL Database, SQL Database in Fabric\u00a0and [&hellip;]<\/p>\n","protected":false},"author":29335,"featured_media":81,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,572,668,672,619],"tags":[435,736,34,700],"class_list":["post-7135","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-azure-sql","category-managed-instance","category-sql-database-in-fabric","category-sql-server-2025","category-t-sql","tag-functions","tag-lob-types","tag-t-sql","tag-t-sql-regex"],"acf":[],"blog_post_summary":"<p>At a glance \u2014 Native regular expression (regex) functions in T-SQL now accept varchar(max) and nvarchar(max) inputs of up to 2 MB across all seven regex functions, including the two table-valued functions (REGEXP_MATCHES and REGEXP_SPLIT_TO_TABLE). This capability ships in SQL Server 2025 CU5 and is already available in Azure SQL Database, SQL Database in Fabric\u00a0and [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/posts\/7135","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/users\/29335"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/comments?post=7135"}],"version-history":[{"count":1,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/posts\/7135\/revisions"}],"predecessor-version":[{"id":7151,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/posts\/7135\/revisions\/7151"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/media\/81"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/media?parent=7135"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/categories?post=7135"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/tags?post=7135"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}