August 6th, 2009

The great thing about regular expression engines is that there are so many to choose from

Back in the days before perl ruled the earth, regular expressions were one of those weird niche features, one of those things that everybody reimplements when they need it. If you look at the old unix tools, you’ll see that even then, there were three different regular expression engines with different syntax. You had grep, egrep, and vi. Probably more. The grep regular expression language supported character classes, the dot wildcard, the asterisk operator, the start and end anchors, and grouping. No plus operator, no question mark, no alternation, no repetition counts. The egrep program added support for plus, question mark, and alternation. Meanwhile, somebody went back and added repetition counts to grep but didn’t add them to vi; somebody else added the \< and \> metacharacters to vi but didn’t add them to sed. POSIX added repetition counts to awk but changed the notation from \{n,m\} to {n,m}. And so on. No two programs use the same regular expression language, but they overlap sufficiently that you can often get by with the common subset and not have to worry about which particular flavor you’re up against.

Until you wander into the places where they differ.

From: John Jones
Subject: Problem with regular expression

I’m trying to write a regular expression to match blah blah blah.

From: Jane Smith
Subject: RE: Problem with regular expression

I think this will match what you want: ^Z@1&*B*!34

I just ran my hand randomly over the keyboard to generate that fake regular expression. The scary thing is, at first glance, it is not obviously not a regular expression!

From: Chris Brown
Subject: RE: Problem with regular expression

Try $)(#$C)*#

From: John Smith
Subject: RE: Problem with regular expression

Thanks, everybody, for your suggestions, but I can’t get any of them to work. For example, I can’t get any of them to match against this string: blah blah blah blah.

At this point, people chimed in with other suggestions, confirming that John doubled the backslashes, that sort of thing. John posted his test program, and then the reason was obvious.

From: Jane Smith
Subject: RE: Problem with regular expression

Oh, you’re using CAtlRegExp. In that class, \w doesn’t match a single character; it matches an entire word. You want to use \a instead.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

0 comments

Discussion are closed.