October 21st, 2016

PowerShell regex crash course – Part 4 of 5

Doctor Scripto
Scripter

Summary: Thomas Rayner, Microsoft Cloud and Datacenter Management MVP, shows the basics of working with regular expressions in PowerShell.

Hello! I’m Thomas Rayner, a proud Cloud & Datacenter Management Microsoft MVP, filling in for The Scripting Guy! this week. You can find me on Twitter (@MrThomasRayner), or posting on my blog, workingsysadmin.com. This week, I’m presenting a five-part crash course about how to use regular expressions in PowerShell. Regular expressions are sequences of characters that define a search pattern, mainly for use in pattern matching with strings. Regular expressions are extremely useful to extract information from text such as log files or documents. This isn’t meant to be a comprehensive series but rather, just as the name says, a crash course. So, buckle up!

Many people are intimidated by regular expressions, or “regex”. If you see something like ‘(\d{1,3}\.){3}(\d{1,3})’ and your eyes start glazing over, don’t worry. By the end of this series, you’ll have the skills to identify that pattern matches IP addresses. For the uninitiated, big strings of seemingly random characters appear indecipherable, but regex is an incredibly powerful tool that any PowerShell pro needs to have a grip on.

From what I’ve seen, lookaheads and lookbehinds are very underused by PowerSheller users who write regex. So far, everything we’ve looked at (quantifiers, special characters, character classes, and groups) all match characters in a string. For instance ‘\w{3}\s{2}’ will match three alphanumeric characters followed by two whitespace characters. Lookaheads and lookbehinds are different, though. The best way to describe them is to say that lookaheads and lookbehinds match locations between characters, rather than matching characters themselves. Stay with me here.

Say you have a string “domain\username”, and you want to take just the username part. There are a lot of ways to do this in regex. Here are a few.

'domain\username' -replace '\w+\\',''

'domain\username' -replace '.*\\',''

'domain\username' -replace '\w+\\',''

[regex]::matches('domain\username','\w+$').value

[regex]::matches('domain\username','[^\\]+$').value

The first three examples are variations of a “look for something that has a backslash after it, and replace it with nothing”. The last two examples look for “a word character (or not a backslash) until you get to the end of the string”. There is another way to squeeze the juice out of this lemon, though.

[regex]::matches('domain\username','(?<=\\).+$').value

Let’s take a closer look at the regex pattern here. (?<=\\) is a lookbehind. What it means is “match the space between characters where the character on the left is a backslash”. Because the “username” part of “domain\username” has a “\”, this part matches the space between the characters “\” and “u”. Then, the pattern .+$ takes every character until the end of the string.

You can also do a lookahead.

[regex]::matches('domain\username','.+?(?=\\)').value

Here I’m just getting the “domain” part of this string. The pattern .+? matches all the characters that it takes to get to the next part of the pattern which is (?=\\). That pattern is a lookahead which matches the space between the “n” in “domain” and the “\” that comes after.

You can make a lookahead or lookbehind into a negative lookahead or negative lookbehind by replacing the “=” part with “!”. Consider the following example.

@('something','this one bad') | where { $_ -match 's(?!\s)' }  #returns ‘something’ only

Hypothetically, we have an array of two items, and I am interested only in items that have an “s” but not where a space comes right after the “s”. Weird, but it’s a simple example just for this crash course.

Here’s a full table of the different lookahead and lookbehind syntax.

Syntax Meaning – Example
(?=<pattern>) Lookahead. Matches the space between the character that comes before where that character is followed by <pattern>.

[regex]::matches(‘something’,’(?=e)’).value

This will match the space between the “m” and “e” in “something” because the pattern searches for a space where the character on the right is an “e”.

(?!<pattern>) Negative lookahead. Matches the space between the character that comes before it where the character that comes after it does not match <pattern>.

[regex]::matches(‘something’,’m(?!q)’).value

This will match the “m” and the space between the “m” and “e” in “something” because the pattern searches for an “m” where the following character is not “q”.

(?<=<pattern>) Lookbehind. Matches the space between the character that comes after it where that character is preceded by <pattern>.

[regex]::matches(‘something’,’(?<=m)’).value

This will match the space between the “m” and the “e” in “something” because the pattern searches for a space where the character on the left is “m”.

(?<!<pattern>) Negative lookbehind. Matches the space between the character that comes after it where that character is not preceded by <pattern>.

[regex]::matches(‘something’,’(?<!q)e’).value

This will match the space between the “m” and the “e” as well as the “e” because the pattern searches for an “e” that is not preceded by “q”.

I can hear you asking, “What’s the point?”. Why would you want to use a lookahead or lookbehind when the syntax can look so confusing? Well, let’s use an example to illustrate the point. Say I have a string, ‘this\is\a string’, where I want the “is\a” part. Let’s also say that the actual words “this is a string” aren’t consistently those words. It could be ‘why\is\this something’ too. Why? Because it’s an example.

I could do this.

[regex]::matches(‘this\is\a string’,'\\[^\\].+?\s').value

That’s pretty good. I’ve got the pattern “a single backslash followed by as many characters as it gets to get to a space”. But that includes the leading “\” before “is”, and it’s hard to see here, but there’s a space character following “a”. Okay, let’s strip them out.

[regex]::matches(‘this\is\a string’,'\\[^\\].+?\s').value.trim('\ ')

Not bad. That will return the part that we care about, and it works in most cases. .trim() takes an array of characters, and we passed it the backslash and the space character to trim those off the start and end.

Let’s see how I would tackle this with lookaheads and lookbehinds.

[regex]::matches(‘this\is\a string’,'(?<=\\).+?(?=\s)').value

The pattern here is “the space between characters where the character on the left is a backslash, followed by as many characters as it takes to get to whitespace”.

The one that performs better will depend on the context in which you use regex. In this particular example, the non-lookahead example with .trim() is actually a couple ticks faster on average in my tests. In larger files with more complicated string manipulation (maybe splitting, trimming, replacing, then joining is needed), the lookaheads are going to save you a lot of processing time.

I don’t personally see a ton of people using lookaheads and lookbehinds, but they are a powerful tool and I think the more tools you have in your toolbox, the better equipped you are to handle challenges.

Tune in tomorrow for a bunch of examples!

That was some amazing content, Thomas!  I know for a fact I’m going to be ripping through text files with regular expressions this weekend to see what I can do with them! Thanks!

I invite you to follow the Scripting Guys on Twitter and Facebook. If you have any questions, send email to them at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow.

Until then, always remember that with Great PowerShell comes Great Responsibility.

Sean Kearney Honorary Scripting Guy Cloud and Datacenter Management MVP

Author

The "Scripting Guys" is a historical title passed from scripter to scripter. The current revision has morphed into our good friend Doctor Scripto who has been with us since the very beginning.

0 comments

Discussion are closed.