PowerShell regex crash course ā Part 4 of 5

Dr Scripto
Summary: Thomas Rayner, Microsoft Cloud and Datacenter Management MVP, shows the basics of working with regular expressions in PowerShell.
Hello! Iām Thomas Rayner, a proud Cloud & Datacenter Management Microsoft MVP, filling in for The Scripting Guy! this week. You can find me on Twitter (@MrThomasRayner), or posting on my blog, workingsysadmin.com. This week, Iām presenting a five-part crash course about how to use regular expressions in PowerShell. Regular expressions are sequences of characters that define a search pattern, mainly for use in pattern matching with strings. Regular expressions are extremely useful to extract information from text such as log files or documents. This isnāt meant to be a comprehensive series but rather, just as the name says, a crash course. So, buckle up!
Many people are intimidated by regular expressions, or āregexā. If you see something like ā(\d{1,3}\.){3}(\d{1,3})ā and your eyes start glazing over, donāt worry. By the end of this series, youāll have the skills to identify that pattern matches IP addresses. For the uninitiated, big strings of seemingly random characters appear indecipherable, but regex is an incredibly powerful tool that any PowerShell pro needs to have a grip on.
From what I’ve seen, lookaheads and lookbehinds are very underused by PowerSheller users who write regex. So far, everything we’ve looked at (quantifiers, special characters, character classes, and groups) all match characters in a string. For instance ‘\w{3}\s{2}’ will match three alphanumeric characters followed by two whitespace characters. Lookaheads and lookbehinds are different, though. The best way to describe them is to say that lookaheads and lookbehinds match locations between characters, rather than matching characters themselves. Stay with me here.
Say you have a string “domain\username”, and you want to take just the username part. There are a lot of ways to do this in regex. Here are a few.
'domain\username' -replace '\w+\\',''
'domain\username' -replace '.*\\',''
'domain\username' -replace '\w+\\',''
[regex]::matches('domain\username','\w+$').value
[regex]::matches('domain\username','[^\\]+$').value
The first three examples are variations of a “look for something that has a backslash after it, and replace it with nothingā. The last two examples look for “a word character (or not a backslash) until you get to the end of the string”. There is another way to squeeze the juice out of this lemon, though.
[regex]::matches('domain\username','(?<=\\).+$').value
Letās take a closer look at the regex pattern here. (?<=\\) is a lookbehind. What it means is āmatch the space between characters where the character on the left is a backslashā. Because the āusernameā part of ādomain\usernameā has a ā\ā, this part matches the space between the characters ā\ā and āuā. Then, the pattern .+$ takes every character until the end of the string.
You can also do a lookahead.
[regex]::matches('domain\username','.+?(?=\\)').value
Here Iām just getting the ādomainā part of this string. The pattern .+? matches all the characters that it takes to get to the next part of the pattern which is (?=\\). That pattern is a lookahead which matches the space between the ānā in ādomainā and the ā\ā that comes after.
You can make a lookahead or lookbehind into a negative lookahead or negative lookbehind by replacing the ā=ā part with ā!ā. Consider the following example.
@('something','this one bad') | where { $_ -match 's(?!\s)' }Ā #returns āsomethingā only
Hypothetically, we have an array of two items, and I am interested only in items that have an āsā but not where a space comes right after the āsā. Weird, but itās a simple example just for this crash course.
Hereās a full table of the different lookahead and lookbehind syntax.
Syntax | Meaning ā Example |
---|---|
(?=<pattern>) | Lookahead. Matches the space between the character that comes before where that character is followed by <pattern>.
This will match the space between the āmā and āeā in āsomethingā because the pattern searches for a space where the character on the right is an āeā. |
(?!<pattern>) | Negative lookahead. Matches the space between the character that comes before it where the character that comes after it does not match <pattern>.
This will match the āmā and the space between the āmā and āeā in āsomethingā because the pattern searches for an āmā where the following character is not āqā. |
(?<=<pattern>) | Lookbehind. Matches the space between the character that comes after it where that character is preceded by <pattern>.
This will match the space between the āmā and the āeā in āsomethingā because the pattern searches for a space where the character on the left is āmā. |
(?<!<pattern>) | Negative lookbehind. Matches the space between the character that comes after it where that character is not preceded by <pattern>.
This will match the space between the āmā and the āeā as well as the āeā because the pattern searches for an āeā that is not preceded by āqā. |
I can hear you asking, āWhatās the point?ā. Why would you want to use a lookahead or lookbehind when the syntax can look so confusing? Well, letās use an example to illustrate the point. Say I have a string, āthis\is\a stringā, where I want the āis\aā part. Letās also say that the actual words āthis is a stringā arenāt consistently those words. It could be āwhy\is\this somethingā too. Why? Because itās an example.
I could do this.
[regex]::matches(āthis\is\a stringā,'\\[^\\].+?\s').value
Thatās pretty good. Iāve got the pattern āa single backslash followed by as many characters as it gets to get to a spaceā. But that includes the leading ā\ā before āisā, and itās hard to see here, but thereās a space character following āaā. Okay, letās strip them out.
[regex]::matches(āthis\is\a stringā,'\\[^\\].+?\s').value.trim('\ ')
Not bad. That will return the part that we care about, and it works in most cases. .trim() takes an array of characters, and we passed it the backslash and the space character to trim those off the start and end.
Letās see how I would tackle this with lookaheads and lookbehinds.
[regex]::matches(āthis\is\a stringā,'(?<=\\).+?(?=\s)').value
The pattern here is āthe space between characters where the character on the left is a backslash, followed by as many characters as it takes to get to whitespaceā.
The one that performs better will depend on the context in which you use regex. In this particular example, the non-lookahead example with .trim() is actually a couple ticks faster on average in my tests. In larger files with more complicated string manipulation (maybe splitting, trimming, replacing, then joining is needed), the lookaheads are going to save you a lot of processing time.
I donāt personally see a ton of people using lookaheads and lookbehinds, but they are a powerful tool and I think the more tools you have in your toolbox, the better equipped you are to handle challenges.
Tune in tomorrow for a bunch of examples!
That was some amazing content, Thomas!Ā I know for a fact Iām going to be ripping through text files with regular expressions this weekend to see what I can do with them! Thanks!
I invite you to follow the Scripting Guys on Twitter and Facebook. If you have any questions, send email to them at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow.
Until then, always remember that with Great PowerShell comes Great Responsibility.
Sean Kearney Honorary Scripting Guy Cloud and Datacenter Management MVP
0 comments