Summary: Microsoft MVP, Tome Tanasovski, shows how to use regular expression lookaheads and lookbehinds in Windows PowerShell to format numbers.
Hey, Scripting Guy! How do I use a regular expression to add the appropriate commas to a number?
—TT
Hello TT,
Microsoft Scripting Guy, Ed Wilson, here. Guest Blogger Week continues with Tome Tanasovski.
Tome is a Windows engineer for a market-leading, global financial services firm in New York City. He is the founder and leader of the New York City PowerShell User group, a cofounder of the NYC Techstravaganza, a blogger, a speaker, and a regular contributor to the Windows PowerShell forum at Microsoft. He is currently working on the PowerShell Bible, which is due out in 2011 from Wiley. Tome is also a recipient of the MVP award for Windows PowerShell. Tome will be providing an hour-long, deep-dive about regular expressions via Live Meeting on March 22, 2011 for the UK PowerShell User group.
Regular expressions is one of the topics that will figure in the 2011 Scripting Games. In addition to the resources mentioned in the 2011 Scripting Games Study Guide, you should review TOME’s blogs today and tomorrow. If you get a chance to attend the March 22, 2011 Live Meeting, that would be helpful as well.
The Scripting Guys already have quite an extensive list of regular expression articles that talk about using -match, -replace, Select-String, and a few other nifty tricks in the Windows PowerShell ninja arsenal for pattern matching and text slicing-and-dicing.
My opinion is that you are hurting yourself if you are an IT Pro and you don’t know the basics of regular expressions, which include:
· Character classes [a-zA-Z]
· Metacharacters \d \w \s
· Quantifiers {2,3}*+?
· Grouping (this|that)
· Captures $matches[1]
If you are reading this blog and you are not already familiar with these concepts, take a few minutes to read some of the previous blogs.
Although a Windows PowerShell enthusiast can get by with knowing only the basics, there are a few complex concepts worth learning at least once so that you know that they are available when you need to use them. Many people who come up with elegant regular expressions do so with a small bit of reading at the time of creation to remember the exact syntax required for a particular task.
Two concepts that do not naturally stick in most beginner brains are lookarounds and word boundaries. This isn’t because they are complex, but because they are not used as frequently as the other elements that are used every day in regular expressions. These mentally reclusive elements are exactly what are needed to solve the problem of adding commas to a series of numbers.
Let us look at an example of the problem that we are trying to solve: I have a long number like 1234567890. I need to add a comma to break up the number into sets of threes from right to left. With that understanding, I know that a comma belongs between seven and eight, between four and five, and between one and two. The result is 1,234,567,890.
It seems fairly simple to you and me, but how do you explain all of the mental rules involved to a computer in the same brief and simple way? The first thing we have to understand is that we are looking for positions, not characters, in the string. Fortunately, regular expressions have positional elements to help us. Let’s take a look at positional elements.
Understanding positions
It’s not always obvious that there are more than just letters, numbers, symbols, and spaces that make up the patterns that our brain sees when we’re scanning text. The ^ and $ are the most used examples of these intangible elements. In a regular expression, the carat (^) symbol means the beginning of the string, and a dollar sign ($) indicates the end of the string.
If you were to look at the bytes within the text, you would not find a byte value for these positions; they are logical positions before and after the bytes. If you could also describe positions between the bytes in the middle of your string, you could tell the computer to do something like replace the position before the 2, 5, and 8 in 1234567890 with commas. Lookaheads, (?=), allow you to do just that. Here is an example of using a lookahead to find a position followed by the number 2.
Note: In Windows PowerShell, regular expressions are accepted in many places. Here I am using the replace operator.
The lookahead expression, (?=)
‘1234567890’ -replace ‘(?=2)’, ‘,’
1,234567890
Using a parenthesis followed by a question mark, (?=, is the syntax that is used to initiate a lookahead, and a corresponding closing parenthesis indicates the end of the lookahead. In the previous example, (?=2) says any position in the string where the next character is a 2. You can use any regular expression elements after the equal sign. For example, the following lookahead will match the word “test” as long as another instance of the word “test” exists later in the string:
‘test(?=.*test)’
This lookahead can also be written as ‘test.*test’, but the nice thing about a lookahead is that it does not consume any more text in the match. This makes sense when you look at an example. The following lookahead will only replace the word “Hello” if it is followed by the word “World,” instead of replacing the entire “HelloWorld” with “Goodbye”:
‘HelloWorld’ -replace ‘Hello(?=World)’, ‘Goodbye’
GoodbyeWorld
By using a lookahead, we could solve our explicit example of 1234567890 as follows:
‘1234567890’ -replace ‘(?=(2|5|8))’, ‘,’
Although this works fine for our example, the regular expression fails miserably when we throw any other number at it. Instead, we can use a lookahead to find positions where there are sets of three digits in front of the matched numbers by using this pattern:
‘1234567890’ -replace ‘(?=(\d{3})+)’, ‘,’
,1,2,3,4,5,6,7,890
This works nicely for the last comma, but the rest of the matches fall short. The first thing we notice is the comma at the beginning of the string. We can solve this quite easily by saying that not only should the lookahead have a set of digits, but the position should also have a digit before it. We do this with a lookbehind.
The lookbehind expression, (?<=)
‘1234567890’ -replace ‘(?<=\d)(?=(\d{3})+)’, ‘,’
1,2,3,4,5,6,7,890
Lookbehinds are initiated with (?<=. It’s also worth noting that you can negate a lookahead or lookbehind by using an exclamation mark (!) instead of the equal sign (=). For example, if you wanted to replace only the first instance of the word “script” in a string, you could use a regular expression that finds the word “script,” and then does a lookahead to make sure that there are no other instances of the word in the string. This example replaces the second word script with the letter X.
‘script script’ -replace ‘script(?!.*script)’, ‘X’
script X
Create an anchor
There is one final puzzle to solve in our regular expression. We are matching almost all of the positions between our numbers because our lookahead (?=(\d{3})+) will succeed anywhere that the position is followed by at least one set of three digits. We need to ensure that the sets of three digits finish at the end of the string. Fortunately, we know that we can use the dollar sign ($) to indicate the end of the string, so we can give our number a boundary. The following example solves our specific problem of adding the appropriate number of commas to a string of numbers.
‘1234567890’ -replace ‘(?<=\d)(?=(\d{3})+$)’, ‘,’
This regular expression is a solution to our problem, but it does not answer every instance of the problem of adding commas to numbers. What if you wanted to do this for all numbers embedded within a string, such as in the example seen here in which two groups of numbers are inside a string with words.
‘1234 blah blah 123456789 blah’
The dollar sign only matches the end of the string, so the regular expression will not work for us. This is where the wonderful world of the word boundary character class \b saves the day.
Understanding the word boundary (\b) character class
The character class \b indicates a position between a word character (the \w character) and a non-word character (\W). In case you do not remember, the word character matches numbers too, so the boundary can be used to signify the position between the last number and the following space in each set of numbers. The result is seen here.
‘1234 blah blah 123456789 blah’ -replace ‘(?<=\d)(?=(\d{3})+\b)’, ‘,’
1,234 blah blah 123,456,789 blah
The following image illustrates the previous code and its output as seen in the Windows PowerShell console. As you can see, the command and the output are exactly the same.
Although regular expression solutions that use lookarounds and word boundaries are not typical for the IT Pro who is parsing log data or working with text output from a non-Windows PowerShell command, it is important to remember that they exist. If you work with text long enough, I guarantee that you will come across a situation when regular expressions will save the day.
TT, that is all there is to using regular expression lookaheads and lookbehinds to work with adding commas to numbers. I want to thank Tome for writing an excellent tutorial on regular expressions. Guest Blogger Week will continue tomorrow when Tome will be back to continue talking about regular expressions.
I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.
Ed Wilson, Microsoft Scripting Guy
0 comments