April 13th, 2009

Hey, Scripting Guy! How Can I Create a Phone Directory from Files with Varying Text Formats?

Hey, Scripting Guy! Question

Hey, Scripting Guy! I have a folder full of text files that have user contact information. I think these files were produced by an old database backup. I have to create a phone directory from these files; I thought about trying to read each file and copy just the fourth line from each of the files. Unfortunately, in some of the files the phone number is on one line, but in other files it is on a different line. To make matters worse, some of the files have one phone number, but others have two or even three phone numbers listed. Is there any hope to solve this mess, or would it be easier to set up an internal Web site and send an e-mail message out to everyone to go update their contact information?

– JB

SpacerHey, Scripting Guy! Answer

Hi JB,

We are not sure the internal Web site would be a good idea. First, it is almost impossible to get users to fill out anything. You would end up having to hassle the managers to hassle their workers, which of course would stretch into months and months of headaches and nightmares. You would be better off typing the phone numbers yourself. However, if you are not excited about two weeks’ worth of typing, you may be interested in our topic today—regular expressions. You can also review VBScript versions of today’s article: match a telephone number and find all telephone numbers in a folder.

This week we are focusing on regular expressions. There are some VBScript examples in the Script Center. Here is a good introduction from the 2008 Winter Scripting Games (by the way, in the 2009 Summer Scripting Games, I can pretty much guarantee you will need to be able to do something with regular expressions for one of the events). The Regex .NET Framework class from the System.Text.RegularExpressions namespace is documented on MSDN. This is one of the main classes we use in Windows PowerShell when working with regular expressions. You also will find some information about regular expressions in the Microsoft Press book, Windows PowerShell Scripting Guide. Here is a very good article about regular expression use in VBScript. In this week’s articles, we are using Windows PowerShell for our samples. Please refer to the Windows PowerShell Scripting Hub for more information about this exciting new technology.

Regular expressions are everywhere in Windows PowerShell. There is a –match operator that will use regular expressions, cmdlets that accept regular expressions as parameters, .NET Framework classes, and even a Select-String cmdlet that eats regular expressions for lunch. Of course, it might be good if we look at what a regular expression actually is before we get too far in the discussion.

Regular expressions are used to look inside text and find things that match. To do the matching, regular expressions rely upon highly stylized patterns. The patterns can range from simple letter sequences to complex specialized patterns that form their own language. Let’s take a look at a simple use of regular expressions using the –match operator. In this example, we use an If statement to see if the string “This is a string” has the letter pattern “is” inside it. Of course it does, but is our code correct? If the pattern is found in the source string, we display the word “match”; otherwise, we display the words “no match.” If you need a bit of a refresher on the IF…ELSE construction in Windows PowerShell, refer to this article. The pattern we are using here is two literal characters: the letter i and the letter s back to back. This is shown here:

PS C:\> IF("This is a string" -match "is" ) { "match" } ELSE { "no match" }match

We use the up arrow inside the Windows PowerShell console, change the match pattern from “is” to “if,” and run the command again. This is shown here (the string pattern does not contain the letter pattern “if”):

PS C:\> if("This is a string" -match "if" ) { "match" } ELSE { "no match" }no match

What if we were wondering if the pattern “is a” appears in the string? Our pattern is the letters i and s, a blank space, and the letter a:

PS C:\> if("This is a string" -match "is a" ) { "match" } ELSE { "no match" }Match

That was easy, but what if the space between the letters s and a was two spaces or worse yet a tab? What would we do then? We could mess around until we came up with something that might work, but the moment the text changes or something is inconsistent, the pattern would fail. This is where the regular expression special characters come in handy. As it turns out, there is a pattern “/s” (without the quotation marks) that will match any white space of any kind.

Anyway, here is an example of using the “/s” pattern. I should tell you at this point that most of the time we will be working with things that are case sensitive. This means that “/S” does not match the same thing that “/s” matches. In fact they are opposite: “/S” matches no white space. So our pattern looks for the letters is followed by any white space:

PS C:\> if("This is a string" -match "is\s" ) { "match" } ELSE { "no match" }Match

To match “is a” we can use the pattern “is \sa”. This is seen here:

PS C:\> if("This is a string" -match "is\sa" ) { "match" } ELSE { "no match" }Match

But what if the string says “is an”? Would the pattern work? As shown here, it does in fact match. This is because we found the pattern “is a” in this exact sequence:

PS C:\> if("This is an string" -match "is\sa" ) { "match" } ELSE { "no match" }Match

That is pretty cool, but what if we want to match “is a” when it is only “is a” and not when it appears as “is an”? What would we need to change? In this case, the letter a is followed by a blank space. The blank space does not appear between the a and the n in the word “an”. We already know the special pattern for white space, so we go ahead and modify the match pattern. As we can see here, it does not match is an but it does match is a:

PS C:\> if("This is an string" -match "is\sa\s" ) { "match" } ELSE `{ "no match"}no matchPS C:\> if("This is a string" -match "is\sa\s" ) { "match" } ELSE { "no match" }Match

We can use a pair of curly brackets with a number inside to specify a specific number of occurrences of the pattern. If I want to check a string and ensure that it has at least two occurrences of the letter i, our pattern is the letter i followed by the number 2 inside a set of curly brackets. It looks like “i{2}” and is seen in the code here:

PS C:\> if("This is a string" -match "i{2}" ) { "match" } ELSE { "no match" }no matchPS C:\> if("This iis a string" -match "i{2}" ) { "match" } ELSE { "no match" }Match

We can use a range of letters by using a dash between two letters inside a pair of square brackets to indicate a consecutive sequence of letters. The pattern looks like this: “[a-e]”. In this example, we search for a match with any letter in the range of a through e. These are case sensitive, so if you were interested in both uppercase and lowercase letters, the pattern would be similar to this one: “[a-e][A-E]”. An example of using the range pattern is seen here:

PS C:\> if("This is a string" -match "[a-e]" ) { "match" } ELSE { "no match" }Match

Let’s try a different range this time. Suppose we are interested in a range of lowercase letters j through m? We would use the range “[j-m]”. A match is only generated if one of the letters in the range is present in the test string. However, if we use the “*” symbol, we are looking for zero or more matches. Because there are no lowercase letters in the range of “j-m”, the first pattern fails. The second generates a match, though, because the zero portion of the zero or more match character was used. This is seen here:

PS C:\> if("This is a string" -match "[j-m]" ) { "match" } ELSE { "no match" }no matchPS C:\> if("This is a string" -match "[j-m]*" ) { "match" } ELSE { "no match" }Match

The “*” character means zero or more matches, as we just saw, but we can generate the same match pattern by using the numbers inside curly brackets. If we use a zero for 0 matches, and nothing for more, our pattern is this: “{0,}”. This pattern is the same as the “*” pattern. As shown here, it generates the same results as our previous pattern. (Note that the lines are getting too long for our text format, and so I am using the backtick character (“`”), which indicates line continuation. When I actually typed it in the Windows PowerShell console, I did not use the line continuation because the console is able to wrap.

PS C:\> if("This is a string" -match "[j-m]{0,}" ) { "match" } ELSE `{ "no match" }match

We can also use the ? character which tells regular expressions to match zero or one instance of the pattern. In this example, the pattern is the lowercase letters j through m and we are looking for zero or more instances of those letters. In our example our pattern is “[j-m]?” and is seen here:

PS C:\> if("This is a string" -match "[j-m]?" ) { "match" } ELSE { "no match" }Match

The previous match pattern, “[j-m]?”, could also be written using the curly bracket and numbers. To find zero or more instances of letters in the range of lowercase j through lowercase m we would use this pattern: “[j-m]{0,1}”. The newly revised pattern and results are shown here:

PS C:\> if("This is a string" -match "[j-m]{0,1}" ) `{ "match" } ELSE { "no match" }match

What if we were interested in two or three i characters? The best way to find two or three instances of a specific character is to use the curly brackets and specify “i{2,3}”. This is seen here:

PS C:\> if("This is a string" -match "i{2,3}" ) { "match" } ELSE { "no match" }Match

If we needed to look for three or four i characters? We would continue to modify the same pattern of letter character and curly brackets. This time it would look like “i{3,4}” as seen here:

PS C:\> if("This is a string" -match "i{3,4}" ) { "match" } ELSE { "no match" }no match

If we want to find a number, we use the “\d” pattern. If we are interested in finding two numbers in the input string, we use our curly bracket trick with the “\d” and our match pattern becomes “\d{2}”:

PS C:\> if("This is a number 22" -match "\d{2}" ) { "match" } ELSE { "no match"}match

We now have enough background to get to our original question. We need to find all telephone numbers inside a text file. To do this, we need to read a text file and use a regular expression pattern that will be able to correctly identify a telephone number. We can use the Select-String cmdlet to do both of these tasks. It will read a text file and will look for a regular expression pattern. Telephone numbers in the United States use three numbers for the area code, followed by a group of three numbers, and then a group of four numbers. The three groups of numbers are separated by two dashes, which make the groups easier to remember. The regular expression match pattern is therefore “\d{3}-\d{3}-\d{4}”. When we use Select-String to read the text file seen in the image below, we obtain the results seen here:

PS C:\> Select-String -Pattern "\d{3}-\d{3}-\d{4}" -Path C:\fso\Ken_Myer.txtfso\Ken_Myer.txt:6:425-555-1212 (office), 425-555-5656 (cell)fso\Ken_Myer.txt:9:425-555-3434 (home)

Image of the text file being read by Select-String

 

If we want to make sure we have a phone number followed by text such as (office), we need to add to the match pattern. As seen in the previous image, the phone number description text is surrounded by a set of parentheses. We need to tell the regular expression engine to look for the parenthesis character and interpret it literally. To do this, we need to escape the parenthesis. In Windows PowerShell, regular expressions the escape character is a backslash (\). In our particular example, however, the escape character does not appear to make any difference:

PS C:\> "(office)" -match "([a-z]*)"TruePS C:\> "(office)" -match "\([a-z]*\)"True

Our complete telephone number and text description pattern is now complete as seen here: “\d{3}-\d{3}-\d{4}\s\([a-zA-Z]*\)”. When we use it with Select-String and have it read the text file, we receive the results seen here:

PS C:\> Select-String -Pattern "\d{3}-\d{3}-\d{4}\s\([a-zA-Z]*\)" `-Path C:\fso\Ken_Myer.txtfso\Ken_Myer.txt:6:425-555-1212 (office), 425-555-5656 (cell)fso\Ken_Myer.txt:9:425-555-3434 (home)

Being able to read a text file and parse it for information in a single line of code is pretty cool. But suppose we have multiple files in a folder that could have telephone information contained within them. Such a folder is seen here:

Image of multiple files in a folder

 

As you can see in that image, there is a mixture of data files in the folder. It includes everything from text files to Access databases. In Windows PowerShell, we could use the Get-ChildItem cmdlet to obtain a listing of text files. We could then pipeline it to the Get-Content cmdlet to read the content of each text file, and pipeline the results to the Select-String cmdlet to find the pattern matches. If we do this, the command would look like this:

PS C:\> Get-ChildItem -Path c:\fso -Include *.txt -Recurse | Get-Content | Select-string -Pattern "\d{3}-\d{3}-\d{4}\s\([a-zA-Z]*\)"980-555-1212 (office), 980-555-5656 (cell)980-555-3434 (home)425-555-1212 (office), 425-555-5656 (cell)425-555-3434 (home)513-555-1212 (office), 513-555-5656 (cell)

There is nothing wrong with such a command. Before I learned how to use Select-String, I probably wrote code that looked like that. However, when we test it by using the Measure-Command cmdlet, we see that the command takes nearly two seconds to run:

PS C:\> Measure-command { Get-ChildItem -Path c:\fso -Include *.txt -Recurse | Get-Content | Select-string -Pattern "\d{3}-\d{3}-\d{4}\s\([a-zA-Z]*\)" }Days              : 0Hours             : 0Minutes           : 0Seconds           : 1Milliseconds      : 923Ticks             : 19233254TotalDays         : 2.22607106481481E-05TotalHours        : 0.000534257055555555TotalMinutes      : 0.0320554233333333TotalSeconds      : 1.9233254TotalMilliseconds : 1923.3254

When we use the path parameter from Select-String and tell it to look for *.txt files, the results are quite a bit better as seen here. So we have the option of 43 milliseconds compared to nearly two seconds. When you consider that the Select-String cmdlet is easier to type, the choice is clear. The results of Measure-Command are shown here:

PS C:\> Measure-command {Select-String -Pattern `"\d{3}-\d{3}-\d{4}\s\([a-zA-Z]*\)" -Path C:\fso\*.txt}Days              : 0Hours             : 0Minutes           : 0Seconds           : 0Milliseconds      : 43Ticks             : 436937TotalDays         : 5.0571412037037E-07TotalHours        : 1.21371388888889E-05TotalMinutes      : 0.000728228333333333TotalSeconds      : 0.0436937TotalMilliseconds : 43.6937

To read a folder and return the information that matches is not much more difficult than figuring out the regular expression pattern in the first place. This is seen here:

PS C:\> Select-String -Pattern "\d{3}-\d{3}-\d{4}\s\([a-zA-Z]*\)" -Path `C:\fso\*.txtfso\Jim_Kim.txt:6:980-555-1212 (office), 980-555-5656 (cell)fso\Jim_Kim.txt:7:980-555-3434 (home)fso\Ken_Myer.txt:6:425-555-1212 (office), 425-555-5656 (cell)fso\Ken_Myer.txt:9:425-555-3434 (home)fso\Yan_Lee.txt:5:513-555-1212 (office), 513-555-5656 (cell)fso\Yan_Lee.txt:7:606-555-3434 (home)

Well, JB, we have successfully trolled through a folder and found all the phone numbers. Along the way, we learned a lot about working with regular expressions. Join us tomorrow as Regular Expression Week continues. Until then, peace.

 

Ed Wilson and Craig Liebendorfer, Scripting Guys

Author

0 comments

Discussion are closed.

Feedback