May 17th, 2007

How Can I Search For Phone Numbers in a Text File?

Hey, Scripting Guy! Question

Hey, Scripting Guy! I have a bunch of text files, one for each of our employees. Included in these text files are phone numbers: work phone, home phone, cell phone, etc. How can I write a script that searches for and extracts the phone numbers from these text files?

— MN

SpacerHey, Scripting Guy! AnswerScript Center

Hey, MN. You know, many years ago the Scripting Guy who writes this column read The Hitchhiker’s Guide to the Galaxy. Now, being a Scripting Guy, he can’t help but admire anyone who writes a trilogy that actually contains five books. However, he still feels obligated to point out that the Hitchhiker’s Guide informed everyone that the answer to the ultimate question of life, the universe, and everything (a question that hasn’t even been asked yet) is 42. Sorry, but that’s simply not right. Instead, the correct answer is 1301.

Why 1301? Well, we’re assuming that the ultimate question of life, the universe, and everything is this: if I go to TechEd 2007 in Orlando, how in the world will I ever track down the Scripting Guys? The answer? By remembering the magic number: 42.

Oh, wait; sorry. The magic number is 1301, not 42. Go into the Partners Expo and look for booth 1301, the CMP Media booth. See the guy in a ratty old black hat with the word Japan on it and the woman with a bunch of staples in her head? Well, there you go: you just found the Scripting Guys. We’re interested in talking to as many people as we can (or, to be more accurate, as many people as we can before we can slip out the back door and head for Splash Mountain.) So if you find yourself in Orlando June 4-8 be sure to drop by and say hi, and to give us an idea of what we can do to serve you better.

Remember that number: 1301. The same year that Scripting Guy Peter Costantini was born.

Now that we know the answer to the ultimate question of life it’s time to turn our attention to the next ultimate question of life: how can I extract phone numbers from a text file? According to MN’s email she has a series of text files, each of them looking something like this:

Ken Myer
Accountant
Fabrikam, North American Division
Building 16, Room 129
Redmond, WA 98052
425-555-1212 (office), 425-555-5656 (cell)
Wife’s name: Sarah
Children: Robert (4), Teri (2)
425-555-3434 (home)

What MN needs to do is read through these files and extract the phone numbers, like so:

425-555-1212 (office)
425-555-5656 (cell)
425-555-3434 (home)

That doesn’t sound too hard, does it? Unfortunately though, there’s at least one complication: the text files are not necessarily consistent. For example, in Ken Myer’s text file his home phone appears on line 9. However, suppose Ken wasn’t married and suppose he didn’t have any children. In that case, his text file would look like this, and his home phone would appear on line 7:

Ken Myer
Accountant
Fabrikam, North American Division
Building 16, Room 129
Redmond, WA 98052
425-555-1212 (office), 425-555-5656 (cell)
425-555-3434 (home)

Likewise, while the phone numbers will always be in the same format: Area Code-Prefix-Suffix (phone type), the area codes can (and will) differ, as will the phone types. For example, some people don’t have cell phones. In other words, about all we can do is search for a particular pattern and then report back each instance of that pattern. But how in the world can we do that?

Well, here’s one way:

Const ForReading = 1

Set objFSO = CreateObject(“Scripting.FileSystemObject”) Set objFile = objFSO.OpenTextFile(“C:\Scripts\Ken_myer.txt”, ForReading)

strSearchString = objFile.ReadAll

objFile.Close

Set objRegEx = CreateObject(“VBScript.RegExp”)

objRegEx.Global = True objRegEx.Pattern = “\d{3}-\d{3}-\d{4} \([a-zA-Z]*\)”

Set colMatches = objRegEx.Execute(strSearchString)

If colMatches.Count > 0 Then For Each strMatch in colMatches Wscript.Echo strMatch.Value Next End If

Before we begin explaining how this script works, take a deep breath and relax: this is nowhere near as complicated as it might look. As you can see, we actually start off in very simple fashion, defining a constant named ForReading and setting the value to 1. (We’ll need this constant when we go to open the text file). Speaking of which, our next two steps are to create an instance of the Scripting.FileSystemObject object and then use the OpenTextFile method to open the file C:\Scripts\Ken_myer.txt:

Set objFSO = CreateObject(“Scripting.FileSystemObject”)
Set objFile = objFSO.OpenTextFile(“C:\Scripts\Ken_myer.txt”, ForReading)

As soon as the file is open we use the ReadAll method to read in the entire file and store the contents in a variable named strSearchString. And once we have the contents stashed safely away in strSearchString we go ahead and close the text file.

Now comes the tricky part.

If we were searching for a particular value (e.g., the phone number 425-555-1212) we could simply use VBScript’s InStr function. However, we aren’t interested in any one phone number; as we noted before, we’re really look for a phone number pattern. Any time you’re searching for a pattern you need to use a regular expression. That’s also true for the Scripting Guys: any time we’re searching for a pattern we need to use a regular expression. Hence the next line of code, which creates an instance of the VBScript.RegExp object:

Set objRegEx = CreateObject(“VBScript.RegExp”)

Note. If you’re already lost you might want to take a look at String Theory for System Administrators, Scripting Guy Dean Tsaltas’ immortal introduction to regular expressions for regular people.

Once we have a regular expression object we need to configure two key properties of that object. First, we set the Global property to True; this tells the regular expression object that we want to search the entire target string (strSearchString) for all instances of the pattern. If Global is set to False then the regular expression object will find the first instance of the pattern and then stop. In this case, that’s no good: it would find the first phone number and then quit, without looking for any additional phone numbers for Ken Myer.

And then it’s time to define our search Pattern:

objRegEx.Pattern = “\d{3}-\d{3}-\d{4} \([a-zA-Z]*\)”

Maybe you better sit down; you look a little dizzy. That’s the problem with regular expressions: you take one look at the syntax and your first thought is to panic. Listen, don’t panic (as the Hitchhiker’s Guide repeatedly implored). Regular expression syntax isn’t very pretty, but it can be explained.

To begin with, our phone numbers always start with the area code; that means a phone number will always start with three numbers back-to-back-to-back. Because of that, our regular expression also starts with three numbers; that’s what the \d{3} is for. (The \d means digits, the {3} means there must be three consecutive digits.) After the area code we have a hyphen followed by three more digits (the prefix). And guess what? In our regular expression we also have a hyphen followed by three more digits (again using \d{3}). In the phone number the prefix is followed by another hyphen and four more digits. And in our regular expression – that’s right, we have the exact same thing: \d{4}. (Note that we use the syntax {4} because we’re now looking for four consecutive digits.)

That’s actually all the information we need to find phone numbers. However, each phone number of followed by a parenthetical statement that tells us the type of phone number (e.g., (cell)). We thought it might be nice to grab these parenthetical statements as well. Therefore, we decided to gussy up our pattern a little bit.

So how did we gussy up our pattern? Well, if you look closely you’ll see there’s a blank space immediately following the telephone suffix; that corresponds to the blank space in a phone number like 425-555-1212 (office). Following the blank space we use this construction to represent the opening parenthesis: \(. (Why the \ before the parenthesis? That’s because the open and close parentheses are reserved characters in regular expressions. The \ simply tells the script to look for the actual open parenthesis character; that is, look for the ( character.)

You might note that, at the end of the pattern, we use a similar construction to represent the close parenthesis: \).

OK, so open and close parentheses aren’t too hard. But what about the stuff that goes between the parentheses? At the very least that could be office, cell, or home; how do we search for something that could be almost anything?

To be honest, there are probably all sorts of ways to do that. (Regular expressions are nothing it not flexible.) Because everything within the parentheses is going to be a word (of some kind) we decided to go this route:

[a-zA-Z]*

We’ve actually got several things going on here. To begin with, we’re interested only in letters, either lowercase or uppercase. In a regular expression, you can search for a range of values by enclosing that range within square brackets and by using hyphens as needed. Inside our square brackets we actually have two items: a-z and A-Z. The a-z syntax simply says, “Find anything that begins with a lowercase letter (a-z)”; as you might expect, the A-Z syntax says “Find anything that begins with an uppercase letter (A-Z).” The net effect is that we’re searching for any letter (lowercase or uppercase). If there’s a number or a punctuation mark or anything but the letters A through Z then this won’t be considered a match.

OK, so we’re searching for letters (and only letters) enclosed in parentheses. But how many letters are we searching for? Well, we don’t know. The word office has 5 letters; the word cell has 4. Technically, we’re looking for 0 or more letters; that’s what the asterisk (*) represents. In other words, we’re looking for a phone number, followed by a blank space, followed by a parenthetical statement that contains 0 or more letters (and only letters).

Confused? We don’t blame you. Truthfully, the best way to understand how regular expressions work is to play with them a bit. For example, try removing the a-z section from the pattern and then running the script. The script won’t find any phone numbers. Why not? Because, with a-z removed, all that’s left is A-Z, meaning that we’re looking only for uppercase letters (e.g., OFFICE). Now, go into the text file, make everything uppercase, and try the script again.

Speaking of trying the script, here’s what we get back when we try the script:

425-555-1212 (office)
425-555-5656 (cell)
425-555-3434 (home)

Which is exactly what we wanted to get back.

OK, so now we know how to locate phone numbers in a text file. That’s good, but it doesn’t fully answer MN’s question; after all, she asks how to perform this task on scores of text files. That’s a good question, and it’s one we get asked quite a bit: how do I perform the same task on a whole bunch of text files, all at the same time. That’s something we’ll address tomorrow, as the conclusion to a rare two-part Hey, Scripting Guy! column.

Note. Will that result in these two columns becoming a collector’s item of some kind? Sure; why not?

In conclusion, remember, that’s booth 1301 (the CMP Media booth) in the Partners Expo hall. The Scripting Guys will be there, they’ll have a fun little giveaway for everyone, and you’ll have the opportunity to win your very own Dr. Scripto bobblehead doll.

Actually we should clarify that a bit: Scripting Guy Greg Stemp will be there in booth 1301. However, if her trip to Florida goes anything like her trip to San Diego you’ll need to go to the Orlando Regional Medical Center if you want to talk to Scripting Guy Jean Ross. But be sure and drop by there and say hi. (And maybe give her a pint of blood or something while you’re there.) Editor’s Note: Scripting Guy Jean Ross had the staples removed weeks ago, is perfectly healthy, and will be in booth 1301. Or so she insists.

Author

0 comments

Discussion are closed.

Feedback