Hey, Scripting Guy! How can I remove all the non-alphabetic characters in a string?
— CD
Hey, CD. You know, typically the Scripting Guys don’t play favorites; we treat all our readers and all the questions we receive exactly the same. (OK, so, technically that means that, for the most part, we simply ignore all our readers as well as all the questions we receive. But at least we’re treating everyone the same!)
This time, however, we’re going to make an exception: CD, this is a great question, one of the best we’ve ever received. Thank you so much for asking how you can write a script that removes all the non-alphabetic characters in a string. This is a truly great, great question.
So, anyway, the best way to remove all the – what’s that? Is this a great question because it addresses the concerns of so many system administration scripters? Well, to tell you the truth, we’re not really sure how many people need to remove all the non-alphabetic characters in a string, although we have received questions along similar lines. So then is this a great question because it provides the Scripting Guys an opportunity to explain an important yet little-understood concept about system administration scripting? Well, maybe. But that isn’t why we thought this was a great question. No, we thought this was a great question because we knew we could answer it in just a few minutes, and with just a few lines of code. In turn, that means we can get today’s column out of the way and move on to something that everyone really cares about: tonight’s college football game between the University of Washington Huskies and the Syracuse Orange.
Note. If you go to the official site of Syracuse University Athletics you’ll learn that the favorite fruit of volleyball player Mindy Stanislovaitis is the strawberry, and that her favorite superhero is Spiderman. You’ll also learn that the Orange are apparently playing the Washington Post rather than the Washington Huskies in tonight’s opening game. Of course, if you go 5-18 over the past two years, as Syracuse has, well, then it probably makes sense to open the season against a newspaper rather than another college football team. |
At any rate, kickoff is just a few hours away, so we better get started on our answer for today’s question (which, by the way, is a great, great question):
Set objRegEx = CreateObject(“VBScript.RegExp”)objRegEx.Global = True objRegEx.Pattern = “[^A-Za-z]”
strSearchString = “ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890abcdefghijklmnopqrstuvwxyz!@#$%&*(_+”
strSearchString = objRegEx.Replace(strSearchString, “”)
Wscript.Echo strSearchString
See? We told you this was going to be quick. And easy.
Although there are other ways we could accomplish this task, this easiest approach (in our minds, anyway) is to use a regular expression. Therefore, the very first thing we do in this script is execute the following line of code, a line of code that creates an instance of VBScript’s regular expression object (VBScript.RegExp):
Set objRegEx = CreateObject(“VBScript.RegExp”)
Once we have a regular expression object in hand, we then set two important properties of the object. First, we set the Global property to True; we do this so that when we execute the Replace method this ensures that the regular expression will search the entire string value. What if we didn’t set the Global property to True? In that case, our regular expression would find the first instance of the target text (that is, the first non-alphabetic character) and then stop. Admittedly, there are times when all you need to know is whether or not there is at least one non-alphabetic character in a string. However, this isn’t one of those times.
Next, we need to define the regular expression Pattern, the text we’re going to search for. That’s what this line of code does:
objRegEx.Pattern = “[^A-Za-z]”
The key to working with regular expressions is this: don’t let the cryptic syntax scare you off. Too often people look at a construction like [^A-Za-z] and think, “Well, I’m not sure who that’s intended for, but it’s definitely not intended for me.” As if turns out, however, this is actually a pretty straightforward little statement. With regular expressions, you can search for a range of characters simply by enclosing those characters in square brackets. Want to search for all the characters between A and Z (uppercase letters), inclusive? Then use this syntax: [A-Z]. What about all the lowercase characters, those between a and z, inclusive? No problem: [a-z]. And if you want to search for either uppercase characters or lowercase characters, well, one way to do that is include both character ranges in a single set of square brackets: [A-Za-z].
Like we said, simple and straightforward.
OK, but what about the little caret symbol (^)? Well, any time the caret appears alongside a character range that caret is equivalent to the word “not.” What does [^A-Za-z] mean? It means this: search for any characters that are not in the range A-Z or a-z. In other words, search for any non-alphabetic characters.
Which, coincidentally enough, is just exactly what we want to search for.
Believe it or not, from here on the task gets even easier. To begin with, we assign a value to a variable named strSearchString; note the many non-alphabetic characters in the value:
strSearchString = “ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890abcdefghijklmnopqrstuvwxyz!@#$%&*(_+”
After that we call the Replace method, passing the method two parameters: the value we want to search (strSearchString) and the replacement text. In this case, of course, we want to replace non-alphabetic characters with absolutely nothing; therefore, we use an empty string (“”) as the replacement text:
strSearchString = objRegEx.Replace(strSearchString, “”)
That line of code should remove all the non-alphabetic characters in strSearchString. (That is, it should locate each of these characters and replace them with absolutely nothing.) So does that line of code remove all the non-alphabetic characters in strSearchString? Let’s see what happens when we echo back the value of strSearchString:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Well, what do you know?
In case you’re wondering, yes, you can use this same approach to replace all the non-alphabetic characters in a text file; in fact, we threw together a sample script that does just that. Note that, for this script, we added something to our Pattern: \n\r. This tells the script that we want to exclude the newline character (\n) and the carriage return character (\r) from the search scope. In turn, that means that our script will not remove any line breaks from the text file. If you don’t care about preserving line breaks, well, then just remove the \n\r from the Pattern.
Without any further ado (or explanation), here’s the script:
Const ForReading = 1 Const ForWriting = 2Set objFSO = CreateObject(“Scripting.FileSystemObject”) Set objFile = objFSO.OpenTextFile(“C:\Scripts\Test.txt”, ForReading)
strSearchString = objFile.ReadAll objFile.Close
Set objRegEx = CreateObject(“VBScript.RegExp”)
objRegEx.Global = True objRegEx.Pattern = “[^A-Za-z\n\r]”
strSearchString = objRegEx.Replace(strSearchString, “”)
Set objFile = objFSO.OpenTextFile(“C:\Scripts\Test.txt”, ForWriting) objFile.WriteLine strSearchString
objFile.Close
Just a little bonus script for everyone. Needless to say, now that college football is upon us, the Scripting Guy who writes this column is happy. For a change.
Speaking of which, we should note that this same Scripting Guy is under no illusions about the upcoming season: as a long-time season ticket holder, he knows UW football about as well as anyone. (He also knows that they play the toughest schedule in the country this season, playing 9 teams that went to bowl games last year.) But that’s the great thing about sports: until they start playing games, everyone is undefeated, and everyone has a chance to be the national champion.
Especially those teams that open the season against the Washington Post.
0 comments