December 6th, 2014

Weekend Scripter: Remove Non-Alphabetic Characters from String

Doctor Scripto
Scripter

Summary: Microsoft Scripting Guy, Ed Wilson, talks about using Windows PowerShell to remove all non-alphabetic characters from a string.

Microsoft Scripting Guy, Ed Wilson, is here. This morning I am drinking a nice up of English Breakfast tea and munching on a Biscotti. I know…Biscotti is not a very good breakfast. Oh well. I went to my favorite bakery yesterday looking for some nice scones (which do make a good breakfast). For some reason, all of the scones they had were covered with a half-inch thick gunky sugar icing. I mean, DUDE!

There was not enough time to talk the Scripting Wife into making some nice scones, so here I am munching on Biscotti. The tea is nice anyway, and I am listening to some great Duke Ellington on my Surface Pro 3 while I am catching up on the Hey, Scripting Guy! Blog comments.

One of the really cool things about the Hey, Scripting Guy! Blog is that I continue to get comments on posts that were written a long time ago. I ran across a couple of comments for a post that was written more than seven years ago.

The post was written using VBScript, and it was titled How Can I Remove All the Non-Alphabetic Characters in a String? The post talks about using regular expressions, and the information is still valid in a Windows PowerShell world. From that standpoint, it makes sense to take a quick look at that post before moving forward.

PowerShell makes using regular expressions easy

Lots of Windows PowerShell commands have regular expressions built in to them. That means that I really do not need to do anything special to unleash the power of regular expressions. I do not really need to know regular expressions, but knowing a bit about them does make stuff easier.

At first, it might look like there is a regular expression character class that would do what I want to do here—that is remove non-alphabetic characters. But unfortunately, that is not the case. There is the \w character class, which will match a word character; but here, word characters include numbers and letters.

   Note  Regular expressions are generally case sensitive, and it is important to remember that. Here, the \w
   character class is different than the \W character class (non-word characters).

According to the previously mentioned Hey, Scripting Guy! Blog post, the trick to solving the problem of removing non-alphabetic characters from a string is to create two letter ranges, a-z and A-Z, and then use the caret character in my character group to negate the group—that is, to say that I want any character that IS NOT in my two letter ranges. Here is the pattern I come up with:

[^a-zA-Z]

   Note  When working with regular expressions, I like to put my RegEx pattern into single quotation marks
   (string literal) to avoid any potentially unexpected string expansion issues that could arise from using
   double quotation marks (expanding string).

How do I do RegEx?

I want to replace non-alphabetic characters in a string. Here is a string I can use to perform my test:

$string = 'abcdefg12345HIJKLMNOP!@#$%qrs)(*&^TUVWXyz'

I also assign my regular expression pattern to a variable. As I noted earlier, I use single quotation marks (like I did in my test string). Here is my RegEx pattern:

$pattern = '[^a-zA-Z]'

I happen to know that there is a Replace method in the .NET Framework System.String class. Here is what it might look like if I call the System.String Replace method:

$string.Replace($pattern,' ')

Unfortunately, this does not work, and all of the non-alphabetic characters are still in the output string. This is because the Replace method from the System.String class replaces straight-out strings, and it does not accept a RegEx pattern.

Using the Replace operator

Luckily, I can use the –Replace operator in Windows PowerShell to do the replacement. I want to replace any non-alphabetic character with a blank space in my output. So I simply use the string that is stored in the $string variable. I call the –Replace operator, and I tell the –Replace operator to look for a match with the RegEx pattern that I stored in the $pattern variable and to replace any match with a blank space. Here is the syntax of that command:

$string -replace $pattern, ' '

When I run the code, the following appears in the output pane of my Windows PowerShell ISE:

Image of command output

The complete script is shown here:

$string = 'abcdefg12345HIJKLMNOP!@#$%qrs)(*&^TUVWXyz'

$pattern = '[^a-zA-Z]'

$string -replace $pattern, ' ' 

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson, Microsoft Scripting Guy 

Author

The "Scripting Guys" is a historical title passed from scripter to scripter. The current revision has morphed into our good friend Doctor Scripto who has been with us since the very beginning.

1 comment

Discussion is closed. Login to edit/delete existing comments.

  • Danijel James

    Actually, it should read like this:

    $string = ‘abcdefg12345HIJKLMNOP!@#$%qrs)(*&^TUVWXyz’

    [regex]$pattern = ‘[^a-zA-Z]+’

    $string -replace $pattern,”