February 18th, 2011

Speed Up Array Comparisons in Powershell with a Runtime Regex

Summary: Learn how to speed up array comparisons in Windows PowerShell by using a runtime regular expression

Hey, Scripting Guy! Question

  Hey, Scripting Guy! I am interested in speeding up comparisons of arrays when I use Windows PowerShell. Can you help me?

—CR

Hey, Scripting Guy! Answer Hello CR,

Microsoft Scripting Guy, Ed Wilson, here. We are still in our Guest Blogger Week, so I will turn your question over to today’s guest blogger, Rob Campbell.

Photo of Rob Campbell

Here is Rob’s description of his scripting background:

I work at a medium-to-large corporate financial institution as an AD and Exchange administrator. I’ve worked in IT for over 35 years, starting as a night operator on IBM System 3 and 360 series mainframes. I never really got the hang of VB, but I love Windows PowerShell. I have done a few large-scale scripts, but most of my scripting work is ad-hoc, on-demand reports and maintenance changes. I don’t know that I can lay claim to any particular area of specialization. The nearest I ever came up with for a good job description was “tactical logician.”

Occasionally, you’ll need to compare the contents of a pair of arrays of strings in Windows PowerShell, and there are some nice built-in operators that help you do that, notably –contains, -notcontains, and compare-object. Normally I’ll use the –contains and -notcontains operators. Here, we find all the elements of array $b that do and do not appear in array $a:

$a = “red.”,”blue.”,”yellow.”,”green.”,”orange.”,”purple.”

$b = ”blue.”,”green.”,”orange.”,”white.”,”gray.”

$b |? {$a -contains $_}

blue.

green.

orange.

$b |? {$a -notcontains $_}

white.

gray.

This is great for single instances, repetitive comparisons, or relatively small numbers of arrays. It is concise and intuitive. However, sometimes you need to compare an array to a collection of thousands of arrays. This could be email recipient addresses in message tracking logs, AD group memberships, NTFS access control lists, or virtually any large collection of objects that have multivalued string properties that you need to compare to some other array of string properties. For this kind of scenario, there is another method that provides much better performance—a regular expression (regex) match.

Regular expression matches are most commonly used to match single strings, and we are talking about array comparisons—comparing one set of multiple values to another set of multiple values. Regular expressions can be written to match multiple values at once by using the alternation operator (|).

Some characters are reserved for use as metacharacters in regular expressions, and they must be escaped with a backslash to be interpreted literally in the match. One of these is the period, so for our example, a regular expression to match all the values in our $a array would look like this:

[regex] $a_regex = “^(red\.|blue\.|yellow\.|green\.|orange\.|purple\.)$”

With all of the values, we want to match groups that are in parentheses and separated by a pipe symbol. The following code uses this regex to replace our –contains construct with –match operations.

$b -match $a_regex

blue.

green.

orange.

$b -notmatch $a_regex

white.

gray.

Both methods produce the same result, but there is a substantial difference in the amount of time it takes for them to do it.

You could use the following code if you had to compare $a to 10,000 $b arrays, finding the common or different elements in each one.

$counter = 1..10000

$test = measure-command {

foreach ($i in $counter){

$b | where {$a -contains $_}

$b | where {$a -notcontains $_}

}

}

$test.totalseconds

9.2625084

Here we run the same test using our regex match 10,000 times.

$counter = 1..10000

$test = measure-command {

foreach ($i in $counter){

$b -match $a

$b -notmatch $a

}

}

$test.totalseconds

0.6718625

The actual test times will vary depending on the processor and background processes, but I have found the alternation regex to consistently perform many time faster than the –contains operators.

Now, how can we use this in a script when what is in $a isn’t known until runtime? Simple. We wait until runtime to create our regular expression, based on what is in $a.

First, we want to make sure that we do a literal match on our incoming array elements. Fortunately, the regex type contains a static method (escape) that will do this for us. So, starting with our $a array, first we escape all the special characters in each element:

$a |foreach {[regex]::escape($_)}

red\.

blue\.

yellow\.

green\.

orange\.

purple\.

In addition, all the reserved characters are escaped for us. Now, we join the elements of our array with the pipe symbol that denotes alternation:

($a |foreach {[regex]::escape($_)}) –join “|”

red\.|blue\.|yellow\.|green\.|orange\.|purple\.

We are almost there. All that is left is to add our grouping parentheses, anchors, and any regex options we want. I’m going to add the (?i) that denotes a case-insensitive regex to this one, and use the beginning-of-line (^) and end-of-line ($) anchors:

‘(?i)^(‘ + (($a |foreach {[regex]::escape($_)}) –join “|”) + ‘)$’

(?i)^(red\.|blue\.|yellow\.|green\.|orange\.|purple\.)$

There is our regex. All that is left is to assign it to a variable and cast it to the proper type:

[regex] $a_regex = ‘(?i)^(‘ + (($a |foreach {[regex]::escape($_)}) –join “|”) + ‘)$’

$a_regex.tostring()

(?i)^(red\.|blue\.|yellow\.|green\.|orange\.|purple\.)$

Now our test script looks like this:

$a = “red.”,”blue.”,”yellow.”,”green.”,”orange.”,”purple.”

$b = ”blue.”,”green.”,”orange.”,”white.”,”gray.”

[regex] $a_regex = ‘(?i)^(‘ + (($a |foreach {[regex]::escape($_)}) –join “|”) + ‘)$’

$b -match $a_regex

blue.

green.

orange.

$b -notmatch $a_regex

white.

gray.

CR, that is all there is to using regular expressions to speed up comparisons of arrays. Thank you, Rob, for sharing with us today. This brings Guest Blogger Week to a close. Join me tomorrow for Weekend Scripter.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson, Microsoft Scripting Guy

Author

0 comments

Discussion are closed.