Parsing Text with PowerShell (2/3)

Avatar

Steve

This is the second post in a three-part series.

  • Part 1:
    • Useful methods on the String class
    • Introduction to Regular Expressions
    • The Select-String cmdlet
  • Part 2:
    • the -split operator
    • the -match operator
    • the switch statement
    • the Regex class
  • Part 3:
    • a real world, complete and slightly bigger, example of a switch-based parser

The -split operator

The -split operator splits one or more strings into substrings.

The first example is a name-value pattern, which is a common parsing task. Note the usage of the Max-substrings parameter to the -split operator.
We want to ensure that it doesn’t matter if the value contains the character to split on.

$text = "Description=The '=' character is used for assigning values to a variable"
$name, $value = $text -split "=", 2

@"
Name  =  $name
Value =  $value
"@
Name  =  Description
Value =  The '=' character is used for assigning values to a variable

When the line to parse contains fields separated by a well known separator, that is never a part of the field values, we can use the -split operator in combination with multiple assignment to get the fields into variables.

$name, $location, $occupation = "Spider Man,New York,Super Hero" -split ','

If only the location is of interest, the unwanted items can be assigned to $null.

$null, $location, $null = "Spider Man,New York,Super Hero" -split ','

$location
New York

If there are many fields, assigning to null doesn’t scale well. Indexing can be used instead, to get the fields of interest.

$inputText = "x,Staffan,x,x,x,x,x,x,x,x,x,x,Stockholm,x,x,x,x,x,x,x,x,11,x,x,x,x"
$name, $location, $age = ($inputText -split ',')[1,12,21]

$name
$location
$age
Staffan
Stockholm
11

It is almost always a good idea to create an object that gives context to the different parts.

$inputText = "x,Steve,x,x,x,x,x,x,x,x,x,x,Seattle,x,x,x,x,x,x,x,x,22,x,x,x,x"
$name, $location, $age = ($inputText -split ',')[1,12,21]
[PSCustomObject] @{
    Name = $name
    Location = $location
    Age = [int] $age
}
Name  Location Age
----  -------- ---
Steve Seattle   22

Instead of creating a PSCustomObject, we can create a class. It’s a bit more to type, but we can get more help from the engine, for example with tab completion.

The example below also shows an example of type conversion, where the default string to number conversion doesn’t work.
The age field is handled by PowerShell’s built-in type conversion. It is of type [int], and PowerShell will handle the conversion from string to int,
but in some cases we need to help out a bit. The ShoeSize field is also an [int], but the data is hexadecimal,
and without the hex specifier (‘0x’), this conversion fails for some values, and provides incorrect results for the others.

class PowerSheller {
    [string] $Name
    [string] $Location
    [int] $Age
    [int] $ShoeSize
}

$inputText = "x,Staffan,x,x,x,x,x,x,x,x,x,x,Stockholm,x,x,x,x,x,x,x,x,33,x,11d,x,x"
$name, $location, $age, $shoeSize = ($inputText -split ',')[1,12,21,23]
[PowerSheller] @{
    Name = $name
    Location = $location
    Age = $age
    # ShoeSize is expressed in hex, with no '0x' because reasons ๐Ÿ™‚
    # And yes, it's in millimeters.
    ShoeSize = [Convert]::ToInt32($shoeSize, 16)
}
Name    Location  Age ShoeSize
----    --------  --- --------
Staffan Stockholm  33      285

The split operator’s first argument is actually a regex (by default, can be changed with options).
I use this on long command lines in log files (like those given to compilers) where there can be hundreds of options specified. This makes it hard to see if a certain option is specified or not, but when split into their own lines, it becomes trivial.
The pattern below uses a positive lookahead assertion.
It can be very useful to make patterns match only in a given context, like if they are, or are not, preceded or followed by another pattern.

$cmdline = "cl.exe /D Bar=1 /I SomePath /D Foo  /O2 /I SomeOtherPath /Debug a1.cpp a3.cpp a2.cpp"

$cmdline -split "\s+(?=[-/])"
cl.exe
/D Bar=1
/I SomePath
/D Foo
/O2
/I SomeOtherPath
/Debug a1.cpp a2.cpp

Breaking down the regex, by rewriting it with the x option:

(?x)      # ignore whitespace in the pattern, and enable comments after '#'
\s+       # one or more spaces
(?=[-/])  # only match the previous spaces if they are followed by any of '-' or '/'.

Splitting with a scriptblock

The -split operator also comes in another form, where you can pass it a scriptblock instead of a regular expression.
This allows for more complicated logic, that can be hard or impossible to express as a regular expression.

The scriptblock accepts two parameters, the text to split and the current index. $_ is bound to the character at the current index.

function SplitWhitespaceInMiddleOfText {
    param(
        [string]$Text,
        [int] $Index
    )
    if ($Index -lt 10 -or $Index -gt 40){
        return $false
    }
    $_ -match '\s'
}

$inputText = "Some text that only needs splitting in the middle of the text"
$inputText -split $function:SplitWhitespaceInMiddleOfText
Some text that
only
needs
splitting
in
the middle of the text

The $function:SplitWhitespaceInMiddleOfText syntax is a way to get to content (the scriptblock that implements it) of the function, just as $env:UserName gets the content of an item in the env: drive.
It provides a way to document and/or reuse the scriptblock.

The -match operator

The -match operator works in conjunction with the $matches automatic variable. Each time a -match or a -notmatch succeeds, the $matches variable is populated so that each capture group gets its own entry. If the capture group is named, the key will be the name of the group, otherwise it will be the index.

As an example:

if ('a b c' -match '(\w) (?<named>\w) (\w)'){
    $matches
}
Name                           Value
----                           -----
named                          b
2                              c
1                              a
0                              a b c

Notice that the indices only increase on groups without names. I.E. the indices of later groups change when a group is named.

Armed with the regex knowledge from the earlier post, we can write the following:

PS> "    10,Some text" -match '^\s+(\d+),(.+)'
True
PS> $matches
Name                           Value
----                           -----
2                              Some text
1                              10
0                              10,Some text

or with named groups

PS> "    10,Some text" -match '^\s+(?<num>\d+),(?<text>.+)'
True
PS> $matches
Name                           Value
----                           -----
num                            10
text                           Some text
0                              10,Some text

The important thing here is to put parentheses around the parts of the pattern that we want to extract. That is what creates the capture groups that allow us to reference those parts of the matching text, either by name or by index.

Combining this into a function makes it easy to use:

function ParseMyString($text){
    if ($text -match '^\s+(\d+),(.+)') {
        [PSCustomObject] @{
            Number = [int] $matches[1]
            Text    = $matches[2]
        }
    }
    else {
        Write-Warning "ParseMyString: Input `$text` doesn't match pattern"
    }
}

ParseMyString "    10,Some text"

 

Number  Text
------- ----
     10 Some text

Notice the type conversion when assigning the Number property. As long as the number is in range of an integer, this will always succeed, since we have made a successful match in the if statement above. ([long] or [bigint] could be used. In this case I provide the input, and I have promised myself to stick to a range that fits in a 32-bit integer.)
Now we will be able to sort or do numerical operations on the Number property, and it will behave like we want it to – as a number, not as a string.

The switch statement

Now we’re at the big guns ๐Ÿ™‚

The switch statement in PowerShell has been given special functionality for parsing text.
It has two flags that are useful for parsing text and files with text in them. -regex and -file.

When specifying -regex, the match clauses that are strings are treated as regular expressions. The switch statement also sets the $matches automatic variable.

When specifying -file, PowerShell treats the input as a file name, to read input from, rather than as a value statement.

Note the use of a ScriptBlock instead of a string as the match clause to determine if we should skip preamble lines.

class ParsedOutput {
    [int] $Number
    [string] $Text

    [string] ToString() { return "{0} ({1})" -f $this.Text, $this.Number }
}

$inputData =
    "Preamble line",
    "LastLineOfPreamble",
    "    10,Some Text",
    "    Some other text,20"

$inPreamble = $true
switch -regex ($inputData) {

    {$inPreamble -and $_ -eq 'LastLineOfPreamble'} { $inPreamble = $false; continue }

    "^\s+(?<num>\d+),(?<text>.+)" {  # this matches the first line of non-preamble input
        [ParsedOutput] @{
            Number = $matches.num
            Text = $matches.text
        }
        continue
    }

    "^\s+(?<text>[^,]+),(?<num>\d+)" { # this matches the second line of non-preamble input
        [ParsedOutput] @{
            Number = $matches.num
            Text = $matches.text
        }
        continue
    }
}

 

Number  Text
------ ----
    10 Some Text
    20 Some other text

The pattern [^,]+ in the text group in the code above is useful. It means match anything that is not a comma ,. We are using the any-of construct [], and within those brackets, ^ changes meaning from the beginning of the line to anything but.

That is useful when we are matching delimited fields. A requirement is that the delimiter cannot be part of the set of allowed field values.

The regex class

regex is a type accelerator for System.Text.RegularExpressions.Regex. It can be useful when porting code from C#, and sometimes when we want to get more control in situations when we have many matches of a capture group. It also allows us to pre-create the regular expressions which can matter in performance sensitive scenarios, and to specify a timeout.

One instance where the regex class is needed is when you have multiple captures of a group.

Consider the following:

TextPattern
a,b,c,(\w,)+

If the match operator is used, $matches will contain

Name                           Value
----                           -----
1                              c,
0                              a,b,c,

The pattern matched three times, for a,, b, and c,. However, only the last match is preserved in the $matches dictionary.
However, the following will allow us to get to all the captures of the group:

[regex]::match('a,b,c,', '(\w,)+').Groups[1].Captures
Index Length Value
----- ------ -----
    0      2 a,
    2      2 b,
    4      2 c,

Below is an example that uses the members of the Regex class to parse input data

class ParsedOutput {
    [int] $Number
    [string] $Text

    [string] ToString() { return "{0} ({1})" -f $this.Text, $this.Number }
}

$inputData =
    "    10,Some Text",
    "    Some other text,20"  # this text will not match

[regex] $pattern = "^\s+(\d+),(.+)"

foreach($d in $inputData){
    $match = $pattern.Match($d)
    if ($match.Success){
        $number, $text = $match.Groups[1,2].Value
        [ParsedOutput] @{
            Number = $number
            Text = $text
        }
    }
    else {
        Write-Warning "regex: '$d' did not match pattern '$pattern'"
    }
}

 

WARNING: regex: '    Some other text,20' did not match pattern '^\s+(\d+),(.+)'
Number Text
------ ----
    10 Some Text

It may surprise you that the warning appears before the output. PowerShell has a quite complex formatting system at the end of the pipeline, which treats pipeline output different than other streams. Among other things, it buffers output in the beginning of a pipeline to calculate sensible column widths. This works well in practice, but sometimes gives strange reordering of output on different streams.

Summary

In this post we have looked at how the -split operator can be used to split a string in parts, how the -match operator can be used to extract different patterns from some text, and how the powerful switch statement can be used to match against multiple patterns.

We ended by looking at how the regex class, which in some cases provides a bit more control, but at the expense of ease of use. This concludes the second part of this series. Next time, we will look at a complete, real world, example of a switch-based parser.

Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.

Staffan Gustafsson, @StaffanGson, powercode@github

Staffan works at DICE in Stockholm, Sweden, as a Software Engineer and has been using PowerShell since the first public beta.
He was most seriously pleased when PowerShell was open sourced, and has since contributed bug fixes, new features and performance improvements.
Staffan is a speaker at PSConfEU and is always happy to talk PowerShell.

Avatar
Steve Lee

Follow Steve   

No Comments.