This is the third and final post in a three-part series.
- Part 1:
- Useful methods on the String class
- Introduction to Regular Expressions
- The Select-String cmdlet
- Part 2:
- the -split operator
- the -match operator
- the switch statement
- the Regex class
- Part 3:
- a real world, complete and slightly bigger, example of a switch-based parser
- General structure of a switch-based parser
- The real world example
- a real world, complete and slightly bigger, example of a switch-based parser
In the previous posts, we looked at the different operators what are available to us in PowerShell.
When analyzing crashes at DICE, I noticed that some of the C++ runtime binaries where missing debug symbols. They should be available for download from Microsoft’s public symbol server, and most versions were there. However, due to some process errors at DevDiv, some builds were released publicly without available debug symbols. In some cases, those missing symbols prevented us from debugging those crashes, and in all cases, they triggered my developer OCD.
So, to give actionable feedback to Microsoft, I scripted a debugger (cdb.exe in this case) to give a verbose list of the loaded modules, and parsed the output with PowerShell, which was also later used to group and filter the resulting data set. I sent this data to Microsoft, and 5 days later, the missing symbols were available for download. Mission accomplished!
This post will describe the parser I wrote for this task (it turned out that I had good use for it for other tasks later), and the general structure is applicable to most parsing tasks.
The example will show how a switch
-based parser would look when the input data isn’t as tidy as it normally is in examples, but messy – as the real world data often is.
General Structure of a switch Based Parser
Depending on the structure of our input, the code must be organized in slightly different ways.
Input may have a record start that differs by indentation or some distinct token like
Foo <- Record start - No whitespace at the beginning of the line
Prop1=Staffan <- Properties for the record - starts with whitespace
Prop3 =ValueN
Bar
Prop1=Steve
Prop2=ValueBar2
If the data to be parsed has an explicit start record, it is a bit easier than if it doesn’t have one. We create a new data object when we get a record start, after writing any previously created object to the pipeline. At the end, we need to check if we have parsed a record that hasn’t been written to the pipeline.
The general structure of a such a switch-based parser can be as follows:
$inputData = @" Foo Prop1=Value1 Prop3=Value3 Bar Prop1=ValueBar1 Prop2=ValueBar2 "@ -split '\r?\n' # This regex is useful to split at line endings, with or without carriage return class SomeDataClass { $ID $Name $Property2 $Property3 } # map to project input property names to the properties on our data class $propertyNameMap = @{ Prop1 = "Name" Prop2 = "Property2" Prop3 = "Property3" } $currentObject = $null switch -regex ($inputData) { '^(\S.*)' { # record start pattern, in this case line that doesn't start with a whitespace. if ($null -ne $currentObject) { $currentObject # output to pipeline if we have a previous data object } $currentObject = [SomeDataClass] @{ # create new object for this record Id = $matches.1 # with Id like Foo or Bar } continue } # set the properties on the data object '^\s+([^=]+)=(.*)' { $name, $value = $matches[1, 2] # project property names $propName = $propertyNameMap[$name] if ($propName = $null) { $propName = $name } # assign the parsed value to the projected property name $currentObject.$propName = $value continue } } if ($currentObject) { # Handle the last object if any $currentObject # output to pipeline }
ID Name Property2 Property3
-- ---- --------- ---------
Foo Value1 Value3
Bar ValueBar1 ValueBar2
Alternatively, we may have input where the records are separated by a blank line, but without any obvious record start.
commitId=1234 <- In this case, a commitId is first in a record
description=Update readme.md
<- the blank line separates records
user=Staffan <- For this record, a user property comes first
commitId=1235
description=Fix bug.md
In this case the structure of the code looks a bit different. We create an object at the beginning, but keep track of if it’s dirty or not. If we get to the end with a dirty object, we must output it.
$inputData = @" commit=1234 desc=Update readme.md user=Staffan commit=1235 desc=Bug fix "@ -split "\r?\n" class SomeDataClass { [int] $CommitId [string] $Description [string] $User } # map to project input property names to the properties on our data class # we only need to provide the ones that are different. 'User' works fine as it is. $propertyNameMap = @{ commit = "CommitId" desc = "Description" } $currentObject = [SomeDataClass]::new() $objectDirty = $false switch -regex ($inputData) { # set the properties on the data object '^([^=]+)=(.*)$' { # parse a name/value $name, $value = $matches[1, 2] # project property names $propName = $propertyNameMap[$name] if ($null -eq $propName) { $propName = $name } # assign the projected property $currentObject.$propName = $value $objectDirty = $true continue } '^\s*$' { # separator pattern, in this case any blank line if ($objectDirty) { $currentObject # output to pipeline $currentObject = [SomeDataClass]::new() # create new object $objectDirty = $false # and mark it as not dirty } } default { Write-Warning "Unexpected input: '$_'" } } if ($objectDirty) { # Handle the last object if any $currentObject # output to pipeline }
CommitId Description User
-------- ----------- ----
1234 Update readme.md
1235 Bug fix Staffan
The Real World Example
I have adapted this sample slightly so that I get the loaded modules from a running process instead of from my crash dumps. The format of the output from the debugger is the same. The following command launches a command line debugger on notepad, with a script that gives a verbose listing of the loaded modules, and quits:
# we need to muck around with the console output encoding to handle the trademark chars # imagine no encodings # it's easy if you try # no code pages below us # above us only sky [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("iso-8859-1") $proc = Start-Process notepad -passthru Start-Sleep -seconds 1 $cdbOutput = cdb -y 'srv*c:\symbols*http://msdl.microsoft.com/download/symbols' -c ".reload -f;lmv;q" -p $proc.ProcessID
The output of the command above is here for those who want to follow along but who aren’t running windows or don’t have cdb.exe installed.
The (abbreviated) output looks like this:
Microsoft (R) Windows Debugger Version 10.0.16299.15 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.
*** wait with pending attach
************* Path validation summary **************
Response Time (ms) Location
Deferred srv*c:\symbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: srv*c:\symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
ModLoad: 00007ff6`e9da0000 00007ff6`e9de3000 C:\Windows\system32\notepad.exe
...
ModLoad: 00007ffe`97d80000 00007ffe`97db1000 C:\WINDOWS\SYSTEM32\ntmarta.dll
(98bc.40a0): Break instruction exception - code 80000003 (first chance)
ntdll!DbgBreakPoint:
00007ffe`9cd53050 cc int 3
0:007> cdb: Reading initial command '.reload -f;lmv;q'
Reloading current modules
.....................................................
start end module name
00007ff6`e9da0000 00007ff6`e9de3000 notepad (pdb symbols) c:\symbols\notepad.pdb\2352C62CDF448257FDBDDA4081A8F9081\notepad.pdb
Loaded symbol image file: C:\Windows\system32\notepad.exe
Image path: C:\Windows\system32\notepad.exe
Image name: notepad.exe
Image was built with /Brepro flag.
Timestamp: 329A7791 (This is a reproducible build file hash, not a timestamp)
CheckSum: 0004D15F
ImageSize: 00043000
File version: 10.0.17763.1
Product version: 10.0.17763.1
File flags: 0 (Mask 3F)
File OS: 40004 NT Win32
File type: 1.0 App
File date: 00000000.00000000
Translations: 0409.04b0
CompanyName: Microsoft Corporation
ProductName: Microsoft??? Windows??? Operating System
InternalName: Notepad
OriginalFilename: NOTEPAD.EXE
ProductVersion: 10.0.17763.1
FileVersion: 10.0.17763.1 (WinBuild.160101.0800)
FileDescription: Notepad
LegalCopyright: ??? Microsoft Corporation. All rights reserved.
...
00007ffe`9ccb0000 00007ffe`9ce9d000 ntdll (pdb symbols) c:\symbols\ntdll.pdb\B8AD79538F2730FD9BACE36C9F9316A01\ntdll.pdb
Loaded symbol image file: C:\WINDOWS\SYSTEM32\ntdll.dll
Image path: C:\WINDOWS\SYSTEM32\ntdll.dll
Image name: ntdll.dll
Image was built with /Brepro flag.
Timestamp: E8B54827 (This is a reproducible build file hash, not a timestamp)
CheckSum: 001F20D1
ImageSize: 001ED000
File version: 10.0.17763.194
Product version: 10.0.17763.194
File flags: 0 (Mask 3F)
File OS: 40004 NT Win32
File type: 2.0 Dll
File date: 00000000.00000000
Translations: 0409.04b0
CompanyName: Microsoft Corporation
ProductName: Microsoft??? Windows??? Operating System
InternalName: ntdll.dll
OriginalFilename: ntdll.dll
ProductVersion: 10.0.17763.194
FileVersion: 10.0.17763.194 (WinBuild.160101.0800)
FileDescription: NT Layer DLL
LegalCopyright: ??? Microsoft Corporation. All rights reserved.
quit:
The output starts with info that I’m not interested in here. I only want to get the detailed information about the loaded modules. It is not until the line
start end module name
that I care about the output.
Also, at the end there is a line that we need to be aware of:
quit:
that is not part of the module output.
To skip the parts of the debugger output that we don’t care about, we have a boolean flag initially set to true.
If that flag is set, we check if the current line, $_
, is the module header in which case we flip the flag.
$inPreamble = $true switch -regex ($cdbOutput) { { $inPreamble -and $_ -eq "start end module name" } { $inPreamble = $false; continue }
I have made the parser a separate function that reads its input from the pipeline. This way, I can use the same function to parse module data, regardless of how I got the module data. Maybe it was saved on a file. Or came from a dump, or a live process. It doesn’t matter, since the parser is decoupled from the data retrieval.
After the sample, there is a breakdown of the more complicated regular expressions used, so don’t despair if you don’t understand them at first. Regular Expressions are notoriously hard to read, so much so that they make Perl look readable in comparison.
# define an class to store the data class ExecutableModule { [string] $Name [string] $Start [string] $End [string] $SymbolStatus [string] $PdbPath [bool] $Reproducible [string] $ImagePath [string] $ImageName [DateTime] $TimeStamp [uint32] $FileHash [uint32] $CheckSum [uint32] $ImageSize [version] $FileVersion [version] $ProductVersion [string] $FileFlags [string] $FileOS [string] $FileType [string] $FileDate [string[]] $Translations [string] $CompanyName [string] $ProductName [string] $InternalName [string] $OriginalFilename [string] $ProductVersionStr [string] $FileVersionStr [string] $FileDescription [string] $LegalCopyright [string] $LegalTrademarks [string] $LoadedImageFile [string] $PrivateBuild [string] $Comments } <# .SYNOPSIS Runs a debugger on a program to dump its loaded modules #> function Get-ExecutableModuleRawData { param ([string] $Program) $consoleEncoding = [Console]::OutputEncoding [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("iso-8859-1") try { $proc = Start-Process $program -PassThru Start-Sleep -Seconds 1 # sleep for a while so modules are loaded cdb -y srv*c:\symbols*http://msdl.microsoft.com/download/symbols -c ".reload -f;lmv;q" -p $proc.Id $proc.Close() } finally { [Console]::OutputEncoding = $consoleEncoding } } <# .SYNOPSIS Converts verbose module data from windows debuggers into ExecutableModule objects. #> function ConvertTo-ExecutableModule { [OutputType([ExecutableModule])] param ( [Parameter(ValueFromPipeline)] [string[]] $ModuleRawData ) begin { $currentObject = $null $preamble = $true $propertyNameMap = @{ 'File flags' = 'FileFlags' 'File OS' = 'FileOS' 'File type' = 'FileType' 'File date' = 'FileDate' 'File version' = 'FileVersion' 'Product version' = 'ProductVersion' 'Image path' = 'ImagePath' 'Image name' = 'ImageName' 'FileVersion' = 'FileVersionStr' 'ProductVersion' = 'ProductVersionStr' } } process { switch -regex ($ModuleRawData) { # skip lines until we get to our sentinel line { $preamble -and $_ -eq "start end module name" } { $preamble = $false; continue } #00007ff6`e9da0000 00007ff6`e9de3000 notepad (deferred) #00007ffe`9ccb0000 00007ffe`9ce9d000 ntdll (pdb symbols) c:\symbols\ntdll.pdb\B8AD79538F2730FD9BACE36C9F9316A01\ntdll.pdb '^([0-9a-f`]{17})\s([0-9a-f`]{17})\s+(\S+)\s+\(([^\)]+)\)\s*(.+)?' { # see breakdown of the expression later in the post # on record start, output the currentObject, if any is set if ($null -ne $currentObject) { $currentObject } $start, $end, $module, $pdbKind, $pdbPath = $matches[1..5] # create an instance of the object that we are adding info from the current record into. $currentObject = [ExecutableModule] @{ Start = $start End = $end Name = $module SymbolStatus = $pdbKind PdbPath = $pdbPath } continue } '^\s+Image was built with /Brepro flag.' { $currentObject.Reproducible = $true continue } '^\s+Timestamp:\s+[^\(]+\((?<timestamp>.{8})\)' { # see breakdown of the regular expression later in the post # Timestamp: Mon Jan 7 23:42:30 2019 (5C33D5D6) $intValue = [Convert]::ToInt32($matches.timestamp, 16) $currentObject.TimeStamp = [DateTime]::new(1970, 01, 01, 0, 0, 0, [DateTimeKind]::Utc).AddSeconds($intValue) continue } '^\s+TimeStamp:\s+(?<value>.{8}) \(This' { # Timestamp: E78937AC (This is a reproducible build file hash, not a timestamp) $currentObject.FileHash = [Convert]::ToUInt32($matches.value, 16) continue } '^\s+Loaded symbol image file: (?<imageFile>[^\)]+)' { $currentObject.LoadedImageFile = $matches.imageFile continue } '^\s+Checksum:\s+(?<checksum>\S+)' { $currentObject.Checksum = [Convert]::ToUInt32($matches.checksum, 16) continue } '^\s+Translations:\s+(?<value>\S+)' { $currentObject.Translations = $matches.value.Split(".") continue } '^\s+ImageSize:\s+(?<imageSize>.{8})' { $currentObject.ImageSize = [Convert]::ToUInt32($matches.imageSize, 16) continue } '^\s{4}(?<name>[^:]+):\s+(?<value>.+)' { # see breakdown of the regular expression later in the post # This part is any 'name: value' pattern $name, $value = $matches['name', 'value'] # project the property name $propName = $propertyNameMap[$name] $propName = if ($null -eq $propName) { $name } else { $propName } # note the dynamic property name in the assignment # this will fail if the property doesn't have a member with the specified name $currentObject.$propName = $value continue } 'quit:' { # ignore and exit break } default { # When writing the parser, it can be useful to include a line like the one below to see the cases that are not handled by the parser # Write-Warning "missing case for '$_'. Unexpected output format from cdb.exe" continue # skip lines that doesn't match the patterns we are interested in, like the start/end/modulename header and the quit: output } } } end { # this is needed to output the last object if ($null -ne $currentObject) { $currentObject } } } Get-ExecutableModuleRawData Notepad | ConvertTo-ExecutableModule | Sort-Object ProductVersion, Name Format-Table -Property Name, FileVersion, Product_Version, FileDescription
Name FileVersionStr ProductVersion FileDescription
---- -------------- -------------- ---------------
PROPSYS 7.0.17763.1 (WinBuild.160101.0800) 7.0.17763.1 Microsoft Property System
ADVAPI32 10.0.17763.1 (WinBuild.160101.0800) 10.0.17763.1 Advanced Windows 32 Base API
bcrypt 10.0.17763.1 (WinBuild.160101.0800) 10.0.17763.1 Windows Cryptographic Primitives Library
...
uxtheme 10.0.17763.1 (WinBuild.160101.0800) 10.0.17763.1 Microsoft UxTheme Library
win32u 10.0.17763.1 (WinBuild.160101.0800) 10.0.17763.1 Win32u
WINSPOOL 10.0.17763.1 (WinBuild.160101.0800) 10.0.17763.1 Windows Spooler Driver
KERNELBASE 10.0.17763.134 (WinBuild.160101.0800) 10.0.17763.134 Windows NT BASE API Client DLL
wintypes 10.0.17763.134 (WinBuild.160101.0800) 10.0.17763.134 Windows Base Types DLL
SHELL32 10.0.17763.168 (WinBuild.160101.0800) 10.0.17763.168 Windows Shell Common Dll
...
windows_storage 10.0.17763.168 (WinBuild.160101.0800) 10.0.17763.168 Microsoft WinRT Storage API
CoreMessaging 10.0.17763.194 10.0.17763.194 Microsoft CoreMessaging Dll
gdi32full 10.0.17763.194 (WinBuild.160101.0800) 10.0.17763.194 GDI Client DLL
ntdll 10.0.17763.194 (WinBuild.160101.0800) 10.0.17763.194 NT Layer DLL
RMCLIENT 10.0.17763.194 (WinBuild.160101.0800) 10.0.17763.194 Resource Manager Client
RPCRT4 10.0.17763.194 (WinBuild.160101.0800) 10.0.17763.194 Remote Procedure Call Runtime
combase 10.0.17763.253 (WinBuild.160101.0800) 10.0.17763.253 Microsoft COM for Windows
COMCTL32 6.10 (WinBuild.160101.0800) 10.0.17763.253 User Experience Controls Library
urlmon 11.00.17763.168 (WinBuild.160101.0800) 11.0.17763.168 OLE32 Extensions for Win32
iertutil 11.00.17763.253 (WinBuild.160101.0800) 11.0.17763.253 Run time utility for Internet Explorer
Regex pattern breakdown
Here is a breakdown of the more complicated patterns, using the ignore pattern whitespace
modifier x
:
([0-9a-f`]{17})\s([0-9a-f`]{17})\s+(\S+)\s+\(([^\)]+)\)\s*(.+)? # example input: 00007ffe`9ccb0000 00007ffe`9ce9d000 ntdll (pdb symbols) c:\symbols\ntdll.pdb\B8AD79538F2730FD9BACE36C9F9316A01\ntdll.pdb (?x) # ignore pattern whitespace ^ # the beginning of the line ([0-9a-f`]{17}) # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them \s # a space ([0-9a-f`]{17}) # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them \s+ # skip any number of spaces (\S+) # capture until we get a space - this would match the 'ntdll' part \s+ # skip one or more spaces \( # start parenthesis ([^\)]) # capture anything but end parenthesis \) # end parenthesis \s* # skip zero or more spaces (.+)? # optionally capture any symbol file path
Breakdown of the name-value pattern:
^\s+(?<name>[^:]+):\s+(?<value>.+) # example input: File flags: 0 (Mask 3F) (?x) # ignore pattern whitespace ^ # the beginning of the line \s+ # require one or more spaces (?<name>[^:]+) # capture anything that is not a `:` into the named group "name" : # require a comma \s+ # require one or more spaces (?<value>.+) # capture everything until the end into the name group "value"
Breakdown of the timestamp pattern:
^\s{4}Timestamp:\s+[^\(]+\((?<timestamp>.{8})\) #example input: Timestamp: Mon Jan 7 23:42:30 2019 (5C33D5D6) (?x) # ignore pattern whitespace ^ # the beginning of the line \s+ # require one or more spaces Timestamp: # The literal text 'Timestamp:' \s+ # require one or more spaces [^\(]+ # one or more of anything but a open parenthesis \( # a literal '(' (?<timestamp>.{8}) # 8 characters of anything, captured into the group 'timestamp' \) # a literal ')'
Gotchas – the Regex Cache
Something that can happen if you are writing a more complicated parser is the following: The parser works well. You have 15 regular expressions in your switch statement and then you get some input you haven’t seen before, so you add a 16th regex. All of a sudden, the performance of your parser tanks. WTF?
The .net regex implementation has a cache of recently used regex
s. You can check the size of it like this:
PS> [regex]::CacheSize 15 # bump it [regex]::CacheSize = 20
And now your parser is fast(er) again.
Bonus tip
I frequently use PowerShell to write (generate) my code:
Get-ExecutableModuleRawData pwsh | Select-String '^\s+([^:]+):' | # this pattern matches the module detail fields Foreach-Object {$_.matches.groups[1].value} | Select-Object -Unique | Foreach-Object -Begin { "class ExecutableModuleData {" }` -Process { " [string] $" + ($_ -replace "\s.", {[char]::ToUpperInvariant($_.Groups[0].Value[1])}) }` -End { "}" }
The output is
class ExecutableModuleData {
[string] $LoadedSymbolImageFile
[string] $ImagePath
[string] $ImageName
[string] $Timestamp
[string] $CheckSum
[string] $ImageSize
[string] $FileVersion
[string] $ProductVersion
[string] $FileFlags
[string] $FileOS
[string] $FileType
[string] $FileDate
[string] $Translations
[string] $CompanyName
[string] $ProductName
[string] $InternalName
[string] $OriginalFilename
[string] $ProductVersion
[string] $FileVersion
[string] $FileDescription
[string] $LegalCopyright
[string] $Comments
[string] $LegalTrademarks
[string] $PrivateBuild
}
It is not complete – I don’t have the fields from the record start, some types are incorrect and when run against some other executables a few other fields may appear. But it is a very good starting point. And way more fun than typing it 🙂
Note that this example is using a new feature of the -replace
operator – to use a ScriptBlock to determine what to replace with – that was added in PowerShell Core 6.1.
Bonus tip #2
A regular expression construct that I often find useful is non-greedy matching
.
The example below shows the effect of the ?
modifier, that can be used after *
(zero or more) and +
(one or more)
# greedy matching - match to the last occurrence of the following character (>) if("<Tag>Text</Tag>" -match '<(.+)>') { $matches }
Name Value
---- -----
1 Tag>Text</Tag
0 <Tag>Text</Tag>
# non-greedy matching - match to the first occurrence of the the following character (>)
if("<Tag>Text</Tag>" -match '<(.+?)>') { $matches }
Name Value
---- -----
1 Tag
0 <Tag>
See Regex Repeat for more info on how to control pattern repetition.
Summary
In this post, we have looked at how the structure of a switch-based parser could look, and how it can be written so that it works as a part of the pipeline. We have also looked at a few slightly more complicated regular expressions in some detail.
As we have seen, PowerShell has a plethora of options for parsing text, and most of them revolve around regular expressions. My personal experience has been that the time I’ve invested in understanding the regex language was well invested.
Hopefully, this gives you a good start with the parsing tasks you have at hand.
Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.
Staffan Gustafsson, @StaffanGson, github
Staffan works at DICE in Stockholm, Sweden, as a Software Engineer and has been using PowerShell since the first public beta. He was most seriously pleased when PowerShell was open sourced, and has since contributed bug fixes, new features and performance improvements. Staffan is a speaker at PSConfEU and is always happy to talk PowerShell.
Some serious content here that I’m going to have read multiple times. Thank you for taking the time to write and post it. And the pedant in me, thanks you for writing “revolving around” instead of “centering around.”
My post here : powershell-one-liner-to-print-process-properties along with searching for an answer brought me to your most excellent three-part series on parsing text with PS. Clearly, I have to do as you recommended in Part 1 and think more in terms of objects rather than text. But it ain’t easy!
Thanks again! Very Respectfully.