{"id":14655,"date":"2019-01-28T09:09:16","date_gmt":"2019-01-28T17:09:16","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/powershell\/?p=14655"},"modified":"2019-02-19T23:34:52","modified_gmt":"2019-02-20T06:34:52","slug":"parsing-text-with-powershell-3-3","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/powershell\/parsing-text-with-powershell-3-3\/","title":{"rendered":"Parsing Text with PowerShell (3\/3)"},"content":{"rendered":"<div class=\"markdown-body\">\n<p>This is the third and final post in a three-part series.<\/p>\n<ul>\n<li><a href=\"https:\/\/blogs.msdn.microsoft.com\/powershell\/2019\/01\/18\/parsing-text-with-powershell-1-3\/\" rel=\"nofollow\">Part 1<\/a>:\n<ul>\n<li>Useful methods on the String class<\/li>\n<li>Introduction to Regular Expressions<\/li>\n<li>The Select-String cmdlet<\/li>\n<\/ul>\n<\/li>\n<li><a href=\"https:\/\/blogs.msdn.microsoft.com\/powershell\/2019\/01\/24\/parsing-text-with-powershell-2-3\/\" rel=\"nofollow\">Part 2<\/a>:\n<ul>\n<li>the -split operator<\/li>\n<li>the -match operator<\/li>\n<li>the switch statement<\/li>\n<li>the Regex class<\/li>\n<\/ul>\n<\/li>\n<li><em>Part 3<\/em>:\n<ul>\n<li>a real world, complete and slightly bigger, example of a switch-based parser\n<ul>\n<li>General structure of a switch-based parser<\/li>\n<li>The real world example<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>In the previous posts, we looked at the different operators what are available to us in PowerShell.<\/p>\n<p>When analyzing crashes at <a href=\"https:\/\/www.dice.se\" rel=\"nofollow\">DICE<\/a>, I noticed that some of the C++ runtime binaries where missing debug symbols. They should be available for download from Microsoft&#8217;s public symbol server, and most versions were there. However, due to some process errors at DevDiv, some builds were released publicly without available debug symbols.\nIn some cases, those missing symbols prevented us from debugging those crashes, and in all cases, they triggered my developer OCD.<\/p>\n<p>So, to give actionable feedback to Microsoft, I scripted a debugger (cdb.exe in this case) to give a verbose list of the loaded modules, and parsed the output with PowerShell, which was also later used to group and filter the resulting data set. I sent this data to Microsoft, and 5 days later, the missing symbols were available for download. Mission accomplished!<\/p>\n<p>This post will describe the parser I wrote for this task (it turned out that I had good use for it for other tasks later), and the general structure is applicable to most parsing tasks.<\/p>\n<p>The example will show how a <code>switch<\/code>-based parser would look when the input data isn&#8217;t as tidy as it normally is in examples, but messy &#8211; as the real world data often is.<\/p>\n<h2><a id=\"user-content-general-structure-of-a-switch-based-parser\" class=\"anchor\" href=\"#general-structure-of-a-switch-based-parser\"><\/a>General Structure of a switch Based Parser<\/h2>\n<p>Depending on the structure of our input, the code must be organized in slightly different ways.<\/p>\n<p>Input may have a record start that differs by indentation or some distinct token like<\/p>\n<pre><code>Foo                    &lt;- Record start - No whitespace at the beginning of the line\r\n    Prop1=Staffan      &lt;- Properties for the record - starts with whitespace\r\n    Prop3 =ValueN\r\nBar\r\n    Prop1=Steve\r\n    Prop2=ValueBar2\r\n<\/code><\/pre>\n<p>If the data to be parsed has an explicit start record, it is a bit easier than if it doesn&#8217;t have one.\nWe create a new data object when we get a record start, after writing any previously created object to the pipeline.\nAt the end, we need to check if we have parsed a record that hasn&#8217;t been written to the pipeline.<\/p>\n<p>The general structure of a such a switch-based parser can be as follows:<\/p>\n<pre class=\"lang:default decode:true\">$inputData = @\"\r\nFoo\r\n    Prop1=Value1\r\n    Prop3=Value3\r\nBar\r\n    Prop1=ValueBar1\r\n    Prop2=ValueBar2\r\n\"@ -split '\\r?\\n'   # This regex is useful to split at line endings, with or without carriage return\r\n\r\nclass SomeDataClass {\r\n    $ID\r\n    $Name\r\n    $Property2\r\n    $Property3\r\n}\r\n\r\n# map to project input property names to the properties on our data class\r\n$propertyNameMap = @{\r\n    Prop1 = \"Name\"\r\n    Prop2 = \"Property2\"\r\n    Prop3 = \"Property3\"\r\n}\r\n\r\n$currentObject = $null\r\nswitch -regex ($inputData) {\r\n\r\n    '^(\\S.*)' {\r\n        # record start pattern, in this case line that doesn't start with a whitespace.\r\n        if ($null -ne $currentObject) {\r\n            $currentObject                   # output to pipeline if we have a previous data object\r\n        }\r\n        $currentObject = [SomeDataClass] @{  # create new object for this record\r\n            Id = $matches.1                  # with Id like Foo or Bar\r\n        }\r\n        continue\r\n    }\r\n\r\n    # set the properties on the data object\r\n    '^\\s+([^=]+)=(.*)' {\r\n        $name, $value = $matches[1, 2]\r\n        # project property names\r\n        $propName = $propertyNameMap[$name]\r\n        if ($propName = $null) {\r\n            $propName = $name\r\n        }\r\n        # assign the parsed value to the projected property name\r\n        $currentObject.$propName = $value\r\n        continue\r\n    }\r\n}\r\n\r\nif ($currentObject) {\r\n    # Handle the last object if any\r\n    $currentObject # output to pipeline\r\n}<\/pre>\n<pre><code>ID  Name      Property2 Property3\r\n--  ----      --------- ---------\r\nFoo Value1              Value3\r\nBar ValueBar1 ValueBar2\r\n<\/code><\/pre>\n<p>Alternatively, we may have input where the records are separated by a blank line, but without any obvious record start.<\/p>\n<pre><code>commitId=1234                         &lt;- In this case, a commitId is first in a record\r\ndescription=Update readme.md\r\n                                      &lt;- the blank line separates records\r\nuser=Staffan                          &lt;- For this record, a user property comes first\r\ncommitId=1235\r\ndescription=Fix bug.md\r\n<\/code><\/pre>\n<p>In this case the structure of the code looks a bit different. We create an object at the beginning, but keep track of if it&#8217;s dirty or not.\nIf we get to the end with a dirty object, we must output it.<\/p>\n<pre class=\"lang:default decode:true\">$inputData = @\"\r\n\r\ncommit=1234\r\ndesc=Update readme.md\r\n\r\nuser=Staffan\r\ncommit=1235\r\ndesc=Bug fix\r\n\r\n\"@ -split \"\\r?\\n\"\r\n\r\nclass SomeDataClass {\r\n    [int] $CommitId\r\n    [string] $Description\r\n    [string] $User\r\n}\r\n\r\n# map to project input property names to the properties on our data class\r\n# we only need to provide the ones that are different. 'User' works fine as it is.\r\n$propertyNameMap = @{\r\n    commit = \"CommitId\"\r\n    desc   = \"Description\"\r\n}\r\n\r\n$currentObject = [SomeDataClass]::new()\r\n$objectDirty = $false\r\nswitch -regex ($inputData) {\r\n    # set the properties on the data object\r\n    '^([^=]+)=(.*)$' {\r\n        # parse a name\/value\r\n        $name, $value = $matches[1, 2]\r\n        # project property names\r\n        $propName = $propertyNameMap[$name]\r\n        if ($null -eq $propName) {\r\n            $propName = $name\r\n        }\r\n        # assign the projected property\r\n        $currentObject.$propName = $value\r\n        $objectDirty = $true\r\n        continue\r\n    }\r\n\r\n    '^\\s*$' {\r\n        # separator pattern, in this case any blank line\r\n        if ($objectDirty) {\r\n            $currentObject                           # output to pipeline\r\n            $currentObject = [SomeDataClass]::new()  # create new object\r\n            $objectDirty = $false                    # and mark it as not dirty\r\n        }\r\n    }\r\n    default {\r\n        Write-Warning \"Unexpected input: '$_'\"\r\n    }\r\n}\r\n\r\nif ($objectDirty) {\r\n    # Handle the last object if any\r\n    $currentObject # output to pipeline\r\n}<\/pre>\n<pre><code>CommitId Description      User\r\n-------- -----------      ----\r\n    1234 Update readme.md\r\n    1235 Bug fix          Staffan\r\n<\/code><\/pre>\n<h2><a id=\"user-content-the-real-world-example\" class=\"anchor\" href=\"#the-real-world-example\"><\/a>The Real World Example<\/h2>\n<p>I have adapted this sample slightly so that I get the loaded modules from a running process instead of from my crash dumps. The format of the output from the debugger is the same.\nThe following command launches a command line debugger on notepad, with a script that gives a verbose listing of the loaded modules, and quits:<\/p>\n<pre class=\"lang:default decode:true\"># we need to muck around with the console output encoding to handle the trademark chars\r\n# imagine no encodings\r\n# it's easy if you try\r\n# no code pages below us\r\n# above us only sky\r\n[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding(\"iso-8859-1\")\r\n\r\n$proc = Start-Process notepad -passthru\r\nStart-Sleep -seconds 1\r\n$cdbOutput = cdb -y 'srv*c:\\symbols*http:\/\/msdl.microsoft.com\/download\/symbols' -c \".reload -f;lmv;q\" -p $proc.ProcessID<\/pre>\n<p>The output of the command above is <a href=\"https:\/\/msdnshared.blob.core.windows.net\/media\/2019\/01\/notepad_modules.txt\">here<\/a>\u00a0for those who want to follow along but who aren&#8217;t running windows or don&#8217;t have cdb.exe installed.<\/p>\n<p>The (abbreviated) output looks like this:<\/p>\n<pre><code>Microsoft (R) Windows Debugger Version 10.0.16299.15 AMD64\r\nCopyright (c) Microsoft Corporation. All rights reserved.\r\n\r\n*** wait with pending attach\r\n\r\n************* Path validation summary **************\r\nResponse                         Time (ms)     Location\r\nDeferred                                       srv*c:\\symbols*http:\/\/msdl.microsoft.com\/download\/symbols\r\nSymbol search path is: srv*c:\\symbols*http:\/\/msdl.microsoft.com\/download\/symbols\r\nExecutable search path is:\r\nModLoad: 00007ff6`e9da0000 00007ff6`e9de3000   C:\\Windows\\system32\\notepad.exe\r\n...\r\nModLoad: 00007ffe`97d80000 00007ffe`97db1000   C:\\WINDOWS\\SYSTEM32\\ntmarta.dll\r\n(98bc.40a0): Break instruction exception - code 80000003 (first chance)\r\nntdll!DbgBreakPoint:\r\n00007ffe`9cd53050 cc              int     3\r\n0:007&gt; cdb: Reading initial command '.reload -f;lmv;q'\r\nReloading current modules\r\n.....................................................\r\nstart             end                 module name\r\n00007ff6`e9da0000 00007ff6`e9de3000   notepad    (pdb symbols)          c:\\symbols\\notepad.pdb\\2352C62CDF448257FDBDDA4081A8F9081\\notepad.pdb\r\n    Loaded symbol image file: C:\\Windows\\system32\\notepad.exe\r\n    Image path: C:\\Windows\\system32\\notepad.exe\r\n    Image name: notepad.exe\r\n    Image was built with \/Brepro flag.\r\n    Timestamp:        329A7791 (This is a reproducible build file hash, not a timestamp)\r\n    CheckSum:         0004D15F\r\n    ImageSize:        00043000\r\n    File version:     10.0.17763.1\r\n    Product version:  10.0.17763.1\r\n    File flags:       0 (Mask 3F)\r\n    File OS:          40004 NT Win32\r\n    File type:        1.0 App\r\n    File date:        00000000.00000000\r\n    Translations:     0409.04b0\r\n    CompanyName:      Microsoft Corporation\r\n    ProductName:      Microsoft??? Windows??? Operating System\r\n    InternalName:     Notepad\r\n    OriginalFilename: NOTEPAD.EXE\r\n    ProductVersion:   10.0.17763.1\r\n    FileVersion:      10.0.17763.1 (WinBuild.160101.0800)\r\n    FileDescription:  Notepad\r\n    LegalCopyright:   ??? Microsoft Corporation. All rights reserved.\r\n...\r\n00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:\\symbols\\ntdll.pdb\\B8AD79538F2730FD9BACE36C9F9316A01\\ntdll.pdb\r\n    Loaded symbol image file: C:\\WINDOWS\\SYSTEM32\\ntdll.dll\r\n    Image path: C:\\WINDOWS\\SYSTEM32\\ntdll.dll\r\n    Image name: ntdll.dll\r\n    Image was built with \/Brepro flag.\r\n    Timestamp:        E8B54827 (This is a reproducible build file hash, not a timestamp)\r\n    CheckSum:         001F20D1\r\n    ImageSize:        001ED000\r\n    File version:     10.0.17763.194\r\n    Product version:  10.0.17763.194\r\n    File flags:       0 (Mask 3F)\r\n    File OS:          40004 NT Win32\r\n    File type:        2.0 Dll\r\n    File date:        00000000.00000000\r\n    Translations:     0409.04b0\r\n    CompanyName:      Microsoft Corporation\r\n    ProductName:      Microsoft??? Windows??? Operating System\r\n    InternalName:     ntdll.dll\r\n    OriginalFilename: ntdll.dll\r\n    ProductVersion:   10.0.17763.194\r\n    FileVersion:      10.0.17763.194 (WinBuild.160101.0800)\r\n    FileDescription:  NT Layer DLL\r\n    LegalCopyright:   ??? Microsoft Corporation. All rights reserved.\r\nquit:\r\n\r\n<\/code><\/pre>\n<p>The output starts with info that I&#8217;m not interested in here. I only want to get the detailed information about the loaded modules. It is not until the line<\/p>\n<pre><code>start             end                 module name\r\n<\/code><\/pre>\n<p>that I care about the output.<\/p>\n<p>Also, at the end there is a line that we need to be aware of:<\/p>\n<pre><code>quit:\r\n<\/code><\/pre>\n<p>that is not part of the module output.<\/p>\n<p>To skip the parts of the debugger output that we don&#8217;t care about, we have a boolean flag initially set to true.\nIf that flag is set, we check if the current line, <code>$_<\/code>, is the module header in which case we flip the flag.<\/p>\n<pre class=\"lang:default decode:true\">    $inPreamble = $true\r\n    switch -regex ($cdbOutput) {\r\n\r\n        { $inPreamble -and $_ -eq \"start             end                 module name\" } { $inPreamble = $false; continue }<\/pre>\n<p>I have made the parser a separate function that reads its input from the pipeline. This way, I can use the same function to parse module data, regardless of how I got the module data. Maybe it was saved on a file. Or came from a dump, or a live process. It doesn&#8217;t matter, since the parser is decoupled from the data retrieval.<\/p>\n<p>After the sample, there is a breakdown of the more complicated regular expressions used, so don&#8217;t despair if you don&#8217;t understand them at first.\nRegular Expressions are notoriously hard to read, so much so that they make Perl look readable in comparison.<\/p>\n<pre class=\"lang:default decode:true\"># define an class to store the data\r\nclass ExecutableModule {\r\n    [string]   $Name\r\n    [string]   $Start\r\n    [string]   $End\r\n    [string]   $SymbolStatus\r\n    [string]   $PdbPath\r\n    [bool]     $Reproducible\r\n    [string]   $ImagePath\r\n    [string]   $ImageName\r\n    [DateTime] $TimeStamp\r\n    [uint32]   $FileHash\r\n    [uint32]   $CheckSum\r\n    [uint32]   $ImageSize\r\n    [version]  $FileVersion\r\n    [version]  $ProductVersion\r\n    [string]   $FileFlags\r\n    [string]   $FileOS\r\n    [string]   $FileType\r\n    [string]   $FileDate\r\n    [string[]] $Translations\r\n    [string]   $CompanyName\r\n    [string]   $ProductName\r\n    [string]   $InternalName\r\n    [string]   $OriginalFilename\r\n    [string]   $ProductVersionStr\r\n    [string]   $FileVersionStr\r\n    [string]   $FileDescription\r\n    [string]   $LegalCopyright\r\n    [string]   $LegalTrademarks\r\n    [string]   $LoadedImageFile\r\n    [string]   $PrivateBuild\r\n    [string]   $Comments\r\n}\r\n\r\n&lt;#\r\n.SYNOPSIS Runs a debugger on a program to dump its loaded modules\r\n#&gt;\r\nfunction Get-ExecutableModuleRawData {\r\n    param ([string] $Program)\r\n    $consoleEncoding = [Console]::OutputEncoding\r\n    [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding(\"iso-8859-1\")\r\n    try {\r\n        $proc = Start-Process $program -PassThru\r\n        Start-Sleep -Seconds 1  # sleep for a while so modules are loaded\r\n        cdb -y srv*c:\\symbols*http:\/\/msdl.microsoft.com\/download\/symbols -c \".reload -f;lmv;q\" -p $proc.Id\r\n        $proc.Close()\r\n    }\r\n    finally {\r\n        [Console]::OutputEncoding = $consoleEncoding\r\n    }\r\n}\r\n\r\n&lt;#\r\n.SYNOPSIS Converts verbose module data from windows debuggers into ExecutableModule objects.\r\n#&gt;\r\nfunction ConvertTo-ExecutableModule {\r\n    [OutputType([ExecutableModule])]\r\n    param (\r\n        [Parameter(ValueFromPipeline)]\r\n        [string[]] $ModuleRawData\r\n    )\r\n    begin {\r\n        $currentObject = $null\r\n        $preamble = $true\r\n        $propertyNameMap = @{\r\n            'File flags'      = 'FileFlags'\r\n            'File OS'         = 'FileOS'\r\n            'File type'       = 'FileType'\r\n            'File date'       = 'FileDate'\r\n            'File version'    = 'FileVersion'\r\n            'Product version' = 'ProductVersion'\r\n            'Image path'      = 'ImagePath'\r\n            'Image name'      = 'ImageName'\r\n            'FileVersion'     = 'FileVersionStr'\r\n            'ProductVersion'  = 'ProductVersionStr'\r\n        }\r\n    }\r\n    process {\r\n        switch -regex ($ModuleRawData) {\r\n\r\n            # skip lines until we get to our sentinel line\r\n            { $preamble -and $_ -eq \"start             end                 module name\" } { $preamble = $false; continue }\r\n\r\n            #00007ff6`e9da0000 00007ff6`e9de3000   notepad    (deferred)\r\n            #00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:\\symbols\\ntdll.pdb\\B8AD79538F2730FD9BACE36C9F9316A01\\ntdll.pdb\r\n            '^([0-9a-f`]{17})\\s([0-9a-f`]{17})\\s+(\\S+)\\s+\\(([^\\)]+)\\)\\s*(.+)?' {\r\n                # see breakdown of the expression later in the post\r\n                # on record start, output the currentObject, if any is set\r\n                if ($null -ne $currentObject) {\r\n                    $currentObject\r\n                }\r\n                $start, $end, $module, $pdbKind, $pdbPath = $matches[1..5]\r\n                # create an instance of the object that we are adding info from the current record into.\r\n                $currentObject = [ExecutableModule] @{\r\n                    Start        = $start\r\n                    End          = $end\r\n                    Name         = $module\r\n                    SymbolStatus = $pdbKind\r\n                    PdbPath      = $pdbPath\r\n                }\r\n                continue\r\n            }\r\n            '^\\s+Image was built with \/Brepro flag.' {\r\n                $currentObject.Reproducible = $true\r\n                continue\r\n            }\r\n            '^\\s+Timestamp:\\s+[^\\(]+\\((?&lt;timestamp&gt;.{8})\\)' {\r\n                # see breakdown of the regular  expression later in the post\r\n                # Timestamp:        Mon Jan  7 23:42:30 2019 (5C33D5D6)\r\n                $intValue = [Convert]::ToInt32($matches.timestamp, 16)\r\n                $currentObject.TimeStamp = [DateTime]::new(1970, 01, 01, 0, 0, 0, [DateTimeKind]::Utc).AddSeconds($intValue)\r\n                continue\r\n            }\r\n            '^\\s+TimeStamp:\\s+(?&lt;value&gt;.{8}) \\(This' {\r\n                # Timestamp:        E78937AC (This is a reproducible build file hash, not a timestamp)\r\n                $currentObject.FileHash = [Convert]::ToUInt32($matches.value, 16)\r\n                continue\r\n            }\r\n            '^\\s+Loaded symbol image file: (?&lt;imageFile&gt;[^\\)]+)' {\r\n                $currentObject.LoadedImageFile = $matches.imageFile\r\n                continue\r\n            }\r\n            '^\\s+Checksum:\\s+(?&lt;checksum&gt;\\S+)' {\r\n                $currentObject.Checksum = [Convert]::ToUInt32($matches.checksum, 16)\r\n                continue\r\n            }\r\n            '^\\s+Translations:\\s+(?&lt;value&gt;\\S+)' {\r\n                $currentObject.Translations = $matches.value.Split(\".\")\r\n                continue\r\n            }\r\n            '^\\s+ImageSize:\\s+(?&lt;imageSize&gt;.{8})' {\r\n                $currentObject.ImageSize = [Convert]::ToUInt32($matches.imageSize, 16)\r\n                continue\r\n            }\r\n            '^\\s{4}(?&lt;name&gt;[^:]+):\\s+(?&lt;value&gt;.+)' {\r\n                # see breakdown of the regular expression later in the post\r\n                # This part is any 'name: value' pattern\r\n                $name, $value = $matches['name', 'value']\r\n\r\n                # project the property name\r\n                $propName = $propertyNameMap[$name]\r\n                $propName = if ($null -eq $propName) { $name } else { $propName }\r\n\r\n                # note the dynamic property name in the assignment\r\n                # this will fail if the property doesn't have a member with the specified name\r\n                $currentObject.$propName = $value\r\n                continue\r\n            }\r\n            'quit:' {\r\n                # ignore and exit\r\n                break\r\n            }\r\n            default {\r\n                # When writing the parser, it can be useful to include a line like the one below to see the cases that are not handled by the parser\r\n                # Write-Warning \"missing case for '$_'. Unexpected output format from cdb.exe\"\r\n\r\n                continue # skip lines that doesn't match the patterns we are interested in, like the start\/end\/modulename header and the quit: output\r\n            }\r\n        }\r\n    }\r\n    end {\r\n        # this is needed to output the last object\r\n        if ($null -ne $currentObject) {\r\n            $currentObject\r\n        }\r\n    }\r\n}\r\n\r\n\r\nGet-ExecutableModuleRawData Notepad |\r\n    ConvertTo-ExecutableModule |\r\n    Sort-Object ProductVersion, Name\r\n    Format-Table -Property Name, FileVersion, Product_Version, FileDescription<\/pre>\n<pre><code>Name               FileVersionStr                             ProductVersion FileDescription\r\n----               --------------                             -------------- ---------------\r\nPROPSYS            7.0.17763.1 (WinBuild.160101.0800)         7.0.17763.1    Microsoft Property System\r\nADVAPI32           10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Advanced Windows 32 Base API\r\nbcrypt             10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Windows Cryptographic Primitives Library\r\n...\r\nuxtheme            10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Microsoft UxTheme Library\r\nwin32u             10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Win32u\r\nWINSPOOL           10.0.17763.1 (WinBuild.160101.0800)        10.0.17763.1   Windows Spooler Driver\r\nKERNELBASE         10.0.17763.134 (WinBuild.160101.0800)      10.0.17763.134 Windows NT BASE API Client DLL\r\nwintypes           10.0.17763.134 (WinBuild.160101.0800)      10.0.17763.134 Windows Base Types DLL\r\nSHELL32            10.0.17763.168 (WinBuild.160101.0800)      10.0.17763.168 Windows Shell Common Dll\r\n...\r\nwindows_storage    10.0.17763.168 (WinBuild.160101.0800)      10.0.17763.168 Microsoft WinRT Storage API\r\nCoreMessaging      10.0.17763.194                             10.0.17763.194 Microsoft CoreMessaging Dll\r\ngdi32full          10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 GDI Client DLL\r\nntdll              10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 NT Layer DLL\r\nRMCLIENT           10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 Resource Manager Client\r\nRPCRT4             10.0.17763.194 (WinBuild.160101.0800)      10.0.17763.194 Remote Procedure Call Runtime\r\ncombase            10.0.17763.253 (WinBuild.160101.0800)      10.0.17763.253 Microsoft COM for Windows\r\nCOMCTL32           6.10 (WinBuild.160101.0800)                10.0.17763.253 User Experience Controls Library\r\nurlmon             11.00.17763.168 (WinBuild.160101.0800)     11.0.17763.168 OLE32 Extensions for Win32\r\niertutil           11.00.17763.253 (WinBuild.160101.0800)     11.0.17763.253 Run time utility for Internet Explorer\r\n<\/code><\/pre>\n<h2><a id=\"user-content-regex-pattern-breakdown\" class=\"anchor\" href=\"#regex-pattern-breakdown\"><\/a>Regex pattern breakdown<\/h2>\n<p>Here is a breakdown of the more complicated patterns, using the <code>ignore pattern whitespace<\/code> modifier <code>x<\/code>:<\/p>\n<pre class=\"lang:default decode:true\">([0-9a-f`]{17})\\s([0-9a-f`]{17})\\s+(\\S+)\\s+\\(([^\\)]+)\\)\\s*(.+)?\r\n\r\n# example input: 00007ffe`9ccb0000 00007ffe`9ce9d000   ntdll      (pdb symbols)          c:\\symbols\\ntdll.pdb\\B8AD79538F2730FD9BACE36C9F9316A01\\ntdll.pdb\r\n\r\n(?x)                # ignore pattern whitespace\r\n^                   # the beginning of the line\r\n([0-9a-f`]{17})     # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them\r\n\\s                  # a space\r\n([0-9a-f`]{17})     # capture expression like 00007ff6`e9da0000 - any hex number or backtick, and exactly 17 of them\r\n\\s+                 # skip any number of spaces\r\n(\\S+)               # capture until we get a space - this would match the 'ntdll' part\r\n\\s+                 # skip one or more spaces\r\n\\(                  # start parenthesis\r\n    ([^\\)])         # capture anything but end parenthesis\r\n\\)                  # end parenthesis\r\n\\s*                 # skip zero or more spaces\r\n(.+)?               # optionally capture any symbol file path<\/pre>\n<p>Breakdown of the name-value pattern:<\/p>\n<pre class=\"lang:default decode:true\">^\\s+(?&lt;name&gt;[^:]+):\\s+(?&lt;value&gt;.+)\r\n\r\n# example input:  File flags:       0 (Mask 3F)\r\n\r\n(?x)                # ignore pattern whitespace\r\n^                   # the beginning of the line\r\n\\s+                 # require one or more spaces\r\n(?&lt;name&gt;[^:]+)      # capture anything that is not a `:` into the named group \"name\"\r\n:                   # require a comma\r\n\\s+                 # require one or more spaces\r\n(?&lt;value&gt;.+)        # capture everything until the end into the name group \"value\"<\/pre>\n<p>Breakdown of the timestamp pattern:<\/p>\n<pre class=\"lang:default decode:true\">^\\s{4}Timestamp:\\s+[^\\(]+\\((?&lt;timestamp&gt;.{8})\\)\r\n\r\n#example input:     Timestamp:        Mon Jan  7 23:42:30 2019 (5C33D5D6)\r\n\r\n(?x)                # ignore pattern whitespace\r\n^                   # the beginning of the line\r\n\\s+                 # require one or more spaces\r\nTimestamp:          # The literal text 'Timestamp:'\r\n\\s+                 # require one or more spaces\r\n[^\\(]+              # one or more of anything but a open parenthesis\r\n\\(                  # a literal '('\r\n(?&lt;timestamp&gt;.{8})  # 8 characters of anything, captured into the group 'timestamp'\r\n\\)                  # a literal ')'<\/pre>\n<h2><a id=\"user-content-gotchas---the-regex-cache\" class=\"anchor\" href=\"#gotchas---the-regex-cache\"><\/a>Gotchas &#8211; the Regex Cache<\/h2>\n<p>Something that can happen if you are writing a more complicated parser is the following:\nThe parser works well. You have 15 regular expressions in your switch statement and then you get some input you haven&#8217;t seen before, so you add a 16th regex.\nAll of a sudden, the performance of your parser tanks. WTF?<\/p>\n<p>The .net regex implementation has a cache of recently used <code>regex<\/code>s. You can check the size of it like this:<\/p>\n<pre class=\"lang:default decode:true\">PS&gt; [regex]::CacheSize\r\n15\r\n\r\n# bump it\r\n[regex]::CacheSize = 20<\/pre>\n<p>And now your parser is fast(er) again.<\/p>\n<h2><a id=\"user-content-bonus-tip\" class=\"anchor\" href=\"#bonus-tip\"><\/a>Bonus tip<\/h2>\n<p>I frequently use PowerShell to write (generate) my code:<\/p>\n<pre class=\"lang:default decode:true\">Get-ExecutableModuleRawData pwsh |\r\n    Select-String '^\\s+([^:]+):' |       # this pattern matches the module detail fields\r\n    Foreach-Object {$_.matches.groups[1].value} |\r\n    Select-Object -Unique |\r\n    Foreach-Object -Begin   { \"class ExecutableModuleData {\" }`\r\n                   -Process { \"    [string] $\" + ($_ -replace \"\\s.\", {[char]::ToUpperInvariant($_.Groups[0].Value[1])}) }`\r\n                   -End     { \"}\" }<\/pre>\n<p>The output is<\/p>\n<pre><code>class ExecutableModuleData {\r\n    [string] $LoadedSymbolImageFile\r\n    [string] $ImagePath\r\n    [string] $ImageName\r\n    [string] $Timestamp\r\n    [string] $CheckSum\r\n    [string] $ImageSize\r\n    [string] $FileVersion\r\n    [string] $ProductVersion\r\n    [string] $FileFlags\r\n    [string] $FileOS\r\n    [string] $FileType\r\n    [string] $FileDate\r\n    [string] $Translations\r\n    [string] $CompanyName\r\n    [string] $ProductName\r\n    [string] $InternalName\r\n    [string] $OriginalFilename\r\n    [string] $ProductVersion\r\n    [string] $FileVersion\r\n    [string] $FileDescription\r\n    [string] $LegalCopyright\r\n    [string] $Comments\r\n    [string] $LegalTrademarks\r\n    [string] $PrivateBuild\r\n}\r\n<\/code><\/pre>\n<p>It is not complete &#8211; I don&#8217;t have the fields from the record start, some types are incorrect and when run against some other executables a few other fields may appear.\nBut it is a very good starting point. And way more fun than typing it \ud83d\ude42<\/p>\n<p>Note that this example is using a new feature of the <code>-replace<\/code> operator &#8211; to use a ScriptBlock to determine what to replace with &#8211; that was added in PowerShell Core 6.1.<\/p>\n<h2><a id=\"user-content-bonus-tip-2\" class=\"anchor\" href=\"#bonus-tip-2\"><\/a>Bonus tip #2<\/h2>\n<p>A regular expression construct that I often find useful is <code>non-greedy matching<\/code>.\nThe example below shows the effect of the <code>?<\/code> modifier, that can be used after <code>*<\/code> (zero or more) and <code>+<\/code> (one or more)<\/p>\n<pre class=\"lang:default decode:true\"># greedy matching - match to the last occurrence of the following character (&gt;)\r\nif(\"&lt;Tag&gt;Text&lt;\/Tag&gt;\" -match '&lt;(.+)&gt;') { $matches }<\/pre>\n<pre class=\"\"><code>Name                           Value\r\n----                           -----\r\n1                              Tag&gt;Text&lt;\/Tag\r\n0                              &lt;Tag&gt;Text&lt;\/Tag&gt;\r\n<\/code><\/pre>\n<div class=\"highlight highlight-source-powershell\">\n<pre class=\"lang:default decode:true  \"># non-greedy matching - match to the first occurrence of the the following character (&gt;)\r\nif(\"&lt;Tag&gt;Text&lt;\/Tag&gt;\" -match '&lt;(.+?)&gt;') { $matches }<\/pre>\n<\/div>\n<pre><code>Name                           Value\r\n----                           -----\r\n1                              Tag\r\n0                              &lt;Tag&gt;\r\n\r\n<\/code><\/pre>\n<p>See <a href=\"https:\/\/www.regular-expressions.info\/repeat.html\" rel=\"nofollow\">Regex Repeat<\/a> for more info on how to control pattern repetition.<\/p>\n<h2><a id=\"user-content-summary\" class=\"anchor\" href=\"#summary\"><\/a>Summary<\/h2>\n<p>In this post, we have looked at how the structure of a switch-based parser could look, and how it can be written so that it works as a part of the pipeline.\nWe have also looked at a few slightly more complicated regular expressions in some detail.<\/p>\n<p>As we have seen, PowerShell has a plethora of options for parsing text, and most of them revolve around regular expressions.\nMy personal experience has been that the time I&#8217;ve invested in understanding the regex language was well invested.<\/p>\n<p>Hopefully, this gives you a good start with the parsing tasks you have at hand.<\/p>\n<p>Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.<\/p>\n<p>Staffan Gustafsson, <a href=\"https:\/\/twitter.com\/staffangson\" rel=\"nofollow\">@StaffanGson<\/a>, <a href=\"https:\/\/github.com\/powercode\/\">github<\/a><\/p>\n<p><em>Staffan works at <a href=\"https:\/\/www.dice.se\" rel=\"nofollow\">DICE<\/a> in Stockholm, Sweden, as a Software Engineer and has been using PowerShell since the first public beta.<\/em>\n<em>He was most seriously pleased when PowerShell was open sourced, and has since contributed bug fixes, new features and performance improvements.<\/em>\n<em>Staffan is a speaker at <a href=\"http:\/\/www.psconf.eu\/\" rel=\"nofollow\">PSConfEU<\/a> and is always happy to talk PowerShell.<\/em><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>This is the third and final post in a three-part series. Part 1: Useful methods on the String class Introduction to Regular Expressions The Select-String cmdlet Part 2: the -split operator the -match operator the switch statement the Regex class Part 3: a real world, complete and slightly bigger, example of a switch-based parser General [&hellip;]<\/p>\n","protected":false},"author":685,"featured_media":13641,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-14655","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-powershell"],"acf":[],"blog_post_summary":"<p>This is the third and final post in a three-part series. Part 1: Useful methods on the String class Introduction to Regular Expressions The Select-String cmdlet Part 2: the -split operator the -match operator the switch statement the Regex class Part 3: a real world, complete and slightly bigger, example of a switch-based parser General [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts\/14655","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/users\/685"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/comments?post=14655"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts\/14655\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/media\/13641"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/media?parent=14655"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/categories?post=14655"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/tags?post=14655"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}