{"id":14625,"date":"2019-01-24T09:23:00","date_gmt":"2019-01-24T17:23:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/powershell\/?p=14625"},"modified":"2022-11-16T12:32:26","modified_gmt":"2022-11-16T20:32:26","slug":"parsing-text-with-powershell-2-3","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/powershell\/parsing-text-with-powershell-2-3\/","title":{"rendered":"Parsing Text with PowerShell (2\/3)"},"content":{"rendered":"<div class=\"markdown-body\">\n<p>This is the second post in a three-part series.<\/p>\n<ul>\n<li><a href=\"https:\/\/blogs.msdn.microsoft.com\/powershell\/2019\/01\/18\/parsing-text-with-powershell-1-3\/\" rel=\"nofollow\">Part 1<\/a>:\n<ul>\n<li>Useful methods on the String class<\/li>\n<li>Introduction to Regular Expressions<\/li>\n<li>The Select-String cmdlet<\/li>\n<\/ul>\n<\/li>\n<li><em>Part 2<\/em>:\n<ul>\n<li>the -split operator<\/li>\n<li>the -match operator<\/li>\n<li>the switch statement<\/li>\n<li>the Regex class<\/li>\n<\/ul>\n<\/li>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/powershell\/parsing-text-with-powershell-3-3\/\">Part 3<\/a>:\n<ul>\n<li>a real world, complete and slightly bigger, example of a switch-based parser<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2><a id=\"user-content-the--split-operator\" class=\"anchor\" href=\"#the--split-operator\"><\/a>The <code>-split<\/code> operator<\/h2>\n<p>The <code>-split<\/code> <a href=\"https:\/\/docs.microsoft.com\/en-us\/powershell\/module\/microsoft.powershell.core\/about\/about_split?view=powershell-6\" rel=\"nofollow\">operator<\/a> splits one or more strings into substrings.<\/p>\n<p>The first example is a name-value pattern, which is a common parsing task. Note the usage of the <a href=\"https:\/\/docs.microsoft.com\/powershell\/module\/microsoft.powershell.core\/about\/about_split?view=powershell-6#max-substrings\" rel=\"nofollow\">Max-substrings<\/a> parameter to the <code>-split<\/code> operator.\nWe want to ensure that it doesn&#8217;t matter if the value contains the character to split on.<\/p>\n<pre class=\"lang:default decode:true\">$text = \"Description=The '=' character is used for assigning values to a variable\"\r\n$name, $value = $text -split \"=\", 2\r\n\r\n@\"\r\nName  =  $name\r\nValue =  $value\r\n\"@<\/pre>\n<pre><code>Name  =  Description\r\nValue =  The '=' character is used for assigning values to a variable\r\n<\/code><\/pre>\n<p>When the line to parse contains fields separated by a well known separator, that is never a part of the field values, we can use the <code>-split<\/code> operator in combination with multiple assignment to get the fields into variables.<\/p>\n<pre class=\"lang:default decode:true\">$name, $location, $occupation = \"Spider Man,New York,Super Hero\" -split ','<\/pre>\n<p>If only the location is of interest, the unwanted items can be assigned to <code>$null<\/code>.<\/p>\n<pre class=\"lang:default decode:true\">$null, $location, $null = \"Spider Man,New York,Super Hero\" -split ','\r\n\r\n$location<\/pre>\n<pre><code>New York\r\n<\/code><\/pre>\n<p>If there are many fields, assigning to null doesn&#8217;t scale well. Indexing can be used instead, to get the fields of interest.<\/p>\n<pre class=\"lang:default decode:true\">$inputText = \"x,Staffan,x,x,x,x,x,x,x,x,x,x,Stockholm,x,x,x,x,x,x,x,x,11,x,x,x,x\"\r\n$name, $location, $age = ($inputText -split ',')[1,12,21]\r\n\r\n$name\r\n$location\r\n$age<\/pre>\n<pre><code>Staffan\r\nStockholm\r\n11\r\n<\/code><\/pre>\n<p>It is almost always a good idea to create an object that gives context to the different parts.<\/p>\n<pre class=\"lang:default decode:true\">$inputText = \"x,Steve,x,x,x,x,x,x,x,x,x,x,Seattle,x,x,x,x,x,x,x,x,22,x,x,x,x\"\r\n$name, $location, $age = ($inputText -split ',')[1,12,21]\r\n[PSCustomObject] @{\r\n    Name = $name\r\n    Location = $location\r\n    Age = [int] $age\r\n}<\/pre>\n<pre><code>Name  Location Age\r\n----  -------- ---\r\nSteve Seattle   22\r\n<\/code><\/pre>\n<p>Instead of creating a PSCustomObject, we can create a class. It&#8217;s a bit more to type, but we can get more help from the engine, for example with tab completion.<\/p>\n<p>The example below also shows an example of type conversion, where the default string to number conversion doesn&#8217;t work.\nThe age field is handled by PowerShell&#8217;s built-in type conversion. It is of type <code>[int]<\/code>, and PowerShell will handle the conversion from <code>string<\/code> to <code>int<\/code>,\nbut in some cases we need to help out a bit. The ShoeSize field is also an <code>[int]<\/code>, but the data is hexadecimal,\nand without the hex specifier (&#8216;0x&#8217;), this conversion fails for some values, and provides incorrect results for the others.<\/p>\n<pre class=\"lang:default decode:true\">class PowerSheller {\r\n    [string] $Name\r\n    [string] $Location\r\n    [int] $Age\r\n    [int] $ShoeSize\r\n}\r\n\r\n$inputText = \"x,Staffan,x,x,x,x,x,x,x,x,x,x,Stockholm,x,x,x,x,x,x,x,x,33,x,11d,x,x\"\r\n$name, $location, $age, $shoeSize = ($inputText -split ',')[1,12,21,23]\r\n[PowerSheller] @{\r\n    Name = $name\r\n    Location = $location\r\n    Age = $age\r\n    # ShoeSize is expressed in hex, with no '0x' because reasons :)\r\n    # And yes, it's in millimeters.\r\n    ShoeSize = [Convert]::ToInt32($shoeSize, 16)\r\n}<\/pre>\n<pre><code>Name    Location  Age ShoeSize\r\n----    --------  --- --------\r\nStaffan Stockholm  33      285\r\n<\/code><\/pre>\n<p>The split operator&#8217;s first argument is actually a regex (by default, can be changed with <a href=\"https:\/\/docs.microsoft.com\/powershell\/module\/microsoft.powershell.core\/about\/about_split?view=powershell-6#options\" rel=\"nofollow\">options<\/a>).\nI use this on long command lines in log files (like those given to compilers) where there can be hundreds of options specified. This makes it hard to see if a certain option is specified or not, but when split into their own lines, it becomes trivial.\nThe pattern below uses a <a href=\"https:\/\/www.regular-expressions.info\/lookaround.html\" rel=\"nofollow\"><em>positive lookahead assertion<\/em><\/a>.\nIt can be very useful to make patterns match only in a given context, like if they are, or are not, preceded or followed by another pattern.<\/p>\n<pre class=\"lang:default decode:true\">$cmdline = \"cl.exe \/D Bar=1 \/I SomePath \/D Foo  \/O2 \/I SomeOtherPath \/Debug a1.cpp a3.cpp a2.cpp\"\r\n\r\n$cmdline -split \"\\s+(?=[-\/])\"<\/pre>\n<pre><code>cl.exe\r\n\/D Bar=1\r\n\/I SomePath\r\n\/D Foo\r\n\/O2\r\n\/I SomeOtherPath\r\n\/Debug a1.cpp a2.cpp\r\n<\/code><\/pre>\n<p>Breaking down the regex, by rewriting it with the <a href=\"https:\/\/docs.microsoft.com\/dotnet\/standard\/base-types\/regular-expression-options\" rel=\"nofollow\">x<\/a> option:<\/p>\n<pre class=\"lang:default decode:true\">(?x)      # ignore whitespace in the pattern, and enable comments after '#'\r\n\\s+       # one or more spaces\r\n(?=[-\/])  # only match the previous spaces if they are followed by any of '-' or '\/'.<\/pre>\n<h3><a id=\"user-content-splitting-with-a-scriptblock\" class=\"anchor\" href=\"#splitting-with-a-scriptblock\"><\/a>Splitting with a scriptblock<\/h3>\n<p>The <code>-split<\/code> operator also comes in another form, where you can pass it a scriptblock instead of a regular expression.\nThis allows for more complicated logic, that can be hard or impossible to express as a regular expression.<\/p>\n<p>The scriptblock accepts two parameters, the text to split and the current index. <code>$_<\/code> is bound to the character at the current index.<\/p>\n<pre class=\"lang:default decode:true\">function SplitWhitespaceInMiddleOfText {\r\n    param(\r\n        [string]$Text,\r\n        [int] $Index\r\n    )\r\n    if ($Index -lt 10 -or $Index -gt 40){\r\n        return $false\r\n    }\r\n    $_ -match '\\s'\r\n}\r\n\r\n$inputText = \"Some text that only needs splitting in the middle of the text\"\r\n$inputText -split $function:SplitWhitespaceInMiddleOfText<\/pre>\n<pre><code>Some text that\r\nonly\r\nneeds\r\nsplitting\r\nin\r\nthe middle of the text\r\n<\/code><\/pre>\n<p>The <code>$function:SplitWhitespaceInMiddleOfText<\/code> syntax is a way to get to content (the scriptblock that implements it) of the function, just as <code>$env:UserName<\/code> gets the content of an item in the <code>env:<\/code> drive.\nIt provides a way to document and\/or reuse the scriptblock.<\/p>\n<h2><a id=\"user-content-the--match-operator\" class=\"anchor\" href=\"#the--match-operator\"><\/a>The <code>-match<\/code> operator<\/h2>\n<p>The <code>-match<\/code> operator works in conjunction with the <code>$matches<\/code> automatic variable. Each time a <code>-match<\/code> or a <code>-notmatch<\/code> succeeds, the <code>$matches<\/code> variable is populated so that each capture group gets its own entry. If the capture group is named, the key will be the name of the group, otherwise it will be the index.<\/p>\n<p>As an example:<\/p>\n<pre class=\"lang:default decode:true\">if ('a b c' -match '(\\w) (?&lt;named&gt;\\w) (\\w)'){\r\n    $matches\r\n}<\/pre>\n<pre><code>Name                           Value\r\n----                           -----\r\nnamed                          b\r\n2                              c\r\n1                              a\r\n0                              a b c\r\n<\/code><\/pre>\n<p>Notice that the indices only increase on groups without names. I.E. the indices of later groups change when a group is named.<\/p>\n<p>Armed with the regex knowledge from the earlier <a href=\"https:\/\/blogs.msdn.microsoft.com\/powershell\/2019\/01\/18\/parsing-text-with-powershell-1-3\/\" rel=\"nofollow\">post<\/a>, we can write the following:<\/p>\n<pre class=\"lang:default decode:true\">PS&gt; \"    10,Some text\" -match '^\\s+(\\d+),(.+)'<\/pre>\n<pre><code>True\r\n<\/code><\/pre>\n<div class=\"highlight highlight-source-powershell\">\n<pre class=\"lang:default decode:true\">PS&gt; $matches<\/pre>\n<\/div>\n<pre><code>Name                           Value\r\n----                           -----\r\n2                              Some text\r\n1                              10\r\n0                              10,Some text\r\n<\/code><\/pre>\n<p>or with named groups<\/p>\n<pre class=\"lang:default decode:true\">PS&gt; \"    10,Some text\" -match '^\\s+(?&lt;num&gt;\\d+),(?&lt;text&gt;.+)'<\/pre>\n<pre><code>True\r\n<\/code><\/pre>\n<div class=\"highlight highlight-source-powershell\">\n<pre class=\"lang:default decode:true\">PS&gt; $matches<\/pre>\n<\/div>\n<pre><code>Name                           Value\r\n----                           -----\r\nnum                            10\r\ntext                           Some text\r\n0                              10,Some text\r\n<\/code><\/pre>\n<p>The important thing here is to put parentheses around the parts of the pattern that we want to extract. That is what creates the capture groups that allow us to reference those parts of the matching text, either by name or by index.<\/p>\n<p>Combining this into a function makes it easy to use:<\/p>\n<div class=\"highlight highlight-source-powershell\">\n<pre class=\"lang:default decode:true \">function ParseMyString($text){\r\n    if ($text -match '^\\s+(\\d+),(.+)') {\r\n        [PSCustomObject] @{\r\n            Number = [int] $matches[1]\r\n            Text    = $matches[2]\r\n        }\r\n    }\r\n    else {\r\n        Write-Warning \"ParseMyString: Input `$text` doesn't match pattern\"\r\n    }\r\n}\r\n\r\nParseMyString \"    10,Some text\"<\/pre>\n<p>&nbsp;<\/p>\n<\/div>\n<pre><code>Number  Text\r\n------- ----\r\n     10 Some text\r\n\r\n<\/code><\/pre>\n<p>Notice the type conversion when assigning the <code>Number<\/code> property. As long as the number is in range of an integer, this will always succeed, since we have made a successful match in the if statement above. (<code>[long]<\/code> or <code>[bigint]<\/code> could be used. In this case I provide the input, and I have promised myself to stick to a range that fits in a 32-bit integer.)\nNow we will be able to sort or do numerical operations on the <code>Number<\/code> property, and it will behave like we want it to &#8211; as a number, not as a string.<\/p>\n<h2><a id=\"user-content-the-switch-statement\" class=\"anchor\" href=\"#the-switch-statement\"><\/a>The <code>switch<\/code> statement<\/h2>\n<p>Now we&#8217;re at the big guns \ud83d\ude42<\/p>\n<p>The <code>switch<\/code> statement in PowerShell has been given special functionality for parsing text.\nIt has two flags that are useful for parsing text and files with text in them. <code>-regex<\/code> and <code>-file<\/code>.<\/p>\n<p>When specifying <code>-regex<\/code>, the match clauses that are strings are treated as regular expressions. The switch statement also sets the <code>$matches<\/code> automatic variable.<\/p>\n<p>When specifying <code>-file<\/code>, PowerShell treats the input as a file name, to read input from, rather than as a value statement.<\/p>\n<p>Note the use of a ScriptBlock instead of a string as the match clause to determine if we should skip preamble lines.<\/p>\n<div class=\"highlight highlight-source-powershell\">\n<pre class=\"lang:default decode:true \">class ParsedOutput {\r\n    [int] $Number\r\n    [string] $Text\r\n\r\n    [string] ToString() { return \"{0} ({1})\" -f $this.Text, $this.Number }\r\n}\r\n\r\n$inputData =\r\n    \"Preamble line\",\r\n    \"LastLineOfPreamble\",\r\n    \"    10,Some Text\",\r\n    \"    Some other text,20\"\r\n\r\n$inPreamble = $true\r\nswitch -regex ($inputData) {\r\n\r\n    {$inPreamble -and $_ -eq 'LastLineOfPreamble'} { $inPreamble = $false; continue }\r\n\r\n    \"^\\s+(?&lt;num&gt;\\d+),(?&lt;text&gt;.+)\" {  # this matches the first line of non-preamble input\r\n        [ParsedOutput] @{\r\n            Number = $matches.num\r\n            Text = $matches.text\r\n        }\r\n        continue\r\n    }\r\n\r\n    \"^\\s+(?&lt;text&gt;[^,]+),(?&lt;num&gt;\\d+)\" { # this matches the second line of non-preamble input\r\n        [ParsedOutput] @{\r\n            Number = $matches.num\r\n            Text = $matches.text\r\n        }\r\n        continue\r\n    }\r\n}<\/pre>\n<p>&nbsp;<\/p>\n<\/div>\n<pre><code>Number  Text\r\n------ ----\r\n    10 Some Text\r\n    20 Some other text\r\n<\/code><\/pre>\n<p>The pattern <code>[^,]+<\/code> in the <code>text<\/code> group in the code above is useful. It means match anything that is not a comma <code>,<\/code>. We are using the <code>any-of<\/code> construct <code>[]<\/code>, and within those brackets, <code>^<\/code> changes meaning from <code>the beginning of the line<\/code> to <code>anything but<\/code>.<\/p>\n<p>That is useful when we are matching delimited fields. A requirement is that the delimiter cannot be part of the set of allowed field values.<\/p>\n<h2><a id=\"user-content-the-regex-class\" class=\"anchor\" href=\"#the-regex-class\"><\/a>The <code>regex<\/code> class<\/h2>\n<p><code>regex<\/code> is a type accelerator for <a href=\"https:\/\/docs.microsoft.com\/dotnet\/api\/system.text.regularexpressions.regex?view=netcore-2.2\" rel=\"nofollow\">System.Text.RegularExpressions.Regex<\/a>. It can be useful when porting code from C#, and sometimes when we want to get more control in situations when we have many matches of a capture group. It also allows us to pre-create the regular expressions which can matter in performance sensitive scenarios, and to specify a timeout.<\/p>\n<p>One instance where the <code>regex<\/code> class is needed is when you have multiple captures of a group.<\/p>\n<p>Consider the following:<\/p>\n<table>\n<thead>\n<tr>\n<th>Text<\/th>\n<th>Pattern<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>a,b,c,<\/code><\/td>\n<td><code>(\\w,)+<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>If the match operator is used, <code>$matches<\/code> will contain<\/p>\n<pre><code>Name                           Value\r\n----                           -----\r\n1                              c,\r\n0                              a,b,c,\r\n<\/code><\/pre>\n<p>The pattern matched three times, for <code>a,<\/code>, <code>b,<\/code> and <code>c,<\/code>. However, only the last match is preserved in the <code>$matches<\/code> dictionary.\nHowever, the following will allow us to get to all the captures of the group:<\/p>\n<div class=\"highlight highlight-source-powershell\">\n<pre class=\"lang:default decode:true\">[regex]::match('a,b,c,', '(\\w,)+').Groups[1].Captures<\/pre>\n<\/div>\n<pre><code>Index Length Value\r\n----- ------ -----\r\n    0      2 a,\r\n    2      2 b,\r\n    4      2 c,\r\n<\/code><\/pre>\n<p>Below is an example that uses the members of the Regex class to parse input data<\/p>\n<div class=\"highlight highlight-source-powershell\">\n<pre class=\"lang:default decode:true \">class ParsedOutput {\r\n    [int] $Number\r\n    [string] $Text\r\n\r\n    [string] ToString() { return \"{0} ({1})\" -f $this.Text, $this.Number }\r\n}\r\n\r\n$inputData =\r\n    \"    10,Some Text\",\r\n    \"    Some other text,20\"  # this text will not match\r\n\r\n[regex] $pattern = \"^\\s+(\\d+),(.+)\"\r\n\r\nforeach($d in $inputData){\r\n    $match = $pattern.Match($d)\r\n    if ($match.Success){\r\n        $number, $text = $match.Groups[1,2].Value\r\n        [ParsedOutput] @{\r\n            Number = $number\r\n            Text = $text\r\n        }\r\n    }\r\n    else {\r\n        Write-Warning \"regex: '$d' did not match pattern '$pattern'\"\r\n    }\r\n}<\/pre>\n<p>&nbsp;<\/p>\n<\/div>\n<pre><code>WARNING: regex: '    Some other text,20' did not match pattern '^\\s+(\\d+),(.+)'\r\nNumber Text\r\n------ ----\r\n    10 Some Text\r\n<\/code><\/pre>\n<p>It may surprise you that the warning appears before the output. PowerShell has a quite complex formatting system at the end of the pipeline, which treats pipeline output different than other streams. Among other things, it buffers output in the beginning of a pipeline to calculate sensible column widths. This works well in practice, but sometimes gives strange reordering of output on different streams.<\/p>\n<h2><a id=\"user-content-summary\" class=\"anchor\" href=\"#summary\"><\/a>Summary<\/h2>\n<p>In this post we have looked at how the <code>-split<\/code> operator can be used to split a string in parts, how the <code>-match<\/code> operator can be used to extract different patterns from some text, and how the powerful <code>switch<\/code> statement can be used to match against multiple patterns.<\/p>\n<p>We ended by looking at how the <code>regex<\/code> class, which in some cases provides a bit more control, but at the expense of ease of use. This concludes the second part of this series. Next time, we will look at a complete, real world, example of a switch-based parser.<\/p>\n<p>Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.<\/p>\n<p>Staffan Gustafsson, <a href=\"https:\/\/twitter.com\/staffangson\" rel=\"nofollow\">@StaffanGson<\/a>, <a href=\"https:\/\/github.com\/powercode\/\">powercode@github<\/a><\/p>\n<p><em>Staffan works at <a href=\"https:\/\/www.dice.se\" rel=\"nofollow\">DICE<\/a> in Stockholm, Sweden, as a Software Engineer and has been using PowerShell since the first public beta.<\/em>\n<em>He was most seriously pleased when PowerShell was open sourced, and has since contributed bug fixes, new features and performance improvements.<\/em>\n<em>Staffan is a speaker at <a href=\"http:\/\/www.psconf.eu\/\" rel=\"nofollow\">PSConfEU<\/a> and is always happy to talk PowerShell.<\/em><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>This is the second post in a three-part series. Part 1: Useful methods on the String class Introduction to Regular Expressions The Select-String cmdlet Part 2: the -split operator the -match operator the switch statement the Regex class Part 3: a real world, complete and slightly bigger, example of a switch-based parser The -split operator [&hellip;]<\/p>\n","protected":false},"author":685,"featured_media":13641,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-14625","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-powershell"],"acf":[],"blog_post_summary":"<p>This is the second post in a three-part series. Part 1: Useful methods on the String class Introduction to Regular Expressions The Select-String cmdlet Part 2: the -split operator the -match operator the switch statement the Regex class Part 3: a real world, complete and slightly bigger, example of a switch-based parser The -split operator [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts\/14625","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/users\/685"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/comments?post=14625"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts\/14625\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/media\/13641"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/media?parent=14625"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/categories?post=14625"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/tags?post=14625"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}