{"id":14554,"date":"2019-01-18T10:56:48","date_gmt":"2019-01-18T18:56:48","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/powershell\/?p=14515"},"modified":"2022-11-16T12:32:33","modified_gmt":"2022-11-16T20:32:33","slug":"parsing-text-with-powershell-1-3","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/powershell\/parsing-text-with-powershell-1-3\/","title":{"rendered":"Parsing Text with PowerShell (1\/3)"},"content":{"rendered":"<div class=\"markdown-body\">\n<p>This is the first post in a three part series.<\/p>\n<ul>\n<li><em>Part 1<\/em>:\n<ul>\n<li>Useful methods on the String class<\/li>\n<li>Introduction to Regular Expressions<\/li>\n<li>The Select-String cmdlet<\/li>\n<\/ul>\n<\/li>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/powershell\/parsing-text-with-powershell-2-3\/\">Part 2<\/a>:\n<ul>\n<li>The -split operator<\/li>\n<li>The -match operator<\/li>\n<li>The switch statement<\/li>\n<li>The Regex class<\/li>\n<\/ul>\n<\/li>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/powershell\/parsing-text-with-powershell-3-3\/\">Part 3<\/a>:\n<ul>\n<li>A real world, complete and slightly bigger, example of a switch-based parser<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>A task that appears regularly in my workflow is text parsing. It may be about getting a token from a single line of text or about turning the text output of native tools into structured objects so I can leverage the power of PowerShell.<\/p>\n<p>I always strive to create structure as early as I can in the pipeline, so that later on I can reason about the content as properties on objects instead of as text at some offset in a string. This also helps with sorting, since the properties can have their correct type, so that numbers, dates etc. are sorted as such and not as text.<\/p>\n<p>There are a number of options available to a PowerShell user, and I&#8217;m giving an overview here of the most common ones.<\/p>\n<p>This is not a text about how to create a high performance parser for a language with a structured EBNF grammar. There are better tools available for that, for example <a href=\"https:\/\/www.antlr.org\/\" rel=\"nofollow\">ANTLR<\/a>.<\/p>\n<h2>.Net methods on the <code>string<\/code> class<\/h2>\n<p>Any treatment of string parsing in PowerShell would be incomplete if it didn&#8217;t mention the methods on the <code>string<\/code> class.\nThere are a few methods that I&#8217;m using more often than others when parsing strings:<\/p>\n<table>\n<thead>\n<tr>\n<th>Name<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>Substring(int startIndex)<\/code><\/td>\n<td>Retrieves a substring from this instance. The substring starts at a specified character position and continues to the end of the string.<\/td>\n<\/tr>\n<tr>\n<td><code>Substring(int startIndex, int length)<\/code><\/td>\n<td>Retrieves a substring from this instance. The substring starts at a specified character position and has a specified length.<\/td>\n<\/tr>\n<tr>\n<td><code>IndexOf(string value)<\/code><\/td>\n<td>Reports the zero-based index of the first occurrence of the specified string in this instance.<\/td>\n<\/tr>\n<tr>\n<td><code>IndexOf(string value, int startIndex)<\/code><\/td>\n<td>Reports the zero-based index of the first occurrence of the specified string in this instance. The search starts at a specified character position.<\/td>\n<\/tr>\n<tr>\n<td><code>LastIndexOf(string value)<\/code><\/td>\n<td>Reports the zero-based index of the last occurrence of the specified string in this instance. Often used together with <code>Substring<\/code>.<\/td>\n<\/tr>\n<tr>\n<td><code>LastIndexOf(string value, int startIndex)<\/code><\/td>\n<td>Reports the zero-based index position of the last occurrence of a specified string within this instance. The search starts at a specified character position and proceeds backward toward the beginning of the string.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>This is a minor subset of the available functions. It may be well worth your time to read up on the string class since it is so fundamental in PowerShell.\nDocs are found <a href=\"https:\/\/docs.microsoft.com\/dotnet\/api\/system.string?view=netcore-2.2\" rel=\"nofollow\">here<\/a>.<\/p>\n<p>As an example, this can be useful when we have very large input data of comma-separated input with 15 columns and we are only interested in the third column from the end. If we were to use the <code>-split ','<\/code> operator, we would create 15 new strings and an array for each line. On the other hand, using <code>LastIndexOf<\/code> on the input string a few times and then <code>SubString<\/code> to get the value of interest is faster and results in just one new string.<\/p>\n<pre class=\"lang:default decode:true\">function parseThirdFromEnd([string]$line){\r\n    $i = $line.LastIndexOf(\",\")             # get the last separator\r\n    $i = $line.LastIndexOf(\",\", $i - 1)     # get the second to last separator, also the end of the column we are interested in\r\n    $j = $line.LastIndexOf(\",\", $i - 1)     # get the separator before the column we want\r\n    $j++                                    # more forward past the separator\r\n    $line.SubString($j,$i-$j)               # get the text of the column we are looking for\r\n}<\/pre>\n<p>In this sample, I ignore that the <code>IndexOf<\/code> and <code>LastIndexOf<\/code> returns -1 if they cannot find the text to search for. From experience, I also know that it is easy to mess up the index arithmetics.\nSo while using these methods can improve performance, it is also more error prone and a lot more to type. I would only resort to this when I know the input data is very large and performance is an issue. So this is not a recommendation, or a starting point, but something to resort to.<\/p>\n<p>On rare occasions, I write the whole parser in C#. An example of this is in a module wrapping the Perforce version control system, where the command line tool can output python dictionaries. It is a binary format, and the use case was complicated enough that I was more comfortable with a compiler checked implementation language.<\/p>\n<h2>Regular Expressions<\/h2>\n<p>Almost all of the parsing options in PowerShell make use of regular expressions, so I will start with a short intro of some regular expression concepts that are used later in these posts.<\/p>\n<p>Regular expressions are very useful to know when writing simple parsers since they allow us to express patterns of interest and to capture text that matches those patterns.<\/p>\n<p>It is a very rich language, but you can get quite a long way by learning a few key parts. I&#8217;ve found <a href=\"https:\/\/www.regular-expressions.info\/\" rel=\"nofollow\">regular-expressions.info<\/a> to be a good online resource for more information. It is not written directly for the .net regex implementation, but most of the information is valid across the different implementations.<\/p>\n<table>\n<thead>\n<tr>\n<th>Regex<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>*<\/code><\/td>\n<td>Zero or more of the preceding character. <code>a*<\/code> matches the empty string, <code>a<\/code>, <code>aa<\/code>, etc, but not <code>b<\/code>.<\/td>\n<\/tr>\n<tr>\n<td><code>+<\/code><\/td>\n<td>One or more of the preceding character. <code>a+<\/code> matches <code>a<\/code>, <code>aa<\/code>, etc, but not the empty string or <code>b<\/code>.<\/td>\n<\/tr>\n<tr>\n<td><code>.<\/code><\/td>\n<td>Matches any character<\/td>\n<\/tr>\n<tr>\n<td><code>[ax1]<\/code><\/td>\n<td>Any of <code>a<\/code>,<code>x<\/code>,<code>1<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>a-d<\/code><\/td>\n<td>matches any of <code>a<\/code>,<code>b<\/code>,<code>c<\/code>,<code>d<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>\\w<\/code><\/td>\n<td>The <code>\\w<\/code> meta character is used to find a word character. A word character is a character from a-z, A-Z, 0-9, including the _ (underscore) character. It also matches variants of the characters such as <code>???<\/code> and <code>???<\/code>.<\/td>\n<\/tr>\n<tr>\n<td><code>\\W<\/code><\/td>\n<td>The inversion of <code>\\w<\/code>. Matches any non-word character<\/td>\n<\/tr>\n<tr>\n<td><code>\\s<\/code><\/td>\n<td>The <code>\\s<\/code> meta character is used to find white space<\/td>\n<\/tr>\n<tr>\n<td><code>\\S<\/code><\/td>\n<td>The inversion of <code>\\s<\/code>. Matches any non-whitespace character<\/td>\n<\/tr>\n<tr>\n<td><code>\\d<\/code><\/td>\n<td>Matches digits<\/td>\n<\/tr>\n<tr>\n<td><code>\\D<\/code><\/td>\n<td>The inversion of <code>\\d<\/code>. Matches non-digits<\/td>\n<\/tr>\n<tr>\n<td><code>\\b<\/code><\/td>\n<td>Matches a word boundary, that is, the position between a word and a space.<\/td>\n<\/tr>\n<tr>\n<td><code>\\B<\/code><\/td>\n<td>The inversion of <code>\\b<\/code>. . <code>er\\B<\/code> matches the <code>er<\/code> in <code>verb<\/code> but not the <code>er<\/code> in <code>never<\/code>.<\/td>\n<\/tr>\n<tr>\n<td><code>^<\/code><\/td>\n<td>The beginning of a line<\/td>\n<\/tr>\n<tr>\n<td><code>$<\/code><\/td>\n<td>The end of a line<\/td>\n<\/tr>\n<tr>\n<td><code>(&lt;expr&gt;)<\/code><\/td>\n<td>Capture groups<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Combining these, we can create a pattern like below to match a text like:<\/p>\n<table>\n<thead>\n<tr>\n<th>Text<\/th>\n<th>Pattern<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>\"  42,Answer\"<\/code><\/td>\n<td><code>^\\s+\\d+,.+<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The above pattern can be written like this using the <code>x<\/code> (ignore pattern whitespace) modifier.<\/p>\n<p>Starting the regex with <code>(?x)<\/code> ignores whitespace in the pattern (it has to be specified explicitly, with <code>\\s<\/code>) and also enables the comment character <code>#<\/code>.<\/p>\n<pre class=\"lang:default decode:true\">(?x)  # this regex ignores whitespace in the pattern. Makes it possible do document a regex with comments.\r\n^     # the start of the line\r\n\\s+   # one or more whitespace character\r\n\\d+   # one or more digits\r\n,     # a comma\r\n.+    # one or more characters of any kind<\/pre>\n<p>By using capture groups, we make it possible to refer back to specific parts of a matched expression.<\/p>\n<table>\n<thead>\n<tr>\n<th>Text<\/th>\n<th>Pattern<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>\"  42,Answer\"<\/code><\/td>\n<td><code>^\\s+(\\d+),(.+)<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div><\/div>\n<\/div>\n<div class=\"highlight highlight-source-regexp\"><\/div>\n<div>\n<pre class=\"lang:default decode:true \">(?x)  # this regex ignores whitespace in the pattern. Makes it possible to document a regex with comments.\r\n^     # the start of the line\r\n\\s+   # one or more whitespace character\r\n(\\d+) # capture one or more digits in the first group (index 1)\r\n,     # a comma\r\n(.+)  # capture one or more characters of any kind in the second group (index 2)<\/pre>\n<p><span style=\"color: inherit; font-family: inherit; font-size: 1.75rem;\">Naming regular expression groups<\/span><\/p>\n<\/div>\n<div class=\"markdown-body\">\n<p>There is a construct called <a href=\"https:\/\/www.regular-expressions.info\/named.html\" rel=\"nofollow\">named capturing groups<\/a>, <code>(?&lt;group_name&gt;pattern)<\/code>, that will create a capture group with a designated name.<\/p>\n<p>The regex above can be rewritten like this, which allows us to refer to the capture groups by name instead of by index.<\/p>\n<pre class=\"lang:default decode:true\">^\\s+(?&lt;num&gt;\\d+),(?&lt;text&gt;.+)<\/pre>\n<p>Different languages have implementation specific solutions to accessing the values of the captured groups. We will see later on in this series how it is done in PowerShell.<\/p>\n<h2>The Select-String cmdlet<\/h2>\n<p>The <code>Select-String<\/code> command is a work horse, and is very powerful when you understand the output it produces.\nI use it mainly when searching for text in files, but occasionally also when looking for something in command output and similar.<\/p>\n<p>The key to being efficient with <code>Select-String<\/code> is to know how to get to the matched patterns in the output. In its internals, it uses the same <code>regex<\/code> class as the <code>-match<\/code> and <code>-split<\/code> operator, but instead of populating a global variable with the resulting groups, as <code>-match<\/code> does, it writes an object to the pipeline, with a <code>Matches<\/code> property that contains the results of the match.<\/p>\n<pre class=\"lang:default decode:true\">Set-Content twitterData.txt -value @\"\r\nLee, Steve-@Steve_MSFT,2992\r\nLee Holmes-13000 @Lee_Holmes\r\nStaffan Gustafsson-463 @StaffanGson\r\nTribbiani, Joey-@Matt_LeBlanc,463400\r\n\"@\r\n\r\n# extracting captured groups\r\nGet-ChildItem twitterData.txt |\r\n    Select-String -Pattern \"^(\\w+) ([^-]+)-(\\d+) (@\\w+)\" |\r\n    Foreach-Object {\r\n        $first, $last, $followers, $handle = $_.Matches[0].Groups[1..4].Value   # this is a common way of getting the groups of a call to select-string\r\n        [PSCustomObject] @{\r\n            FirstName = $first\r\n            LastName = $last\r\n            Handle = $handle\r\n            TwitterFollowers = [int] $followers\r\n        }\r\n    }<\/pre>\n<pre><code>FirstName LastName   Handle       TwitterFollowers\r\n--------- --------   ------       ----------------\r\nLee       Holmes     @Lee_Holmes             13000\r\nStaffan   Gustafsson @StaffanGson              463\r\n<\/code><\/pre>\n<h3>Support for Multiple Patterns<\/h3>\n<p>As we can see above, only half of the data matched the pattern to <code>Select-String<\/code>.<\/p>\n<p>A technique that I find useful is to take advantage of the fact that <code>Select-String<\/code> supports the use of multiple patterns.<\/p>\n<p>The lines of input data in <code>twitterData.txt<\/code> contain the same type of information, but they&#8217;re formatted in slightly different ways.\nUsing multiple patterns in combination with named capture groups makes it a breeze to extract the groups even when the positions of the groups differ.<\/p>\n<pre class=\"lang:default decode:true\">$firstLastPattern = \"^(?&lt;first&gt;\\w+) (?&lt;last&gt;[^-]+)-(?&lt;followers&gt;\\d+) (?&lt;handle&gt;@.+)\"\r\n$lastFirstPattern = \"^(?&lt;last&gt;[^\\s,]+),\\s+(?&lt;first&gt;[^-]+)-(?&lt;handle&gt;@[^,]+),(?&lt;followers&gt;\\d+)\"\r\nGet-ChildItem twitterData.txt |\r\n     Select-String -Pattern $firstLastPattern, $lastFirstPattern |\r\n    Foreach-Object {\r\n        # here we access the groups by name instead of by index\r\n        $first, $last, $followers, $handle = $_.Matches[0].Groups['first', 'last', 'followers', 'handle'].Value\r\n        [PSCustomObject] @{\r\n            FirstName = $first\r\n            LastName = $last\r\n            Handle = $handle\r\n            TwitterFollowers = [int] $followers\r\n        }\r\n    }<\/pre>\n<pre><code>FirstName LastName   Handle        TwitterFollowers\r\n--------- --------   ------        ----------------\r\nSteve     Lee        @Steve_MSFT               2992\r\nLee       Holmes     @Lee_Holmes              13000\r\nStaffan   Gustafsson @StaffanGson               463\r\nJoey      Tribbiani  @Matt_LeBlanc           463400\r\n<\/code><\/pre>\n<p>Breaking down the <code>$firstLastPattern<\/code> gives us<\/p>\n<pre class=\"lang:default decode:true\">(?x)                # this regex ignores whitespace in the pattern. Makes it possible do document a regex with comments.\r\n^                   # the start of the line\r\n(?&lt;first&gt;\\w+)       # capture one or more of any word characters into a group named 'first'\r\n\\s                  # a space\r\n(?&lt;last&gt;[^-]+)      # capture one of more of any characters but `-` into a group named 'last'\r\n-                   # a '-'\r\n(?&lt;followers&gt;\\d+)   # capture 1 or more digits into a group named 'followers'\r\n\\s                  # a space\r\n(?&lt;handle&gt;@.+)      # capture a '@' followed by one or more characters into a group named 'handle'<\/pre>\n<p>The second regex is similar, but with the groups in different order. But since we retrieve the groups by name, we don&#8217;t have to care about the positions of the capture groups, and multiple assignment works fine.<\/p>\n<h3>Context around Matches<\/h3>\n<p><code>Select-String<\/code> also has a <code>Context<\/code> parameter which accepts an array of one or two numbers specifying the number of lines before and after a match that should be captured. All text parsing techniques in this post can be used to parse information from the context lines.\nThe result object has a <code>Context<\/code> property, that returns an object with <code>PreContext<\/code> and <code>PostContext<\/code> properties, both of the type <code>string[]<\/code>.<\/p>\n<p>This can be used to get the second line before a match:<\/p>\n<pre class=\"lang:default decode:true \"># using the context property\r\nGet-ChildItem twitterData.txt |\r\n    Select-String -Pattern \"Staffan\" -Context 2,1 |\r\n    Foreach-Object { $_.Context.PreContext[1], $_.Context.PostContext[0] }<\/pre>\n<pre><code>Lee Holmes-13000 @Lee_Holmes\r\nTribbiani, Joey-@Matt_LeBlanc,463400\r\n<\/code><\/pre>\n<p>To understand the indexing of the Pre- and PostContext arrays, consider the following:<\/p>\n<pre><code>Lee, Steve-@Steve_MSFT,2992                  &lt;- PreContext[0]\r\nLee Holmes-13000 @Lee_Holmes                 &lt;- PreContext[1]\r\nStaffan Gustafsson-463 @StaffanGson          &lt;- Pattern matched this line\r\nTribbiani, Joey-@Matt_LeBlanc,463400         &lt;- PostContext[0]\r\n<\/code><\/pre>\n<p>The pipeline support of <code>Select-String<\/code> makes it different from the other parsing tools available in PowerShell, and makes it the undisputed king of one-liners.<\/p>\n<p>I would like stress how much more useful <code>Select-String<\/code> becomes once you understand how to get to the parts of the matches.<\/p>\n<h2>Summary<\/h2>\n<p>We have looked at useful methods of the string class, especially how to use <code>Substring<\/code> to get to text at a specific offset. We also looked at regular expression, a language used to describe patterns in text, and on the <code>Select-String<\/code> cmdlet, which makes heavy use of regular expression.<\/p>\n<p>Next time, we will look at the operators <code>-split<\/code> and <code>-match<\/code>, the switch statement (which is surprisingly useful for text parsing), and the regex class.<\/p>\n<p>Staffan Gustafsson, <a href=\"https:\/\/twitter.com\/staffangson\" rel=\"nofollow\">@StaffanGson<\/a>, <a href=\"https:\/\/github.com\/powercode\/\">github<\/a><\/p>\n<p>Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>This is the first post in a three part series. Part 1: Useful methods on the String class Introduction to Regular Expressions The Select-String cmdlet Part 2: The -split operator The -match operator The switch statement The Regex class Part 3: A real world, complete and slightly bigger, example of a switch-based parser A task [&hellip;]<\/p>\n","protected":false},"author":685,"featured_media":13641,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-14554","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-powershell"],"acf":[],"blog_post_summary":"<p>This is the first post in a three part series. Part 1: Useful methods on the String class Introduction to Regular Expressions The Select-String cmdlet Part 2: The -split operator The -match operator The switch statement The Regex class Part 3: A real world, complete and slightly bigger, example of a switch-based parser A task [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts\/14554","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/users\/685"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/comments?post=14554"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts\/14554\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/media\/13641"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/media?parent=14554"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/categories?post=14554"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/tags?post=14554"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}