Converting string output to objects

Sean

In my previous post, I talked about using Crescendo to create a PowerShell module for the vssadmin.exe command in Windows. As I explained, you have to write Output Handler code that parses the output of the command you are using. But if you never written a parser like this, where do you start?

In this post I show you how to parse the output from the netstat command. The output of netstat is not very complex and it is basically the same on Windows and Linux systems. The goal here is to talk about parsing strategies that you can use to create a full Crescendo module.

Step 1 – Capture the output

To create a parser you have to capture the output so you can analyze it deeply enough to understand the structure. Capturing the output is easy.

netstat > netstat-output.txt

Step 2 – Analyze the output

The goal of this analysis is to isolate the important data points. There are several question you want to answer as you look at the captured output.

  • What are the individual data points being displayed?
  • How is the data labeled?
  • What information needs to be parsed and what can be ignored?
  • What repeating patterns exist in the output?
    • Look for delimiters and labels
  • Does the data format change? What formats must be handled?

netstat output

Here are my observations about the output from netstat.

  1. There is only one set of header lines. The output is not divided in to multiple sections with different headers. The column headers contain spaces in the column names making parsing more difficult.
  2. The output is presented as a table. The columns are labeled (Proto, Local Address, Foreign Address, State). Each row of the table is formatted the same with spaces separating the columns.
  3. The Address columns contain a mix of IP Addresses and Host names, both with ports. The ports can be numeric or text.
  4. The IP Addresses can be formatted as IPv4 or IPv6 addresses. The IPv6 addresses are enclosed in brackets ([]).
  5. There are no space character inside the data columns but there are colon characters. This makes the space character a good delimiter, as opposed to the colon.

Now I can start writing code for the parser.

Step 3 – Write the parser

From my analysis, I can tell that I am really only interested in rows of data. I don’t care about the table header because it never changes. So I can just ignore it. The first row of data starts after the table header. There are two ways to get data passed to your parsing function:

  • Streaming data on the pipeline

    If I am streaming data, my parser function must accept input from the pipeline and I must look for the header line then start parsing the data on the next line.

  • Passing the entire output from the command as the value for a parameter

    If the data is passed in as a single object, then I can just skip to the first line of data to begin parsing. This is the method I am going to use in this example.

Getting the output of netstat into a variable is simple. You see that it returns an array of 440 lines of text. We know from our analysis that the table header is on the fourth line (third line for Linux), so the data starts on the next line.

PS> $lines = netstat -a
PS> $lines.count
440
PS> $lines[3]
  Proto  Local Address          Foreign Address        State
PS> $lines[4]
  TCP    0.0.0.0:80             0.0.0.0:0              LISTENING

To parse the rows into the individual columns of data we need to use the space character as a delimiter to split the line. Since the number of spaces between columns is variable, the split operation creates empty fields between the data. We can filter those empty fields out with a Where-Object. For example:

$columns = ($lines[4] -split ' ').Trim() | Where-Object {$_ }
$columns
TCP
0.0.0.0:80
0.0.0.0:0
LISTENING

In this example, the Trim() method trims off leading and trailing spaces. This ensures that the fields between the columns become empty strings.

Step 4 – Output the object

The only thing left to do now is to create a PowerShell object that contains the parsed data. Let’s put this all together.

function parseNetstat {
    param([object[]]$Lines)

    if ($IsWindows) {
        $skip = 4
    } else {
        $skip = 3
    }

    $Lines | Select-Object -Skip $skip | ForEach-Object {
        $columns = ($_ -split ' ').Trim() | Where-Object {$_ }
        [pscustomobject]@{
            Protocol = $columns[0]
            LocalAddress = $columns[1]
            RemoteAddress = $columns[2]
            State = $columns[3]
        }
    }
}

parseNetstat (netstat) | Select-Object -Last 10

For this example, I limit the output to the last 10 rows.

Protocol LocalAddress                                   RemoteAddress                 State
-------- ------------                                   -------------                 -----
TCP      [2600:6c56:7e00:78d:e1e8:756c:d2be:42da]:61001 [2603:1036:303:3000::2]:https TIME_WAIT
TCP      [2600:6c56:7e00:78d:e1e8:756c:d2be:42da]:61018 [2603:1030:408::401]:https    ESTABLISHED
TCP      [2600:6c56:7e00:78d:e1e8:756c:d2be:42da]:61293 [2603:1036:303:3000::2]:https ESTABLISHED
TCP      [2600:6c56:7e00:78d:e1e8:756c:d2be:42da]:62640 [2603:1036:303:3c33::2]:https ESTABLISHED
TCP      [2600:6c56:7e00:78d:e1e8:756c:d2be:42da]:62643 [2603:1036:303:3c04::2]:https ESTABLISHED
TCP      [2600:6c56:7e00:78d:e1e8:756c:d2be:42da]:62659 [2603:1036:303:3050::2]:https TIME_WAIT
TCP      [2600:6c56:7e00:78d:e1e8:756c:d2be:42da]:64886 ord37s36-in-x0d:https         ESTABLISHED
TCP      [2600:6c56:7e00:78d:e1e8:756c:d2be:42da]:64887 [2603:1036:404:8e::2]:https   TIME_WAIT
TCP      [2600:6c56:7e00:78d:e1e8:756c:d2be:42da]:64901 [2620:1ec:21::18]:https       ESTABLISHED
TCP      [2600:6c56:7e00:78d:e1e8:756c:d2be:42da]:65492 ord30s21-in-x0e:https         TIME_WAIT

Success! I now have converted text output to a PowerShell object. At this point, this is enough to become an Output Handler for a Crescendo module.

If we want to get fancier, we can parse the address columns into the IP Address and the Port. That data is in $column[1] and $column[2]. To separate the Port from the IP Address we have to determine if the address is IPv4 or IPv6. The following code handles this:

if ($columns[1].IndexOf('[') -lt 0) {
    $laddr = $columns[1].Split(':')[0]
    $lport = $columns[1].Split(':')[1]
} else {
    $laddr = $columns[1].Split(']:')[0].Trim('[')
    $lport = $columns[1].Split(']:')[1]
}
if ($columns[2].IndexOf('[') -lt 0) {
    $raddr = $columns[2].Split(':')[0]
    $rport = $columns[2].Split(':')[1]
} else {
    $raddr = $columns[2].Split(']:')[0].Trim('[')
    $rport = $columns[2].Split(']:')[1]
}

First I check that a column contains an open bracket character ([). If it doesn’t, I can split the string at the colon character (:). If not then I need to split is at the string ']:' and also trim off the opening bracket.

After adding that code to the function I can now filter on the port information. For example:

 parseNetstat (netstat) |
    Where-Object {$_.RemotePort -eq 'https' -and $_.State -eq 'ESTABLISHED'} |
    Select-Object LocalAddress, LocalPort, RemoteAddress, RemotePort -Last 10

LocalAddress                           LocalPort RemoteAddress         RemotePort
------------                           --------- -------------         ----------
2600:6c56:7e00:78d:e1e8:756c:d2be:42da 55643     2603:1036:303:3c1d::2 https
2600:6c56:7e00:78d:e1e8:756c:d2be:42da 59703     2620:1ec:21::18       https
2600:6c56:7e00:78d:e1e8:756c:d2be:42da 59708     2603:1036:303:3c1d::2 https
2600:6c56:7e00:78d:e1e8:756c:d2be:42da 60834     2620:1ec:42::132      https
2600:6c56:7e00:78d:e1e8:756c:d2be:42da 60835     2603:1036:303:3c1c::2 https
2600:6c56:7e00:78d:e1e8:756c:d2be:42da 61018     2603:1030:408::401    https
2600:6c56:7e00:78d:e1e8:756c:d2be:42da 61293     2603:1036:303:3000::2 https
2600:6c56:7e00:78d:e1e8:756c:d2be:42da 61399     ord30s21-in-x03       https
2600:6c56:7e00:78d:e1e8:756c:d2be:42da 65025     2603:1036:303:3c0a::2 https
2600:6c56:7e00:78d:e1e8:756c:d2be:42da 65053     2603:1036:303:3c0c::2 https

Conclusion

Writing the output parser is the hardest part of wrapping a native command, whether you are using Crescendo or not. In this post I have used a few simple techniques for extracting data from the strings. In my next blog post I will take a closer look at a more complex parsing example that I wrote for my VssAdmin module.

If you are interested in the final version of the script in this post, you can find it in this GitHub Gist.

Resources

5 comments

Leave a comment

  • Luis Palacio

    Wouldn’t ConvertFrom-String be more apt to this task? You can even give it a template so PS can learn the data types/formatting.

    Not criticizing your post, just wondering if that wouldn’t be easier.

    • Sean WheelerMicrosoft employee

      Personally, I find the creation of that template to be more work. Another consideration is that ConvertFrom-String is only available in PowerShell 5.1, not PowerShell 7+. But you are not wrong, it could do that job. There are a lot of tools available for parsing strings. I use some of these in my VssAdmin module. And I will talk about some examples in more detail in my next blog post.

      Here are some of the other ways you can work with strings.

      String class members

      • Length
      • IndexOf / LastIndexOf
      • Remove
      • Replace
      • Split
      • StartsWith
      • SubString
      • Conversion methods – To*()
      • Trimming methods – Trim*()

      Available cmdlets

      • Join-String
      • Select-String
      • Out-String
      • ConvertFrom-StringData
      • Convert-String (5.1 only)
      • ConvertFrom-String (5.1 only)

      Operators

      • -like
      • -match
      • -eq
      • -split
      • -join
      • -replace

      Using the switch statement

      • simple match
      • wildcard match
      • regex match
  • Guy Leech

    I’d use regex with matching groups to tokenize into $Matches. Despite the urban myths, regex for this kind of thing is fairly straightforward

  • Joakim Svendsen

    Useful post for many people, I’m sure. Thanks.

    You can simplify this line:
    $columns = ($_ -split ‘ ‘).Trim() | Where-Object {$_ }
    into:
    $Columns = $_ -split ‘ +’

    “+” there is a regex quantifier for the preceding space. The “+” quantifier means “one or more instances of the preceding character/group/element, more is better” (“more is better” is often referred to as “greedy matching” – unless the + is followed by a question mark, which makes it match as few as possible while still achieving an overall regex match while including the remaining regex parts, if a match is possible). It saves you the Where-Object filter afterwards and is more elegant (in my opinion πŸ™‚ ).

    The Trim() call is unneeded on my Ubuntu Linux computer with PowerShell 7. It could arguably be a best practice giving robust code, but not strictly necessary for this case based on my testing.

    About Guy Leech’s comment about regex: I agree. However, in this case, the regex could be written as ^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s*$ (anchoring with ^ and $ provides further data structure validation as regular expressions in .NET/PowerShell are partial (complete match not required)), but then you might realize that this is logically equivalent to a -split on (white)space (you can use -split ‘\s+’ to also support tabs as well as spaces, or you could use the character class [ \t] followed by a + (“[ \t]+”) to split on both tabs and spaces mixed. The character class “\s” also includes/matches \n, newlines, \f, \v, \r and maybe some I don’t know about or have forgotten.

    The “\s*” instances, at the beginning and end, in the basic, example regex use the asterisk quantifier, which is similar to +, except it means zero or more, instead of one or more. This means it always matches (even if there is no “\s” there). These two regex elements trim leading and trailing whitespace like the Trim() call in Sean’s post – if you were to use that approach.

    I am not familiar with this Crescendo module. It looks worth checking out. I wrote a lot of code to parse (un)structured data into objects over the last decade. Wrote quite a bit about it on my blog http://www.powershelladmin.com.

    Including some links since the content is quite relevant to this “parse strings into objects” topic.

    This is quite elegant:
    https://www.powershelladmin.com/wiki/Get_Linux_disk_space_report_in_PowerShell.php

    This one is quite nice:
    https://www.powershelladmin.com/wiki/Parse_schtasks.exe_Output_with_PowerShell.php

    https://www.powershelladmin.com/wiki/Parse_PsLoggedOn.exe_Output_with_PowerShell.php

    https://www.powershelladmin.com/wiki/Parse_openssl_certificate_date_output_into_.NET_DateTime_objects.php

    By the way, my netstat output on Ubuntu 20.04 as of 2021-11-18 is different. The headers are:

    netstat | select -skip 2 -first 10 | %{$t = $_ -split ' +'; [pscustomobject]@{Protocol = $t[0]; ReceiveQueue = $t[1]; SendQueue = $t[2]; LocalAddress = $t[3]; RemoteAddress = $t[4]; State = $t[5]} } | ft -a

    Protocol ReceiveQueue SendQueue LocalAddress RemoteAddress State
    -------- ------------ --------- ------------ ------------- -----
    tcp 0 0 joakim-buntu:58626 server-xxxxx-:https TIME_WAIT
    tcp 0 0 joakim-buntu:34792 li1.members.l:ssh ESTABLISHED
    tcp 0 0 joakim-buntu:35456 lb-140-xxx6-:https ESTABLISHED

    Be well. πŸ™‚