{"id":861,"date":"2023-01-30T12:23:34","date_gmt":"2023-01-30T20:23:34","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/powershell-community\/?p=861"},"modified":"2023-03-30T09:04:21","modified_gmt":"2023-03-30T16:04:21","slug":"mastering-the-steppable-pipeline","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/powershell-community\/mastering-the-steppable-pipeline\/","title":{"rendered":"Mastering the (steppable) pipeline"},"content":{"rendered":"<h2>Mastering the (steppable) pipeline<\/h2>\n<p>Before stepping into the <em>steppable<\/em> pipeline, it is essential that you have a good understanding of how <em>and when<\/em> exactly items are processed by a <a href=\"https:\/\/learn.microsoft.com\/powershell\/scripting\/developer\/cmdlet\/cmdlet-overview\">cmdlet<\/a> in the pipeline. The PowerShell pipeline might just look like syntactical sugar but it is a lot more than that. In fact, it really <em>acts<\/em> like a pipeline where each item flows through and is handled by each cmdlet one-at-a-time. In comparison to the pipes in CMD, PowerShell streams <em>objects<\/em> through the pipeline rather than plain text.<\/p>\n<h2>One-at-a-time <code>process<\/code><\/h2>\n<p>The following explanation describes the <strong>one-at-a-time processing<\/strong> section of the <a href=\"https:\/\/learn.microsoft.com\/powershell\/module\/microsoft.powershell.core\/about\/about_pipelines#one-at-a-time-processing\">About pipelines<\/a> document. A good analogy of the pipeline is a physical assembly line where each consecutive station on the line could be compared with a PowerShell cmdlet. At a specific station and time, some something is done to one item while the next item is prepared at the prior station. For example, at station 2 a component is soldered to the assembly while the next item is being unpacked at station 1. Items iterate through the pipeline like:<\/p>\n<p><strong>Iteration: <code>n<\/code><\/strong><\/p>\n<pre><code> item 3  --&gt; item 2  --&gt; item 1\nStation 1 | Station 2 | Station 3\n<\/code><\/pre>\n<p><strong>Iteration: <code>n + 1<\/code><\/strong><\/p>\n<pre><code> item 4  --&gt; item 3  --&gt; item 2\nStation 1 | Station 2 | Station 3\n<\/code><\/pre>\n<p>Cmdlets act like stations in the assembly line, taken a simple example:<\/p>\n<pre><code class=\"language-PowerShell\">Get-Content .\\Input.txt | Foreach-Object { $_ } | Set-Content .\\Output.txt<\/code><\/pre>\n<p>In this example the <code>Foreach-Object { $_ }<\/code> cmdlet does nothing more than:<\/p>\n<ul>\n<li>picking up each item from the pipeline that has been output by the prior cmdlet <code>Get-Content .\\Input.txt<\/code><\/li>\n<li>placing it back on the pipeline as an input for the next cmdlet <code>Set-Content .\\Output.txt<\/code>.<\/li>\n<\/ul>\n<p>To visualize the order of the items that go through the <code>Foreach-Object { $_ }<\/code> cmdlet you might use the <code>Trace-Command<\/code> cmdlet but that might overwhelm you with data. Instead, using two simple <code>ForEach-Object<\/code> (alias <code>%<\/code>) test commands show you exactly where your measure points are and what goes in and come out the specific cmdlet in between.<\/p>\n<ul>\n<li><code>%{Write-Host 'In:' $_; $_ }<\/code><\/li>\n<li><code>%{Write-Host 'out:' $_; $_ }<\/code><\/li>\n<\/ul>\n<p>Notice that <code>...; $_ }<\/code> in the end of the command will place the current item back on the pipeline. In the following example, the cmdlet at the start of the pipeline (<code>Get-Content .\\Input.txt<\/code>) has been replaced with 4 hardcoded input items (<code>1,2,3,4<\/code>) and the cmdlet at the end of the pipeline (<code>Set-Content .\\Output.txt<\/code>) with <code>Out-Null<\/code> which simply purges the actual output of the pipeline so that only the two test cmdlets produce an output.<\/p>\n<pre><code class=\"language-PowerShell\">1,2,3,4 | %{Write-Host 'In:' $_; $_ } |\n    Foreach-Object { $_ } |\n    %{Write-Host 'Out:' $_; $_ } |\n    Out-Null<\/code><\/pre>\n<p>This shows the following output:<\/p>\n<pre><code class=\"language-Console\">In: 1\nOut: 1\nIn: 2\nOut: 2\nIn: 3\nOut: 3\nIn: 4\nOut: 4<\/code><\/pre>\n<p>This proves that each item flows out of the pipeline (<code>Out: 1<\/code>) before the next item (<code>In: 2<\/code>) is injected into it. As you can imagine, this conserves memory as there are only a few items in the pipeline at any time.<\/p>\n<h2>Choking the pipeline<\/h2>\n<p>The previous section explains how a cmdlet would perform if correctly implemented for the middle of a pipeline but there are a few statements that might &#8220;<strong>choke<\/strong>&#8221; the pipeline, meaning that the items are no longer processed <strong>one-at-the-time<\/strong> but piled up in memory and eventually processed <strong>all-at-once<\/strong>. This happens for:<\/p>\n<ul>\n<li>\n<p><strong>Assigning the pipeline to a variable<\/strong>:<\/p>\n<pre><code class=\"language-PowerShell\">$Content = Get-Content .\\Input.txt | Foreach-Object { $_ }\n$Content | Set-Content .\\Output.txt<\/code><\/pre>\n<\/li>\n<li>\n<p><strong>Using parentheses<\/strong>:<\/p>\n<pre><code class=\"language-PowerShell\">(Get-Content .\\Data.txt | Foreach-Object { $_ }) | Set-Content .\\Data.txt<\/code><\/pre>\n<\/li>\n<li>\n<p><strong>Some cmdlets might choke the pipeline by design:<\/strong><\/p>\n<p>In general, a well defined cmdlet should write single records to the pipeline. See the <a href=\"https:\/\/learn.microsoft.com\/powershell\/scripting\/developer\/cmdlet\/strongly-encouraged-development-guidelines\">Strongly Encouraged Development Guidelines<\/a> article.<\/p>\n<p>Yet this is not always possible. Take, for example, the <code>Sort-Object<\/code> cmdlet, which is supposed to sort an object collection. This might result is a new list where the last item ends up first. To determine what item comes first, you must collect all items before they can be sorted. This is visible from the simple test commands used before:<\/p>\n<pre><code class=\"language-PowerShell\">1,2,3,4 | %{Write-Host 'In:' $_; $_ } | Sort-Object | %{Write-Host 'Out:' $_; $_ } | Out-Null<\/code><\/pre>\n<p>This shows the following output:<\/p>\n<pre><code class=\"language-Console\">In: 1\nIn: 2\nIn: 3\nIn: 4\nOut: 1\nOut: 2\nOut: 3\nOut: 4<\/code><\/pre>\n<\/li>\n<\/ul>\n<p>In general, you should avoid chocking the pipeline, but their are few exceptions where it might be required. For example, where you want to read and write back to the same file as in the previous &#8220;using parenthesis&#8221; example.<\/p>\n<p>In a smooth pipeline, each item is processed one-at-the-time, meaning that <code>Get-Content<\/code> and <code>Set-Content<\/code> are concurrently processing items in the pipeline. This causes the following error:<\/p>\n<blockquote>\n<p>The process cannot access the file &#8216;.\\Data.txt&#8217; because it is being used by another process.<\/p>\n<\/blockquote>\n<p>In this situation, chocking the pipeline and reading the complete file first avoids the error.<\/p>\n<h3>Heavy objects<\/h3>\n<p>Objects in the PowerShell pipeline contain more than just the value of the item. They also include properties such as the name and type of the item and of all the properties. Take, for example, the .NET <code>DataTable<\/code> object. The header of a <code>DataTable<\/code> object contains the column (property) names and types where each row in the <code>DataTable<\/code> only contains the value of each column. If you convert a <code>DataTable<\/code> into a list of PowerShell objects, like:<\/p>\n<pre><code class=\"language-PowerShell\">$Data = $DataTable | Foreach-Object { $_ }<\/code><\/pre>\n<p>PowerShell converts each row into a new object, duplicating the header information for each row. The memory usage considerably increases even if the value is just a few bytes. This extra overhead shouldn&#8217;t be an issue if you stream the objects through the pipeline because there will only be a few objects in the pipeline at any time.<\/p>\n<h3>Missing properties<\/h3>\n<p>Nevertheless, there is a pitfall in using the pipeline. Consider the following two objects being output to a table:<\/p>\n<pre><code class=\"language-PowerShell\">$a = [pscustomobject]@{ name='John'; address='home'}\n$b = [pscustomobject]@{ name='Jane'; phone='123'}\n$a, $b |Format-Table<\/code><\/pre>\n<p>Results<\/p>\n<pre><code class=\"language-Console\">name address\n---- -------\nJohn home\nJane<\/code><\/pre>\n<p>Notice that there is no <code>phone<\/code> column, meaning that the <code>phone='123'<\/code> property is missing from the results. This is due to the one-at-a-time processing. At the moment the <code>Format-Table<\/code> cmdlet receives object <code>$a<\/code> it is supposed to process it immediately by writing it to the console and release it so that it can process the next item. The issue is that the <code>Format-Table<\/code> cmdlet is unaware of the next object <code>$b<\/code> because it hasn&#8217;t entered the pipeline yet. The initial output, based on <code>$a<\/code>, has already been written to the console. In other words, a cmdlet written for one-at-a-time processing bases its output on the first object received from the pipeline. This also implies that if you change the order of the items in the pipeline (for example, <code>$a,  $b | Sort-Object | Format-Table<\/code>) properties might appear differently.<\/p>\n<h3>Processing blocks<\/h3>\n<p>As you might have noticed, some actions, like outputting a header, are only required once. As in the analogy with the assembly line, heating up a soldering gun is only required once, when the pipeline is started. Cleaning up the station is only required when the pipeline is completed. Similar time consuming or expensive actions could be required for a cmdlet, such as opening and closing a file. These actions are respectively defined in the <code>Begin<\/code> and <code>End<\/code> blocks of a cmdlet. The actual processing of items is defined in the <code>Process<\/code> block of cmdlet. A well defined pipeline PowerShell cmdlet might look like this:<\/p>\n<pre><code class=\"language-PowerShell\">function MyCmdlet {\n    [CmdletBinding()] param(\n        [Parameter(ValueFromPipeLine = $True)] [String] $InputString\n    )\n    Begin {\n        $Stream = [System.IO.StreamWriter]::new(\"$($Env:Temp)\\My.Log\")\n    }\n    Process {\n        $Stream.WriteLine($_)\n    }\n    End {\n        $Stream.Close()\n    }\n}<\/code><\/pre>\n<p>When running this example cmdlet in a pipeline like <code>1..9 | MyCmdlet<\/code>, the log file is <em>only opened once<\/em> at the start, then each item in the pipeline is processed one-at-a-time, and the log file is closed (<em>once<\/em>) at the end. Note that when there are no <code>Begin<\/code>, <code>Process<\/code> and <code>End<\/code> processing blocks defined in a function, the content of the function is assigned to the <code>End<\/code> block. See also: <a href=\"https:\/\/learn.microsoft.com\/powershell\/module\/microsoft.powershell.core\/about\/about_functions_advanced_methods\">about Functions Advanced Methods<\/a>.<\/p>\n<p>A similar pipeline can be created with the common <a href=\"https:\/\/learn.microsoft.com\/powershell\/module\/microsoft.powershell.core\/foreach-object\"><code>Foreach-Object<\/code><\/a> cmdlet using the <code>-Begin<\/code>, <code>-Process<\/code> and <code>-End<\/code> parameters to define the corresponding process blocks:<\/p>\n<pre><code class=\"language-PowerShell\">1..9 | Foreach-Object -Begin {\n    $Stream = [System.IO.StreamWriter]::new(\"$($Env:Temp)\\My.Log\")\n} -Process {\n    $Stream.WriteLine($_)\n} -End {\n    $Stream.Close()\n}<\/code><\/pre>\n<h3>Performance<\/h3>\n<p>With this understanding of the pipeline, you can see why you shouldn&#8217;t wrap a cmdlet pipeline inside another pipeline, like:<\/p>\n<pre><code class=\"language-PowerShell\">1..9 | ForEach-Object {\n    $_ | MyCmdlet\n}<\/code><\/pre>\n<p>Wrapping a cmdlet pipeline into another (<code>ForEach-Object<\/code>) pipeline is very expensive because you&#8217;re also invoking the <code>begin<\/code> and <code>end<\/code> block of <code>MyCmdlet<\/code>. This will open and close the log file for each item instead of only once at the beginning and the end of the pipeline. The performance degradation can happend with any cmdlet that takes pipeline input. See also <a href=\"https:\/\/learn.microsoft.com\/powershell\/scripting\/dev-cross-plat\/performance\/script-authoring-considerations#avoid-wrapping-cmdlet-pipelines\">PowerShell scripting performance considerations<\/a>.<\/p>\n<h2>The steppable pipeline<\/h2>\n<p>Unfortunately, it is not always possible to create a single syntactical pipeline. For example, you might need different branches for different parameters values or as output paths. Consider a very large <code>csv<\/code> file that you want to cut in smaller files. The obvious approach is to split it into files with a maximum number of lines:<\/p>\n<pre><code class=\"language-PowerShell\">$BatchSize = 10000\nImport-Csv .\\MyLarge.csv |\n    ForEach-Object -Begin {\n        $Index = 0\n    } -Process {\n        $BatchNr = [math]::Floor($Index++\/$BatchSize)\n        $_ | Export-Csv -Append .\\Batch$BatchNr.csv\n    }<\/code><\/pre>\n<p>But as stated this before, this will open and close each output file (<code>.\\Batch$BatchNr.csv<\/code> ) 10,000 times where it only needs to be opened and closed once per output file. So, the solution here is to use a steppable pipeline which lets you independently define the processing blocks for the required output stream:<\/p>\n<pre><code class=\"language-PowerShell\">$BatchSize = 10000\nImport-Csv .\\MyLarge.csv |\n    ForEach-Object -Begin {\n        $Index = 0\n    } -Process {\n        if ($Index % $BatchSize -eq 0) {\n            $BatchNr = [math]::Floor($Index++\/$BatchSize)\n            $Pipeline = { Export-Csv -notype -Path .\\Batch$BatchNr.csv }.GetSteppablePipeline()\n            $Pipeline.Begin($True)\n        }\n        $Pipeline.Process($_)\n        if ($Index++ % $BatchSize -eq 0) { $Pipeline.End() }\n    } -End {\n        $Pipeline.End()\n    }<\/code><\/pre>\n<p>Every 10,000 (<code>$BatchSize<\/code>) entries, the modulus (<code>%<\/code>) is zero and a new pipeline is created for the expression <code>{ Export-Csv -notype -Path .\\Batch$BatchNr.csv }<\/code>.<\/p>\n<ul>\n<li>The <code>$Pipeline.Begin($True)<\/code> invokes the <code>Begin<\/code> block of the steppable pipeline, which opens an new <code>csv<\/code> file named <code>.\\Batch$BatchNr.csv<\/code> and writes the headers to the file.<\/li>\n<li>The <code>$Pipeline.Process($_)<\/code> invokes the <code>Process<\/code> block of the steppable pipeline using the current item (<code>$_<\/code>), which is appended to the <code>csv<\/code> file.<\/li>\n<li>The <code>$Pipeline.End()<\/code>invokes the <code>End<\/code> block of the steppable pipeline, which closes the <code>csv<\/code> file named <code>.\\Batch$BatchNr.csv<\/code>. This file holds a total of 10,000 entries.<\/li>\n<\/ul>\n<p>(Note that it is important to end the pipeline but there is no harm in invoking the <code>$Pipeline.End()<\/code> multiple times.)<\/p>\n<p>It is a little more code, but if you measure the results you will see that in this situation the later script is more than 50 times faster than the one with the wrapped cmdlet pipeline.<\/p>\n<h3>Multiple output pipelines<\/h3>\n<p>With the steppable pipeline technique, you might even have multiple output pipelines open at once. Consider that for the very large <code>csv<\/code> file in the previous example, you do not want batches of 10,000 entries but divide the entries over 26 files based on the first letter of the <code>LastName<\/code> property:<\/p>\n<pre><code class=\"language-PowerShell\">$Pipeline = @{}\nImport-Csv .\\MyLarge.csv |\n    ForEach-Object -Process {\n        $Letter = $_.LastName[0].ToString().ToUpper()\n        if (!$Pipeline.Contains($Letter)) {\n            $Pipeline[$Letter] = { Export-CSV -notype -Path .\\$Letter.csv }.GetSteppablePipeline()\n            $Pipeline[$Letter].Begin($True)\n        }\n        $Pipeline[$Letter].Process($_)\n    } -End {\n        foreach ($Key in $Pipeline.Keys) { $Pipeline[$Key].End() }\n    }<\/code><\/pre>\n<p><strong>Explanation:<\/strong><\/p>\n<ul>\n<li>\n<p><code>Import-Csv .\\MyLarge.csv | ForEach-Object -Process {<\/code><\/p>\n<p>processes each (One-at-a-time) item of the <code>csv<\/code> file<\/p>\n<\/li>\n<li>\n<p><code>$Letter = $_.LastName[0].ToString().ToUpper()<\/code><\/p>\n<p>Takes the first character of the <code>LastName<\/code> property and puts that in upper case.<\/p>\n<\/li>\n<li>\n<p><code>if  (!$Pipeline.Contains($Letter))  {<\/code><\/p>\n<p>If the pipeline for the specific character doesn&#8217;t yet exist:<\/p>\n<ul>\n<li>Open a new steppable pipeline for the specific letter: <code>{  Export-CSV  -notype -Path .\\$Letter.csv }.GetSteppablePipeline()<\/code><\/li>\n<li>And invoke the <code>Begin<\/code> block: <code>.Begin($True)<\/code> which creates a new <code>csv<\/code> file with the concerned headers<\/li>\n<\/ul>\n<\/li>\n<li>\n<p><code>foreach  ($Key in $Pipeline.Keys)  {  $Pipeline[$Key].End()  }<\/code><\/p>\n<p>Closes all the existing steppable pipelines (aka <code>csv<\/code> files)<\/p>\n<\/li>\n<\/ul>\n<h3>See also<\/h3>\n<ul>\n<li><a href=\"https:\/\/learn.microsoft.com\/powershell\/module\/microsoft.powershell.core\/about\/about_pipelines\">About pipelines<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/powershell\/scripting\/developer\/cmdlet\/cmdlet-overview\">Cmdlet Overview<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/powershell\/scripting\/developer\/cmdlet\/strongly-encouraged-development-guidelines\">Strongly Encouraged Development Guidelines<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/powershell\/module\/microsoft.powershell.core\/about\/about_functions_advanced_methods\">about Functions Advanced Methods<\/a><\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/powershell\/scripting\/dev-cross-plat\/performance\/script-authoring-considerations\">PowerShell scripting performance considerations<\/a><\/li>\n<\/ul>\n<p><!-- link references --><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The PowerShell pipeline explained from the beginning to the end.<\/p>\n","protected":false},"author":109964,"featured_media":77,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[13],"tags":[80,3,81],"class_list":["post-861","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-powershell","tag-pipeline","tag-powershell","tag-steppable"],"acf":[],"blog_post_summary":"<p>The PowerShell pipeline explained from the beginning to the end.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/powershell-community\/wp-json\/wp\/v2\/posts\/861","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/powershell-community\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/powershell-community\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell-community\/wp-json\/wp\/v2\/users\/109964"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell-community\/wp-json\/wp\/v2\/comments?post=861"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/powershell-community\/wp-json\/wp\/v2\/posts\/861\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell-community\/wp-json\/wp\/v2\/media\/77"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/powershell-community\/wp-json\/wp\/v2\/media?parent=861"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell-community\/wp-json\/wp\/v2\/categories?post=861"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell-community\/wp-json\/wp\/v2\/tags?post=861"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}