{"id":17812,"date":"2008-12-24T07:25:30","date_gmt":"2008-12-24T15:25:30","guid":{"rendered":"http:\/\/devblogs.microsoft.com\/powershell\/?p=17812"},"modified":"2019-06-07T07:27:03","modified_gmt":"2019-06-07T15:27:03","slug":"get-markuptag","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/powershell\/get-markuptag\/","title":{"rendered":"Get-MarkupTag"},"content":{"rendered":"<p class=\"PostHeader\">On my personal blog (<a href=\"http:\/\/blogs.msdn.com\/mediaandmicrocode\/archive\/2008\/12\/08\/microcode-powershell-scripting-tricks-scripting-the-web-part-2-get-markuptag.aspx\">Media And Microcode<\/a>), I&#8217;ve been posting a series called &quot;Scripting the Web&quot;, which introduced a function called Get-MarkupTag. Get-MarkupTag is a very handy little function that coerces individual tag elements of a web page into HTML, so you can scrape data from a webpage.<\/p>\n<p class=\"PostHeader\">I&#8217;ve updated Get-MarkupTag a tiny bit for CTP3, marking the tag name as value from pipeline so I can get multiple tag types from a single document. I&#8217;m posting it again here so that a wider audience can make use of it, and so that I can use it in some later blog posts. Since I&#8217;ve already got inline help for the function, I simply used <a href=\"http:\/\/blogs.msdn.com\/powershell\/archive\/2008\/12\/24\/write-commandblogpost.aspx\">Write-CommandBlogPost<\/a> to output its documentation and post it again here.<\/p>\n<p class=\"PostHeader\">Enjoy!<\/p>\n<p>  <\/p>\n<p class=\"CmdletSynopsis\"><b>Synopsis:<\/b>     <\/p>\n<blockquote><p>Extracts out a markup language (HTML or XML) tag from within a document <\/p><\/blockquote>\n<p>  <\/p>\n<p class=\"CmdletSyntax\"><b>Syntax:<\/b>     <\/p>\n<blockquote><p>Get-MarkupTag [[-tag] [&lt;Object&gt;]] [[-html] [&lt;String&gt;]] [-Verbose] [-Debug] [-ErrorAction [&lt;ActionPreference&gt;]] [-WarningAction [&lt;ActionPreference&gt;]] [-ErrorVariable [&lt;String&gt;]] [-WarningVariable [&lt;String&gt;]] [-OutVariable [&lt;String&gt;]] [-OutBuffer [&lt;Int32&gt;]] [&lt;CommonParameters&gt;] <\/p><\/blockquote>\n<p>  <\/p>\n<p class=\"CmdletDescription\"><b>Detailed Description:<\/b>     <\/p>\n<blockquote><p>Extracts out a markup language (HTML or XML) tag from within a document.    <br \/>Returns the tag, the text within the tag, and, if possible, the tag converted     <br \/>to XML <\/p><\/blockquote>\n<p>  <\/p>\n<p class=\"Examples\">Examples: <\/p>\n<blockquote>\n<pre class=\"CmdletExample\">    -------------------------- EXAMPLE 1 --------------------------\r\n\r\n\r\n\r\n\r\n\r\n# Download the Microsoft front page and extract out links and div tags\r\n$microsoft = (New-Object Net.Webclient).DownloadString(&quot;http:\/\/www.microsoft.com\/&quot;)\r\n&quot;a&quot;, &quot;div&quot; | Get-MarkupTag -html $microsoft\r\n    <\/pre>\n<\/blockquote>\n<blockquote>\n<pre class=\"CmdletExample\">    -------------------------- EXAMPLE 2 --------------------------\r\n\r\n\r\n\r\n\r\n\r\n# Extract the rows from ConvertTo-HTML\r\n$text = Get-ChildItem | Select Name, LastWriteTime | ConvertTo-HTML | Out-String \r\nGet-MarkupTag &quot;tr&quot; $text\r\n    <\/pre>\n<\/blockquote>\n<p><\/p>\n<p class=\"CmdletParameters\">Command Parameters: <\/p>\n<blockquote>\n<table>\n<colgroup>\n<col \/>\n<col \/><\/colgroup>\n<tbody>\n<tr>\n<th>Name<\/th>\n<th>Description<\/th>\n<\/tr>\n<tr>\n<td>tag<\/td>\n<td>The tag to extract, e.g. &quot;a&quot;, &quot;div&quot;<\/td>\n<\/tr>\n<tr>\n<td>html<\/td>\n<td>The text to extract the tag from<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/blockquote>\n<p><\/p>\n<p>Here&#8217;s Get-MarkupTag: <i><\/i><\/p>\n<blockquote>\n<pre class=\"CmdletDefinition\">function Get-MarkupTag {\r\n           \r\n    #.Synopsis\r\n    #   Extracts out a markup language (HTML or XML) tag from within a document\r\n    #.Description\r\n    #   Extracts out a markup language (HTML or XML) tag from within a document.\r\n    #   Returns the tag, the text within the tag, and, if possible, the tag converted\r\n    #   to XML\r\n    #.Parameter tag \r\n    #   The tag to extract, e.g. &quot;a&quot;, &quot;div&quot;\r\n    #.Parameter html\r\n    #   The text to extract the tag from\r\n    #.Example\r\n    #   # Download the Microsoft front page and extract out links and div tags\r\n    #   $microsoft = (New-Object Net.Webclient).DownloadString(&quot;http:\/\/www.microsoft.com\/&quot;)\r\n    #   &quot;a&quot;, &quot;div&quot; | Get-MarkupTag -html $microsoft\r\n    #.Example\r\n    #   # Extract the rows from ConvertTo-HTML\r\n    #   $text = Get-ChildItem | Select Name, LastWriteTime | ConvertTo-HTML | Out-String \r\n    #   Get-MarkupTag &quot;tr&quot; $text\r\n    param(\r\n        [Parameter(ValueFromPipeline=$true,Position=0)]$tag,\r\n        [Parameter(Position=1)[string]$html)\r\nbegin {\r\n    \r\n        $replacements = @{\r\n            &quot;&lt;BR&gt;&quot; = &quot;&lt;BR \/&gt;&quot;\r\n            &quot;&lt;HR&gt;&quot; = &quot;&lt;HR \/&gt;&quot;\r\n            &quot;&amp;nbsp;&quot; = &quot; &quot;\r\n            '&amp;macr;'='&#175;'\r\n            '&amp;ETH;'='&#208;'\r\n            '&amp;para;'='&#182;'\r\n            '&amp;yen;'='&#165;'\r\n            '&amp;ordm;'='&#186;'\r\n            '&amp;sup1;'='&#185;'\r\n            '&amp;ordf;'='&#170;'\r\n            '&amp;shy;'='&#173;'\r\n            '&amp;sup2;'='&#178;'\r\n            '&amp;Ccedil;'='&#199;'\r\n            '&amp;Icirc;'='&#206;'\r\n            '&amp;curren;'='&#164;'\r\n            '&amp;frac12;'='&#189;'\r\n            '&amp;sect;'='&#167;'\r\n            '&amp;Acirc;'='&#226;'\r\n            '&amp;Ucirc;'='&#219;'\r\n            '&amp;plusmn;'='&#177;'\r\n            '&amp;reg;'='&#174;'\r\n            '&amp;acute;'='&#180;'\r\n            '&amp;Otilde;'='&#213;'\r\n            '&amp;brvbar;'='&#166;'\r\n            '&amp;pound;'='&#163;'\r\n            '&amp;Iacute;'='&#205;'\r\n            '&amp;middot;'='&#183;'\r\n            '&amp;Ocirc;'='&#212;'\r\n            '&amp;frac14;'='&#188;'\r\n            '&amp;uml;'='&#168;'\r\n            '&amp;Oacute;'='&#211;'\r\n            '&amp;deg;'='&#176;'\r\n            '&amp;Yacute;'='&#221;'\r\n            '&amp;Agrave;'='&#192;'\r\n            '&amp;Ouml;'='&#214;'\r\n            '&amp;quot;'='&quot;'\r\n            '&amp;Atilde;'='&#195;'\r\n            '&amp;THORN;'='&#222;'\r\n            '&amp;frac34;'='&#190;'\r\n            '&amp;iquest;'='&#191;'\r\n            '&amp;times;'='&#215;'\r\n            '&amp;Oslash;'='&#216;'\r\n            '&amp;divide;'='&#247;'\r\n            '&amp;iexcl;'='&#161;'\r\n            '&amp;sup3;'='&#179;'\r\n            '&amp;Iuml;'='&#207;'\r\n            '&amp;cent;'='&#162;'\r\n            '&amp;copy;'='&#169;'\r\n            '&amp;Auml;'='&#196;'\r\n            '&amp;Ograve;'='&#210;'\r\n            '&amp;Aring;'='&#197;'\r\n            '&amp;Egrave;'='&#200;'\r\n            '&amp;Uuml;'='&#220;'\r\n            '&amp;Aacute;'='&#193;'\r\n            '&amp;Igrave;'='&#204;'\r\n            '&amp;Ntilde;'='&#209;'\r\n            '&amp;Ecirc;'='&#202;'\r\n            '&amp;cedil;'='&#184;'\r\n            '&amp;Ugrave;'='&#217;'\r\n            '&amp;szlig;'='&#223;'\r\n            '&amp;raquo;'='&#187;'\r\n            '&amp;euml;'='&#235;'\r\n            '&amp;Eacute;'='&#201;'\r\n            '&amp;micro;'='&#181;'\r\n            '&amp;not;'='&#172;'\r\n            '&amp;Uacute;'='&#218;'\r\n            '&amp;AElig;'='&#198;'\r\n            '&amp;euro;'= &quot;&#8364;&quot;        \r\n        }\r\n    \r\n}\r\nprocess {\r\n\r\n        foreach ($r in $replacements.GetEnumerator()) {\r\n            $l = 0 \r\n            do {\r\n                $l = $html.IndexOf($r.Key, $l, [StringComparison]&quot;CurrentCultureIgnoreCase&quot;)\r\n                if ($l -ne -1) {\r\n                    $html = $html.Remove($l, $r.Key.Length)\r\n                    $html = $html.Insert($l, $r.Value)\r\n                }\r\n            } while ($l -ne -1)         \r\n        }\r\n     \r\n        $r = New-Object Text.RegularExpressions.Regex ('&lt;\/' + $tag + '&gt;'), (&quot;Singleline&quot;, &quot;IgnoreCase&quot;)\r\n        $endTags = @($r.Matches($html))\r\n        $r = New-Object Text.RegularExpressions.Regex ('&lt;' + $tag + '[^&gt;]*&gt;'), (&quot;Singleline&quot;, &quot;IgnoreCase&quot;)\r\n        $startTags = @($r.Matches($html))\r\n        $tagText = @()\r\n        if ($startTags.Count -eq $endTags.Count) {\r\n            $allTags = $startTags + $endTags | Sort-Object Index   \r\n            $startTags = New-Object Collections.Stack\r\n            foreach($t in $allTags) {\r\n                if (-not $t) { continue } \r\n                if ($t.Value -like &quot;&lt;$tag*&quot;) {\r\n                    $startTags.Push($t)\r\n                } else {\r\n                    $start = $startTags.Pop()\r\n                    $tagText+=($html.Substring($start.Index, $t.Index + $t.Length - $start.Index))\r\n                }\r\n            }\r\n        } else {\r\n            # Unbalanced document, use start tags only and make sure that the tag is self-enclosed\r\n            $startTags | Foreach-Object {\r\n                $t = &quot;$($_.Value)&quot;\r\n                if ($t -notlike &quot;*\/&gt;&quot;) {\r\n                    $t = $t.Insert($t.Length - 1, &quot;\/&quot;)\r\n                }\r\n                $tagText+=$t\r\n            } \r\n        }\r\n        foreach ($t in $tagText) {\r\n            if (-not $t) {continue }\r\n            # Correct HTML which doesn't quote the attributes so it can be coerced into XML\r\n            $inTag = $false\r\n            for ($i = 0; $i -lt $t.Length; $i++) {\r\n                if ($t[$i] -eq &quot;&lt;&quot;) {\r\n                    $inTag = $true\r\n                } else {\r\n                    if ($t[$i] -eq &quot;&gt;&quot;) {\r\n                        $inTag = $false\r\n                    }\r\n                }\r\n                if ($inTag -and ($t[$i] -eq &quot;=&quot;)) {\r\n                    if ($t[$i + 1] -notmatch '[''|&quot;]') {\r\n                        $endQuoteSpot = $t.IndexOfAny(&quot; &gt;&quot;, $i + 1)\r\n                        # Find the end of the attribute, then quote\r\n                        $t = $t.Insert($i + 1, &quot;'&quot;)\r\n                        $t = $t.Insert($endQuoteSpot + 1, &quot;'&quot;)                    \r\n                        $i = $endQuoteSpot\r\n                    } else {\r\n                        # Make sure the quotes are correctly formatted, otherwise,\r\n                        # end the quotes manually\r\n                        $whichQuote = &quot;$($Matches.Values)&quot;\r\n                        $endQuoteSpot = $t.IndexOf($whichQuote, $i + 2)\r\n                        $i = $endQuoteSpot\r\n                    }\r\n                }\r\n            }        \r\n            $t | Select-Object @{\r\n                Name='Tag'\r\n                Expression={$_}\r\n            }, @{\r\n                Name='Xml'\r\n                Expression= {\r\n                    ([xml]$t).$tag      \r\n                    trap {\r\n                        Write-Verbose ($_ | Out-String) \r\n                        continue\r\n                    }\r\n                }\r\n            }    \r\n        }\r\n    \r\n}\r\n\r\n}\r\n    <\/pre>\n<\/blockquote>\n<\/p>\n<p><\/p>\n<p class=\"Postfooter\">Hope this Helps,<\/p>\n<p class=\"Postfooter\">James Brundage [MSFT]<\/p>\n<p style=\"font-size: xx-small\">Automatically generated with <a href=\"http:\/\/blogs.msdn.com\/powershell\/archive\/tags\/Write-CommandBlogPost\/default.aspx\">Write-CommandBlogPost<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>On my personal blog (Media And Microcode), I&#8217;ve been posting a series called &quot;Scripting the Web&quot;, which introduced a function called Get-MarkupTag. Get-MarkupTag is a very handy little function that coerces individual tag elements of a web page into HTML, so you can scrape data from a webpage. I&#8217;ve updated Get-MarkupTag a tiny bit for [&hellip;]<\/p>\n","protected":false},"author":600,"featured_media":13641,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[97,137,175],"class_list":["post-17812","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-powershell","tag-advanced-functions","tag-ctp3","tag-get-markuptag"],"acf":[],"blog_post_summary":"<p>On my personal blog (Media And Microcode), I&#8217;ve been posting a series called &quot;Scripting the Web&quot;, which introduced a function called Get-MarkupTag. Get-MarkupTag is a very handy little function that coerces individual tag elements of a web page into HTML, so you can scrape data from a webpage. I&#8217;ve updated Get-MarkupTag a tiny bit for [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts\/17812","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/users\/600"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/comments?post=17812"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts\/17812\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/media\/13641"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/media?parent=17812"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/categories?post=17812"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/tags?post=17812"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}