On my personal blog (Media And Microcode), I’ve been posting a series called "Scripting the Web", which introduced a function called Get-MarkupTag. Get-MarkupTag is a very handy little function that coerces individual tag elements of a web page into HTML, so you can scrape data from a webpage.
I’ve updated Get-MarkupTag a tiny bit for CTP3, marking the tag name as value from pipeline so I can get multiple tag types from a single document. I’m posting it again here so that a wider audience can make use of it, and so that I can use it in some later blog posts. Since I’ve already got inline help for the function, I simply used Write-CommandBlogPost to output its documentation and post it again here.
Enjoy!
Synopsis:
Extracts out a markup language (HTML or XML) tag from within a document
Syntax:
Get-MarkupTag [[-tag] [<Object>]] [[-html] [<String>]] [-Verbose] [-Debug] [-ErrorAction [<ActionPreference>]] [-WarningAction [<ActionPreference>]] [-ErrorVariable [<String>]] [-WarningVariable [<String>]] [-OutVariable [<String>]] [-OutBuffer [<Int32>]] [<CommonParameters>]
Detailed Description:
Extracts out a markup language (HTML or XML) tag from within a document.
Returns the tag, the text within the tag, and, if possible, the tag converted
to XML
Examples:
-------------------------- EXAMPLE 1 -------------------------- # Download the Microsoft front page and extract out links and div tags $microsoft = (New-Object Net.Webclient).DownloadString("http://www.microsoft.com/") "a", "div" | Get-MarkupTag -html $microsoft
-------------------------- EXAMPLE 2 -------------------------- # Extract the rows from ConvertTo-HTML $text = Get-ChildItem | Select Name, LastWriteTime | ConvertTo-HTML | Out-String Get-MarkupTag "tr" $text
Command Parameters:
Name Description tag The tag to extract, e.g. "a", "div" html The text to extract the tag from
Here’s Get-MarkupTag:
function Get-MarkupTag { #.Synopsis # Extracts out a markup language (HTML or XML) tag from within a document #.Description # Extracts out a markup language (HTML or XML) tag from within a document. # Returns the tag, the text within the tag, and, if possible, the tag converted # to XML #.Parameter tag # The tag to extract, e.g. "a", "div" #.Parameter html # The text to extract the tag from #.Example # # Download the Microsoft front page and extract out links and div tags # $microsoft = (New-Object Net.Webclient).DownloadString("http://www.microsoft.com/") # "a", "div" | Get-MarkupTag -html $microsoft #.Example # # Extract the rows from ConvertTo-HTML # $text = Get-ChildItem | Select Name, LastWriteTime | ConvertTo-HTML | Out-String # Get-MarkupTag "tr" $text param( [Parameter(ValueFromPipeline=$true,Position=0)]$tag, [Parameter(Position=1)[string]$html) begin { $replacements = @{ "<BR>" = "<BR />" "<HR>" = "<HR />" " " = " " '¯'='¯' 'Ð'='Ð' '¶'='¶' '¥'='¥' 'º'='º' '¹'='¹' 'ª'='ª' '­'='' '²'='²' 'Ç'='Ç' 'Î'='Î' '¤'='¤' '½'='½' '§'='§' 'Â'='â' 'Û'='Û' '±'='±' '®'='®' '´'='´' 'Õ'='Õ' '¦'='¦' '£'='£' 'Í'='Í' '·'='·' 'Ô'='Ô' '¼'='¼' '¨'='¨' 'Ó'='Ó' '°'='°' 'Ý'='Ý' 'À'='À' 'Ö'='Ö' '"'='"' 'Ã'='Ã' 'Þ'='Þ' '¾'='¾' '¿'='¿' '×'='×' 'Ø'='Ø' '÷'='÷' '¡'='¡' '³'='³' 'Ï'='Ï' '¢'='¢' '©'='©' 'Ä'='Ä' 'Ò'='Ò' 'Å'='Å' 'È'='È' 'Ü'='Ü' 'Á'='Á' 'Ì'='Ì' 'Ñ'='Ñ' 'Ê'='Ê' '¸'='¸' 'Ù'='Ù' 'ß'='ß' '»'='»' 'ë'='ë' 'É'='É' 'µ'='µ' '¬'='¬' 'Ú'='Ú' 'Æ'='Æ' '€'= "€" } } process { foreach ($r in $replacements.GetEnumerator()) { $l = 0 do { $l = $html.IndexOf($r.Key, $l, [StringComparison]"CurrentCultureIgnoreCase") if ($l -ne -1) { $html = $html.Remove($l, $r.Key.Length) $html = $html.Insert($l, $r.Value) } } while ($l -ne -1) } $r = New-Object Text.RegularExpressions.Regex ('</' + $tag + '>'), ("Singleline", "IgnoreCase") $endTags = @($r.Matches($html)) $r = New-Object Text.RegularExpressions.Regex ('<' + $tag + '[^>]*>'), ("Singleline", "IgnoreCase") $startTags = @($r.Matches($html)) $tagText = @() if ($startTags.Count -eq $endTags.Count) { $allTags = $startTags + $endTags | Sort-Object Index $startTags = New-Object Collections.Stack foreach($t in $allTags) { if (-not $t) { continue } if ($t.Value -like "<$tag*") { $startTags.Push($t) } else { $start = $startTags.Pop() $tagText+=($html.Substring($start.Index, $t.Index + $t.Length - $start.Index)) } } } else { # Unbalanced document, use start tags only and make sure that the tag is self-enclosed $startTags | Foreach-Object { $t = "$($_.Value)" if ($t -notlike "*/>") { $t = $t.Insert($t.Length - 1, "/") } $tagText+=$t } } foreach ($t in $tagText) { if (-not $t) {continue } # Correct HTML which doesn't quote the attributes so it can be coerced into XML $inTag = $false for ($i = 0; $i -lt $t.Length; $i++) { if ($t[$i] -eq "<") { $inTag = $true } else { if ($t[$i] -eq ">") { $inTag = $false } } if ($inTag -and ($t[$i] -eq "=")) { if ($t[$i + 1] -notmatch '[''|"]') { $endQuoteSpot = $t.IndexOfAny(" >", $i + 1) # Find the end of the attribute, then quote $t = $t.Insert($i + 1, "'") $t = $t.Insert($endQuoteSpot + 1, "'") $i = $endQuoteSpot } else { # Make sure the quotes are correctly formatted, otherwise, # end the quotes manually $whichQuote = "$($Matches.Values)" $endQuoteSpot = $t.IndexOf($whichQuote, $i + 2) $i = $endQuoteSpot } } } $t | Select-Object @{ Name='Tag' Expression={$_} }, @{ Name='Xml' Expression= { ([xml]$t).$tag trap { Write-Verbose ($_ | Out-String) continue } } } } } }
Automatically generated with Write-CommandBlogPost
0 comments