On my personal blog (Media And Microcode), I’ve been posting a series called "Scripting the Web", which introduced a function called Get-MarkupTag. Get-MarkupTag is a very handy little function that coerces individual tag elements of a web page into HTML, so you can scrape data from a webpage.
I’ve updated Get-MarkupTag a tiny bit for CTP3, marking the tag name as value from pipeline so I can get multiple tag types from a single document. I’m posting it again here so that a wider audience can make use of it, and so that I can use it in some later blog posts. Since I’ve already got inline help for the function, I simply used Write-CommandBlogPost to output its documentation and post it again here.
Enjoy!
Synopsis:
Extracts out a markup language (HTML or XML) tag from within a document
Syntax:
Get-MarkupTag [[-tag] [<Object>]] [[-html] [<String>]] [-Verbose] [-Debug] [-ErrorAction [<ActionPreference>]] [-WarningAction [<ActionPreference>]] [-ErrorVariable [<String>]] [-WarningVariable [<String>]] [-OutVariable [<String>]] [-OutBuffer [<Int32>]] [<CommonParameters>]
Detailed Description:
Extracts out a markup language (HTML or XML) tag from within a document.
Returns the tag, the text within the tag, and, if possible, the tag converted
to XML
Examples:
-------------------------- EXAMPLE 1 --------------------------
# Download the Microsoft front page and extract out links and div tags
$microsoft = (New-Object Net.Webclient).DownloadString("http://www.microsoft.com/")
"a", "div" | Get-MarkupTag -html $microsoft
-------------------------- EXAMPLE 2 --------------------------
# Extract the rows from ConvertTo-HTML
$text = Get-ChildItem | Select Name, LastWriteTime | ConvertTo-HTML | Out-String
Get-MarkupTag "tr" $text
Command Parameters:
Name Description tag The tag to extract, e.g. "a", "div" html The text to extract the tag from
Here’s Get-MarkupTag:
function Get-MarkupTag {
#.Synopsis
# Extracts out a markup language (HTML or XML) tag from within a document
#.Description
# Extracts out a markup language (HTML or XML) tag from within a document.
# Returns the tag, the text within the tag, and, if possible, the tag converted
# to XML
#.Parameter tag
# The tag to extract, e.g. "a", "div"
#.Parameter html
# The text to extract the tag from
#.Example
# # Download the Microsoft front page and extract out links and div tags
# $microsoft = (New-Object Net.Webclient).DownloadString("http://www.microsoft.com/")
# "a", "div" | Get-MarkupTag -html $microsoft
#.Example
# # Extract the rows from ConvertTo-HTML
# $text = Get-ChildItem | Select Name, LastWriteTime | ConvertTo-HTML | Out-String
# Get-MarkupTag "tr" $text
param(
[Parameter(ValueFromPipeline=$true,Position=0)]$tag,
[Parameter(Position=1)[string]$html)
begin {
$replacements = @{
"<BR>" = "<BR />"
"<HR>" = "<HR />"
" " = " "
'¯'='¯'
'Ð'='Ð'
'¶'='¶'
'¥'='¥'
'º'='º'
'¹'='¹'
'ª'='ª'
'­'=''
'²'='²'
'Ç'='Ç'
'Î'='Î'
'¤'='¤'
'½'='½'
'§'='§'
'Â'='â'
'Û'='Û'
'±'='±'
'®'='®'
'´'='´'
'Õ'='Õ'
'¦'='¦'
'£'='£'
'Í'='Í'
'·'='·'
'Ô'='Ô'
'¼'='¼'
'¨'='¨'
'Ó'='Ó'
'°'='°'
'Ý'='Ý'
'À'='À'
'Ö'='Ö'
'"'='"'
'Ã'='Ã'
'Þ'='Þ'
'¾'='¾'
'¿'='¿'
'×'='×'
'Ø'='Ø'
'÷'='÷'
'¡'='¡'
'³'='³'
'Ï'='Ï'
'¢'='¢'
'©'='©'
'Ä'='Ä'
'Ò'='Ò'
'Å'='Å'
'È'='È'
'Ü'='Ü'
'Á'='Á'
'Ì'='Ì'
'Ñ'='Ñ'
'Ê'='Ê'
'¸'='¸'
'Ù'='Ù'
'ß'='ß'
'»'='»'
'ë'='ë'
'É'='É'
'µ'='µ'
'¬'='¬'
'Ú'='Ú'
'Æ'='Æ'
'€'= "€"
}
}
process {
foreach ($r in $replacements.GetEnumerator()) {
$l = 0
do {
$l = $html.IndexOf($r.Key, $l, [StringComparison]"CurrentCultureIgnoreCase")
if ($l -ne -1) {
$html = $html.Remove($l, $r.Key.Length)
$html = $html.Insert($l, $r.Value)
}
} while ($l -ne -1)
}
$r = New-Object Text.RegularExpressions.Regex ('</' + $tag + '>'), ("Singleline", "IgnoreCase")
$endTags = @($r.Matches($html))
$r = New-Object Text.RegularExpressions.Regex ('<' + $tag + '[^>]*>'), ("Singleline", "IgnoreCase")
$startTags = @($r.Matches($html))
$tagText = @()
if ($startTags.Count -eq $endTags.Count) {
$allTags = $startTags + $endTags | Sort-Object Index
$startTags = New-Object Collections.Stack
foreach($t in $allTags) {
if (-not $t) { continue }
if ($t.Value -like "<$tag*") {
$startTags.Push($t)
} else {
$start = $startTags.Pop()
$tagText+=($html.Substring($start.Index, $t.Index + $t.Length - $start.Index))
}
}
} else {
# Unbalanced document, use start tags only and make sure that the tag is self-enclosed
$startTags | Foreach-Object {
$t = "$($_.Value)"
if ($t -notlike "*/>") {
$t = $t.Insert($t.Length - 1, "/")
}
$tagText+=$t
}
}
foreach ($t in $tagText) {
if (-not $t) {continue }
# Correct HTML which doesn't quote the attributes so it can be coerced into XML
$inTag = $false
for ($i = 0; $i -lt $t.Length; $i++) {
if ($t[$i] -eq "<") {
$inTag = $true
} else {
if ($t[$i] -eq ">") {
$inTag = $false
}
}
if ($inTag -and ($t[$i] -eq "=")) {
if ($t[$i + 1] -notmatch '[''|"]') {
$endQuoteSpot = $t.IndexOfAny(" >", $i + 1)
# Find the end of the attribute, then quote
$t = $t.Insert($i + 1, "'")
$t = $t.Insert($endQuoteSpot + 1, "'")
$i = $endQuoteSpot
} else {
# Make sure the quotes are correctly formatted, otherwise,
# end the quotes manually
$whichQuote = "$($Matches.Values)"
$endQuoteSpot = $t.IndexOf($whichQuote, $i + 2)
$i = $endQuoteSpot
}
}
}
$t | Select-Object @{
Name='Tag'
Expression={$_}
}, @{
Name='Xml'
Expression= {
([xml]$t).$tag
trap {
Write-Verbose ($_ | Out-String)
continue
}
}
}
}
}
}
Automatically generated with Write-CommandBlogPost
0 comments