{"id":76403,"date":"2016-02-02T00:01:00","date_gmt":"2016-02-02T00:01:00","guid":{"rendered":"https:\/\/blogs.technet.microsoft.com\/heyscriptingguy\/2016\/02\/02\/convert-a-web-page-into-objects-for-easy-scraping-with-powershell\/"},"modified":"2019-02-18T09:20:02","modified_gmt":"2019-02-18T16:20:02","slug":"convert-a-web-page-into-objects-for-easy-scraping-with-powershell","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/scripting\/convert-a-web-page-into-objects-for-easy-scraping-with-powershell\/","title":{"rendered":"Convert a web page into objects for easy scraping with PowerShell"},"content":{"rendered":"<p><b>Summary<\/b>: Learn how to use Windows PowerShell 5.0 to scrape a web page so that you can easily return parsable objects.<\/p>\n<p>Good morning. Ed Wilson here, and today I have a guest blog post by Doug Finke&#8230;<\/p>\n<p>When surfing the <a href=\"https:\/\/www.powershellgallery.com\/\" target=\"_blank\">PowerShell Gallery<\/a>, you&#039;ll find that each module has a web page with a version history, for example:<\/p>\n<p style=\"margin-left:30px\"><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/29\/2019\/02\/0246.p1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/29\/2019\/02\/0246.p1.png\" alt=\"Image of list\" title=\"Image of list\" \/><\/a><\/p>\n<p>Wouldn&#039;t it be great if you could get this information at the command line? <a href=\"https:\/\/raw.githubusercontent.com\/dfinke\/GifCam\/master\/GetPSGalleryInfo.gif\" target=\"_blank\">Click here for a 20 second video<\/a> that shows the code to do it.<\/p>\n<p style=\"margin-left:30px\"><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/29\/2019\/02\/p2.gif\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/29\/2019\/02\/p2.gif\" border=\"0\" alt=\" \" \/><\/a><\/p>\n<h2>How to do web scrapping<\/h2>\n<p>This approach will only work in Windows PowerShell&nbsp;5.0, because it uses the new <b>ConvertFrom-String<\/b> function to convert the parsed HTML text into objects.<\/p>\n<p>It&#039;s a simple approach. First, use <b>Invoke-WebRequest<\/b> to get the HTML back from the web page. Then, <b>AllElements<\/b> returns a list of objects that you pipe to <b>Where<\/b> and do a match on <b>versionTableRow<\/b>. You grab the <b>InnerText<\/b> property and pipe all of this to <b>ConvertFrom-String<\/b> using the contents of <b>$t<\/b> as the template to convert the text to objects with the property names <b>Name<\/b>, <b>Version<\/b>, <b>Downloads<\/b>, and <b>PublishDate<\/b>.<\/p>\n<p style=\"margin-left:30px\">function Get-PSGalleryInfo {<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp; param(<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [Parameter(Mandatory=$true,ValueFromPipelineByPropertyName=$true)]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $Name<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp; )<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp; Begin {<\/p>\n<p style=\"margin-left:30px\">$t = @&quot;<\/p>\n<p style=\"margin-left:30px\">{Name*:PowerShellISE-preview} {[version]Version:5.1.0.1} (this version) {[double]Downloads:885} {[DateTime]PublishDate:Wednesday, January 27 2016}<\/p>\n<p style=\"margin-left:30px\">{Name*:ImportExcel} 1.97&nbsp; {Downloads:106} Monday, January 18 2016<\/p>\n<p style=\"margin-left:30px\">&quot;@<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp; }<\/p>\n<p>&nbsp;<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp; Process {<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $url =&quot;https:\/\/www.powershellgallery.com\/packages\/$Name\/&quot;<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $r=Invoke-WebRequest $url<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ($r.AllElements | Where {$_.class -match &#039;versionTableRow&#039;}).innerText |<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ConvertFrom-String -TemplateContent $t<\/p>\n<p style=\"margin-left:30px\">&nbsp;&nbsp;&nbsp; }<\/p>\n<p style=\"margin-left:30px\">}<\/p>\n<h2>How to figure out the content of class<\/h2>\n<p>Launch your browser and navigate to the <a href=\"https:\/\/www.powershellgallery.com\/packages\/ImportExcel\/\" target=\"_blank\">ImportExcel 1.98 module<\/a>. You should be able to right-click the page and find an option called <b>View page source<\/b>. When you click it, you&#039;ll get another tab in your browser, which shows you the underlying HTML. Scroll down (or use <b>Search<\/b>) for text that looks familiar in the rendered page.<\/p>\n<p>Here you can see an HTML class attribute that contains <b>versionTableRow<\/b>. For other pages you want to scrape, you need to examine the HTML to figure out what uniquely identifies what you want to extract. Sometimes it&#039;s as easy as this:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/29\/2019\/02\/7737.p3.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/29\/2019\/02\/7737.p3.png\" alt=\"Image of code\" title=\"Image of code\" \/><\/a><\/p>\n<p>Next, you can see the text returned with this PowerShell snippet:<\/p>\n<p style=\"margin-left:30px\">$r.AllElements | Where {$_.class -match &#039;versionTableRow&#039;}).innerText<\/p>\n<p>Use that to create the <b>TemplateContent<\/b> for <b>ConvertFrom-String<\/b>, which transforms the text to objects.<\/p>\n<p>For a great write up on how to work with <b>ConvertFrom-String<\/b>, check out this post on the Windows PowerShell blog:&nbsp;<a href=\"http:\/\/blogs.msdn.com\/b\/powershell\/archive\/2014\/10\/31\/convertfrom-string-example-based-text-parsing.aspx\" target=\"_blank\">ConvertFrom-String: Example-based text parsing<\/a>.<\/p>\n<p>~Doug<\/p>\n<p>Thank you, Doug, for that way cool post. Join me tomorrow for more cool Windows PowerShell stuff.<\/p>\n<p>I invite you to follow me on <a href=\"http:\/\/bit.ly\/scriptingguystwitter\" target=\"_blank\">Twitter<\/a> and <a href=\"http:\/\/bit.ly\/scriptingguysfacebook\" target=\"_blank\">Facebook<\/a>. If you have any questions, send email to me at <a href=\"mailto:scripter@microsoft.com\" target=\"_blank\">scripter@microsoft.com<\/a>, or post your questions on the <a href=\"http:\/\/bit.ly\/scriptingforum\" target=\"_blank\">Official Scripting Guys Forum<\/a>. Also check out my <a href=\"https:\/\/blogs.technet.microsoft.com\/msoms\/\" target=\"_blank\">Microsoft Operations Management Suite Blog<\/a>. See you tomorrow. Until then, peace.<\/p>\n<p><b>Ed Wilson, Microsoft Scripting Guy<\/b>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Summary: Learn how to use Windows PowerShell 5.0 to scrape a web page so that you can easily return parsable objects. Good morning. Ed Wilson here, and today I have a guest blog post by Doug Finke&#8230; When surfing the PowerShell Gallery, you&#039;ll find that each module has a web page with a version history, [&hellip;]<\/p>\n","protected":false},"author":596,"featured_media":87096,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[78,56,600,3,167,45],"class_list":["post-76403","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-scripting","tag-doug-finke","tag-guest-blogger","tag-powershell-5-0","tag-scripting-guy","tag-using-the-internet","tag-windows-powershell"],"acf":[],"blog_post_summary":"<p>Summary: Learn how to use Windows PowerShell 5.0 to scrape a web page so that you can easily return parsable objects. Good morning. Ed Wilson here, and today I have a guest blog post by Doug Finke&#8230; When surfing the PowerShell Gallery, you&#039;ll find that each module has a web page with a version history, [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/posts\/76403","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/users\/596"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/comments?post=76403"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/posts\/76403\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/media\/87096"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/media?parent=76403"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/categories?post=76403"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/tags?post=76403"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}