{"id":3113,"date":"2009-02-20T22:23:00","date_gmt":"2009-02-20T22:23:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/vbteam\/2009\/02\/20\/web-scraping-using-visual-basics-xml-support\/"},"modified":"2024-07-05T13:33:42","modified_gmt":"2024-07-05T20:33:42","slug":"web-scraping-using-visual-basics-xml-support","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/vbteam\/web-scraping-using-visual-basics-xml-support\/","title":{"rendered":"Web-scraping using Visual Basic&#039;s XML support"},"content":{"rendered":"<p>We&#8217;ve just had a <a href=\"http:\/\/blogs.msdn.com\/vbteam\/archive\/2009\/02\/16\/channel-9-interview-asp-net-mvc-using-visual-basic-xml-literals-beth-massi.aspx\">fascinating article<\/a> about using VB&#8217;s XML literals to produce web pages.<\/p>\n<p>I&#8217;ve been interested in the other side of the process: using VB&#8217;s XML support to <em>scrape<\/em> web-pages for data. Here&#8217;s the essential bit of code, which looks for&nbsp;something like&nbsp;&lt;div class=&#8221;content&#8221;&gt;&lt;h2&gt;Title&lt;\/h2&gt; &lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;p&gt;Data1&lt;\/p&gt;&lt;\/td&gt;&lt;\/tr&gt;&lt;\/table&gt;&lt;\/div&gt;, and extracts the title and the data.<\/p>\n<p class=\"MsoNormal\"><span>Dim<\/span><span> fn = <span>Fetch(url)<\/span><\/p>\n<p><\/span><\/p>\n<p class=\"MsoNormal\"><span>Dim<\/span><span> xml = <span>XElement.Load(fn)<\/span><\/p>\n<p><\/span><\/p>\n<p class=\"MsoNormal\"><span>Dim<\/span><span> yui = (<span>From<\/span> i <span>In<\/span> <span>xml&#8230;<span>&lt;<\/span>xhtml:div<span>&gt;<\/span><\/span> <span>Where<\/span> i.GetAttr(<span>&#8220;class&#8221;<\/span>) = <span>&#8220;content&#8221;<\/span>).FirstOrDefault<\/p>\n<p><\/span><\/p>\n<p class=\"MsoNormal\"><span>Dim<\/span><span> title = <span>yui.<span>&lt;<\/span>xhtml:h2<span>&gt;<\/span><\/span>.Value.ToString.Replace(<span>&#8220;,&#8221;<\/span>, <span>&#8221; &#8220;<\/span>)<\/p>\n<p><\/span><\/p>\n<p class=\"MsoNormal\"><span>Dim<\/span><span> data = <span>From<\/span> i <span>In<\/span> <span>yui&#8230;<span>&lt;<\/span>xhtml:td<span>&gt;<\/span><\/span><span><\/p>\n<p><\/span><\/span><\/p>\n<p class=\"MsoNormal\"><span>For<\/span><span> <span>Each<\/span> td <span>In<\/span> data<\/p>\n<p><\/span><\/p>\n<p class=\"MsoNormal\"><span><span>&nbsp;&nbsp;&nbsp; <\/span><span>Dim<\/span> text = <span>td.<span>&lt;<\/span>xhtml:p<span>&gt;<\/span><\/span>.FirstOrDefault<\/p>\n<p><\/span><\/p>\n<p class=\"MsoNormal\"><span><span>&nbsp;&nbsp;&nbsp; <\/span>&#8230;<\/p>\n<p><\/span><\/p>\n<p class=\"MsoNormal\"><span>Next<\/p>\n<p><\/span><\/p>\n<p>The only tricky bit was that XElement.Load()&nbsp;requires well-formed XML\/XHTML,&nbsp;while most web-pages are in poorly-formed HTML. So&nbsp;my function &#8220;Fetch&#8221; retreives the page and then turns sloppy HTML into clean XHTML.&nbsp;It does this&nbsp;through the awesome open source utility &#8220;<a href=\"http:\/\/tidy.sourceforge.net\/\">tidy.exe<\/a>&#8220;.<\/p>\n<p>For more information, and complete source code, see here: <a href=\"http:\/\/blogs.msdn.com\/lucian\/archive\/2009\/02\/21\/web-scraping-with-vb-s-xml-support.aspx\">http:\/\/blogs.msdn.com\/lucian\/archive\/2009\/02\/21\/web-scraping-with-vb-s-xml-support.aspx<\/a><\/p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We&#8217;ve just had a fascinating article about using VB&#8217;s XML literals to produce web pages. I&#8217;ve been interested in the other side of the process: using VB&#8217;s XML support to scrape web-pages for data. Here&#8217;s the essential bit of code, which looks for&nbsp;something like&nbsp;&lt;div class=&#8221;content&#8221;&gt;&lt;h2&gt;Title&lt;\/h2&gt; &lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;p&gt;Data1&lt;\/p&gt;&lt;\/td&gt;&lt;\/tr&gt;&lt;\/table&gt;&lt;\/div&gt;, and extracts the title and the data. Dim fn [&hellip;]<\/p>\n","protected":false},"author":260,"featured_media":8818,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[192,195],"tags":[99],"class_list":["post-3113","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-featured","category-visual-basic","tag-lucian-wischik"],"acf":[],"blog_post_summary":"<p>We&#8217;ve just had a fascinating article about using VB&#8217;s XML literals to produce web pages. I&#8217;ve been interested in the other side of the process: using VB&#8217;s XML support to scrape web-pages for data. Here&#8217;s the essential bit of code, which looks for&nbsp;something like&nbsp;&lt;div class=&#8221;content&#8221;&gt;&lt;h2&gt;Title&lt;\/h2&gt; &lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;p&gt;Data1&lt;\/p&gt;&lt;\/td&gt;&lt;\/tr&gt;&lt;\/table&gt;&lt;\/div&gt;, and extracts the title and the data. Dim fn [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/posts\/3113","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/users\/260"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/comments?post=3113"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/posts\/3113\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/media\/8818"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/media?parent=3113"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/categories?post=3113"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/vbteam\/wp-json\/wp\/v2\/tags?post=3113"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}