Web-scraping using Visual Basic's XML support


We’ve just had a fascinating article about using VB’s XML literals to produce web pages.

I’ve been interested in the other side of the process: using VB’s XML support to scrape web-pages for data. Here’s the essential bit of code, which looks for something like <div class=”content”><h2>Title</h2> <table><tr><td><p>Data1</p></td></tr></table></div>, and extracts the title and the data.

Dim fn = Fetch(url)

Dim xml = XElement.Load(fn)

Dim yui = (From i In xml…<xhtml:div> Where i.GetAttr(“class”) = “content”).FirstOrDefault

Dim title = yui.<xhtml:h2>.Value.ToString.Replace(“,”, ” “)

Dim data = From i In yui…<xhtml:td>

For Each td In data

    Dim text = td.<xhtml:p>.FirstOrDefault



The only tricky bit was that XElement.Load() requires well-formed XML/XHTML, while most web-pages are in poorly-formed HTML. So my function “Fetch” retreives the page and then turns sloppy HTML into clean XHTML. It does this through the awesome open source utility “tidy.exe“.

For more information, and complete source code, see here: http://blogs.msdn.com/lucian/archive/2009/02/21/web-scraping-with-vb-s-xml-support.aspx


Leave a comment

Feedback usabilla icon