Web-scraping using Visual Basic's XML support

We’ve just had a fascinating article about using VB’s XML literals to produce web pages.

I’ve been interested in the other side of the process: using VB’s XML support to scrape web-pages for data. Here’s the essential bit of code, which looks for something like <div class=”content”><h2>Title</h2> <table><tr><td><p>Data1</p></td></tr></table></div>, and extracts the title and the data.

Dim fn = Fetch(url)

Dim xml = XElement.Load(fn)

Dim yui = (From i In xml…<xhtml:div> Where i.GetAttr(“class”) = “content”).FirstOrDefault

Dim title = yui.<xhtml:h2>.Value.ToString.Replace(“,”, ” “)

Dim data = From i In yui…<xhtml:td>

For Each td In data

Dim text = td.<xhtml:p>.FirstOrDefault

…

The only tricky bit was that XElement.Load() requires well-formed XML/XHTML, while most web-pages are in poorly-formed HTML. So my function “Fetch” retreives the page and then turns sloppy HTML into clean XHTML. It does this through the awesome open source utility “tidy.exe“.

For more information, and complete source code, see here: http://blogs.msdn.com/lucian/archive/2009/02/21/web-scraping-with-vb-s-xml-support.aspx

Web-scraping using Visual Basic's XML support

Category

Topics

Author

0 comments

Leave a commentCancel reply

Read next

Visual Basic 10 on the 10-4 Channel9 Series! (Lisa Feigenbaum)

Visual Basic at MIX ’09 (Lisa Feigenbaum)