February 20th, 2009

Web-scraping using Visual Basic's XML support

We’ve just had a fascinating article about using VB’s XML literals to produce web pages.

I’ve been interested in the other side of the process: using VB’s XML support to scrape web-pages for data. Here’s the essential bit of code, which looks for something like <div class=”content”><h2>Title</h2> <table><tr><td><p>Data1</p></td></tr></table></div>, and extracts the title and the data.

Dim fn = Fetch(url)

Dim xml = XElement.Load(fn)

Dim yui = (From i In xml…<xhtml:div> Where i.GetAttr(“class”) = “content”).FirstOrDefault

Dim title = yui.<xhtml:h2>.Value.ToString.Replace(“,”, ” “)

Dim data = From i In yui…<xhtml:td>

For Each td In data

    Dim text = td.<xhtml:p>.FirstOrDefault

   

Next

The only tricky bit was that XElement.Load() requires well-formed XML/XHTML, while most web-pages are in poorly-formed HTML. So my function “Fetch” retreives the page and then turns sloppy HTML into clean XHTML. It does this through the awesome open source utility “tidy.exe“.

For more information, and complete source code, see here: http://blogs.msdn.com/lucian/archive/2009/02/21/web-scraping-with-vb-s-xml-support.aspx

Author

0 comments