Web-scraping using Visual Basic's XML support
We’ve just had a fascinating article about using VB’s XML literals to produce web pages.
I’ve been interested in the other side of the process: using VB’s XML support to scrape web-pages for data. Here’s the essential bit of code, which looks for something like <div class=”content”><h2>Title</h2> <table><tr><td><p>Data1</p></td></tr></table></div>, and extracts the title and the data.
Dim fn = Fetch(url)
Dim xml = XElement.Load(fn)
Dim yui = (From i In xml…<xhtml:div> Where i.GetAttr(“class”) = “content”).FirstOrDefault
Dim title = yui.<xhtml:h2>.Value.ToString.Replace(“,”, ” “)
Dim data = From i In yui…<xhtml:td>
For Each td In data
Dim text = td.<xhtml:p>.FirstOrDefault
The only tricky bit was that XElement.Load() requires well-formed XML/XHTML, while most web-pages are in poorly-formed HTML. So my function “Fetch” retreives the page and then turns sloppy HTML into clean XHTML. It does this through the awesome open source utility “tidy.exe“.
For more information, and complete source code, see here: http://blogs.msdn.com/lucian/archive/2009/02/21/web-scraping-with-vb-s-xml-support.aspx