VB XML Cookbook, Recipe 7: Enumerating Large XML Files (Doug Rothaus)
It’s been a while since I wrote one of these XML cookbook entries. Here’s some info on a common problem: Really big XML files.
I’m going to show you two things in this recipe. The first is a tip on reading very large XML files while still being able to use XML Axis Properties. The second is how to do make it available to LINQ queries by exposing it as IEnumerable.
Reading a Large XML File
If you’re new to working with XML, there’s something important that you need to know. That is, when you load an XML file into an in-memory document, the entire file gets loaded into memory. When working with small XML files, which is often the case, this is no big deal. In fact, it’s rather convenient. However, if you are working with an extremely large XML file, this is a big problem. I recently wrote some code to read through a bunch of XML files, not realizing that one of them was over half a gigabyte! My code loaded the entire file into memory using the XDocument.Load() method—well, it tried to at least. Needless to say, when I hit the huge file, my app did not perform well.
How do you read an enormous XML file then? You use the XmlReader class, which has been around since the first release of the .NET Framework. It reads through an XML file, but simply places a pointer on the current XML element or attribute as you go through the file. As you read through the file with the XmlReader object, you can examine the current XML, decide if you are interested in it, process it, discard it, and move on to the next part of the file. The important thing is that you can minimize how much memory is utilized at any one time in your app.
Take heed: you still need to be aware of how you are reading through the file. If you open an XmlReader, read to the root element, then load that entire element into memory you haven’t solved anything.
Now you may be saying that if you use an XmlReader object, you don’t get all of the cool functionality of XML Axis Properties. That’s true, and that’s why there’s a ReadFrom method that reads the XML from your XmlReader into an XNode, which you can then cast as an XElement object and make use of all of the VB XML juicy goodness. Using an XmlReader and the ReadFrom method together ensures that you only use as much memory as the largest XML element that you load.
Let’s look at an example. The app that I was working on was reading through XML files that contained reflection information from .NET assemblies. For each member of a particular class, there was an <api> element. Within that <api> element there was a bunch of information about that member, and my app needed to grab some of the info for use in summary counts. Here’s an abbreviated XML sample of what the data looked like.
<reflection>
<apis>
<api id=”M:Microsoft.VisualBasic.Strings.Mid(System.String,System.Int32)”>
<apidata name=”Mid” group=”member” subgroup=”method” />
<containers>
<namespace api=”N:Microsoft.VisualBasic” />
<type api=”T:Microsoft.VisualBasic.Strings” ref=”true” />
</containers>
</api>
<api id=”M:Microsoft.VisualBasic.Strings.Mid(System.String,System.Int32,System.Int32)”>
<apidata name=”Mid” group=”member” subgroup=”method” />
<containers>
<namespace api=”N:Microsoft.VisualBasic” />
<type api=”T:Microsoft.VisualBasic.Strings” ref=”true” />
</containers>
</api>
</apis>
</reflection>
Here’s some code to read through each <api> element, one at a time, with an XmlReader object. Once I have loaded the <api> element into an XElement object, I can use XML Axis properties to get values from the XML contained in the element. The most memory that I use is determined by the largest <api> element rather than the entire file.
Dim reader = Xml.XmlReader.Create(“….reflectionData.xml”)
reader.MoveToContent()
While reader.ReadToFollowing(“api”)
Dim api = TryCast(XElement.ReadFrom(reader), XElement)
If api Is Nothing Then Continue While
‘ Get information from the <api> element.
Dim ns = api.<containers>.<namespace>.@api
Dim containingType = api.<containers>.<type>.@api
End While
reader.Close()
Now this code simply reads to the first <api> element and then reads all of its sibling <api> elements. If one of the <api> elements has a child <api> element, that child element gets loaded in the call to the ReadFrom method and the XmlReader object’s pointer moves past it. This works fine for my app because none of the <api> elements have child <api> elements. You may have different requirements and need to adjust your code.
I ran this code on a 5MB file with a little less than 30,000 <api> elements. Loading the entire file into memory consumed over 120MB. Using the XmlReader, the code consumed less than 1MB. I gathered memory stats using the GetTotalMemory method.
One last thing to note in this section is that you can also run into memory issues when writing to a file. If you create a large XDocument in memory, and then write it to a file, you end up consuming the memory required to create the document, which is likely unnecessary. You have a couple of choices to minimize your memory footprint while writing an XML file. Similar to using the XmlReader class and the ReadFrom method, you can use the XmlWriter class and the WriteTo method. As another option, you can use the XStreamingElement class to write a single element at a time from an enumerable source, such as a LINQ query.
What about LINQ Queries?
In addition to having a small memory footprint, I also wanted to be able to use LINQ to query a large XML file. This can be achieved by creating a class that implements the IEnumerable interface. By fitting the code that I would have used to loop through the XML file into a class that implements IEnumerable(Of XElement), I can use an instance of that class as the source of any number of LINQ queries.
What I’ve created for this step is almost exactly the same as the class created by this walkthrough: Walkthrough: Implementing IEnumerable(Of T) in Visual Basic . The walkthrough shows you how to implement IEnumerable(Of String) to expose the contents of a text file one line at a time. We’ll do the same with an XML file.
When you implement IEnumerable, you actually need to implement both IEnumerable and IEnumerator. The bulk of your code goes into the IEnumerator implementation. You could create one class that implements both, but I like to split them into two classes.
I’ve called the class that implements IEnumerable(Of XElement) XmlReaderEnumerable. Following that naming convention I’ve called the class that implements IEnumerator(Of XElement) XmlReaderEnumerator. The behavior is the same as the earlier XmlReader example. The XmlReaderEnumerator class finds the first instance of a particular element, and then finds all of its sibling elements of the same name. As a result, I’ve added a constructor that takes both the path to the XML file, and the name of the XML element to search for. Note that the name is case sensitive as XML is case sensitive.
The XmlReaderEnumerable class doesn’t do much. All it does is return a reference to an instance of the XmlReaderEnumerator class. Here’s the code.
Public Class XmlReaderEnumerable
Implements IEnumerable(Of XElement)
Private _filePath As String
Private _elementName As String
Public Sub New(ByVal filePath As String, ByVal elementName As String)
_filePath = filePath
_elementName = elementName
End Sub
Public Function GetEnumerator() As IEnumerator(Of XElement) _
Implements IEnumerable(Of XElement).GetEnumerator
Return New XmlReaderEnumerator(_filePath, _elementName)
End Function
Private Function GetEnumerator1() As IEnumerator _
Implements IEnumerable.GetEnumerator
Return Me.GetEnumerator()
End Function
End Class
The XmlReaderEnumerator class is where the code resides to read through the XML file. In the constructor, it opens the file and moves to the start of the XML content. In the MoveNext method, it reads to the element of the supplied name (for example, “api”). In the Dispose method, it closes the reader. That’s it. It looks like a lot of code, but it really isn’t.
Public Class XmlReaderEnumerator
Implements IEnumerator(Of XElement)
Private _xmlReader As Xml.XmlReader
Private _elementName As String
Private _filePath As String
Public Sub New(ByVal filePath As String, ByVal elementName As String)
_filePath = filePath
_elementName = elementName
_xmlReader = Xml.XmlReader.Create(_filePath)
_xmlReader.MoveToContent()
End Sub
Private _current As XElement
Public ReadOnly Property Current() As XElement
0 comments