June 7th, 2010

VB XML Cookbook, Recipe 7: Enumerating Large XML Files (Doug Rothaus)

VB XML Cookbook, Recipe 7: Enumerating Large XML Files (Doug Rothaus)

It’s been a while since I wrote one of these XML cookbook entries. Here’s some info on a common problem: Really big XML files.

I’m going to show you two things in this recipe. The first is a tip on reading very large XML files while still being able to use XML Axis Properties. The second is how to do make it available to LINQ queries by exposing it as IEnumerable.

Reading a Large XML File

If you’re new to working with XML, there’s something important that you need to know. That is, when you load an XML file into an in-memory document, the entire file gets loaded into memory. When working with small XML files, which is often the case, this is no big deal. In fact, it’s rather convenient. However, if you are working with an extremely large XML file, this is a big problem. I recently wrote some code to read through a bunch of XML files, not realizing that one of them was over half a gigabyte! My code loaded the entire file into memory using the XDocument.Load() method—well, it tried to at least. Needless to say, when I hit the huge file, my app did not perform well.

How do you read an enormous XML file then? You use the XmlReader class, which has been around since the first release of the .NET Framework. It reads through an XML file, but simply places a pointer on the current XML element or attribute as you go through the file. As you read through the file with the XmlReader object, you can examine the current XML, decide if you are interested in it, process it, discard it, and move on to the next part of the file. The important thing is that you can minimize how much memory is utilized at any one time in your app.

Take heed: you still need to be aware of how you are reading through the file. If you open an XmlReader, read to the root element, then load that entire element into memory you haven’t solved anything.

Now you may be saying that if you use an XmlReader object, you don’t get all of the cool functionality of XML Axis Properties. That’s true, and that’s why there’s a ReadFrom method that reads the XML from your XmlReader into an XNode, which you can then cast as an XElement object and make use of all of the VB XML juicy goodness. Using an XmlReader and the ReadFrom method together ensures that you only use as much memory as the largest XML element that you load.

 Let’s look at an example. The app that I was working on was reading through XML files that contained reflection information from .NET assemblies. For each member of a particular class, there was an <api> element. Within that <api> element there was a bunch of information about that member, and my app needed to grab some of the info for use in summary counts. Here’s an abbreviated XML sample of what the data looked like.

<reflection>

  <apis>

    <api id=”M:Microsoft.VisualBasic.Strings.Mid(System.String,System.Int32)”>

      <apidata name=”Mid” group=”member” subgroup=”method” />

      <containers>

        <namespace api=”N:Microsoft.VisualBasic” />

        <type api=”T:Microsoft.VisualBasic.Strings” ref=”true” />

      </containers>

    </api> 

    <api id=”M:Microsoft.VisualBasic.Strings.Mid(System.String,System.Int32,System.Int32)”>

      <apidata name=”Mid” group=”member” subgroup=”method” />

      <containers>

        <namespace api=”N:Microsoft.VisualBasic” />

        <type api=”T:Microsoft.VisualBasic.Strings” ref=”true” />

      </containers>

    </api>

  </apis>

</reflection>

 

Here’s  some code to read through each <api> element, one at a time, with an XmlReader object. Once I have loaded the <api> element into an XElement object, I can use XML Axis properties to get values from the XML contained in the element. The most memory that I use is determined by the largest <api> element rather than the entire file.

 

        Dim reader = Xml.XmlReader.Create(“….reflectionData.xml”)

        reader.MoveToContent()

 

        While reader.ReadToFollowing(“api”)

            Dim api = TryCast(XElement.ReadFrom(reader), XElement)

 

            If api Is Nothing Then Continue While

 

            ‘ Get information from the <api> element.

            Dim ns = api.<containers>.<namespace>.@api

            Dim containingType = api.<containers>.<type>.@api

        End While

 

        reader.Close()

 

 

 

Now this code simply reads to the first <api> element and then reads all of its sibling <api> elements. If one of the <api> elements has a child <api> element, that child element gets loaded in the call to the ReadFrom method and the XmlReader object’s pointer moves past it. This works fine for my app because none of the <api> elements have child <api> elements. You may have different requirements and need to adjust your code.

 

I ran this code on a 5MB file with a little less than 30,000 <api> elements. Loading the entire file into memory consumed over 120MB. Using the XmlReader, the code consumed less than 1MB. I gathered memory stats using the GetTotalMemory method.

 

One last thing to note in this section is that you can also run into memory issues when writing to a file. If you create a large XDocument in memory, and then write it to a file, you end up consuming the memory required to create the document, which is likely unnecessary. You have a couple of choices to minimize your memory footprint while writing an XML file. Similar to using the XmlReader class and the ReadFrom method, you can use the XmlWriter class and the WriteTo method. As another option, you can use the XStreamingElement class to write a single element at a time from an enumerable source, such as a LINQ query. 

 

What about LINQ Queries?

In addition to having a small memory footprint, I also wanted to be able to use LINQ to query a large XML file. This can be achieved by creating a class that implements the IEnumerable interface. By fitting the code that I would have used to loop through the XML file into a class that implements IEnumerable(Of XElement), I can use an instance of that class as the source of any number of LINQ queries.

 

What I’ve created for this step is almost exactly the same as the class created by this walkthrough: Walkthrough: Implementing IEnumerable(Of T) in Visual Basic . The walkthrough shows you how to implement IEnumerable(Of String) to expose the contents of a text file one line at a time. We’ll do the same with an XML file.

 

When you implement IEnumerable, you actually need to implement both IEnumerable and IEnumerator. The bulk of your code goes into the IEnumerator implementation. You could create one class that implements both, but I like to split them into two classes.

 

I’ve called the class that implements IEnumerable(Of  XElement) XmlReaderEnumerable. Following that naming convention I’ve called the class that implements IEnumerator(Of XElement) XmlReaderEnumerator. The behavior is the same as the earlier XmlReader example. The XmlReaderEnumerator class finds the first instance of a particular element, and then finds all of its sibling elements of the same name. As a result, I’ve added a constructor that takes both the path to the XML file, and the name of the XML element to search for. Note that the name is case sensitive as XML is case sensitive.

 

The XmlReaderEnumerable class doesn’t do much. All it does is return a reference to an instance of the XmlReaderEnumerator class. Here’s the code.

 

Public Class XmlReaderEnumerable

    Implements IEnumerable(Of XElement)

 

    Private _filePath As String

    Private _elementName As String

 

    Public Sub New(ByVal filePath As String, ByVal elementName As String)

        _filePath = filePath

        _elementName = elementName

    End Sub

 

    Public Function GetEnumerator() As IEnumerator(Of XElement) _

        Implements IEnumerable(Of XElement).GetEnumerator

 

        Return New XmlReaderEnumerator(_filePath, _elementName)

    End Function

 

    Private Function GetEnumerator1() As IEnumerator _

        Implements IEnumerable.GetEnumerator

 

        Return Me.GetEnumerator()

    End Function

End Class

 

The XmlReaderEnumerator class is where the code resides to read through the XML file. In the constructor, it opens the file and moves to the start of the XML content. In the MoveNext method, it reads to the element of the supplied name (for example, “api”). In the Dispose method, it closes the reader. That’s it. It looks like a lot of code, but it really isn’t.

 

Public Class XmlReaderEnumerator

    Implements IEnumerator(Of XElement)

 

    Private _xmlReader As Xml.XmlReader

    Private _elementName As String

    Private _filePath As String

 

    Public Sub New(ByVal filePath As String, ByVal elementName As String)

        _filePath = filePath

        _elementName = elementName

 

        _xmlReader = Xml.XmlReader.Create(_filePath)

        _xmlReader.MoveToContent()

    End Sub

 

    Private _current As XElement

 

    Public ReadOnly Property Current() As XElement

Author

0 comments