i-think Twenty-Two

Now with more coherency.

XML Made Easy with LINQ to XML

| Comments

XML is a fantastic way to structure information. Here are the two things I like most about XML.

  1. It’s fundamental concepts are simple, making many XML files readable by regular humans.
  2. The formalised structure enables re-use of a more generalised XML parser.

Projects can certainly suffer from too much XML or XML is used when a better option exists. Once your XML files become too difficult to read in a text editor it may be better to look at another option (or better design your XML schema).

A lightning fast introduction to XML

Skip this section if you already know XML, but take time to look at this XML sample as it will be used throughout the article.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<?xml version="1.0" encoding="UTF-8"?>
<Blog>
   <Entries>
      <Entry Archived="false">
         <Title>My First Post</Title>
         <Body>I love LINQ. It's the best</Body>
         <Comments>
            <!-- TODO: Shouldn't comments have authors? -->
            <Comment>I love LINQ more</Comment>
            <Comment>LINQ is the way of the future.</Comment>
         </Comments>
      </Entry>                  
   </Entries>
</Blog>

Above is an example of a simple XML file. XML files follow a structured pattern called a schema. The schema defines the rules for what is allowed where and generally defines the structure of your file. Fortunately you don’t need to write a formal schema to get started with XML. Instead you can just start laying out your data. That’s where the “X” in XML comes from, because it is eXtensible.

So the sample XML above is being used to store the contents of a simple blog. XML isn’t the best way to do this, but a blog is a simple well understood concept. If you read my article on LINQ to SQL you might notice that this is very similar to the database example I used there.

Every XML document should start with what is known as an XML declaration. It’s in the first line of the XML and defines the version of the XML as well as the encoding of the file. If you are using notepad you can select the encoding when you save the file. The topic of encodings is out of the scope of this article.

The next important element that all XML files need is a root node. In this example our root node is called “Blog” and it holds all of our other elements. There can only be one root node in an XML document so if we wanted another blog we would have to put it in another XML file or redesign our XML to have a new root node (such as BlogCollection).

From there we can see that our XML document is made up of two key parts, elements and attributes. Elements are the things in angle brackets (called tags) and an element continues until it is closed with a matching closing tag. Closing tags are different from regular tags as they have a forward slash (/) before the name of the tag. We will use the term element to describe everything from the opening tag (a regular tag) to the closing tag, and a tag as the bit with the angle brackets.

There is also a special kind of tag called a self-closing tag that is both an opening tag and a closing tag. These tags have a forward slash before the closing angle bracket. For example:

1
<SelfClosingTag />

The space before the forward slash is optional (and stems back to compatibility with HTML). Personally I like keeping the space there, but your project may have different rules.

The other important concept is attributes. Attributes go inside the tag to provide more information about a tag. Attributes can only be used once per element (but one element can have multiple attributes). In the example above, we have given the entry tag the Archived attribute.

Sometimes it can be difficult to determine whether data should be expressed as an attribute or as a child element (an element inside another element). Typically the rule of thumb is that an attribute should be describing metadata, that is extra information about the element itself and how it might be interpreted. Occasionally this doesn’t clear things up at all. If you are still confused, consider the complexity of the data and whether multiple instances of the data will be required. Complex and repeating data is a sure sign that you want to use an element.

Importantly elements can contain other elements which can in turn contain more elements (and so on). XML follows a very strict hierarchy (which makes it easy to navigate) so an element must be closed inside the element that it was opened in. This means that any element (except the root node of course) has one and only one parent element. If you are modelling structured data it is unlikely you’ll run into troubles.

Finally I’ve also added a comment to remind me to add authors to the comments. We won’t actually be doing this, it was merely there to demonstrate how you can include comments in your XML documents. Comments should be ignored when parsing an XML file as they are unrelated to the data. Comments begin with <!-- and end with -->.

Ok, so by now you should know enough about XML to understand how we can parse this XML file and pull the necessary elements.

Now for the exciting stuff

LINQ to XML is a set of classes designed to work well with LINQ. It provides a very simple API that allows XML to be read and written with ease.

The centre of your LINQ to XML world is XElement. Through XElement we can access all of the important information in the sample above. Let’s start by writing a query that can help us get the Blog entries to display on the front page. We’ll assume I’ve loaded the XML as a string into a variable called blogXml.

1
2
3
4
5
6
7
8
9
10
var blog = XElement.Parse(blogXml);
var frontPage = from e in blog.Descendants("Entry")
                where e.Attribute("Archived").Value == "false"
                select e;
foreach (var entry in frontPage)
{
   WriteBlogTitle(entry.Element("Title").Value);
   WriteBlogBody(entry.Element("Body").Value);
   WriteBlogCommentCount(entry.Descendants("Comment").Count());
}

This example does absolutely no error checking (something you’ll definitely want to do if you are working with real XML) but demonstrates how simple it is to find particular elements inside XML. Additionally you can use XElement objects to pass XML fragments around your application. We could have made our LINQ query return an anonymous type that pulled out the Title, Body and Comment count for each entry, but instead we just pulled out the XElement itself. From there we were able count the comments inside our loop.

There is nothing preventing you from using these fantastic classes without having to use LINQ queries as well. In fact, most of the XML parsing code I’ve written lately doesn’t use LINQ queries at all to find elements, just the methods of the XElement class. Let’s look at the ones you’ll likely use most. Don’t worry that these parameters take an XName as their parameter, strings are automatically cast to a XName. You’ll need to use XName if you are dealing with namespaces (which I’ll discuss in a future post).

  • Element(XName name) returns the first immediate child element with the given name. If the element does not exist it returns null.
  • Elements() returns an IEnumerable<XElement> of all the immediate child elements. So against Blog the enumeration would yield a single “Entries” XElement. If there are no child elements the enumeration will be empty.
  • Elements(XName name) returns an IEnumerable<XElement> of all the immediate child elements with the given name. If no elements with the name exist it will return an empty enumeration.
  • Attribute(XName name) returns an XAttribute that is the attribute with the specified name. If the attribute does not exist it returns null.

To match the Element() and Elements() methods there are also a set of Descendant() and Descendants() methods. These work in the same way except that they return all elements under the node. We used this method when we were finding the Entry element as we didn’t care about the rest of the document’s hierarchy.

Because these methods return null if the element (or attribute) is not found it is important to check that the value is not null unless you are using a method which returns an IEnumerable<T> object.

Where to from here?

You now know all the important classes needed to parse XML files (perhaps to load up some strongly typed objects). In my next post I’ll be discussing how you can use this same class to build complex XML structures. In the meantime, check out the MSDN documentation for XElement.

Comments