 Hello, my name is Aaron Adderson, I work for AT&T Interactive, I live in Seattle, I'm a member of Seattle RV. My memory from online is 10 to 12 in my website. My top is called Journey to the Pointy Forest in the State of Exxonville, Paris, New York. But I don't really like this title because it makes me think of a journey or possibly like a unicorn or something like that. So I decided to change the name of the top to you suck at XML. Full disclosure, I am the author of Nova Geary and I also maintain Mechanized, which unfortunately makes me an expert in parsing terrible HTML. Warning, this top is a downer. It is a downer, but the good news is it will be done in 30 minutes. So it comes in four parts. We're going to talk about XML processing, HTML processing, data extraction, and HTML correction. And we're going to talk about a few different libraries that you can use in order to accomplish these tasks. So first up is a 30 second XML refresher. XML has nodes that look like this. They're well balanced. They do not look like this. This is not well balanced. We all know unbalanced cats fall. Nodes have attributes. For example, Jeremy is awesome. Documents are a collection of nodes. It looks like this. And an important thing here is that I want you to know that everything in here is a node. The cat's tag is a node. The text is a node. Even spaces between tags are nodes. Attributes are nodes. Everything in here is a node. But I shouldn't keep this up too long because if you look at XML too long, you'll shoot your eye out. So XML processing. We're going to talk about a few different XML parsing styles. Sacks, push, pull, and dog parsing. When you're collecting a style of parser, you need to take into account a few different things like the number of documents which you're going to be parsing, memory and speed constraints, how you want to extract data from the XML which you're parsing, program or time, obviously. And once you've mastered these parsing styles, you can become an XML ninja. So first up is the sacks parsers. Sacks parsers are event-based parsers. Basically, there's a bunch of different event types which you can register like starting an element, ending an element, when characters are countered, starting a document, ending a document, etc. So you instantiate a parser. You look in the events that you care about and then parse the document. The parser will then send events out to your callbacks. Current libraries which support this are rexML, libxML, Ruby, and Yojoo variant. Now rexML, this in Texas, looks like this. You create a new document class. You include a module which contains all of the defaults for the callbacks. And then you implement the callbacks which you're interested in. You instantiate a new sacks parser, tell about your document, and then parse the document. libxML, Ruby also has the same sort of style. You create a new document class, include a bunch of callbacks. The only kind of strange differences here are all of these callbacks start with on underscore. Also, you have to tell libxML, Ruby what type of thing it is which you're parsing. You can't just pass it a thing. You have to say, well, this is a string, or this is an IO, or whatever. Then you set your callbacks and call parsed. Notary only made difference here is rather than including a module or inheriting from class, that contains all of your default callbacks. Instantiate a new parser, give it your document, and then parse the xml. I want to show a little bit of example sacks output. Given the xml that we looked at earlier, one might write a sacks parser which looks like this. Right here, all we're doing is printing out when we encounter an open tag, and then printing out when we encounter a closed tag. You can see basically the parser is just moving through the document, calling your advantage to sacks parser so that they're very fast. Sacks parsers are used inside of SOFRAR, for example. Disadvantages. Searching is hard. Document handles are verbose. Programmer expense is high. When you're implementing these documents, you're going to end up with a state mission eventually, because you need a key track of where you are within the document. This makes this document handling is push parsing. Push parsing interface works the same way as sacks. You give it a bank call-vax. The only main difference between a push parser and the previous sacks parser is that the program controls document. Document IO. So rather than passing an IO object into our indoor parser, you actually feed the data into the parser. This is useful for things like XMPP, or Java clients, where you're interacting with an internet-linked document, so you don't necessarily want to pass that socket off to the parser. You want to be able to feed that data into the parser. So Node Gear is the only library currently that the document looks exactly the same as a sacks document, because it is. In this example, I just wanted to illustrate that we can feed data into the parser. In this example, I'm splitting the XML of every character and feeding it character by character into the parser. So the programmer has fine-grained control over the IO that goes into the parser. And call-vax are called just like the previous sacks parser we looked at. The advantages of this are low memory consumption. It's quite fast, not quite as fast as the previous parser we were looking at. And it gives you very fine-grained control over IO disadvantages. Singing problems as a sacks parser. Your documents are going to end up looking like... Your document classes are going to end up looking like... Pull parsers. There's a different style of parser that we're going to talk about. The interface to a pull parser is handed XML and yielded a node object. But the node object is only yielded when the programmer actually pulls it from the parser, hence the name pull parser. The current Ruby library is supported in this type of parser. This style of parser, WrexML, put that somewhere in the node period. So WrexML is an app that looks something like this. You instantiate a new pull parser. You ask it if you can get an event and then pull the event out of the pull parser. These work like cursors moving through your document. So you have a cursor moving through the document and countering events. And you can pull the event as you see fit. The parse event has information about the current node that the cursor is on. So you can get a node name or attributes or whatever. WrexML Ruby. It looks like this. You'll notice that parser variable. Well, a couple things. Again, you have to tell what type of thing a parser is. The parser actually contains the parser. So as you move the parser through the document, you can access things on the parser like the node name which you are on. Node-peries interface looks like this. We use the each method instead. So every time this block is executed, another parsed node is yielded to the block. Advantages of this. Low memory consumption and extremely fast. According to the WrexML folks, this is the fastest hexagonal parser available in WrexML too. Disadvantages are an absolute cursor. It's not so programmer friendly. What I mean by this is that as you're moving through the document, when you need to find data, you only get one pass. So if you don't get the data which you need in that one pass, then you need to pass through the entire document again. So you need to make sure that you get it on the first pass. Now, so far, these stacks interface is to me like a poke in the eye. So I'm glad we're talking about DOM interfaces. These are my favorite and probably most of you are familiar with them. Given some XML, they build an in-memory tree and then they're easily searchable via XPath. So we can pass through the document as much as we want. So we're looking for the data which we need using a rich searching language called XPath. Current Ruby libraries which handle this are WrexML. The WrexML Ruby is for hot and open periods. So we have one more option here. WrexML looks like this. To create an in-memory document, it's very easy. Just pass a new XML and a new WrexML document new and you have an in-memory tree. Searching with XPath, you have to create it. Create a new object and then you can search for it using the XPath object. LiveXML Ruby, slightly hard to create in-memory tree. You have to instantiate a parser first. So again, you tell what type of thing it is with your parser. It instantiates a parser. Then you actually have to call parser to give your document back. Once you get the document back, then you can call find method on it to sort through. Age of product interface is much easier. One call to get here is xml.3. You can search through it with XPath and you can search through xml.css. Neither LiveXML Ruby nor WrexML Ruby do this functionality. One thing I want you to take note of here is that both of these methods are both... In order to search via XPath or CSS, you do use the same method, search. I'll talk about that in a minute. And no sugary. Very similar to age of product. One call to get your document back, but something different here is you have to specify that you're using XPath versus using CSS. So you call XPath to search via XPath.css to call to search via CSS. And the advantages of CompStyle partners are easy data extraction and they're very program friendly. Disadvantages are high memory consumption and paying speed penalty. Now, when I say that they're high memory consumption, I'm just comparing these to SAC style parsers. SAC style parsers don't need to keep anything in memory really. These DOM style parsers will keep you in the entire document in memory. So, part two, HTML processing. HTML style parsers. We have pretty much the same style parsers in HTML as we do in XML, but I'm only going to talk about DOM because the other two are exactly the same as HTML. You're just feeding HTML rather than HTML. Available movie libraries for this are NARF, LiveXML, Ruby, and NoBegiri, and HVAC. NARF is interesting. I'm guessing probably none of you have heard of NARF. NARF is the first HTML parser that Mechanae has used. It sits on top of WrexML and corrects broken HTML before it gets into the WrexML parser. So, basically you instantiate it like this. You create a new HTML tree in WrexML parser and feed our HTML in, but we actually get a WrexML document back. So we can treat that WrexML document just like any other WrexML document. LiveXML, Ruby, same style interface as the XML parser except that we have to call it HTMLParser.Strain and then parser.Pars, and then we get our document back. NoBegiri, easier. One call. You get your HTML DOM back. It's forgot. Even easier. You get your HTML DOM back. So, part three. Data extraction. We have a couple of techniques for this. CSS selectors and XPath queries. In these next slides, I'm really only going to talk about NoBegiri and Hverpot. No reason is because those are the only two which provide CSS selectors. They provide both CSS selectors and XPath queries. The other libraries only provide XPath queries, so these techniques do apply to all of them just as easily. CSS selectors, very easy. Hverpot, you call search, pass it to your CSS selector. You get a list of nodes back. You can deal with those. NoBegiri, very similar. Except you use the CSS method and you can get back a list of nodes. That's pretty much all there is to it. You can go to CSS, plug in your CSS, and you're good to go. XPath queries. These are a little bit harder than CSS. I don't think as many people are familiar with XPath queries, so I'm going to go over a little bit of XPath basics. This means find all food tags in my tree starting it at the root of the tree. This always means start at the root. That first slash means I want to search this tree and I want to start at the root of the document always. This query, find all food tags starting at the current reference node. That dot means that I'm starting from somewhere else in the document. I can be at any node in this tree and I want to find all food tags which are descendant of my current node. Start at the dot. Square bar. Find all food tags with a child bar tag. This matches XML that looks like this. We have a food tag with a child bar tag. So we're matching that food tag with a food tag. Now to be confused with food square at bar which means find all food tags that have an attribute of bar. This is simply saying that the food tag has an attribute called bar which exists. We're nearly looking at existence. So this query will match the first food tag which has a bar attribute. Is this query XPath or CSS? Does anyone know? If you plug this into a browser what would be colored in red and would this work? The answer is yes, it will work. It will work and this text would be highlighted right in your browser. Now the interesting thing is it's kind of a trick question. This is actually both a valid XPath and a valid CSS. So that means that ambiguity is a rise. For example, since each repository has one entry point for searching it can't tell whether you're asking for CSS or XPath. And you might actually be surprised that it treats this as XPath if you're looking at this as your reference. So you'll get back on it, right? This is why in no degree you're forced to choose whether you want to search for the XPath or CSS to eliminate those ambiguities. So I want to talk a little bit about XML namespaces which is something you inevitably run into when you are parsing XML. We have here an XML file showing forward.com's inventory. They have a couple of tires. Great. And we have another XML document here showing Schwinn's inventory. They have a couple of tires as well too. But we have a problem. If we're searching these documents we can't tell the difference between those tires. We just get back a bunch of tire names. We can't tell that one came from forward and one came from Schwinn. So that's where namespaces come in. We have a couple of options. We have explicit namespaces. The anti-amiduity squad came in and fixed up these XML documents. They declared namespaces in forward.com called car and one in Schwinn called byte. And something that I want to make sure you guys understand is that these names, car and byte are completely arbitrary. What is important here is the URLs. These URLs must be unique. The names are arbitrary. So we know in both these documents that tires are associated with cars. In the first one they're associated with forward and the second one they're associated with Schwinn. So when you're finding these in November you do something like this. You register the URL or define your bike tire or your car tire. The important thing here too is these names are completely arbitrary. I could have picked something besides a bike or something besides a car. The important thing is the URL. So our second option is implicit namespaces. The anti-typing too much squad came in and said, well, every one of our tags in this document is actually associated with forward and namespace with Schwinn. So we can set implicit namespaces. So rather than doing the explicit car pull and tire or bike pull and tire, we can set an implicit namespace. So all of these tags actually have a namespace associated with this URL. The tag names say just tire, but they're still associated with the URL so we can disambiguate them. Luckily for us, that's how it works exactly the same. No namespaces. We have to be able to tell the difference between tags that have no namespaces and tags that do have a namespace. For example, in this document, our first inventory has tires between namespaces and the second inventory has tires between non-namespaces. And that looks like this. If we don't add a namespace, if we don't add namespaces, then we get back tags so library supporting namespaces and we've got some old ruby, and I want to talk about hrepot. Hrepot treats these only works with explicit namespaces like this and the namespaces are actually used from the tag names which is that because those names are arbitrary strings. Both of these documents contain the word car, but one is still associated with twin and one is still associated with four and you can disambiguate them. You can't do that at every time. Implicit namespaces are not supported. Namespaces are as important as tags. Remember that when you start turning. Part four, HTML correction. Why is document correction important to us? We want to deal with 3R HTML like the browser that would be the least surprising functionality. I want to be able to look at a CSS selector that I'm using on my web page and find that same data when I'm searching for my document. I want to be able to pull up fire web select an object inside of my page and use that same CSS selector to find the data that I'm looking for. Both know that we're going to use non-correct HTML. Notary sits on top of livexml to you which actually does the HTML correction. These two libraries have different correction schemes, but how can we tell which is correct? We can detect the long differences. Long parsers store the document memory as a tree. For example, we have this reference HTML here. This reference HTML is going to be stored as something like this. We want to be able to take differences of these trees, so I wrote a library called TreeDif which you can find here. Take these internal memory tree representations of the pairs there. And I want to say that tree differences are interesting. The only trees which we're interested in are trees which are different from each other. The reason that we want to do that is because if two trees are different given one CSS selector the two trees will return different results. So we really only want to examine these trees which will return different results to us. So what I did was oh yes, also different trees indicate that the correction scheme is different. We have differences in our correction algorithms and we want to examine differences in those correction algorithms. I got 461 random HTML files and found that 336 of those parse trees were different. So what I did was I looked through those parse trees and examined the differences and I want to share with you five of my favorite differences. The first one is called Encyclopedia Brown in case of a missing slash TD tag. Now, I've reduced these HTML examples to the point where I get the minimum amount of HTML which produces two different in-memory trees. Now, the way that we look in the browser the browser corrects this by adding two closing TD tags after hello at the world. H-proc correction looks like this. I've added the red tags or the ones that H-proc has added. No one here corrects it like this. It's the same way that the browser does. The that's terribly green. The blue box indicates nodes that no computer has that H-proc does not. The red box indicates nodes that H-proc has that no computer does not. And the gray boxes are ones that are nodes that they have was created by the three different consequences of this correction. So, if you were a search for TD for its inner tags, you would get Hello World back which is not correct because we have 2D and it's corrected incorrectly. In node theory, we search for a TD tag and get back the right the right tags. In case of a valid HTML actually there is a you look at the P tag the P tag's align is missing two quotes. The way the browser corrects this, it actually just adds the quote tags so great, it's exactly the same pretty much. H-proc adds a closing center tag node theory adds the quotes. This results in a strange in-memory graph. H-proc actually moves the P tag underneath the center tag so we end up with a strange in-memory tree where node theory keeps the P tag underneath. Example number 4, font tags there's this 3 and it's hard. Here we have a great excellent HTML which I took from your web page Jim and reduced a little bit correction, obviously nothing has happened here H-proc got out of the font tag and you can see the differences in the trees here it's difficult to read but you can see that these sub-trees have actually moved. This is missing an equal sign in the body. The browser corrects it like this it looks kind of strange, closing table tags but there's actually a problem going on here that's a little bit more subtle which I'll explain in a little bit. This is actually a preach out of the tree this is what the tree looks like when it is printed out there's a more subtle problem going on. Node theory correction, it removes the text that's not valid for an attribute name so we're missing a little bit of beta but the tree structure is still in place and we actually have that our main memory trees are completely different completely different you'll notice here that it's kind of hard to read but H-proc is actually missing the body tag the body tag is gone if we search for the body tag we get zero back but where did the body go we saw it printed out and we printed out the document it's actually common so if we go through and navigate the tree ourselves we can find this as a text method I'll give it an example but it's not fun anymore 71% of the trees have differences I examined 15 of them in every example livexml2's corrections mirrored the browser the interesting thing about comparing these trees is they would find bugs in either one of them this wasn't biased towards one or another if the trees were different that means that one of them is correcting it differently one of them must made you wrong in every case livexml2 mirrored the browser and not the ones did H-proc come here or whatever conclusion use the best tool for the job it's soapbox time for me I believe that notepadir is better than rexml and H-proc because it is built on a more widely used xml and html parser people who use livexml2 don't just write in ruby they write pearl, c, c++ objective, c, pycon the list goes on I believe it is better than livexml ruby because it has more ecomatic ruby interface it also includes a css rexml has one test in 1.9 it has two tests so they have backported that to 1.8 1.8 might have to have two tests livexml2 was released in April 2000 livexml2 has always had an html parser ever since the first release that html parser is 9 years old interesting codes I have one second left interesting codes you can find them here I will tweet the slides I will tweet the slides when I post them and that is the end