Taming HTML Parsing with libxml (2)

No frills, just the latest news from hundreds of the best sources.

Taming HTML Parsing with libxml (2)

Jan 20, 2012 9:54 am
The parsing of HTML necessary for my DTCoreText open source project is done entirely with the sophisticated use of NSScanner. But it has been long on my list to rewrite all of this parsing with a the industry standard libxml2 which comes preinstalled on all iOS devices. Not only is this potentially much faster in dealing with large chunks of HTML. It probably is also more intelligent in correcting structural errors of the HTML code you throw at it. In part 1 of this series I showed you how to add the libxml2 library to your project and explained the basic concepts of using libxml2 for parsing any odd HTML snippet into a DOM. In this here part 2 we will create an Objective-C based wrapper around libml2 so that we can use it just like NSXMLParser, only for HTML. Label Buy an ad here Readability There are basically two methods – approaches – to handling XML/HTML: DOM or SAX. DOM (short for Document Object Model) refers to building a tree of nodes and leaves which you then can query for information based on their path (with XPath). SAX (short for Simple API for XML) basically walks through the the document and sends you events as it discovers things. For example the begin of an element, the end of it, when a comment was found, etc. NSXMLParser is a SAX parser for well-formed XML. That means you can use it only on XML, it cannot deal with the nature of HTML where some elements might not require that you close them. But if you look at the delegate protocol and compare it to the SAX interface of libxml2 you find an eerie resemblance. It might as well be that NSXMLParser is just a wrapper around the XML SAX interface of libxml2. As we have previously learned we know that libxml2 also possesses a HTML parser which reuses most of the same internal data structures as its XML cousin. So we shall endeavor to create DTHTMLParser as a SAX parser for HTML. Don’t worry about piecing together all the source code from this article, I will put that online in my DTFoundation project. Instead I want you to follow my lead and understand the pieces. The libxml2 SAX Interface The SAX interface works such that you register a handler function – in C – for each SAX event that you are interested in. This is done via a C-structure of type htmlSAXHandler which is nothing more than an aggregate of multiple C-function pointers. In the libxml headers you often see some “html…” data structure typedef’d into one with “xml” as prefix. The reason for this is that it reuses whatever it can from the XML parser. But this also adds confusion – as I found through my experimenting – because some of the events that make sense for parsing XML don’t fire for HTML, like for example the one that reports CDATA elements. HTML does not have those, but the function pointer is still present in the htmlSAXHandler struct. By CMD+Clicking on a type we can swim upstream to find the original definition of htmlSAXHandler: htmlSAXHandler = xmlSAXHandler = _xmlSAXHandler = struct _xmlSAXHandler { internalSubsetSAXFunc internalSubset; isStandaloneSAXFunc isStandalone; hasInternalSubsetSAXFunc hasInternalSubset; hasExternalSubsetSAXFunc hasExternalSubset; resolveEntitySAXFunc resolveEntity; getEntitySAXFunc getEntity; entityDeclSAXFunc entityDecl; notationDeclSAXFunc notationDecl; attributeDeclSAXFunc attributeDecl; elementDeclSAXFunc elementDecl; unparsedEntityDeclSAXFunc unparsedEntityDecl; setDocumentLocatorSAXFunc setDocumentLocator; startDocumentSAXFunc startDocument; endDocumentSAXFunc endDocument; startElementSAXFunc startElement; endElementSAXFunc endElement; referenceSAXFunc reference; charactersSAXFunc characters; ignorableWhitespaceSAXFunc ignorableWhitespace; processingInstructionSAXFunc processingInstruction; commentSAXFunc comment; warningSAXFunc warning; errorSAXFunc error; fatalErrorSAXFunc fatalError; /* unused error() get all the errors */ getParameterEntitySAXFunc getParameterEntity; cdataBlockSAXFunc cdataBlock; externalSubsetSAXFunc externalSubset; unsigned int initialized; /* The following fields are extensions available only on version 2 */ void *_private; startElementNsSAX2Func startElementNs; endElementNsSAX2Func endElementNs; xmlStructuredErrorFunc serror; }; All these function pointer types are aptly named SomethingFunc and we’ll get to how these need to look like in a minute. The problem for us to solve how to bridge from our Objective-C paradigm (using self, properties, IVARs) into the C-function paradigm (no access to our parser object). This problem is compounded by the fact that we are building this with ARC so that we don’t have to deal with memory management any more. Our resulting class will run on iOS 4 and above. Somehow we need to pass enough information to each event-function so that we can communicate with the parser object. There are two reasonable possibilities: either we pass a reference to the parser instance or we pass a pointer to a struct that contains multiple values that we might need. I decided for the first because it seems cleaner to me. libxml2 provides a concept of a user_data pointer that is passed to the event-functions together with other relevant information. We can use (abuse?) this for passing a reference to self. ARC requires this to be be cast to [...]
Health News Zone
Auto News Today
Celebrity Gossip Feed