DTCoreText – New Formula!
No frills, just the latest news from hundreds of the best sources.
DTCoreText – New Formula!
Jan 25, 2012 9:02 am
I chose this article’s title to try and grab your attention. Well, the product is still the same and does the same. The only difference is one that is under the hood. And as such it is your job – should you choose to accept it – to marvel at the benefits that the new old parsing engine brings us. Label Buy an ad here Readability Ever since my friends at scribd showed me a first prototype for HTML parsing based on libxml2 I was – admittedly – jealous. This prototype basically worked by having libxml2 parse out the individual paragraphs of an article and then display each paragraph in its own cell of a UITableView. Back then I brushed off the suggestion and went with something I understood: NSScanner. All the C-code necessary to deal with libxml2 seemed overly daunting, so I went with a simple basic structure for the parsing, something like this (pseudocode) while (there_is_more) { if (we_scanned_a_tag) { if (is_tag_a) { // a specific code } else if (is_tag_b) { // b specific code } else if ... } else { // we must be inside a tag // deal with skipping over a tag that is incomplete, i.e. crap if (we_scanned_some_text_before_the_next_tag_open) { // append the text with correct formatting to the output } } } This structure had grown organically as I added support for additional tags and it went into the initWithHTML category method because that seemed to be the logical place. Little did I think ahead that this approach would preclude the possibility of having an event-based parser do callbacks into my code, because that would mean adding these event-handling methods to NSAttributedString. Even the above pseudocode is long, you can imagine how much of a Spaghetti the origin code became. Much of the problem didn’t actually come from all these tags, those where simply a big if-statement. Complexity came from having to deal with all these special cases where HTML might not be well-formed. In my Open Source genstrings2 I saw how much faster pure C-code performs than NSScanner which also served to rekindle my wish to switch to libxml2 because this is also written in low-level highly optimized C. I approached libxml2 in several steps: Jealousy – “Boy wouldn’t it be great, but I’m afraid that this is out of my league” Announce Intention – I wrote an issue on GitHub hoping for somebody to step forward Do an Experiment and Document – I googled a bit and put together Part 1 of my libxml HTML tutorial. Write a Wrapper – For Part 2 of my libxml HTML tutorial I wrote an Objective-C wrapper for libxml2. Astonishment – The feeling you get when you find that you begin to understand the C-code needed Benchmark – I removed all string building code and compared the raw parsing performance of both approaches Transform the Pasta – Moved the code for building the attributed string into an aptly named class and have this driven from events generated by the new HTML parser. The final step I called like this because if you break up Spaghetti code into several logical pieces and then layer these into several layers that’s a different kind of Pasta, that’s called Lasagne. For step 6, the comparison I moved the NSScanner parsing loop into its own class and directly compared the running time on my iPhone 4S resulting in this tweet: War&Peace HTML (3.4 MB), NSScanner: 4.264s, libxml2: 1.398s = 3x as fast on single thread, plus latter fixes HTML structure At this stage it was clear to me that I need no extra self-convincing. So I went to work in a branch of the project. Most of the work was simply copy/pasting the attributed string building code into the right place, tag start, tag end or the characters found event method. This also allowed for omitting some workarounds that where needed to deal with non-well-formed HTML. The second big BIG advantage of libxml2 is that its HTML parser fixes up the structure for you and also adds a missing html and body tag so that you end up with a perfect structure. It even adds a </br> right after a <br>. Even though this is completely unnecessary it still makes the HTML look like perfect XML. Much nicer to work with. While I was doing the migrating new issues and pull requests with fixes stated to come in putting me a bit under stress because I needed to include the fixes in both branches. Which is why I decided to merge the branch back into master at the earliest possible time. The initWithHTML method has shrunken to a much more manageable size: - (id)initWithHTML:(NSData *)data options:(NSDictionary *)options documentAttributes:(NSDictionary **)dict { // only with valid data if (![data length]) [...]