Category Archives: xml

Insecure Cat

Do you see the cat?

Balisage 2009 trip report

Balisage (once Extreme markup) seems to be the Summer conference of choice for schema architects, developers, visionaries and markup philosophers: in short, those laboring on the underpinnings of XML and related technologies. Conference literature calls us “markup geeks.” Mere users of these technologies might find this conference heavy sledding (it is quite technical), and marketers should be aware that this is not a trade show: there are few opportunities for promoting goods and services. However, idea-pushers will find a welcoming audience at Balisage.

This year, I attended Balisage for the first time, and I found it a highly rewarding experience. I came away full of ideas for new ways to work, new tools to use, and new directions in which to push our XML-based publishing practice. In addition I made a passel of new friends: The community that has grown up around this conference is very welcoming, and surprisingly so, given how close-knit it seems to be. The conference is well-organized and run by folks from Mulberry Technologies and friends, with a wide array of reviewers vetting papers and presentations. My one complaint is that I was unable to remain for the full schedule and missed some very exciting-looking presentations on Thursday and Friday.

Disclaimer: I was unable to see every talk, so this report represents merely some highlights, as I saw them. This report is biased and unfair, but honest and without the intention of brutality.

Day 1

On Monday, Michael Kay of Saxonica chaired the Symposium on Processing XML Efficiently, a separate but related series of presentations that shared the same logistical backbone with the main Balisage conference.

The surprise hit of this session was Rob Cameron’s presentation on Parallel Bit Stream Technology. At first, I had no clue what was going on in this talk. The basic idea seemed clever: rotate your bytes sideways, packing all the first bits into a wide (64-bit or wider) word, all the second bits into the next word, and so on. But what was the point? When Cameron demonstrated that some simple addition and shifting operations led to a bit pattern that encoded the positions of all the valid XML numeric entities, the “aah” of dawning comprehension rippled through the audience. He went on to show how this could be used to jump-start a tokenizer and accelerate XML parsing, resulting in impressive speedups in a core technology. There are some remaining challenges, though: to make these results available via Java will require some tricky bridgework.

I was curious to hear about the internals of our competitor, Highwire’s, much-ballyhooed H2O platform, presented by James Robinson. Given the context of the conference, it was no big surprise that the workhorse of H2O (Firenze) is an XSLT-based rendering pipeline with built-in caching features. What was most interesting was what it’s not built on top of: a database. The core storage mechanism of the public-facing content delivery system is essentially a hierarchical file-system store. Search features are provided by FAST, (acquired last year by Microsoft). Some interesting observations:

  1. The use of XML data and processing models in every step introduced some overhead (for example, performing HEAD requests that check the cache went through an inefficient XSLT and consumed 30% of overall execution time). Resolving these performance bottlenecks requires some special attention and custom optimizations. However, Robinson felt such costs to be outweighed by the benefit of the more convenient programming model.
  2. The LRU cache was distorted by search engine crawlers, which tend to perform broader-spectrum requests than typical users. Robinson plans to optimize for human users by introducing an ARC cache, which will take into account the frequency of requests for a given resource when aging cache entries.

David Lee presented a performance comparison of various scripting techniques for running multi-step XML pipelines. The resulting comparison of xmlsh and xproc, which run xml processes internally, with shell scripts which execute xslt and xquery externally, was largely a cautionary tale about the cost of repeatedly starting and stopping the Java VM. Although this wasn’t news to many attendees, it is the kind of implementation mistake that is easy to make even if one’s theoretically aware of the issue. The talk was also valuable as an introduction to xmlsh and xproc, two scripting technologies that ought to be in every XML-hacker’s toolbelt. Lee developed xmlsh, an XML scripting language with syntax based on Unix shell; XProc is a nascent W3C standard with a few available implementations, including Norm Walsh’s Calabash, which was tested as part of this work.

Balisage, Day 1

After some introductions, Tommie Usdin launched the conference with a dose of tart wisdom. She issued cautions about blind adherence to standards, urging a deeper understanding of requirements. The rest of the conference was peppered with mild disagreement with her plenary point: Standards considered harmful.

Alex Milowski reminded us that Netscape has had sophisticated XML support in the browser since 1999, although the lack of comparable support in Internet Explorer was treated as a minor, if unfortunate, side note. Milowski made a plea for a core set of XML vocabularies including HTML, SVG and MathML. He then demonstrated some pioneering work with audio-enabled eBooks (in the DAISY format) using Firefox extensions, and indicated a similar model is also available in the mobile space (using iPhone and Android). What I learned: you can associate MIME types with Firefox plugins so as to trigger custom behavior for your content type.

Somehow I managed to miss David Birnbaum’s talk on writing optimal XPath and XQuery for eXist’s evaluation engine, which I had intended to see. I guess I’ll have to read his paper when it’s posted. I also sadly missed seeing Michael Sperberg-McQueen’s talks, which, to judge from the questions with which he prodded the speakers, would have been fascinating.

Uche Ogbuji presented his application development platform, Akara, which integrates XML-native operations with Python in clever ways. Python developers working with XML will want to stay abreast of this work from the former developer of the 4Suite toolset, currently available in Beta.

To wrap up the first day, we had to choose between a fascinating-sounding flight into abstraction from Wendell Piez, and a survey of XML visualization tools from Mohammed Zergaoui. In the end, my practical orientation won out, and I was rewarded with a number of links to follow. In particular I am intrigued by Xopus, an in-browser WYSIWYG XML editor that looks very promising. I got a demo at the conference (from Betty Harvey), and will definitely follow up since this is something we need to embed in our solutions. I still wonder what a “nomic game” is, though.

Day 2

Fabio Vitali presented a new approach to handling overlapping markup. I learned that this is a major area of theoretical interest in markup that has been addressed by numerous proposed schemes. XML’s element tree can’t represent overlap directly in a natural way, and markup such as bold, italic is inherently ambiguous (which tag should be the outer one?). Although new to me, this seemed to be well-trodden ground for many at the conference, and the new contribution was the idea of using RDF and OWL as representation meta-languages.

Norman Walsh shared some data gathered from the “phone home” feature of XML Calabash, his XProc implementation. Every time Calabash runs, it sends some basic profiling information to, unless the user explicitly opts out. Some insights: it appears that many (most) pipelines are streamable, at least from the viewpoint of XProc, although it was less clear what the benefit of streaming might be in these cases. One odd feature of Walsh’s data was that a surpriing number of pipelines have only a single step in them; he attributed this to the convenience of executing other XML processes via XProc, or possibly to a pattern of tire-kicking behavior.

The final regularly-scheduled presentation was Michael Kay’s discussion of the polarity of streaming processes. This included a fascinating discursion on the merits of co-routines for programming inherently multi-processing programs; their performance benefits relative to multi-threading; the difficulty of implementation in single-stack languages such as Java. Remarkably, there was a dearth of questions after Kay’s presentation. This was not a case of lack of interest; rather it seemed as if all the bases had been covered, and in the end there was just nothing more to be said once the master had spoken.

After a deeper dive into eBook formats with Alex Milowski, and a presentation on REST and XML that (according to him) came to Kurt Cagle during a sleepless night in the hotel room, it was time for me to drive home. Perhaps next year I’ll be able to stay for the full week, and not miss the XML games that were scheduled.

(note: cross-posted at the ifactory blog)