Monthly Archives: April 2014

The two most common Solr performance blunders, and a rant about the dumbification of computer programming

Recently I’ve seen friends at work fall into a couple of well-worn traps, and I wondered – why do these same simple but devastating problems keep turning up again and again? The answer lies in deceptive software API’s, and the solution I think may come from video games. But before explaining all that, I want to just describe these two problems in a little detail.

Many programmers learn to deal with databases. Fewer work with full text search indexes like Lucene. I think newcomers to Lucene often bring with them the mental models acquired from databases, since they share many similarities. Both implement an “atomic” transactional model in which writes don’t become visible to readers until after a commit point is passed. This is critical in a transaction processing environment to ensure that two things succeed or fail together: for example, the concert ticket doesn’t get allocated to you unless ticketmaster confirms your credit card, and your card doesn’t get charged unless you get the ticket.

Lucene implements this model, but it isn’t designed to support transaction processing of that sort: it’s mostly optimized for batch updates and superfast querying across large numbers of documents (Yes, Lucene experts, I know about soft commits and near real-time, but that’s a story for another evening). This design criterion led to tradeoffs that tend to make committing expensive, very much more expensive than it is in a typical database.

Programmers writing their “Hello World” Lucene program don’t particularly need to be worrying about performance problems, but as soon as they start using Lucene for its intended purpose – indexing and querying large amounts of text – they do need to worry. Too often though it seems we fall into the trap of committing after every insert, causing a dramatic fall-off in indexing (and querying) performance, even to the extent of making a search service non-responsive. A very common version of this problem manifests itself in Solr, where you can see the dreaded “PERFORMANCE WARNING: Overlapping onDeckSearchers” message in the logs.

There are pounds and pounds of curative blog posts, wiki pages and stack overflow answers explaining this problem, why it arises, and what to do about it. And thanks to search technology, it’s pretty easy to find them if you go looking. But an ounce of prevention would save a lot of headaches here.

My number 2 most common Solr performance blunder is `fl=\*`. Solr search results include the values of fields stored in the index associated with the result document. Typical search applications show a title, with a link to a full version of the document, and possibly an associated contextual ‘snippet’. A few other fields like date, publisher or book title might be included too. Such applications must store the full text of every document in the index to make it available to the snippeting component (the highlighter, as it’s called). However, if that text is *retrieved* as part of the search result, and the documents are not tiny, this practice vastly increases the amount of data that needs to be transferred, often leading to a 10-100x slowdown in search performance. Programmers tend to do this because it’s just easier to retrieve all the fields (that’s what `fl=*` does) than to list explicitly only the fields required to display the result.

No decent Solr tutorial will lead a programmer to do this, and again, there is plenty of good information explaining how to select fields, but the Solr default is to select all fields, so this is a very easy trap to fall into. And it becomes even easier not to notice that this is happening when your queries are mediated by a middle tier.

I maintain an API for searching across multiple different backend data storage and indexing systems, and in that API we once defaulted to returning full document results. I believe our thinking was that beginners would have everything they need and not stumble over having to learn too much of a complex API even to get started. But I got tired of leaning over people’s chairs with a knowing grin and pointing out their n00b mistake. It really wasn’t their fault, anyway (it was mine.) They were just doing the natural thing, following the most straightforward path the API made available. So I changed the default, and I think the people I work with must be really smart since they seemed to be able to figure out how to get the missing field values when they really did need them. Even if they had to ask about that, it was much better to field a question like “how do I get the full text of a document in the search result?” than a question like “why do my search results sometimes take 10 seconds to come back?” because often a situation like that would be intermittent, only arising when really large documents made it into the search results, and could persist for a long time with merely mediocre performance before anybody really took note of the egregious outliers.

Apparently I’m not the only one making this mistake. [Safari Flow](http://safariflow.com) uses [Haystack](http://haystacksearch.org) as its internal search API. It’s welcoming words – I just stumbled on this while writing this sentence – “Search doesn’t have to be hard.” We recently found out that Haystack, by way of its defaults, encourages users to make exactly these mistakes. The Solr connector automatically commits after every insert, and I couldn’t even find any way to limit the set of stored fields returned. In both cases we had to fix these serious performance problems by editing Haystack’s Solr “backend” connector source code (in spite of its promise that “\[Haystack\] plays nicely with third-party apps without needing to modify the source…”

OK I’m a bit peeved about Haystack right now, but I truly hope the maintainers will read this and take it as constructive criticism, because their library actually does provide a lot of convenience to Django framework programmers grappling with search. Here’s my advice.

There is a notion gaining currency that programming computers is becoming easier. Sites like codeacademy teach JavaScript using a glossy, game-like spoon-fed interface. “Learn Python The Hard Way” presents Python (and other languages, in spinoff titles) using a baby-step scaffolded teaching approach (the only hard thing about it for me was sticking with it – to be fair, it acknowledges on page 1 that it wasn’t designed for impatient smartypants). There are many other “learn to code in 5 minutes” websites and courses that offer an easy path to software mastery.

This conceit that programming can be easy is partially fueled by the development of software languages and tools. It *is* easier to incorporate other people’s code now using reasily-available libraries and frameworks, and to make use of existing systems, so not every program has to start as a *tabula rasa*. It is *not* necessary to understand computer architecture in a deep way in order to write much useful and/or entertaining code now. In some ways, things have gotten easier.

There is also a cultural component to this new easy-going attitude: it’s a deliberate effort to be more inclusive, to shed the high-priest hacker snobbery that has been the stock-in-trade of software gurus for thirty years and more. “RTFM” with its veiled obscenity was always a little rude, even when uttered in jest, where its moral successor, “lmgtfy,” is simply peevish, but they reflect the same unpleasant underlying attitude of condescension. I’m glad to see some reflection on that negative side of hacker culture and thecorresponding openness to newcomers.

The positive side of the “learn to program” movement is that there are numerous ways to contribute without being a master. More than ever it is possible to go very far with very little. Silicon valley startups no longer sweat hardware: they just rent space with Amazon. This is healthy: it means that the culture as a whole is able to learn and grow, to stand on the shoulders of the previous generation rather than their faces.

I’m sure you saw this coming: yes, Virginia there is a dark side. The thing is, the obnoxious attitude grows out of a hard reality. Expert programming requires knowledge. Mental nimbleness and a problem-solving bent count for a lot, but true mastery of any craft, including programming, is only available to those willing and able to devote years of study, trial, error and correction. And there are still problems to solve that demand mastery, where beginners should be cautioned to read lightly.

So let’s stop saying that search can be easy. We do learners a real disservice by pretending that things are going to be easier than they are. There are complex problems in search, getting them wrong can kill performance (*i.e.* your web site), and our role as guides should be to offer paths to learning that have the right degree of steepness, and to offer warnings about potential pitfalls. If we take you on a mountain-climbing journey, and just tell you everything will be taken care of and there’s nothing to worry about, we’re leading you into a potentially dangerous situation without any preparation: in that setting, this kind of attitude would be criminal. Tell people to bring their helmets, and teach them how to self-arrest! Wat? Metaphor getting out of hand …

At the same time, we don’t want to scare people away. There’s no call to be going all high-priest-in-the-inner-sanctum with acolytes getting in only after years of fasting and prayer. Here’s where I think we can take a cue from video games. I read (this post)[http://robotinvader.com/blog/?p=402] about luring gamers into playing the video game Devil May Cry. It talks about a game that is notoriously difficult, but also offers an easy way out. The interesting thing is that it challenges the player to try the hard way first, and warns them that there will be no way back if the easy path is chosen. Nethack, an insanely arcane video game, does a similar thing with its wizard mode: players can use it to try out all kinds of stuff, without dying, but it comes with a caveat: this is not for real, and your scores won’t be reported.