It may have been the cute title that attracted me to this paper, "A funny thing happened on the way to a billion..." (PS), but it turns out to be an good read on "hard-won lessons" on supporting "queries on billions of documents" on the IBM Semantic Super Computing platform.
Some of the lessons are familiar to those working on a large scale cluster. For example, in Section 5.1, they talk about "what can go wrong when a query is sent out to 256 nodes", including that "some of the nodes will almost certainly not be online", "a few might be either hung or responding very slowly", "the network ... may be come overwhelmed", and errors or performance issues related to configuration problems on the boxes.
Section 2.3 on sampling the data offers a useful suggestion. The IBM system indexed on a 128 bit random unique identifier for the documents, allowing them to quickly answer some types of aggregate queries after only examining a small sample of the documents. The authors write that "using an index that already provides a uniform and random distribution" allows "orders of magnitude" fewer disk seeks than alternative approaches that would sample from a sorted index. It is a good point that grouping data by a random UID before sampling turns what otherwise would be random disk accesses into sequential reads, substantially reducing disk seeks.
Other lessons focus more on the user experience of search. The authors write that "discovery is more than search" and argue for tools to help explore data and documents. They also write that "people are not going to learn something new unless they have to" and suggest that the tools should be easy to use at first, but provide power underneath that makes "complex things ... possible."
Several other lessons are in there as well, from issues with disk I/O to testing to cluster management.