Specifically, they found that, instead of testing two search rankers in a normal A/B test (e.g. 50% of users see ranker A, 50% see ranker B), showing all searchers an interleaved combination of the two possible search result orderings makes it much easier to see which ranker people prefer. The primary explanation the authors give for this is that interleaving the results gives searchers the easier task of expressing a relative preference between the two rankers.
Some excerpts from the paper:
Unlike expert judgments, usage data ... such as clicks, query reformulations, and response times ... can be collected at essentially zero cost, is available in real time, and reflects the value of the users, not those of judges far removed from the users' context. The key problem with retrieval evaluation based on usage data is its proper interpretation.Please see also my older post, "Actively learning to rank", which summarizes some earlier very interesting work by Filip and Thorsten.
We explored and contrasted two possible approaches to retrieval evaluation based on implicit feedback, namely absolute metrics and paired comparison tests ... None of the absolute metrics gave reliable results for the sample size collected in our study. In contrast, both paired comparison algorithms ... gave consistent and mostly significant results.
Paired comparison tests are one of the central experiment designs used in sensory analysis. When testing a perceptual quality of an item (e.g. taste, sound) ... absolute (Likert scale) evaluations are difficult to make. Instead, subjects are presented with two or more alternatives and asked ... which of the two they prefer.
This work proposes a method for presenting the results from two [rankers] so that clicks indicate a user's preference between the two. [Unlike] absolute metrics ... paired comparison tests do not assume that observable user behavior changes with retrieval quality on some absolute scale, but merely that users can identify the preferred alternative in direct comparison.
Update: Filip had a nice update to this work in a SIGIR 2010 paper, "Comparing the Sensitivity of Information Retrieval Metrics". Particularly notable is that only x10 as many clicks are required as explicit judgments to detect small changes in relevance. Since click data is much easier and cheaper to acquire than explicit relevance judgments, this is another point in favor of using online measures of relevance rather than the older technique of asking judges (often a lot of judges) to compare the results.