[When] learning rankings of documents from search engine logs .... all previous work has only used logs collected passively, simply using the recorded interactions that take place anyway. We instead propose techniques to guide users so as to provide more useful training data for a learning search engine.The strategy they propose is to come up with some rough estimate of the cost of ranking incorrectly, then twiddle with the search results in such a way that the data produced will help us minimize that cost.
[With] passively collected data ... users very rarely evaluate results beyond the first page, so the data obtained is strongly biased toward documents already ranked highly. Highly relevant results that are not initially ranked highly may never be observed and evaluated.
One possibility would be to intentionally present unevaluated results in the top few positions, aiming to collect more feedback on them. However, such an ad-hoc approach is unlikely to be useful in the long run and would hurt user satisfaction.
We instead introduce ... changes ... [designed to] not substantially reduce the quality of the ranking shown to users, produce much more informative training data and quickly lead to higher quality rankings being shown to users.
There are a bunch of questions raised by the paper that could use further discussion: Is the loss function proposed a good one (in particular, with how it deals with lack of data)? How do other loss functions perform on real data? How much computation does the proposed method require to determine which experiment to run? Are there simpler strategies that require less online computation (while the searcher is waiting) that perform nearly as well on real data?
But, such quibbles are beside the point. The interesting thing about this paper is the suggestion of learning from clickstream data, not just passively from what people do, but also actively by changing what people see depending on what we need to learn. The system should explore the data, constantly looking for whether what it believes to be true actually is true, constantly looking for improvements.
On a broader point, this paper appears to be part of an ongoing trend in search relevance rank away from link and text analysis and toward analysis of searcher behavior. Rather than trying to get computers to understand the content and whether it is useful, we watch people who read the content and look at whether they found it useful.
People are great at reading web pages and figuring out which ones are useful to them. Computers are bad at that. But, people do not have time to compile all the pages they found useful and share that information with billions of others. Computers are great at that. Let computers be computers and people be people. Crowds find the wisdom on the web. Computers surface that wisdom.
See also my June 2007 post, "The perils of tweaking Google by hand", where I discussed treating every search query as an experiment where results are frequently twiddled, predictions made on the impact of those changes, and unexpected outcomes result in new optimizations.