[To execute a web search] a subset of documents is identified based on the presence of the user's keywords. Then, these documents are ranked by a very fast algorithm that combines ... 200 [pre-computed] signals in-memory using a proprietary formula.Update: Anand writes a follow-up post, "How Google Measures Search Quality".
[This] appears to be made-to-order for machine learning algorithms. Tons of training data (both from usage and from the armies of "raters" employed by Google), and a manageable number of signals (200) -- these fit the supervised learning paradigm well, bringing into play an array of ML algorithms from simple regression methods to Support Vector Machines.
And indeed, Google has tried methods such as these. Peter tells me that their best machine-learned model is now as good as, and sometimes better than, the hand-tuned formula on the results quality metrics that Google uses.
The big surprise is that Google still uses the manually-crafted formula for its search results. They haven't cut over to the machine learned model yet.
Peter suggests two reasons for this. The first is hubris: the human experts who created the algorithm believe they can do better than a machine-learned model. The second reason is more interesting. Google's search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.
Friday, May 30, 2008
Machines versus humans at Google
A curious revelation from Googler Peter Norvig appears in a recent post by Anand Rajaraman:
Posted by Greg Linden at 10:35 AM