When it comes to the search engine — which has many thousands of interlocking equations — [Google] has to double-check the engineers' independent work with objective, quantitative rigor to ensure that new formulas don't do more harm than good.I found this surprising. Google manually comes up with tweaks to its search engine that only apply to a small percentage of queries, tests the tweaks, and then tosses them into the relevance rank?
Recently, a search for "French Revolution' returned too many sites about the recent French presidential election campaign -- in which candidates opined on various policy revolutions -- rather than the ouster of King Louis XVI. A search-engine tweak gave more weight to pages with phrases like "French Revolution" rather than pages that simply had both words.
Typing the phrase "teak patio Palo Alto" didn't return a local store called the Teak Patio .... Mr. Singhal's group [wrote] a new mathematical formula to handle queries for hometown shops.
Is it better to provide new information or to display pages that have stood the test of time and are more likely to be of higher quality? Until now, Google has preferred pages old enough to attract others to link to them ... [Singhal's] team's solution: a mathematical model that tries to determine when users want new information and when they don't. (And yes, like all Google initiatives, it had a name: QDF, for "query deserves freshness.")
The problem with these manual tweaks is that they rapidly become unwieldy. As you add hundreds or thousands of these hand-coded rules, they start to interact in unpredictable ways. When evaluating a new rule, it becomes unclear if performance of that rule might be improved by tweaks to the rule, tweaks to other rules, or removing other rules that have now been subsumed.
It appears Google has hit this problem head on. The "many thousands of interlocking equations" require a "closely guarded internal [program] called Debug" that attempts to explain whether the rules are doing "more harm than good."
Frankly, I thought Google was beyond this. Rather than piling hack upon hack, I thought Google's relevance rank was a giant, self-optimizing system, constantly learning and testing to determine automatically what works best.
How would this work? With a search engine the size of Google's, every search query can be treated as an experiment, every interaction as an opportunity to learn and adapt.
In each query, a few of the results would be different each time. Each time, the search engine is making a prediction on the impact (usually an anticipated slight negative impact) of making this change. Wrong predictions are surprises, opportunities to learn, and are grouped with other wrong predictions until the engine can generalize and attempt a broader tweak to the algorithm. Those broader tweaks are automatically tested, integrated if they work, and the cycle repeats.
For example, on the query [teak patio palo alto], experiments may show the Teak Patio store in Palo alto is getting unexpectedly high clickthrough. Another query, for [garden stone seattle], is showing similar problems in experiments. In clustering, both queries are classified as local. Both show modest purchase intent. The clickthrough urls have been classified as local businesses. On the known data, the most general rule based on this result, one boosting local businesses when a query is classified as local with purchase intent, appears to give a lift. The rule is tested live on a percentage of users, results match predictions, and the system adds the new rule for all users. The process repeats.
Similarly, a group of queries that match known names (e.g. celebrities) may show that links classified as sites are appearing too low in rankings. In automated testing, a general version of this rule performs poorly unless another rule with strong overlap is removed; a more specific rule performs well but applies less frequently. The older rule is removed, the newer general rule put in place.
While this process does result in specific tweaks to the engine like the manual approach, it does not rely on someone manually finding the rule. Unexpected tweaks to relevance rank may arise from the data. Moreover, a self-optimizing relevance rank does not rely on someone manually coming back to rules to maintain or debug them over time.
This approach would require massive computational power -- a huge infrastructure classifying and clustering queries and urls into hierarchies, a framework for testing billions of changes and tweaks simultaneously and generalizing from the results -- but I thought Google had that power already.
Perhaps this merely shows how much further there is to go in search. As Larry Page recently said, "We're probably only 5 per cent of the way there."
Update: Five weeks later, there are a few more tidbits about Google's experiments and their tweaking of their relevance rank in an interview of Udi Manber by Eric Enge. Some excerpts:
We run literally thousands of experiments a year and pick the ones that score well.
We have projects that their sole purpose is to reduce complexity. A team may go and work for two months on a new simpler sub-algorithm. If it performs the same as the previous algorithm, but it's simpler, that will be a big win.
Overall, we have to be very careful that the complexity of the algorithm does not exceed what we can maintain.