The paper, "Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff?" (ACM page) covers much of the talk and adds more detail. An excerpt:
Although there are lots of pages out there, there are not that many pages that people actually go to ... [We] estimate the entropy of URLs (and queries and IP addresses) based on logs from Microsoft's www.live.com. We find that it takes just 22 bits to guess the next URL (or next query or next IP address). That is ... millions, not billions.So, this is one of the dramatic, thought-provoking, outlandish claims. Qiaozhu is arguing that analysis how many pages people actually go to in Live Search seems to indicate that a relatively small cache of pages (millions) would be sufficient to capture almost all searchers. During the talk, he took this even further, saying that there might be some possibility of fitting a search index of all the pages you might care about on a flash drive.
Large investments in clusters in the cloud could be wiped out if someone found a way to capture much of the value of billions [of indexed pages] in a small cache of millions.
On the one hand, there is little doubt there is a long tail of infrequently executed queries and that many (and perhaps even most) queries can be satisfied from a reasonably large cache if it is well structured.
On the other hand, there appear to be large logical leaps in using entropy to make claims about the difficulty of the problems in search. Information entropy provides a lower bound on how much space is needed to encode information, the maximum compression possible for the information. The analyses in the paper, such as that all the URLs accessed through Live Search in the last two years could be encoded in 22.44 bits, provide the minimum number of bits required to encode that data. To take that minimum and then re-state it as the actual amount of effort required (e.g. "It takes around 22 bits to guess the next url") seems like it requires a fairly large leap of faith.
Practical issues also get in the way here -- such as new or updated pages and responding immediately to changes in behavior such as a major news event -- that make it seem unlikely that "large investments in expensive computer systems in the cloud to index billions of pages ... could ... be wiped out [because] ... a small cache of a few million pages ... [captures] much of the value."
And, other looks at similar data have said there is little potential to capture all of the value of a search engine with a large cache. For example, in SIGIR 2007, Baeza-Yates et al. wrote that:
In the query log we used, 88% of the unique queries are singleton queries, and 44% are singleton queries out of the whole volume. Thus, out of all queries in the stream composing the query log, the upper threshold on hit ratio is 56%.Yet there remains much to consider here. While there may be large numbers of singleton queries, do those queries matter? Are there clicks on them? What would be the impact of dropping many of them (either by mapping them to similar queries that are in the cache or by returning no results)? The measures of entropy given by Qioazhu imply that many of these queries do not matter to users, part of why I found the paper so thought-provoking. Looking at cache size with respect to user experience -- satisfying queries that matter -- seems like fertile ground for further investigation.
This is because only 56% of all the queries comprise queries that have multiple occurrences. It is important to observe, however, that not all queries in this 56% can be cache hits because of compulsory misses. This is different from capacity misses, which happen due to space constraints on the amount of memory the cache uses. If we consider a cache with infinite memory, then the hit ratio is 50%. Note that for an infinite cache there are no capacity misses.
The WSDM paper also had an interesting discussion of how much personalization could help with search, where "personalization" here is limited to looking at how users from the same IP range (e.g. grouping everyone in the 156.111.*.* block together). The authors found that the variation (as measured by entropy) of the URLs clicked for a query for an IP range was much less than the entropy overall. In giving an example of how MSG can be disambiguated to mean either "Madison Square Garden" or "Mono-sodium Glutamate", they imply that this is primarily due to the implicit information about location contained in IP addresses.
In this personalization/localization analysis and some of the other analyses in the paper, I did wonder if the drops in entropy are at least partially explained by reducing the data size. That is, when I limit to queries from a small IP range, there may be only a few people who did that query and clicked results, which would give me low entropy, but only because I have just a few examples. Even so, the conclusion makes intuitive sense, that adding information about IP address to queries can help disambiguate some queries, especially local searches.
Finally, the paper also includes brief but fun discussions of the difficulty of query suggestions, the variation in what people click on in the search results, and time of day and time of week effects. Definitely a fun and thought-provoking read.
Update: Video of Qiaozhu's talk is now available.
Update: Please see also my earlier post on a WWW 2007 paper by Dou et al. where they use "click entropy ... as a simple measurement on whether the query should be personalized."