The paper is a summary of the challenges in building a web search engine that can scale to handle the expected growth of the web in the next 3-5 years.
In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters.The paper goes on to discuss the problems that come up for crawling, indexing, and query processing when, instead of all being in the same data center, clusters are distributed across a WAN. It is a great overview and a worthwhile read.
The number of Web sites continues to grow rapidly and there are currently more than 20 billion indexed pages. In the near future, centralized systems are likely to become ineffective against such a load, thus suggesting the need of fully distributed search engines.
In this paper we discuss the main issues with the design of a distributed Web retrieval system, including discussions on the state of the art.
I wanted to highlight a few small but interesting points in the paper that I suspect would be easy to overlook.
First, I really like the idea mentioned on page 5 of the paper of optimizing the distributed index by "using query logs to partition the document collection and to route queries." What would be really fun to see is this optimization also being done continuously, replicating and moving index shards in real-time in response to access patterns.
Second, the paper very briefly mentions the idea of a peer-to-peer model for a web search engine. On the one hand, the paper makes this idea sound intriguing by talking about all the problems of doing distributed web search and the number of machines required (possibly 1.5M in 2010). On the other hand, there is a big difference between 30+ clusters on a WAN with 50k+ fast-interconnect, trusted machines in each cluster and using millions of untrusted machines on a P2P network.
The Baeza-Yates paper does not explore P2P web search further, it does cite "P2P Content Search: Give the Web Back to the People" (PDF) which, in turn, cites "On the Feasibility of Peer-to-Peer Web Indexing and Search" (PS). This second paper argues that P2P web search, at least in a naive implementation, would consume most of the bandwidth on the internet and therefore is not feasible. Worth a peek if you, like me, think this might be an intriguing idea to explore.
Finally, the Baeza-Yates paper mentions personalized search. I think that is interesting both because Yahoo researchers are thinking about personalized search and, like folks at Microsoft Research, they seem to be leaning toward a client-side implementation:
When query processing involves personalization of results, additional information from a user profile is necessary at search time, in order to adapt the search results according to the interests of the user.However, a client-side approach, as the MSR researches say, has the "disadvantage of ... [no] direct access to details of the Web corpus" (which limits the extent of the personalization) and, as the Yahoo researchers point out, "limits the user to always using the same terminal."
An additional challenge related to personalization of Web search engines is that each user profile represents a state, which must be the latest state and be consistent across replicas.
Alternatively, a system can implement personalization as a thin layer on the client-side. This last approach is attractive because it deals with privacy issues related to centrally storing information about users and their behavior.