Wednesday, November 30, 2011

Browsing behavior for web crawling

A recent paper out of Yahoo, "Discovering URLs through User Feedback" (ACM), describes the value from using what pages people browse to and click on (which is in Yahoo's toolbar logs) to inform their web crawler about new pages to crawl and index.

From the paper:
Major commercial search engines provide a toolbar software that can be deployed on users' Web browsers. These toolbars provide additional functionality to users, such as quick search option, shortcuts to popular sites, and malware detection. However, from the perspective of the search engine companies, their main use is on branding and collecting marketing statistics. A typical toolbar tracks some of the actions that the user performs on the browser (e.g., typing a URL, clicking on a link) and reports these actions to the search engine, where they are stored in a log file.

A Web crawler continuously discovers new URLs and fetches their content ... to build an inverted index to serve [search] queries. Even though the basic mechanism of a crawler is simple, crawling efficiently and eff ectively is a difficult problem ... The crawler not only has to continuously enlarge its repository by expanding its frontier, but also needs to refresh previously fetched pages to incorporate in its index the changes on those pages. In practice, crawlers prioritize the pages to be fetched, taking into account various constraints: available network bandwidth, peak processing capacity of the backend system, and politeness constraints of Web servers ... The delay to discover a Web page can be quite long after its creation and some Web sites may be only partially crawled. Another important challenge is the discovery of hidden Web content ... often ... backed by a database.

Our work is the first to evaluate the benefits of using the URLs collected from a Web browser toolbar as a form of user feedback to the crawling process .... On average, URLs accessed by the users are more important than those found ... [by] the crawler ... The crawler has a significant delay in discovering URLs that are first accessed by the users ... Finally, we [show] that URL discovery via toolbar [has a] positive impact on search result quality, especially for queries seeking recently created content and tail content.
The paper goes on to quantify the surprisingly large number of URLs found by the toolbar that are useful, not private, and not excluded by robots.txt. Importantly, a lot of these are deep web pages, only visible by doing a query on a database, and hard to ferret out of that database any way but looking at the pages people actually look at.

Also interesting are the metrics on pages the toolbar data finds first. People often send links to new web pages by e-mail or text message. Eventually, those links might appear on the web, but eventually can be a long time, and many of the urls found first in the toolbar data ("more than 60%") are found way before the crawler manages to discover them ("at least 90 days earlier than the crawler").

Great paper out of Yahoo Research and a great example of how useful behavior data can be. It is using big data to help people help others find what they found.

5 comments:

Anonymous said...

Google has been doing this for 7 years.

Greg Linden said...

Yes, Google doesn't talk about it much, but they appear to have been using toolbar data for a long time. See, for example, my 2008 post, "Google Toolbar data and the actual surfer model". The enormous value of this toolbar data explains why Google pushed so hard on toolbar installations, making expensive deals with, for example, Adobe to get it installed when Flash is installed and on every Dell computer.

I think there are two reasons why this Yahoo paper is important. First, most people who use toolbars probably are not aware of how their data is being used, so there is a press story here (perhaps a positive story on how the Google brain learns from everyone who uses Google, or perhaps a negative story on privacy). Second, many in the search industry, including researchers and executives I've talked to in the past, have doubted the value of browsing behavior data for crawling and relevance rank, and this article might help convince more of those who need convincing.

Srikanta Bedathur said...

A variant of this approach was also discussed in our paper "EverLast: A Distributed Architecture for Preserving the Web" [JCDL 2009]. I think Yahoo's paper quantifies what all believed always -- seed-based crawlers have limited use when it comes to reaching interesting parts of the highly dynamic Web!

Greg Linden said...

Thanks, Srikanta, I realize this is not the first attempt to use browsing data from proxies or other sources, but the scale of the effort is very different here. The most important part of the Yahoo paper is the scale of the browsing data they have. The Yahoo toolbar is widely installed. Yahoo (and a few others) have an enormous amount of data on what people do on the Web, and their paper quantifies how useful that kind of big data on browsing behavior is for web crawling.

Tillirix said...

I wonder whether toolbar penetration is rising or falling. Chrome is gaining share and emphasizes simplicity which arguably means no toolbars. Naturally, valuable insights can still be drawn even if penetration is low or falling, if you consider that the sample might be biased toward visitors who like toolbars. My hunch is that less sophisticated uses may adopt toolbars more heavily but I could be wrong. Do you have any insights on penetration and biases?