The work has several surprising conclusions. First, and most importantly, they authors argue that "PageRank is quite a poor predictor of traffic ranks for the most popular portion of the Web." They say this is because the basic assumptions of PageRank simply are not true.
Specifically, PageRank assumes every link from a page is followed with equal probability, but their data shows that "a few [links] carry a disproportionate amount of traffic while most carry very little traffic." When they attempted to compensate for this with a version of PageRank they called Weighted PageRank (where the links are weighted based on click traffic), they found it helped only a little.
This lead the authors to conclude that the other two assumptions of page rank -- that people have the same chance of jumping from (ending their session) any particular page and equal probability of jumping to (starting a new session) on any particular page -- are false and problematic. From the paper:
People are much more likely to jump to a few very popular sites than to the great majority of other sites.Finally, the authors suggest that relevance rank should be informed by click data, but note that "such steps are likely to amplify the search bias toward already popular sites." In the talk, an audience member also noted that such steps may be susceptible to click spam, which is even easier to do than link spam for those wanting to manipulate search results.
People follow many more links from a few very popular hubs than from the great majority of less popular sites.
Some sites are much more likely to be the starting or ending points of surfing sessions.
It is worth pointing out however, as I have done before, that naive PageRank has been under assault by spammers for many years and almost certainly is no longer used by any of the search engines in the original form, not without layers upon layers of efforts to eliminate link spam and ferret out any meaning remaining in the chaos that the link graph has become. As compelling as this paper's conclusions are, it could be the case that their version of PageRank naively followed links so manipulated by link spammers as to completely confuse the relevance rank, producing the poor correlation between PageRank and highly trafficked sites that they saw.
In addition to the thoughts on PageRank, the paper had several other very interesting results as well. They noted that only 5% of traffic originated from search hosts, "a surprisingly small fraction." They noted that 54% of traffic "does not have a referrer page, meaning that users type the URL directly, click on a bookmark, or click on a link in e-mail" at a much higher rate than one might expect. Finally, they noted strong recency and 24 hour trends in traffic data, saying that "47% of the clicks at any given time are predicted by the clicks from the previous day at the same time" and that, though the clicks from the previous three hours are a strong predictor of clicks for the current hour, after four hours, "the requests from the previous day yield higher precision and recall."
In all, an excellent paper, probably my favorite of the conference. Do not miss it. It is well worth reading.
For more on the thoughts in the paper on using click data for relevance rank, please see also my earlier post, "Actively learning to rank", that discusses an excellent KDD 2007 paper by Filip Radlinski and Thorsten Joachims.
Update: Mark's talk is now available online.