Ho John Lee pointed to a long but truly excellent survey paper on PageRank, "Deeper Inside PageRank" (PDF) by Langville and Meyer.
The 46 page paper not only describes PageRank and twiddles of PageRank in detail, but also it talks about research on optimizing the PageRank computation and generating personalized versions of PageRank. It's a thick, dense paper, a lot of work to plow through, but I found a lot of juicy food for thought in there.
I ended the paper buzzing with questions, primarily around the probabilities of link transitions and the personalization (aka teleportation) vector. If you've got a good understanding of PageRank, I'd appreciate it if you could comment on my thoughts below and let me know if I've gone astray.
On the probabilities of transitioning across a link in the link graph, the paper's example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that "any suitable probability distribution" can be used instead including one derived from "web usage logs".
Similarly, section 6.2 describes the personalization vector -- the probabilities of jumping to an unconnected page in the graph rather than following a link -- and briefly suggests that this personalization vector could be determined from actual usage data.
In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these -- the probability of following a link and the personalization vector's probability of jumping to a page -- to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.
But, if I have enough usage data to do this, can't I calculate the equivalent PageRank directly? Let's say I have a toolbar (like Alexa, Yahoo, or Google) or ISP logs (like MSN or AOL) that gives me data on everything people visit. Instead of weighting the links in the link graph using the usage data, can I ignore the link graph and rank pages by their likelihood of being visited?
What's the difference between these two calculations? In one, I'm summing over the probability that surfers come over inbound links to find the probability that people will visit the page. In the other, I'm computing that probability directly from who actually visited the page. Using the link graph would seem to be only something to use if you don't have the usage data, an indirect estimate of the relevance of a page that you could calculate directly given enough data.
Now, I am assuming here that the value of a link is entirely determined by how much that link is used. If an unused link on a page does have meaning and should influence the relevance of the linked page, then this all falls apart.
But is this otherwise accurate? With enough data on what pages people visit, could the calculation over the link graph be eliminated or at least reduced?
By the way, don't miss the interesting discussion in the paper on using the personalization vector for personalized search by using a different vector for different groups of people. There are severe scaling issues with this method of search personalization, which can be partially addressed by some of the ideas from the Google Kaltix folks. For more on that, see my previous post, "More on Google personalized search".
[Ho John Lee post via Brian Dennis]
Update: Ho John Lee responded with some comments. Definitely worth reading.