Friday, May 30, 2008

Machines versus humans at Google

A curious revelation from Googler Peter Norvig appears in a recent post by Anand Rajaraman:
[To execute a web search] a subset of documents is identified based on the presence of the user's keywords. Then, these documents are ranked by a very fast algorithm that combines ... 200 [pre-computed] signals in-memory using a proprietary formula.

[This] appears to be made-to-order for machine learning algorithms. Tons of training data (both from usage and from the armies of "raters" employed by Google), and a manageable number of signals (200) -- these fit the supervised learning paradigm well, bringing into play an array of ML algorithms from simple regression methods to Support Vector Machines.

And indeed, Google has tried methods such as these. Peter tells me that their best machine-learned model is now as good as, and sometimes better than, the hand-tuned formula on the results quality metrics that Google uses.

The big surprise is that Google still uses the manually-crafted formula for its search results. They haven't cut over to the machine learned model yet.

Peter suggests two reasons for this. The first is hubris: the human experts who created the algorithm believe they can do better than a machine-learned model. The second reason is more interesting. Google's search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.
Update: Anand writes a follow-up post, "How Google Measures Search Quality".

Collective intelligence requires more than voting

Giles Bowkett has an insightful point on voting schemes at sites like Digg:
When you build a system where you get points for the number of people who agree with you, you are building a popularity contest for ideas.

However, your popularity contest for ideas will not be dominated by the people with the best ideas, but the people with the most time to spend on your web site.

Votes appear to be free, like contribution is with Wikipedia, but in reality you have to register to vote, and you have to be there frequently for your votes to make much difference. So the votes aren't really free - they cost time.

If your popularity contest for ideas inherently, by its structure, favors people who waste their own time, then ... the most popular ideas will not be the best ideas ... The people who have the best ideas, and the ability to recognize them, also have better things to do and better places to be.
Let's compare these voting schemes to prediction markets. To function properly, prediction markets provide incentives for truth telling and for gathering knowledge. In prediction markets, even ones using play money, you have to put something at risk to have a chance at a gain, and people usually only take that risk if they think they are likely to succeed.

In most of these voting schemes, you have an unlimited number of votes. There is no incentive to do research -- it takes more time to do that than to vote -- and no cost to voting on something and being wrong.

Simple voting schemes do not create the proper incentives to get good outcomes. As Giles points out, those with knowledge have no additional incentive to participate than those without knowledge. In fact, the people with the strongest incentive to participate most likely are those seeking to manipulate the system for some external gain, a problem we see on Digg.

Please see also my previous post, "Summing collective ignorance".

[Giles post found via Dare Obasanjo]

Marissa on personalized search

Google VP Marissa Mayer had a brief tidbit on personalized search at the Google I/O Conference:
There will be a bigger personalization piece [in core search] -- looking at where users are and they last searched on.

Personalization will be really important in the future.
This reminds me of Marissa's keynote at SES 2007 where she said "personalization is the future" and "search sites will understand more about searchers, where they are located and what their personal preferences are."

Please see also my May 2007 post, "Personalized search yields dramatically better search results", which quotes Marissa as saying, "[Personalization is] one of the biggest relevance advances in the past few years ... [and it] makes results dramatically better."

For more details on Google's personalization efforts, please see also my Dec 2007 post, "Eric Enge interviews Sep Kamvar" and the links from that post.

Thursday, May 29, 2008

Udi on Google search quality

Google VP Udi Manber offers a high level description of what goes into Google's relevance rank in his recent post, "Introduction to Google Search Quality".

Some excerpts:
Ranking is hard ... We need to be able to understand all web pages, written by anyone, for any reason ... We also need to understand the queries people pose [and their needs], which are on average fewer than three words, and map them to our understanding of all documents ... And we have to do all of that in a few milliseconds.

PageRank is still in use today, but it is now a part of a much larger system. Other parts include language models (the ability to handle phrases, synonyms, diacritics, spelling mistakes, and so on), query models (it's not just the language, it's how people use it today), time models (some queries are best answered with a 30-minutes old page, and some are better answered with a page that stood the test of time), and personalized models (not all people want the same thing).

In 2007, we launched more than 450 new improvements, about 9 per week on the average. Some of these improvements are simple and obvious -- for example, we fixed the way Hebrew acronym queries are handled (in Hebrew an acronym is denoted by a (") next to the last character, so IBM will be IB"M), and some are very complicated -- for example, we made significant changes to the PageRank algorithm in January.
Please see also Barry Schwartz's post, "A Deeper Look At Google's Search Quality Efforts", which provides some additional commentary on Udi's post.

Please see also my earlier post, "The perils of tweaking Google by hand", which talks about whether these thousands of twiddles to the search engine and variations of them should be constantly tested rather than just evaluating one version of them at the time they are created.

Yahoo builds two petabyte PostgreSQL database

James Hamilton writes about Yahoo's "over 2 petabyte repository of user click stream and context data with an update rate for 24 billion events per day".

It apparently is built on top of a modified version of PostgreSQL and runs on about 1k machines. In his post, James speculates on the details of the internals. Very interesting.

Please see also Eric Lai's article in ComputerWorld, "Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest". On that, note that the Google Bigtable paper from 2006 says Bigtable handles "petabytes of data", so the Yahoo claim may depend on what you consider a database.

Wednesday, May 14, 2008

Yahoo, Hadoop, and Pig Latin

Chris Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins from Yahoo have an upcoming paper at SIGMOD 2008, "Pig Latin: A Not-So-Foreign Language for Data Processing" (PDF), that details Yahoo's work to build a powerful parallel processing language on top of Hadoop.

Some excerpts:
We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce.

At a growing number of organizations, innovation revolves around the collection and analysis of enormous data sets such as web crawls, search logs, and click streams ... For example, the engineers who develop search engine ranking algortihms spend much of their time analyzing search logs looking for exploitable trends.

A Pig Latin program is a sequence of steps ... each of which carries out a single data transformation ... Writing a Pig Latin program is similar to specifying a query execution plan ... This method is much more appealing than encoding [a] task as an SQL query, and then coercing the system to choose the desired plan through optimizer hints.

Pig ... is fully implemented and available as ... open-source. [Pig is executed] on Hadoop, an open-source, scalable map-reduce implementation. Pig has an active and growing user base inside Yahoo! and ... [is] beginning to attract users in the broader community.
The paper provides a number of examples of what Pig code looks like and how it executes across a cluster. The related work section of the paper is excellent and should not be missed; it compares Pig, Bigtable, MapReduce, map-reduce-merge, Dryad, and Sawzall.

Please see also my previous posts on Pig and Yahoo's Hadoop clusters, "Yahoo Pig and Google Sawzall", "Hadoop Summit notes", and "Yahoo deploys large scale Hadoop cluster".

Advertising, search, and drive-by malware

Googlers Niels Provos, Panayiotis Mavrommatis, Moheeb Rajab, and Fabian Monrose have an amusingly titled 2008 tech report, "All Your iFrames Are Point to Us" (PDF), with frightening results on the sophistication and ubiquity of malware distribution networks.

Some excerpts:
Various browser vulnerabilities [can] automatically download and run -- i.e. unknowingly to the visitor -- [a malware] binary upon visiting a website ... The potential victim base from these so-called drive-by downloads can be far greater than other forms of exploitation because traditional defenses (e.g. firewalls, dynamic addressing, proxies) pose no barriers to infection.

Malware serving networks are composed of tree-like structures ... [that] deliver the malware to the victim after a number of indirection steps ... to lure users .... Our results reveal that ad serving networks are increasingly being used as hops in the malware distribution chain.

About 0.6% of the top million URLs that appeared most frequently in Google's search results led to exposure to malicious activity at some point.

In a set of 2,000 well known advertising networks ... 2% of the landing sites were delivering malware via advertisements ... [often with] more than 6 redirection steps.

1.3% of the incoming search queries to Google's search engine return at least one link to a malicious site [either in the results or in an ad].
The paper goes on to describe the honeypot they used to detect malware, some of the properties of the malware, more detail and examples on the malware distribution networks and how they obscure the malicious final landing site, and how antivirus products fail to detect much of the malware.

Please see also "Ghost turns Zombie: Exploring the Life Cycle of Web-based Malware" (PDF), a fascinating USENIX 2008 paper by some of the same authors that details the "post-infection network behavior of web-based malware". Some of the botnets "demonstrated surprising sophistication" including sending "memory dumps of failed installations" back to the malware developers.

Please see also Niels Provos' post on the Official Google Security Blog and a post by Bruce Schneier, both on the first paper.

Friday, May 09, 2008

Starting Findory: Funding

[This is a continuation of the posts in my Starting Findory series describing my experiences building my first company, Findory.]

I had the wrong strategy with trying to get funding for Findory. I pursued VCs instead of angels.

I should have realized, VCs would never fund Findory. From their point of view, there was just too much risk.

Findory was lead by a first time entrepreneur and was missing critical expertise in building a company. Findory lacked a full team; in particular, there was no finance, marketing, or business development talent. And, Findory had an unproven technology -- a specific technique for personalizing news, search, and advertising -- that is difficult for a VC firm to evaluate.

Had any one of these three been different, there might have been a better chance. A proven founder may be able to get funding with nothing more than an idea on a napkin. A strong team ready to go is an attractive investment for a VC. A proven technology that just needs the wind of capital behind it is an easy bet. But, with all three missing, there was never much of a chance.

Perhaps in the frothier SF Bay Area, this might have been different. Perhaps someone might have taken a gamble. In Seattle, there are only a half a dozen VCs doing substantial internet investing; most firms out of Seattle will not fund without a local firm joining in the round.

I did pitch dozens of venture capitalists in and outside of Seattle on Findory. It was a fascinating experience, educational, exciting, and often humbling. But, ultimately, I have to say it was a wasteful distraction from building things to help Findory readers.

I should have focused on angels. Not only would angels have brought the right amount of cash for short-term needs, but also the addition of experienced angels to the board would have provided valuable advice and connections. It would have been just what Findory needed.

I had thought that, with how far I had bootstrapped Findory, that I could skip a step and jump directly to VCs. That was foolish. Findory remained a long way from the point where the momentum would be so obvious that it would overwhelm the other concerns.

Instead of looking to VCs, I should have looked for an experienced angel, someone passionate about personalizing information, who would have given the startup the advice it needed while enjoying the experience of joining with us to grow the company.

Please see also Paul Graham's article, "A Hacker's Guide to Investors".

Please see also Paul Graham's article, "Why There Aren't More Googles", especially his discussion of VCs as tending to be surprisingly "conservative" and "driven by consensus".

Please also see the other posts in the Starting Findory series.

Update: Anand Rajamaran mentions this post and offers some good advice in his Inc Technology article, "Angel, Venture Capital, or Bootstrap?".

Scaling Facebook's databases

Impressive numbers from Facebook on their architecture: 1,800 MySQL servers (900 pairs of master/slave) holding a heavily partitioned data set managed by just 2 DBAs.

Failures are handled by promoting a slave to master and then putting in a new slave quickly. It all must be heavily scripted and automated to be managed by just two people. Very nice.

Please see also the garbled but tolerable video from the panel, "Scaling MySQL -- Up or Out?", from the 2008 MySQL Conference where these numbers were disclosed.

I have to say, YouTube looked pretty bad sitting up there on the panel refusing to provide any numbers at the same time Facebook was being so forthcoming. But, Cuong Do Cuong's talk last year, "YouTube Scalability", from the Google Scalability Conference has some nice details on YouTube's architecture and lessons learned, though no specific numbers.

Please see also comments on the numbers from James Hamilton and Om Malik.

Netflix-KDD Workshop

The "Second KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition" will be on August 24 in Las Vegas.

Last year's workshop was a gold mine of work on large scale recommender systems using the Netflix data set. It promises to be interesting again this year.

Learning to love customers like you

Michael Schrage at MIT Technology Review writes a fun article, "Recommendation Nation: Learning to love customers like you".

Some excerpts:
Recommendations are everywhere on the Internet.

Pop-ups and context-sensitive advertisements have been supplemented by this low, seductive whisper of automated suggestion. The truth is that I now get more good recommendations about more things, more often, from Bayesian algorithms than from my best friends.

Perhaps this should make me wistful, but it doesn't. Better tech­nology doesn't mean worse friends.

Unlike human recommenders,, ­, and never insult me by implying that I spend my time, money, or attention on the wrong things. They're simply making relevant--and occasionally novel--recommendations based on my past choices and the things I attend to in real time.

The focus of digital personalization has shifted from what I am interested in now to what I might be interested in next. All the choices I make in the moment are absorbed into a sphere of suggestion where, after they have been statistically weighted, they are reborn as offers and advice.
The article goes on to talk about several ways recommendations could be improved so they would be more helpful.

I am briefly quoted in the article talking about the early motivation behind building's recommendations.

Wednesday, May 07, 2008

Crawling is harder than it looks

The best paper award at WWW 2008 went to a paper on large-scale crawling titled "IRLbot: Scaling to 6 Billion Pages and Beyond" (PDF) by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov.

I have never been that interested in crawling, so I almost missed this talk. But, I am glad I didn't.

The paper is a fascinating look at the difficulty of doing very large scale web crawling, focusing on avoiding web spam and pathological link structures that could trap a crawler, making sure to not overwhelm crawled webservers, and using high performance techniques for duplicate detection.

Some extended excerpts:
The web has changed significantly since the days of early crawlers, mostly in the area of dynamically generated pages and web spam. With server-side scripts that can create infinite loops, high-density link farms, and unlimited number of hostnames, the task of web crawling has changed from simply doing a BFS scan of the WWW to deciding in real time which sites contain useful information and giving them higher priority as the crawl progresses.

The first performance bottleneck we faced was caused by the complexity of verifying uniqueness of URLs and their compliance with robots.txt. As N scales into many billions ... [prior] algorithms ... no longer keep up with the rate at which new URLs are produced by our crawler (i.e., up to 184K per second) ... [A] new technique called Disk Repository with Update Management (DRUM) ... can store large volumes of arbitrary hashed data on disk and implement very fast check, update, and check+update operations using bucket sort ... DRUM can be thousands of times faster than prior disk-based methods.

In order to determine the legitimacy of a given domain, we use a very simple algorithm based on the number of incoming links from assets that spammers cannot grow to infinity. Our algorithm, which we call Spam Tracking and Avoidance through Reputation (STAR), dynamically allocates the budget of allowable pages for each domain and all of its subdomains in proportion to the number of in-degree links from other domains.

We ran IRLbot on a [single] quad-CPU AMD Opteron 2.6 GHz server (16 GB RAM, 24-disk RAID-5) attached to a 1-gb/s link ... [for a] total active crawling span of 41.27 days. During this time, IRLbot attempted 7,606,109,371 connections and received 7,437,281,300 valid HTTP replies ... IRLbot ended up with N = 6,380,051,942 responses with status code 200 and content-type text/html.

The average download rate during this crawl was 319 mb/s (1,789 pages/s) with the peak 10-minute average rate of 470 mb/s (3,134 pages/s). The crawler received 143 TB of data, out of which 254 GB were robots.txt files, and transmitted 1.8 TB of HTTP requests. The parser processed 161 TB of HTML code.
At very large scale and from a single box, those are pretty remarkable crawling rates, 2k pages/second. The details of the bottlenecks they encountered and how they overcame them made this paper a quite enjoyable read.

I did end with a question about part of the work. The authors say (in 7.1 and Figure 6) that the probability of finding a unique page never dropped below .11 and, earlier (in 1.4) say that "we believe a good fraction of the 35B URLs not crawled in this experiment [contain] useful content." However, they define a unique page as the same URL, so it seems like it could easily be the case that those 35B URLs seen but not crawled could have duplicate or near duplicate content to the 6B crawled.

Update: One topic the IRLbot paper ignored was how frequently we should recrawl a page. Another WWW 2008 paper, "Recrawl Scheduling Based on Information Longevity" (PDF) by Chris Olston and Sandeep Pandey, has a nice overview of that issue and then extends the prior work by focusing on the persistence of a web page. The authors end up arguing that it is not worth recrawling pages that change very rapidly very frequently because it is impossible to ever get the index to match the actual content of the page.

Update: On the question of what content is useful, both in terms of crawling and recrawling, I always find myself wondering how much we should care about pages that never get visited by anyone. Shouldn't our focus in broadening a crawl beyond 6B pages and in recrawling pages in our index depend on how much users seem to be interested in the new content?

Finding the location of interests and objects from search logs

A paper at WWW 2008, "Spatial Variation in Search Engine Queries" (PDF), by Lars Backstrom, Jon Kleinberg, Ravi Kumar, and Jasmine Novak offered many clever examples of using where people are when they do a web search both to determine when interest in a topic is geographically isolated and to estimate the physical location of objects.

The most dramatic example the latter in the paper was estimating the location of a hurricane over time by the queries for the name of the hurricane. It worked surprisingly well.

I intended to write much more about this paper, but Danny Sullivan already covered much of what I planned to say in his excellent post, "Yahoo Paper: Finding The Local 'Center' Of Search Queries".

Danny excerpts many of the pictures from the article and comes to the same conclusion that I did, that it is fascinating, motivating paper, but it leaves open the question how to apply this to improve local search.

Please see also the papers at the LocWeb Workshop from WWW 2008, several of which also looked at using IP addresses in search logs to determine local intent.

Thursday, May 01, 2008

Random walks of the click graph

I recently have been coming across several interesting efforts that use random walks of a click graph for relevance or keyword suggestions in search and advertising.

The basic idea is to build a bipartite graph with queries on one side, documents or ads on the other, and weighted links representing clicks after making the query on the documents or ads. Then, we start at one point of the graph, randomly choose a link to walk across, and keep doing that for several steps, seeing where we end up most frequently.

One nice example of this is the WWW 2008 short paper "Simrank++: Query Rewriting through Link Analysis of the Click Graph" (PDF + tech report PDF) by Ioannis Antonellis, Hector Garcia-Molina, and Chi-Chao Chang.

An excerpt from the longer tech report:
We focus on query rewrites based on the recent history of ads displayed and clicked on.

The back-end generates a historical click graph that records the clicks that were generated by ads when a user inputs a given query. The click graph is a weighted bi-partite graph, with queries on one side and ads on the other.

Our techniques identify not only queries that are directly connected by an ad (e.g., users that submit either "mp3" or "i-tunes" click on ad an for "iPod") but also queries that are more indirectly related.

Notice that our query rewriting problem is a type of collaborative filtering (CF) problem. We can view the queries as "users" who are recommending "ads" by clicking on them. When we identify similar queries, we are finding queries that have similar recommendations, just like in CF, where one finds users that have similar tastes.
Another short paper at WWW 2008, this one by Martin Szummer and Nick Craswell, "Behavioral Classification on the Click Graph" (PDF), looked at using random walks of the click graph for classifying adult websites.

An excerpt:
We present a method for exploiting user browsing activity to
classify web pages ... as adult (inappropriate for minors) or not.

When a user types a query and then clicks a search result, they create a query-URL association. By logging a large number of click events, the search engine can amass a large number of query-URL pairs. These can be viewed as a bipartite graph, where each query is adjacent to one or more URLs and each URL is adjacent to one or more queries.

Our method exploits the structure of this graph. We implicitly look for clusters of nodes (representing queries or URLs) that share clicks to/from each other. We assume that nodes that cluster well are likely to belong to the same underlying category.

The implicit clustering is done via a Markov random walk on the graph. The walk captures the transitivity of class similarity on the graph: if A is co-clicked with B and B is co-clicked with C, then A is also likely to be related to C.
To me, this had the feel of a click graph-based version of TrustRank (PDF), propagating reputation across a click graph instead of a link graph.

Please see also Nick and Martin's SIGIR 2007 paper, "Random Walks on the Click Graph" (PDF), which looks using click graph random walks for relevance rank in image search.

Please see also yet another WWW 2008 paper, "Using the Wisdom of the Crowds for Keyword Generation" (PDF), by Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, and Rakesh Agrawal, which also looks at using random walks of the click graph in order to suggest keywords for advertising.

Size matters? Or simplicity?

An amusing WWW 2008 poster by Joshua Blumenstock, "Size Matters: Word Count as a Measure of Quality on Wikipedia" (PDF) found that a very simple measure of "quality" of article on Wikipedia, the number of words in the article, performed nearly as well as much more complicated classifiers.

Word count with a simple threshold achieved 96.5% accuracy in the classification task. The other more complicated techniques ranged from 81-98% accuracy.

Joshua is careful in the paper to not "exaggerate the importance of this metric", but I believe there is a lesson here: Try the simple things first. It can often be surprising how well something simple works.