Thursday, February 17, 2011

What I have been reading lately

Here are a few of the articles that have caught my attention recently:
  • Googler and AI guru Peter Norvig on what artificial intelligence has been able to do and what was a surprise ([1])

  • Watson's performance in Jeopardy was impressive, but even Google and Bing get the right answers to Jeopardy questions more often than the average human ([1])

  • The Google Translate app is downright amazing. Almost a babel fish in my ear, just remarkable. ([1])

  • Kicking off a project to rewrite code is a really bad idea, but people want to do it all the time ([1] [2])

  • Excellent answer on why Dropbox succeeded ([1])

  • Dilbert on the morality of using web click data. Ouch! ([1])

  • Old but good Slate article on why we don't have ubiquitous free Wifi. Makes me sad. ([1])

  • Enjoyable rant against stealth mode startups ([1])

  • Internet investment funds and unprofitable companies going public? The dot-com bubble is back, baby! ([1] [2])
More in my shared items microblogging stream or, better yet, if you use Google Reader, search for me and follow my shared items there.

Wednesday, February 16, 2011

Comparing Google Megastore

A small pile of Googlers recently presented a paper, "Megastore: Providing Scalable, Highly Available Storage for Interactive Services" (PDF) that details the architecture of a major distributed database used at Google.

Megastore "has been widely deployed within Google for several years ... handles more than three billion write and 20 billion read transitions daily ... [and] stores nearly a petabyte ... across many datacenters."

Others have already summarized the paper ([1] [2]), so I'll focus on my reaction to it. What I found surprising about Megastore, especially when comparing to other large scale databases, is that it favors consistency over performance.

For consistency, Megastore provides "full ACID semantics within partitions", "supports two-phase commit across entity groups", guarantees that reads always see the last write, uses Paxos for confirming consensus among replicas, and uses distributed locks between "coordinator processes" as part of detecting failures. This is all unusually strong compared to the more casual eventual consistency offered by databases like Amazon's Dynamo, Yahoo's HBase, and Facebook's Cassandra.

The problem with providing Megastore's level of consistency is performance. The paper mostly describes Megastore's performance in sunny terms, but, when you look at the details, it does not compare favorably with other databases. Megastore has "average read latencies of tens of milliseconds" and "average write latencies of 100-400 milliseconds". In addition, Megastore has a limit of "a few writes per second per entity group" because higher write rates will cause conflicts, retries, and even worse performance.

By comparison, Facebook expects 4ms reads and 5ms writes on their database, so Megastore is an order of magnitude or two slower than what Facebook developers appear to be willing to tolerate.

Google application developers seem to find the latency to be a hassle as well. The paper says that Googlers find the latency "tolerable" but often have to "hide write latency from users" and "choose entity groups carefully".

This makes me wonder if Google has made the right tradeoff here. Is it really easier for application developers to deal with high latency all of the time than to almost always have low latency but have to worry more about consistency issues? Most large scale databases have made the choice the other way. Quite surprising.

Update: Googler DeWitt Clinton writes with a good point: "We build on top of Megastore when we require some of those characteristics (availability, consistency), and to Bigtable directly when we require low latency and high throughput instead. So it's up to the application to decide what tradeoffs to make, definitely not one-size-fits-all."

Thursday, February 03, 2011

Google, Bing, and web browsing data

I suppose I should comment, as everyone else on the planet has, on Google's claim that Bing is copying their results.

My reaction is mostly one of surprise. I am surprised that Google wants this issue discussed in the press. I am surprised that Google wants this aired in the court of public opinion.

Google is trying to draw a line on what use of behavior data is acceptable. Google clearly thinks they are on the right side of that line, and I do too, but I'm not sure the average searcher would agree. And that is why Google is playing a dangerous game here, one that could backfire on them badly.

Let's take a look at what Google Fellow Amit Singhal said:
This experiment confirms our suspicion that Bing is using some combination of:
  • Internet Explorer 8, which can send data to Microsoft via its Suggested Sites feature
  • the Bing Toolbar, which can send data via Microsoft’s Customer Experience Improvement Program
or possibly some other means to send data to Bing on what people search for on Google and the Google search results they click.
Of course, what Amit does not mention here is that the widely installed Google Toolbar and the fairly popular Google Chrome web browser sends very similar data back to Google, data about every page someone visits and every click they make. Moreover, Google tracks almost every web search and every click after a web search made by web users around the world, since almost every web search is done on Google.

By raising this issue, Google very publicly is trying to draw a particular line on how toolbar and web browsing data should be used, and that may be a dangerous thing for Google to do. The average searcher, for example, may want that line drawn somewhere other than where Google might expect it to be drawn -- they may want it drawn at not using any toolbar/Chrome data for any purposes, or even not using any kind of behavior data at all -- and, if that line is drawn somewhere other than where Google wants it, Google could be hurt badly. That is why I am surprised that Google is coming out so strong here.

As for the particular issue of whether this is copying or not, I don't have much to say on that, but I think the most thought-provoking piece I have seen related to that question is John Langford's post, "User preferences for search engines". John argues that searchers own their browsing behavior and can reveal what they do across the web to whoever they want to. Whether you agree or not with that, it is worth reading John's thoughts on it and considering what you think might be the alternative.

Update: In the comments, Peter Kasting, a Google engineer working on Chrome, denies that Chrome sends clickstream data (the URLs you visit) back to Google. A check of the Google Chrome privacy policy ([1] [2]) appears to confirm that. I apologize for getting that wrong and have corrected the post above.

Update: A few days later, search industry watcher Danny Sullivan writes, "In short, Google doesn’t occupy any higher ground than Microsoft, from what I can see, when it comes to using data gathered from browser add-ons to improve its own services, including its search engine." Whether you think Danny is right or not, his article demonstrates that Google was wrong in thinking they would easily win if they took this fight to the press.

Tuesday, February 01, 2011

YouTube uses Amazon's recommendation algorithm

In a paper at the recent RecSys 2010 conference, "The YouTube Video Recommendation System" (ACM), eleven Googlers describe the system behind YouTube's recommendations and personalization in detail.

The most interesting disclosure in the paper is that YouTube has switched from their old recommendation algorithm based on random walks to a new one based on item-to-item collaborative filtering. Item-to-item collaborative filtering is the algorithm Amazon developed back in 1998. Over a decade later, it appears YouTube found a variation of Amazon's algorithm to be the best for their video recommendations.

Other notable tidbits in the paper are what the Googlers have to do to deal with noisy information (noisy video metadata and user preference data), the importance of freshness on videos (much like news), that they primarily used online measures of user satisfaction (like CTR and session length) when competing different recommendation algorithms against each other and tuning each algorithms, and the overall improvement (about x3 better) that recommendations got over simple features that just showed popular content.

Some excerpts from the paper:
Recommending interesting and personally relevant videos to [YouTube] users [is] a unique challenge: Videos as they are uploaded by users often have no or very poor metadata. The video corpus size is roughly on the same order of magnitude as the number of active users. Furthermore, videos on YouTube are mostly short form (under 10 minutes in length). User interactions are thus relatively short and noisy ... [unlike] Netflix or Amazon where renting a movie or purchasing an item are very clear declarations of intent. In addition, many of the interesting videos on YouTube have a short life cycle going from upload to viral in the order of days requiring constant freshness of recommendation.

To compute personalized recommendations we combine the related videos association rules with a user's personal activity on the site: This can include both videos that were watched (potentially beyond a certain threshold), as well as videos that were explicitly favorited, “liked”, rated, or added to playlists ... Recommendations ... [are the] related videos ... for each video ... [the user has watched or liked after they are] ranked by ... video quality ... user's unique taste and preferences ... [and filtered] to further increase diversity.

To evaluate recommendation quality we use a combination of different metrics. The primary metrics we consider include click through rate (CTR), long CTR (only counting clicks that led to watches of a substantial fraction of the video), session length, time until first long watch, and recommendation coverage (the fraction of logged in users with recommendations). We use these metrics to both track performance of the system at an ongoing basis as well as for evaluating system changes on live traffic.

Recommendations account for about 60% of all video clicks from the home page ... Co-visitation based recommendation performs at 207% of the baseline Most Viewed page ... [and more than 207% better than] Top Favorited and Top Rated [videos].
For more on the general topic of recommendations and personalization on YouTube, please see my 2009 post, "YouTube needs to entertain".

By the way, it would have been nice if the Googlers had cited the Amazon paper on item-to-item collaborative filtering. Seems like a rather silly mistake in an otherwise excellent paper.

Update: To be clear, this was not intended as an attack on Google in any way. Googlers built on previous work, as they should. What is notable here is that, despite another decade of research on recommender systems, despite all the work in the Netflix Prize, YouTube found that a variant of the old item-to-item collaborative filtering algorithm beat out all others for recommending YouTube videos. That is a very interesting result and one that validates the strengths of that old algorithm.