Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?I do suspect that adding more data from IMDb, Amazon, or some other source will be necessary for anyone to win the Netflix Prize. As I said after working on the data a bit last year, "In my analyses, data simply seemed too sparse in some areas to make any predictions, and supplementing with another data set seemed like the most promising way to fill in the gaps."
Team B got much better results, close to the best results on the Netflix leaderboard! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize.
But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often surprised that many people in the business, and even in academia, don't realize this.
Please see also my previous posts, "The advantages of big data and big clusters", "Better understanding through big data", Netflix Prize enabling recommender research", and "The Netflix Prize and big data".
Update: Yehuda Koren from the currently top-ranked BellKor team swung by and commented:
Our experience with the Netflix data is different.Update: Anand has a follow-up post. Some choice excerpts:
IMDB data and the likes gives more information only about the movies, not about the users ... The test set is dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra information about the movies.
Our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset.
Why not have more data and better algorithms?
Scalable algorithms involve only a fixed number of sequential scans and sorts of data (since large data sets must necessarily reside on disk and not RAM). Most algorithms that require random access to data or take time greater than O(N log N) are not scalable to large data sets. For example, they cannot easily be implemented using methods such as MapReduce.
Thus, choosing a more complex algorithm can close the door to using large data sets, at least at reasonable budgets that don't involve terabytes of RAM.
In the case of web search, Google made a big leap by adding links and anchor text, which are independent data sets from the text of web pages. In the case of AdWords, the CTR data was an independent data set from the bid data. And Google ... became a domain registrar so they could add even more data about domain ownership and transfers into their ranking scheme. Google consistently has believed and bet on more data, while trumpeting the power of their algorithms.
You might think that it won't help much to add more of the same data, because diminishing returns would set in ... In many important cases, adding more of the same data makes a bigger difference than you'd think ... [such as when] the application sees the data embedded in a very-high-dimensional space.