I particularly enjoyed Peter's thoughts on the advantages of big data and big clusters. Near the beginning of the talk, Peter said:
Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm ... is performing better than the best algorithm on less training data.Later, near the end of the talk, Peter extended this point:
Worry about the data first before you worry about the algorithm.
Is it just that Google has more data and more machines? But it couldn't be just the more data because [in some competitions] ... everyone got the same data.Amazon was similar in many ways. While Amazon did not have Google's mighty parallel processing tools and massive cluster, Amazon did have big data (transactional and log data) which it used extensively for website visible features like personalization and search query refinements and backend work like supply chain optimizations. In addition, Amazon was very early if not the first to do website A/B tests, a framework for rapidly testing new algorithms and designs live on Amazon.com, which encouraged behavior like Google's "try anything" engineering approach.
So, I think that having more machines is a very important part because it allows us to turn around the experiments much faster than the other guys. So, it's not the online performance where you are actually doing the search that matters, but it's the -- gee, I have an idea, I think we should change this -- and we can get the answer in two hours which I think is a big advantage over someone else who takes two days. And, I think it also helps that we took an engineering approach of -- well, we'll try anything.
I find Peter's words particularly interesting when thinking about the Netflix contest. Netflix may be demonstrating how to do a Google-like experimental effort if you do not have Google-scale resources. The Netflix contest uses other people's machine resources and the power of many minds "trying anything" to attempt to find improvements to the Netflix recommender system.
See also notes by Brian Mingus on what appears to be the same talk a few weeks later at U of Colorado at Boulder.
[Ionut post found via Philipp Lenssen]