Googler Peter Norvig gave a talk at industry day at CIKM 2008 that, despite my fascination with all things Peter Norvig, almost frightened me off by including the phrase "the Ultimate Agile Development Tool" in its title.
The talk redeemed itself in the first couple minutes, citing Steve Yegge's "Good Agile, Bad Agile" and making it clear that Peter more meant being agile than Agile.
His core point was that "code is a liability". Relying on data over code as much as possible allows simpler code that is more flexible, adaptive, and robust.
In one of several examples, Peter put up a slide showing an excerpt for a rule-based spelling corrector. The snippet of code, that was just part of a much larger program, contained a nearly impossible to understand let alone verify set of case and if statements that represented rules for spelling correction in English. He then put up a slide containing a few line Python program for a statistical spelling correction program that, given a large data file of documents, learns the likelihood of seeing words and corrects misspellings to their most likely alternative. This version, he said, not only has the benefit of being simple, but also easily can be used in different languages.
For another example, Peter pulled from Jing et al, "Canonical Image Selection from the Web" (ACM), which uses a clever representation of the features of an image, a huge image database, and clustering of images with similar features to find the most representative image of, for example, the Mona Lisa on a search for [mona lisa].
Peter went on to say to say that more data seems to help in many problems more than complicated algorithms. More data can hit diminishing returns at some point, but the point seems to be fairly far out for many problems, so keeping it simple while processing as much data as possible often seems to work best. Google's work in statistical machine translation works this way, he said, primarily using the correlations discovered between the words in different languages in a training set of 18B documents.
The talk was one of the ones recorded by videolectures.net and should appear there in a week. If you cannot wait, the CIKM talk was similar to Peter's startup school talk from a few months ago, so you could use that as a substitute.