Googler Tushar Chandra recently gave a talk at the LADIS 2010 workshop on "Sibyl: A system for large scale machine learning" (slides available as PDF), which discusses a system that Google has built to do large scale machine learning on top of MapReduce and GFS.
This system probably is just one of many inside the secretive search giant, but we don't often get these kinds of peeks inside. What I found most interesting was the frank discussion of the problems the Google team encountered and how they overcame them.
In particular, the Googlers talk about how they building on top of a system intended for batch log processing caused some difficulties, which they overcame by using a lot of local memory and being careful about arranging data and moving data around. Even so, the last few slides mention how they kept causing localized network and GFS master brownouts, impacting other work on the cluster.
That last problem seems to have been an issue again and again in cloud computing systems. That pesky network is a scarce, shared resource, and it often takes a network brownout to remind us that virtual machines are not all it takes to get everyone playing nice.
On a related topic, please see my earlier post, "GFS and its evolution", and its discussion of the pain Google hit when trying to put other interactive workloads on top of GFS. And, if you're interested in Google's work here, you might also be interested in the open source Mahout, which is a suite of machine learning algorithms mostly intended for running on top of Hadoop clusters.