The details of setting this up are available at the node "AmazonEC2" on the Lucene-Hadoop Wiki at Apache.org.
When looking for more about this, I noticed that the hyped-but-not-launched natural language search engine Powerset appears to be leading the charge on using Hadoop on EC2. From the Hadoop mailing list:
From: Gian Lorenzo Thione <thi...@powerset.com>That is an interesting detail on the recent announcement that Powerset is a heavy user of Amazon's EC2.
Date: Fri, 25 Aug 2006 23:04:16 GMT
At Powerset we have used EC2 and Hadoop with a large number of nodes, successfully running Map/Reduce computations and HDFS. Pretty much like you describe, we use HDFS for intermediate results and caching, and periodically extract data to our local network. We are not really using S3 at the moment for persistent storage.
A nice feature of Hadoop as measured against our use of EC2 has been the capability of fluidly changing the number of instances that are part of the cluster. Our instances are set up to join the cluster and the DFS as soon as they are activated and when - for any reason - we lose those machines, the overall process doesn't suffer. We have been quite happy with this, even at significant number of instances.
I am not sure I have an immediate use for Hadoop on EC2, but it is nice to see. Developers may now be able to rapidly bring up hundreds of servers, run a massive parallel computation on them using Hadoop's MapReduce implementation, and then shut down all the instances, all with low effort and at low cost. Very cool.
[Wiki node found via John Krystynak]
Update: Eight months later, Tom White posts a tutorial, "Running Hadoop MapReduce on Amazon EC2 and Amazon S3". [Found via Todd Huff]