Amazon recently launched Elastic MapReduce, a web service that lets people run MapReduce jobs on Amazon's cluster.
Elastic MapReduce appears to handle almost all the details for you. You upload data to S3, then run a MapReduce job. All the work of firing up EC2 instances, getting Hadoop on them, getting the data out of S3, and putting the results back in S3 appears to be done for you. Pretty cool.
Even so, I have a big problem with this new service, the pricing. MapReduce jobs are batch jobs that could run at idle times on the cluster, but there appears to be no effort to run these during idle times nor is there any discount on the pricing. In fact, you actually pay a premium for MapReduce jobs above the cost of the EC2 instances used during the job.
It is a huge missed opportunity. Smoothing out peaks and troughs in cluster load improves efficiency. Using the idle time of machines in Amazon's EC2 cluster should be essentially free. The hardware and infrastructure costs are all sunk. In a non-peak time, only the marginal cost of the additional electricity used by a busy box over an idle box is a true cost.
What Amazon should be doing is offer a steep discount on EC2 pricing for interruptible batch jobs like MapReduce jobs, then only run those jobs in the idle capacity of non-peak times. This would allow Amazon to smooth the load on their cluster and improve utilization while passing on the savings to others.
For more on this topic, please see also my Jan 2007 post, "I want differential pricing for Amazon EC2".
Please see also Amazon VP James Hamilton's recent post, "Resource Consumption Shaping", which also talks about smoothing load on a cluster. Note that James argues that the marginal cost of making an idle box busy is near zero because of the way power and network use is billed (at the 95th percentile).
For some history on past efforts to run Hadoop on EC2, please see my Nov 2006 post, "Hadoop on Amazon EC2".
Update: Eight months later, Amazon launches differential pricing for EC2.