As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:The comments on the post are enjoyable and useful. Many rightfully point out that it might not be fair to compare a system like MapReduce to a full database. DeWitt and Stonebraker do partially address this, though, by not just limiting their criticism to GFS, but also going after BigTable.
1. A giant step backward in the programming paradigm for large-scale data intensive applications
2. A sub-optimal implementation, in that it uses brute force instead of indexing
3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
4. Missing most of the features that are routinely included in current DBMS
5. Incompatible with all of the tools DBMS users have come to depend on
The most compelling part of the post for me is their argument that some algorithms require random access to data, something that is not well supported by GFS, and it is not always easy or efficient to restructure those algorithms primarily to do sequential scans.
However, as the slides from one of Google's talks on MapReduce say, "MapReduce isn't the greatest at iterated computation, but still helps run the 'heavy lifting'" (slide 33, lecture 5). And those slides in both lectures 4 and 5 give examples of how iterated algorithms like clustering and PageRank can be implemented in a MapReduce framework.
Moreover, from the heavy usage of MapReduce inside of Google, it is clear that the class of computations that can be done in MapReduce reasonably efficiently (both in programmer time and computation) is quite substantial. It seems hard to argue that MapReduce is not supporting many of Google's needs for large scale computation.
Perhaps if the argument is that Google's needs are specialized and that others may find their computations more difficult to implement in a MapReduce framework, DeWitt and Stonebraker have a stronger point.
[Found via François Schiettecatte]
Update: Good rebuttal from Googler Mark Chu-Carroll. [Found via Dare Obasanjo]
Update: I finally got around to reading Stonebraker et al., "The End of an Architectural Era" (PDF) from VLDB 2007. The article is mostly about Stonebraker's H-store database, but I cannot help but notice that many of the points the authors make against RDBMs would seem to undermine the new claim that MapReduce is "a giant step backwards."
For example, Stonebraker et al. write, "RDBMs can be beaten handily ... [by] a specialized engine." They predict that "the next decade will bring domination by shared-nothing computer systems ... [and] DBMS should be optimized for this configuration .... We are heading toward a world ... [of] specialized engines and the death of the 'one size fits all' legacy systems." Finally, they write "SQL is not the answer" and argue for a system where the computation is done on the nodes near the data.
While the authors also make some more specific statements that do not apply quite as well to GFS/MapReduce/BigTable -- for example, that we will have "a grid of systems with main memory storage, build-in high availability, no user stalls, and useful transaction work in under 1 millisecond" -- I do not understand why they do not see MapReduce as a wonderful example of a specialized data store that runs over a massive shared-nothing system and keeps computation near the data.
Update: Stonebraker and DeWitt follow up in a new post, "MapReduce II". The new post can be summarized by this line from it: "We believe it is possible to build a version of MapReduce with more functionality and better performance." That almost certainly is true, but no one has yet, at least not at the scale Google is processing.