Hadoop is the open source version of Google GFS and MapReduce. Yahoo is pushing a lot of the development of Hadoop.
The notes cover talks on Hadoop, Yahoo Pig, Microsoft Dryad,
James' notes are excellent, long, and detailed. I found several noteworthy tidbits in there.
Most of the Hadoop clusters are small, 2k nodes or less, often just tens of nodes. They have plans to expand to 5k.
They have hit problems with scheduling jobs -- "FIFO scheduling doesn't scale for large, diverse user bases", "We're investing heavily in scheduling to handle more concurrent jobs" -- but an attempt to deal with this by breaking one cluster into many virtual clusters called Hadoop on Demand is not working well -- "HoD scheduling implementation has hit the wall ... HoD was a good short term solution but not adequate for current usage levels. It's not able to handle the large concurrent job traffic Yahoo! is currently experiencing."
Joins are "hard to write in Hadoop" and even harder to optimize, which is part of the motivation for the higher level Pig language built on top of Hadoop. While the Pig team argues that their language is simpler and easier to use than SQL, they do have plans to write an SQL-like processing layer on top of Pig.
On the topic of joins, there apparently was a brief discussion comparing Hadoop/MapReduce data processing to more traditional databases where the primary difference mentioned was that databases have indexes where Hadoop and MapReduce have to create the equivalent of those indexes ad hoc for each job. While that is not a new point, considering it again makes me wonder if there might be a middle ground here where we retain older extracts and other intermediate results and reuse them for similar computations over the same data. That would have a similar effect as an index, but would have the advantages of being built on demand and targeted to the current workloads.
If there are other tidbits you found of interest in the notes (or you attended the conference and found other things to be of interest), please add a comment to this post!
Update: James adds a summmary post with a few more thoughts.
Update: As Doug Cutting and Chad Walters point out in the comments, a 2k node cluster easily could have 10k cores, so I struck my statement that only having 2k nodes seems to be in conflict with Yahoo's earlier announcement.