A small pile of Googlers recently presented a paper, "Megastore: Providing Scalable, Highly Available Storage for Interactive Services" (PDF) that details the architecture of a major distributed database used at Google.
Megastore "has been widely deployed within Google for several years ... handles more than three billion write and 20 billion read transitions daily ... [and] stores nearly a petabyte ... across many datacenters."
Others have already summarized the paper ( ), so I'll focus on my reaction to it. What I found surprising about Megastore, especially when comparing to other large scale databases, is that it favors consistency over performance.
For consistency, Megastore provides "full ACID semantics within partitions", "supports two-phase commit across entity groups", guarantees that reads always see the last write, uses Paxos for confirming consensus among replicas, and uses distributed locks between "coordinator processes" as part of detecting failures. This is all unusually strong compared to the more casual eventual consistency offered by databases like Amazon's Dynamo, Yahoo's HBase, and Facebook's Cassandra.
The problem with providing Megastore's level of consistency is performance. The paper mostly describes Megastore's performance in sunny terms, but, when you look at the details, it does not compare favorably with other databases. Megastore has "average read latencies of tens of milliseconds" and "average write latencies of 100-400 milliseconds". In addition, Megastore has a limit of "a few writes per second per entity group" because higher write rates will cause conflicts, retries, and even worse performance.
By comparison, Facebook expects 4ms reads and 5ms writes on their database, so Megastore is an order of magnitude or two slower than what Facebook developers appear to be willing to tolerate.
Google application developers seem to find the latency to be a hassle as well. The paper says that Googlers find the latency "tolerable" but often have to "hide write latency from users" and "choose entity groups carefully".
This makes me wonder if Google has made the right tradeoff here. Is it really easier for application developers to deal with high latency all of the time than to almost always have low latency but have to worry more about consistency issues? Most large scale databases have made the choice the other way. Quite surprising.
Update: Googler DeWitt Clinton writes with a good point: "We build on top of Megastore when we require some of those characteristics (availability, consistency), and to Bigtable directly when we require low latency and high throughput instead. So it's up to the application to decide what tradeoffs to make, definitely not one-size-fits-all."