Saturday, July 07, 2007

Stonebraker on fast databases

Very interesting interview with database guru Michael Stonebraker in the May ACM Queue. Some excerpts:
Data warehouses are getting positively gigantic. It's very hard to run ad hoc queries against 20 terabytes of data and get an answer back anytime soon. The data warehouse market is one where we can get between one- and two-orders-of-magnitude performance improvements.

Stream processing ... a feed comes out of the wall and you run it through a workflow to normalize the symbols, clean up the data, discard the outliers, and then compute some sort of secret sauce .... This is a fire hose of data ... A specialized architecture can just clobber the relational elephants in this market.

None of the big text vendors, such as Google and Yahoo, use databases; they never have. They didn't start there, because the relational databases were too slow from the get-go. Those guys have all written their own engines.

In scientific and intelligence databases ... if you have array data and use special-purpose technology that knows about arrays, you can clobber a system in which tables are used to simulate arrays.

[For] Wall Street .... it's basically a latency arms race. If your infrastructure was built with one-second latency, it's just impossible to continue, because if the people arbitraging against you have less latency than you do, you lose. A lot of the legacy infrastructures weren't built for sub-millisecond latency.

Let's say you have an architecture where you process the data from the wire and then use your favorite messaging middleware to send it to the next machine, where you clean the data. People just line up software architectures with a bunch of steps, often on separate machines, and often on separate processes. And they just get clobbered by latency.

Is the relational model going to make it? In semi-structured data, it's already obvious that it's not ... Data warehouses ... are better modeled as entity relationships rather than in a relational model.

Both the programming language interface and the data model can be thrown up in the air. We aren't in 1970. It's 37 years later, and we should rethink what we're trying to accomplish and what are the right paradigms to do it.
[Interview found via Werner Vogels]

5 comments:

Unknown said...

Hi!

While I agree with much of what he is saying, I think he hasn't looked much at Yahoo and Google. Both use a lot of MySQL.

I also read a lot of "come buy Vertica" in that message :)

Cheers,
-Brian

Steve Jenson said...

Hi Brian,

Stonebraker specifically points to Yahoo and Google as examples of companies who don't use relational databases.

If Google thought so highly of MySQL, they probably wouldn't have bothered building BigTable.

Many of their SQL-based systems were moved to BigTable including Blogger (which I worked on). I wouldn't be surprised if there was no MySQL at Google within a few years.

Toby DiPasquale said...

Brian: Both use a lot of MySQL for things that MySQL is good for, i.e. transaction processing. They don't, however, use MySQL for anything related to search, spellcheck, language translation, personalization, etc.

As far as I understand it, MySQL is used for basically two things: prototyping and real-money transaction processing. Everything else is GFS, MapReduce, BigTable, etc. Yahoo! has 19 people full-time on Hadoop so they can get some of the same cost advantages that GOOG has w/r/t said infrastructure components. And Microsoft is hard at work trying to NIH their way into the same thing with Boxwood and Dryad (already in use by the adCenter Labs guys, I hear).

Of course, I'm sure he would also not mind you using Vertica, but at $25K/TB of data, its a little on the expensive side compared with MySQL. Or Hadoop, for that matter...

Anonymous said...

That interview was a great find. Some food for thought on embedded databases.

Siva Gudavalli said...

That was a great stuff. The point you are trying to make out is absolutely true. In the coming future relational databases are not going to meet the Business Intelligence requirements. We should think out of box and check out for alternatives.