Data warehouses are getting positively gigantic. It's very hard to run ad hoc queries against 20 terabytes of data and get an answer back anytime soon. The data warehouse market is one where we can get between one- and two-orders-of-magnitude performance improvements.[Interview found via Werner Vogels]
Stream processing ... a feed comes out of the wall and you run it through a workflow to normalize the symbols, clean up the data, discard the outliers, and then compute some sort of secret sauce .... This is a fire hose of data ... A specialized architecture can just clobber the relational elephants in this market.
None of the big text vendors, such as Google and Yahoo, use databases; they never have. They didn't start there, because the relational databases were too slow from the get-go. Those guys have all written their own engines.
In scientific and intelligence databases ... if you have array data and use special-purpose technology that knows about arrays, you can clobber a system in which tables are used to simulate arrays.
[For] Wall Street .... it's basically a latency arms race. If your infrastructure was built with one-second latency, it's just impossible to continue, because if the people arbitraging against you have less latency than you do, you lose. A lot of the legacy infrastructures weren't built for sub-millisecond latency.
Let's say you have an architecture where you process the data from the wire and then use your favorite messaging middleware to send it to the next machine, where you clean the data. People just line up software architectures with a bunch of steps, often on separate machines, and often on separate processes. And they just get clobbered by latency.
Is the relational model going to make it? In semi-structured data, it's already obvious that it's not ... Data warehouses ... are better modeled as entity relationships rather than in a relational model.
Both the programming language interface and the data model can be thrown up in the air. We aren't in 1970. It's 37 years later, and we should rethink what we're trying to accomplish and what are the right paradigms to do it.
Saturday, July 07, 2007
Stonebraker on fast databases
Very interesting interview with database guru Michael Stonebraker in the May ACM Queue. Some excerpts:
Posted by Greg Linden at 8:12 AM