Friday, August 22, 2008

Column versus row stores

Daniel Abadi, Samuel Madden, and Nabil Hachem had a paper at SIGMOD 2008, "Column-Stores vs. Row-Stores: How Different Are They Really?" (PDF), with a fascinating discussion of what optimizations could be implemented in a traditional row store database to make it behave like a column store.

The paper attempts to answer the question of whether there "is something fundamental about the way column-oriented DBMSs are internally architected" that yields performance gains or whether the column-oriented optimizations can be mimicked in a row-store design.

Specifically, the authors look at taking a row-store and vertically partitioning the rows, indexing all the columns, creating materialized views of the data (just the needed columns for a query already precomputed), compressing the data, and a couple variations on late materialization when joining columns that significantly impact performance.

Despite strong statements in the abstract and introduction that "it is not possible for a row-store to achieve some of the performance advantages of a column-store" and that "there is in fact something fundamental about the design of column-store systems that makes them better suited to data-warehousing workloads", the discussion and conclusion further into the paper are far murkier. For example, the conclusion states that it "is not that simulating a column-store in a row-store is impossible", just that attempting to fully simulate a column-store is difficult in "today's row store systems".

The murkiness seems to come from the difficulty the authors had forcing a row-store to do some of the query plan optimizations a column-store does once they had jiggered all the underlying data to simulate a column-store.

For example, when discussing "index-only plans" (which basically means throwing indexes on all the columns of a row-store), the row-store used slow hash-joins to combine indexed columns and they "couldn't find a way to force [the system] to defer these joins until later in the plan, which would have made the performance of this approach closer to vertical partitioning".

Later, the authors say that they couldn't "trick" the row-store into storing columns "in sorted order and then using a merge join", so it instead did "relatively expensive hash joins". They say the problems are not fundamental and that "there is no reason why a row-store cannot store tuple headers separately, use virtual record-ids to join data, and maintain heap files in guaranteed position order", just that they were not able to force the row-store to do these things in their simulation.

In the end, the paper only concludes that:
A successful column-oriented simulation [in a row-store] will require some important system improvements, such as virtual record-ids, reduced tuple overhead, fast merge joins of sorted data, run-length encoding across multiple tuples ... operating directly on compressed data ... and late materialization ... some of ... [which] have been implemented or proposed to be implemented in various row-stores.
That seems like a much weaker conclusion that what the introduction promised. The conclusion appears to be that a row-store can perform as well as a column-store on data warehouse workloads, just that it looks awkward and difficult to make all the optimizations necessary for it to do so.

Please see also two blog posts, "Debunking a Myth: Column-Stores vs. Indexes" and "Debunking Another Myth: Column-Stores vs. Vertical Partitioning", by the first author, Daniel Abadi, where he expands on some parts of the paper.

If you are interested in this topic, you might also enjoy two of my older posts, "MapReduce a step backwards?" and "C-Store and Google BigTable".

1 comment:

Martin Probst said...

The authors of this paper (who are also the authors the database column blog) clearly have a commercial agenda. I've been reading their blog for some time, but there were too many posts that obviously contained huge holes or very biased argumentation to take it as a credible source of information.

Which is somewhat sad, as these guys are clearly clever and could share a lot of knowledge about database architectures.