Thursday, December 06, 2012

Google and the right database for the job

I finally got a chance to read "Processing a Trillion Cells per Mouse Click", a paper out of Google presented at the recent VLDB 2012 conference.

It describes the rather cool PowerDrill column-oriented database at Google that is optimized for speed, x10-100 times faster than other column-oriented databases, and several orders of magnitude faster than MapReduce/Hadoop. But, of course, there are tradeoffs to get those speed gains, and the tradeoff PowerDrill makes is that it keeps a lot in memory, so it can only contain a fraction of the data of the other systems.

What is so interesting about this, and what other companies need to learn from this, is the way Google builds so many databases to analyze its massive log data. The goal is to let people find stuff in the logs as fast as possible. That means you need many tools, the right tool for the job.

Hadoop and similar systems allow you to scan massive amounts of log data but, c'mon, all of us know that the vast majority of Hadoop jobs ignore almost all of the data. Every one of these jobs starts by selecting out a couple of the columns, the same columns almost everyone else wants, and dropping everything else. Fire up your job, waste hours of time waiting for almost all the data from a full table scan to be thrown out, and finally you get the result.

Dremel and other column-oriented databases help a lot with this. If almost all log processing jobs only want a couple columns, a column-oriented database is designed to pull out just a few columns quickly, and it's going to be a lot faster.

PowerDrill goes a step further. If almost all log processing jobs only want the most recent logs and only a few of the columns, just create a database with only the most recent logs and a few of the columns. Add in a lot of carefully designed compression, sharding across a medium-sized cluster, and the ability to skip over much of the data when it isn't needed (instead of doing full table scans all the time), and you got yourself the ability to answer most questions people ask of the logs in seconds, not hours.

And that's the point. Build a system that can answer 90% of the questions people ask of the logs in seconds. Build another than can answer 90% of the remaining, harder questions people ask of the logs in minutes. Then have a system that primarily archives all the logs, but also can answer, given enough time and power, much more complicated questions people very rarely ask.

Those Google guys have many databases for asking questions of their logs. Maybe you should too.

Some excerpts from the PowerDrill paper:
The column-store developed as part of PowerDrill is tailored to support a few selected datasets and tuned for speed ... Our column-store relies on having as much data in memory as possible ... PowerDrill can run interactive single queries over more rows than Dremel, however the total amount of data it can serve is much smaller.

Consider a typical use case such as triggering 20 SQL queries with a single click in the UI. In our production system on average these queries process 782 billion cells in 30-40 seconds (under 2 seconds per query) .... Each month it is used by more than 800 users sending out about 4 million SQL queries ... scanning [the equivalent of] 525 trillion cells .... One of our top users ... [in] 6 hours ... [executed about] 12 thousand queries .... Our production system is running on well over 1000 machines, the distributed servers altogether using over 4T of main memory.

[PowerDrill] pushes the "interactivity limit" out significantly ... The majority of queries are fairly discriminative, similar, and uniform ... The store has only a few but often explored tables (as opposed to many tables that are not used very often) ... [For many common queries] our techniques push the limit of interactivity out by one or two orders of magnitude.

Saturday, December 01, 2012

Quick links

More of what caught my attention recently:
  • Android now has 72.4% of the mobile market, up from 52.5% a year ago ([1])

  • Google's new Nexus 4 smartphone is in high demand and for good reason: "The idea that a Nexus quad-core smartphone is hitting the market ... [at] $300 is simply stunning. Even more so is that it's available without any contract or carrier locks, which means you can use it virtually anywhere in the world. .. The price of freedom has never been more reasonable." ([1] [2] [3])

  • Google and Amazon aim to destroy Apple's high margin business model, selling hardware at cost and making money off content instead ([1])

  • "Amazon is a black hole threatening to devour corporate America" ([1])

  • "The ground is shifting beneath ... tech titans because of a major force: the rise of mobile devices" ([1])

  • Mobile/tablets are being used for about 16% of online sales, but sales from referrals out of Twitter and Facebook are near 0% ([1])

  • Google expects that 50% of traffic to Google.com will come from mobile in 2013. I wonder what that implies for Google, since it almost certainly does not mean 50% of revenue comes from mobile in 2013. ([1])

  • Google's latest Chromebook laptop and Nexus 7 tablet are both in high demand, and Google is "massively ramping production". Meanwhile, Microsoft is cutting production of its Surface hybrid tablet because of low demand. ([1] [2] [3])

  • Tablets mostly are used in the evening and for games and entertainment ([1] [2] [3])

  • Surprising data (at least to me) on browser market share, I thought IE was falling rapidly, but no. Data says IE is steady, Chrome growth is stalled, and Firefox is no longer falling, actually climbing slightly. ([1])

  • "Giving users the choice to view (or not view) may actually increase this advertising effectiveness" ([1])

  • Experimental data is poised to kill off a big chunk of the last three decades of work in theoretical physics ([1])

  • Good overview of current state of autonomous flying robots. Lots of breakthroughs recently. ([1])

  • "It's actually more natural for humans to think logarithmically than linearly" ([1])

  • If you don't need the actual location right away, it's three orders of magnitude cheaper (in energy use) to collect raw GPS data and process it later (in the cloud) than it is to process it immediately on the mobile device ([1])

  • Startups would love to get their hands on Google Fiber (especially the upload speeds) but can't. Cities should be thinking about encouraging Google Fiber (or similar) as a way to encourage startups. ([1])

  • Key question is: "Do patents, in fact, provide a net incentive for innovation in the software industry?" ([1])

  • Crazy data about the incredibly low cost of renting botnets, paying for someone to take out websites with DDoS attacks, sending spam, and buying various types of trojans ([1])

  • "We can't be afraid to let them actually take charge and ship" ([1])

  • "Only a handful of startups that are big successes. What happens along the way that causes such failure? It's like there's a tunnel full of monsters that kill them along the way. I'm going to tell you what these monsters are so you know to avoid them." ([1])

  • "By far the most common mistake startups make is to solve problems no one has" ([1])

  • Dilbert summarizes the advice from most business books ([1])

  • "People with lots of authority tend to behave like neurological patients with a damaged orbito-frontal lobe, a brain area that's crucial for empathy and decision-making" ([1])

  • "Studies of the human brain demonstrate that .... some people seem to think about their future selves in the same way that they think about complete strangers" ([1])

  • On why PC sales are flat: "Norvig's Law: Any technology that surpasses 50% penetration will never double again (in any number of months)." ([1])

  • "To the surprise of pundits, numbers continue to be best system for determining which of two things is larger" ([1])