HBase: On designing schemas for column-oriented data-stores

At Runa, we made an early decision to optimize. 🙂

We decided not go down the scaling a traditional database system (the one in question at the time was MySQL) route and instead to use a column-oriented data-store, that was built for scaling. After a cursory evaluation (which was done by me – and it involved a few hours of browsing the web (pseudo research),  checking email (not research),  and instant-messaging with a couple of buddies (definitely not research) – we picked HBase.

OK, so the real reason was that  we knew a few companies are using it, and that there seemed to be a little bit more of a community around it. In fact, there are a couple of HBase user-groups out here in the bay area. Someone recently asked me about CouchDB, and I’ve noticed a lot more buzz about it now that I’m watching for it… I’ve no reason why we didn’t pick it instead. They’re both Apache projects… maybe we’ll have use of CouchDB someday.

HBase schemas:

So anyway, we picked HBase. Now, having done so, and also having spent all my life using relational databases, I had no idea how to begin using HBase – the very first question sort of stumped me – what should the schema look like?

Here’s what I figured out – and I’m sure people will flame me for this – you don’t really need to design one. You just take your objects, extract their data using something like protocol buffers, yaml/json, or even (ugh!) XML – and then stick those into a single column in an HBase table.

Our simplistic HBase mapper:

We use JSON – but we’re doing something slightly different. Think of our persistable object represented by a fairly shallow JSON object – like so:

:painting => {
  :trees => [ "cedar", "maple", "oak"],
  :houses => 4,
  :cars => [
    {:make => 'honda', :model => 'fit', :license => 'ah12001'},
    {:make => 'toyota', :model => 'yaris', :license => 'xb34544'}],
  :road => {:name => '101N', :speed => 65}

OK, bizarre example. Still – the way you would persist this in HBase with the common approach would be to create an HBase table that had a simple schema – a single column family with one column – and you’d store the entire JSON message as text in each row. Maybe you would compress it. The row-id could be something that makes sense in your domain – I’ve seen user-ids, other object-ids, even time-stamps used. We use time-stamps for one of our core tables.

The variation we’re using is instead of storing the whole thing as a ‘blob’, we’re splitting it into columns – so that the tree-like structure of this JSON message is represented as a flat set of key-value pairs. Then, each value can be stored under a column-family:column-name.

The example above would translate to the following flat structure –

  "road:name" => "101N",
  "road:speed" => 65,
  "houses:" => 4,
  "trees:cedar" => "cedar",
  "trees:maple" => "maple",
  "trees:oak" => "oak",
  "cars_model:ah12001" => "fit",
  "cars_model:xb34544" => "yaris",
  "cars_make:ah12001" => "honda",
  "cars_make:xb34544" => "toyota"

Now it is ready to be inserted into a single row in HBase. The table we’re inserting into has the following column-families –


This can then be read back and converted into the original object easily enough.

Column-names as data:

Something to note here is that there are now column-family:column-name pairs (that together constitute the ‘column-name’) contain data themselves. An example is ‘cars_model:ah12001’ which is the name of the column, whose value is ‘fit’.

Why do we do this? Because we want the entire object-graph flattened into one row, and this allows us to do that.(what is the primary key?)

The thing to remember here is that in HBase (and others like it), each row can have any number of columns (constrained only by the column-families defined) and rows can populate values for different columns, leaving others blank. Nulls are stored for free. Coupled with the fact that HBase is optimized for millions of columns, this pattern of data-modeling becomes feasible. In fact you could store tons of data in any row in this manner.

Final comments:

If you’re always going to operate on the full object graph, then you don’t really need to split things up this way – you could use one of the options described above (xml, json, or protocol buffers). If different clients of this data typically need only a subset of the object graph (say only the car models or only the speed limits of roads, or some such combination), then with this data-splitting approach, they could only load up the required columns.

This idea of using columns as data take a little getting used to. I’m sure there are better ways of using such powerful data-stores – but this is the approach we’re taking right now, and it seems to be working for us so far.

Clojure HBase helper:

I’ve written some Clojure code that helps this transformation back and forth: hash-tables into/out of HBase. It is open-source – and once I clean it up, I will write about it.

Hope this stuff helps – and if I’ve described something stupid, then please leave a correction. Thanks!

P. S. – I met a Googler today who said BigTable (and by inference HBase) is not a column-oriented database. I think that is incorrect – at least according to wikipedia. I read it on the Internet, I must be right 🙂

32 thoughts on “HBase: On designing schemas for column-oriented data-stores

  1. How do You find Yur storage model suitable for searching for some data? For example, if, i Your example, You should collect all informations to all cars of the type “honda yaris”. Is there any way how to make it without going through all of columns in cars_model and cars_make ?

  2. Just desire to say your article is as amazing. The clearness for your put up is just cool and that
    i could assume you’re an expert in this subject.
    Fine with your permission allow me to grab
    your feed to keep up to date with coming near near post.
    Thank you a million and please carry on the rewarding work.

  3. Strictly speaking the Googler was correct. I think databases like Vertical or even KDB (under the covers) are column oriented. HBase is column family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s