s-expressions

Amit Rathore blogs about software development

Archive for March, 2009

Capjure: a simple HBase persistence layer

Posted by Amit Rathore on March 31, 2009

Or how to get some free help putting stuff into HBase. For Clojure programmers.

OK, so maybe this one can even be called simplistic. Still, it works for my needs right now – so I thought it might help others. I wrote about my use-case in a previous post about HBase schema design – and this is the little helper utility I mentioned towards the end of it.

How to use it

Download here. There are two vars that need appropriate bindings for capjure to work – *hbase-master* and *primary-keys-config*

*hbase-master*

This one is obvious. It must be bound to an hbase name-node or a master server. For example, I might bind it to hbase.test-site.net:60000. Watch out for the fact that Amazon’s EC2 instances need lots of configuration to open such ports to the world. This has no real relevance to this post, just thought I’d mention it.

*primary-keys-config*

This one is a bit more involved – and I’m sure I’ve done a bad job of simplifying this usage pattern. Still, lets consider the example from the previous post. When you have an array of several hashes as a value in your JSON object that is being persisted (eg. for the :cars key) -

  :cars => [
    {:make => 'honda', :model => 'fit', :license => 'ah12001'},
    {:make => 'toyota', :model => 'yaris', :license => 'xb34544'}],

it will be converted into

{
  "cars_model:ah12001" => "fit",
  "cars_make:ah12001" => "honda",
  "cars_model:xb34544" => "yaris",
  "cars_make:xb34544" => "toyota"
}

To make this happen, capjure needs to know what to use as a primary-key. Or something like that :) Here, we have decided upon the :license attribute of each hash. Capjure then removes that property from the child-hashes being saved, and sticks the value into the key part of the flattened data-structure as shown above.

This is accomplished by -

(def encoders (config-keys
  (config-for :cars :license  (fn [car-map]
					       (car-map :license))))

Similarly, other primary-keys can be configured. And because the actual value used is the value returned by the function defined (as above), it can be as complex as needed. For example of the values have spaces, you can encode it using some scheme.

A similar configuration is needed for this process to be reversed during reading out of HBase. The

(def decoders (config-keys
  (config-for :cars :license  (fn [value]
					       		value)))

In this case, we just use an identity function because the reverse mapping is straight-forward (in other words, we didn’t do anything fancy during the previous flattening operation). What happens is that a key-value pair (key being the one specified (:license)) and the value as whatever is returned by the function is added to the flattened object being re-hydrated.

Similarly, other configuration parameters can be added for other sub-objects that have primary-keys.

Together, the encoders and decoders form the *primary-keys-config*. Thus, if you do the following -

 
(def keys-config {:encode encoders :decode decoders})

then keys-config should be used as the value that *primary-keys-config* gets bound to.

Methods of interest

Once this is done, objects can be pushed into and out of HBase quite trivially -

 
(binding [*hbase-master* "hbase.test-site.net:60000" *primary-keys-config* keys-config]
	(capjure-insert some-json-object "hbase_table_name" "some-row-id"))

and -

 
(binding [*hbase-master* "hbase.test-site.net:60000" *primary-keys-config* keys-config]
	(read-as-hydrated "hbase_table_name" "some-row-id"))

Other convenience methods

Capjure provides other convenience methods like -

 
row-exists? [hbase-table-name row-id-string]
cell-value-as-string [row column-name]
read-all-versions-as-strings [hbase-table-name row-id-string number-of-versions column-family-as-string]
read-cell [hbase-table-name row-id column-name]
rowcount [hbase-table-name & columns]
delete-all [hbase-table-name & row-ids-as-strings]

and others. Everything is based off (uses it underneath) the HBase client API. Thanks to Dan Larkin for clojure-json.

Limitations

I’m no expert in persistence systems – and I’m sure this one has plenty of issues. The main limitation is that the object that capjure can persist can only be so deep. Specifically, the object should be a hash that contain symbols (or strings) as keys, and the values can either be strings (or other primitives), arrays of such primitives, a hash with one level of key-values, or an array of hashes that are one level deep.

Feedback welcome

Please contact me if you have suggestions and stuff. Again, the code is on github.

Posted in Uncategorized | Tagged: , , , | Leave a Comment »

HBase: On designing schemas for column-oriented data-stores

Posted by Amit Rathore on March 8, 2009

At Runa, we made an early decision to optimize. :)

We decided not go down the scaling a traditional database system (the one in question at the time was MySQL) route and instead to use a column-oriented data-store, that was built for scaling. After a cursory evaluation (which was done by me – and it involved a few hours of browsing the web (pseudo research),  checking email (not research),  and instant-messaging with a couple of buddies (definitely not research) – we picked HBase.

OK, so the real reason was that  we knew a few companies are using it, and that there seemed to be a little bit more of a community around it. In fact, there are a couple of HBase user-groups out here in the bay area. Someone recently asked me about CouchDB, and I’ve noticed a lot more buzz about it now that I’m watching for it… I’ve no reason why we didn’t pick it instead. They’re both Apache projects… maybe we’ll have use of CouchDB someday.

HBase schemas:

So anyway, we picked HBase. Now, having done so, and also having spent all my life using relational databases, I had no idea how to begin using HBase – the very first question sort of stumped me – what should the schema look like?

Here’s what I figured out – and I’m sure people will flame me for this – you don’t really need to design one. You just take your objects, extract their data using something like protocol buffers, yaml/json, or even (ugh!) XML – and then stick those into a single column in an HBase table.

Our simplistic HBase mapper:

We use JSON – but we’re doing something slightly different. Think of our persistable object represented by a fairly shallow JSON object – like so:

:painting => {
  :trees => [ "cedar", "maple", "oak"],
  :houses => 4,
  :cars => [
    {:make => 'honda', :model => 'fit', :license => 'ah12001'},
    {:make => 'toyota', :model => 'yaris', :license => 'xb34544'}],
  :road => {:name => '101N', :speed => 65}
}

OK, bizarre example. Still – the way you would persist this in HBase with the common approach would be to create an HBase table that had a simple schema – a single column family with one column – and you’d store the entire JSON message as text in each row. Maybe you would compress it. The row-id could be something that makes sense in your domain – I’ve seen user-ids, other object-ids, even time-stamps used. We use time-stamps for one of our core tables.

The variation we’re using is instead of storing the whole thing as a ‘blob’, we’re splitting it into columns – so that the tree-like structure of this JSON message is represented as a flat set of key-value pairs. Then, each value can be stored under a column-family:column-name.

The example above would translate to the following flat structure -

{
  "road:name" => "101N",
  "road:speed" => 65,
  "houses:" => 4,
  "trees:cedar" => "cedar",
  "trees:maple" => "maple",
  "trees:oak" => "oak",
  "cars_model:ah12001" => "fit",
  "cars_model:xb34544" => "yaris",
  "cars_make:ah12001" => "honda",
  "cars_make:xb34544" => "toyota"
}

Now it is ready to be inserted into a single row in HBase. The table we’re inserting into has the following column-families -

"road"
"houses"
"trees"
"cars_model"
"cars_make"

This can then be read back and converted into the original object easily enough.

Column-names as data:

Something to note here is that there are now column-family:column-name pairs (that together constitute the ‘column-name’) contain data themselves. An example is ‘cars_model:ah12001′ which is the name of the column, whose value is ‘fit’.

Why do we do this? Because we want the entire object-graph flattened into one row, and this allows us to do that.(what is the primary key?)

The thing to remember here is that in HBase (and others like it), each row can have any number of columns (constrained only by the column-families defined) and rows can populate values for different columns, leaving others blank. Nulls are stored for free. Coupled with the fact that HBase is optimized for millions of columns, this pattern of data-modeling becomes feasible. In fact you could store tons of data in any row in this manner.

Final comments:

If you’re always going to operate on the full object graph, then you don’t really need to split things up this way – you could use one of the options described above (xml, json, or protocol buffers). If different clients of this data typically need only a subset of the object graph (say only the car models or only the speed limits of roads, or some such combination), then with this data-splitting approach, they could only load up the required columns.

This idea of using columns as data take a little getting used to. I’m sure there are better ways of using such powerful data-stores – but this is the approach we’re taking right now, and it seems to be working for us so far.

Clojure HBase helper:

I’ve written some Clojure code that helps this transformation back and forth: hash-tables into/out of HBase. It is open-source – and once I clean it up, I will write about it.

Hope this stuff helps – and if I’ve described something stupid, then please leave a correction. Thanks!

P. S. – I met a Googler today who said BigTable (and by inference HBase) is not a column-oriented database. I think that is incorrect – at least according to wikipedia. I read it on the Internet, I must be right :)

Posted in Uncategorized | Tagged: , , | 21 Comments »

 
Follow

Get every new post delivered to your Inbox.

Join 1,586 other followers