Capjure: a simple HBase persistence layer

Or how to get some free help putting stuff into HBase. For Clojure programmers.

OK, so maybe this one can even be called simplistic. Still, it works for my needs right now – so I thought it might help others. I wrote about my use-case in a previous post about HBase schema design – and this is the little helper utility I mentioned towards the end of it.

How to use it

Download here. There are two vars that need appropriate bindings for capjure to work – *hbase-master* and *primary-keys-config*

*hbase-master*

This one is obvious. It must be bound to an hbase name-node or a master server. For example, I might bind it to hbase.test-site.net:60000. Watch out for the fact that Amazon’s EC2 instances need lots of configuration to open such ports to the world. This has no real relevance to this post, just thought I’d mention it.

*primary-keys-config*

This one is a bit more involved – and I’m sure I’ve done a bad job of simplifying this usage pattern. Still, lets consider the example from the previous post. When you have an array of several hashes as a value in your JSON object that is being persisted (eg. for the :cars key) –

  :cars => [
    {:make => 'honda', :model => 'fit', :license => 'ah12001'},
    {:make => 'toyota', :model => 'yaris', :license => 'xb34544'}],

it will be converted into

{
  "cars_model:ah12001" => "fit",
  "cars_make:ah12001" => "honda",
  "cars_model:xb34544" => "yaris",
  "cars_make:xb34544" => "toyota"
}

To make this happen, capjure needs to know what to use as a primary-key. Or something like that 🙂 Here, we have decided upon the :license attribute of each hash. Capjure then removes that property from the child-hashes being saved, and sticks the value into the key part of the flattened data-structure as shown above.

This is accomplished by –

(def encoders (config-keys
  (config-for :cars :license  (fn [car-map]
					       (car-map :license))))

Similarly, other primary-keys can be configured. And because the actual value used is the value returned by the function defined (as above), it can be as complex as needed. For example of the values have spaces, you can encode it using some scheme.

A similar configuration is needed for this process to be reversed during reading out of HBase. The

(def decoders (config-keys
  (config-for :cars :license  (fn [value]
					       		value)))

In this case, we just use an identity function because the reverse mapping is straight-forward (in other words, we didn’t do anything fancy during the previous flattening operation). What happens is that a key-value pair (key being the one specified (:license)) and the value as whatever is returned by the function is added to the flattened object being re-hydrated.

Similarly, other configuration parameters can be added for other sub-objects that have primary-keys.

Together, the encoders and decoders form the *primary-keys-config*. Thus, if you do the following –

 
(def keys-config {:encode encoders :decode decoders})

then keys-config should be used as the value that *primary-keys-config* gets bound to.

Methods of interest

Once this is done, objects can be pushed into and out of HBase quite trivially –

 
(binding [*hbase-master* "hbase.test-site.net:60000" *primary-keys-config* keys-config]
	(capjure-insert some-json-object "hbase_table_name" "some-row-id"))

and –

 
(binding [*hbase-master* "hbase.test-site.net:60000" *primary-keys-config* keys-config]
	(read-as-hydrated "hbase_table_name" "some-row-id"))

Other convenience methods

Capjure provides other convenience methods like –

 
row-exists? [hbase-table-name row-id-string]
cell-value-as-string [row column-name]
read-all-versions-as-strings [hbase-table-name row-id-string number-of-versions column-family-as-string]
read-cell [hbase-table-name row-id column-name]
rowcount [hbase-table-name & columns]
delete-all [hbase-table-name & row-ids-as-strings]

and others. Everything is based off (uses it underneath) the HBase client API. Thanks to Dan Larkin for clojure-json.

Limitations

I’m no expert in persistence systems – and I’m sure this one has plenty of issues. The main limitation is that the object that capjure can persist can only be so deep. Specifically, the object should be a hash that contain symbols (or strings) as keys, and the values can either be strings (or other primitives), arrays of such primitives, a hash with one level of key-values, or an array of hashes that are one level deep.

Feedback welcome

Please contact me if you have suggestions and stuff. Again, the code is on github.

Adopting lean software development: What is a user story?

I recently got an email from a reader (Dmitry Lobanov) who had some questions about the process stuff I’ve written about here. With his permission, I’ve reproduced the contents of his email and have responded to his queries. It may help other folks that are in early stages of adopting lean/agile methods –

Hello Amit. You keep interesting blog with a lot of useful information about project management. To my opinion the most attractive idea in your blog is user stories decoupling.

Let me tell how I found it interesting and then I’ll ask a few questions, Ok?

In our project we use scrum with some XP techniques. I found link to your blog in an overview of Mingle at Thoughtworks website. While reading I found your idea of decoupling stories into 2 story points tasks very interesting. And I think, that it makes sense. We tried to compare two approaches (yours and card poker) while estimating amount of time required for 1 specific user story. Well, it is not 100% your approach in fact, because we still tried to use estimation and ideal days, but we took your idea concerning 2 story points tasks. But we are not ready to forget about estimation yet 🙂

First we estimated using card poker, and then repeated estimation using your approach. And we found the following advantages of your approach:
– your approach gave us complicated tree of concrete task. That means that we could just take them and start working on this user story, we know what and how should be done. In other words we have got a plan! Card poker didn’t force us to decouple user story into atomic tasks and just gave us list of generalized tasks (which themselves can be user stories).
– card poker estimation gave us some number of story points (we used ideal days according to scrum guidelines), let it be X ideal days. And then we used your approach for estimation: number of atomic tasks multiplied by 2 story points (we still use ideal days 🙂 ), let it be Y ideal days. And we have been amazed, that Y was 10 times greater than X! 10 times! That’s too much.

This experiment shows us that standard estimation lacks of accuracy a lot… well, we suspected it, but only after experiment we saw it clearly. And using card poker (i.e. bad decoupling) we constantly have problems with developers got stuck in development process, and as a result they do 5-days task for 2 weeks or even more. We have one user story, which had been estimated as 8 days, but two developers work on it for 2 sprints already. It’s no good at all.
So in the near sprint we’ll try decoupling user stories into 2 story points tasks and watch the result.

I have a couple of questions about decoupling and sprint planning, and it would be nice if you provide your point of view concerning them.

In your post about requirements management (http://epistemologic.com/2007/04/08/requirements-management-user-stories-mind-maps-and-story-trees/) you touched only initial planing. But what to do with bugs and technical tasks? I’ll explain. Look, our project has been in development for 2 years, we develop payment acceptance system. And our system has been in use for a bit less than 2 years (it took approximately 3 moths to issue first release). So it changes constantly, we issue new release every 2-3 weeks. But when the development started, team suffers a lot from a lack of project management experience. I came to project few months ago (I’m not a project manager, just developer, who wants to improve project). Now situation seems to improve. But we have a LOT of bugs. As you understand, these bugs can’t be attached to specific user story, moreover we hadn’t user stories before, in fact we introduced scrum only 2-3 moths ago.

For example we need to fix one bug (one of many) concerning working with hardware. And we clearly understand, that we need to refactor part of our system, which works with hardware. It could take a lot of time, we understand that we need to add hardware manager, change hardware interfaces and so fouth in order to unify working with hardware. And it should be done, because if we just fix bug not changing anything, then literally tomorrow we’ll have to fix similar bug, and then another one, e.t.c. How this task should be registered? It is not a user story, because product owner (or stake-holders) wouldn’t see the result, they wouldn’t be able to “touch” it, it can’t be demonstrated to them. And it is not a defect for particular user story, because it is the behaivour of legacy code, we haven’t corresponding user story. What should we do?

I see only one solution. We should wait till stakeholders give us user story concerned this functionality. For example, stakeholders want us to add support for a new type of device. Hence we can include refactoring task to a bunch of tasks under this user story.

But our system has been in development for a long time, and we added support for a lot of devices already. So it can take a lot of time till suitable user story will appear. And we should just wait and fix bugs?

And how to register bugs if we don’t have corresponding user story?

Another one situation we faced yesterday. If we have user story, which involves a lot of changes in architecture. That is user story contains a lot of task, but these task can’t be accomplished separately, and they are not user stories themselves, because they are technical tasks (refactoring and researches). It’s clear that we can’t show them to stakeholders, but we can’t accomplish user story not accomplishing these tasks. And we think, that we won’t be able to accomplish all these tasks in one sprint. And I don’t know how we can decouple 1 user story intro several user stories. What should be done?

Correct me please if I am wrong in my interpretation of user stories. I think that user story is “something” useful to stakeholder. Stakeholders can “touch” user story, they can take a look at it, they can play with new functionality and so fourth. Hence anything that can’t be demonstrated to stakeholders is not a user story, is it so? Stakeholders are not interested in tasks, they want to know only 2 things: What and When, they don’t want to know How.

And I have another one question concerning terminology. What does “epic story” mean in terms of user stories? How does it compare to user stories?

Looking forward to your reply.

Best regards,
Dmitry Lobanov

OK – so here’s my take on the questions raised:

User stories: I’ve had to deal with the question of what a user story is quite a few times over the years. There are many theoretical definitions – and I don’t care about most of them. In my opinion, the only thing that matters is this – a user-story should add business value. That raises the obvious – shouldn’t everything you do while working on a project satisfy this criteria? Yes, it should – and this is usually something that business stake-holders understand quite well.

There are several kinds of stories. The first is the obvious “feature” stories that have a GUI etc. This is the category of stories that can be “touched” and is “tangible”. Justifying this one is easy – indeed, these are the ones most commonly requested by the business.

Then there are the “technical” stories that don’t have a UI – e.g. – “store the uploaded data in a compressed format because the space used up by the system right now is costing us too much money”. Again, the value is obvious.

The next category is bugs – fixing these delivers clear value and are no trouble to manage as such. See below for more information.

One other type of story is the spike. These are easy to deal with since it is obvious what value they deliver – namely an experiment to determine if an approach might work.

The final variation is stories that deal with paying down technical debt. These are harder to justify because the business is usually not technical enough to understand these. There is no easy answer to how to sell these – but there is usually a very tangible benefit that can be attained by playing these cards. Usually this benefit is deferred – allowing the team to build new features faster, or reduce the number of bugs being discovered in a subsystem. When put in these terms the business is usually quick to understand – and then it boils down to a question of ROI. The cost of implementing these cards should be less than the savings they represent. It is important for everyone on the team to see the whole – and make these decisions together. The key is to sell the business on the value these stories delivers as opposed to the technical details. And by the way, this should be done whenever the debt starts to weigh heavier than what it should. A clean code-base is a happy code-base.

How to deal with bugs: The short answer is I usually treat bugs as stories. They don’t need to be associated with a specific user story. A bug is a bug – it doesn’t matter how it came to be – (of course, it is often related to a story – and most often the same developer ends up fixing such bugs) – and I track them in the same system used for stories. I also let the same prioritization process determine which ones get fixed and when. Further, some bugs are so tiny that they may even get fixed without there being any record of them at all. Some are so large that they might actually result in multiple stories – especially if a redesign is required. Keeping in mind that it is cheaper to fix bugs sooner than later, bugs can be managed just like any other story.

So – to recap, I don’t care for the theoretical definition of user-stories – that they should be tangible and what not. Instead, I recommend that folks do what makes sense for their situation. If I had to give only one tip around stories, it would be this – keep them small. This mnemonic might help.

Finally – about epic stories. An epic is merely a feature (or some technical aspect of the system) that is way too large to be completed as a single user-story. It is epic, because it is a large story 🙂 I always keep playable stories down to less than 2 days in length – as far as possible. Epic stories, therefore, must be managed by breaking them down into incremental chunks of functionality. This is possible about 95% of the time – sometimes however when some spike is involved or some hairy technical refactoring is needed – longer stories can be played.

I hope these thoughts answer some of the questions. I’m sure other folks have different experiences and different solutions to these issues – and that’s fine too, since agile is all about adapting.

HBase: On designing schemas for column-oriented data-stores

At Runa, we made an early decision to optimize. 🙂

We decided not go down the scaling a traditional database system (the one in question at the time was MySQL) route and instead to use a column-oriented data-store, that was built for scaling. After a cursory evaluation (which was done by me – and it involved a few hours of browsing the web (pseudo research),  checking email (not research),  and instant-messaging with a couple of buddies (definitely not research) – we picked HBase.

OK, so the real reason was that  we knew a few companies are using it, and that there seemed to be a little bit more of a community around it. In fact, there are a couple of HBase user-groups out here in the bay area. Someone recently asked me about CouchDB, and I’ve noticed a lot more buzz about it now that I’m watching for it… I’ve no reason why we didn’t pick it instead. They’re both Apache projects… maybe we’ll have use of CouchDB someday.

HBase schemas:

So anyway, we picked HBase. Now, having done so, and also having spent all my life using relational databases, I had no idea how to begin using HBase – the very first question sort of stumped me – what should the schema look like?

Here’s what I figured out – and I’m sure people will flame me for this – you don’t really need to design one. You just take your objects, extract their data using something like protocol buffers, yaml/json, or even (ugh!) XML – and then stick those into a single column in an HBase table.

Our simplistic HBase mapper:

We use JSON – but we’re doing something slightly different. Think of our persistable object represented by a fairly shallow JSON object – like so:

:painting => {
  :trees => [ "cedar", "maple", "oak"],
  :houses => 4,
  :cars => [
    {:make => 'honda', :model => 'fit', :license => 'ah12001'},
    {:make => 'toyota', :model => 'yaris', :license => 'xb34544'}],
  :road => {:name => '101N', :speed => 65}
}

OK, bizarre example. Still – the way you would persist this in HBase with the common approach would be to create an HBase table that had a simple schema – a single column family with one column – and you’d store the entire JSON message as text in each row. Maybe you would compress it. The row-id could be something that makes sense in your domain – I’ve seen user-ids, other object-ids, even time-stamps used. We use time-stamps for one of our core tables.

The variation we’re using is instead of storing the whole thing as a ‘blob’, we’re splitting it into columns – so that the tree-like structure of this JSON message is represented as a flat set of key-value pairs. Then, each value can be stored under a column-family:column-name.

The example above would translate to the following flat structure –

{
  "road:name" => "101N",
  "road:speed" => 65,
  "houses:" => 4,
  "trees:cedar" => "cedar",
  "trees:maple" => "maple",
  "trees:oak" => "oak",
  "cars_model:ah12001" => "fit",
  "cars_model:xb34544" => "yaris",
  "cars_make:ah12001" => "honda",
  "cars_make:xb34544" => "toyota"
}

Now it is ready to be inserted into a single row in HBase. The table we’re inserting into has the following column-families –

"road"
"houses"
"trees"
"cars_model"
"cars_make"

This can then be read back and converted into the original object easily enough.

Column-names as data:

Something to note here is that there are now column-family:column-name pairs (that together constitute the ‘column-name’) contain data themselves. An example is ‘cars_model:ah12001’ which is the name of the column, whose value is ‘fit’.

Why do we do this? Because we want the entire object-graph flattened into one row, and this allows us to do that.(what is the primary key?)

The thing to remember here is that in HBase (and others like it), each row can have any number of columns (constrained only by the column-families defined) and rows can populate values for different columns, leaving others blank. Nulls are stored for free. Coupled with the fact that HBase is optimized for millions of columns, this pattern of data-modeling becomes feasible. In fact you could store tons of data in any row in this manner.

Final comments:

If you’re always going to operate on the full object graph, then you don’t really need to split things up this way – you could use one of the options described above (xml, json, or protocol buffers). If different clients of this data typically need only a subset of the object graph (say only the car models or only the speed limits of roads, or some such combination), then with this data-splitting approach, they could only load up the required columns.

This idea of using columns as data take a little getting used to. I’m sure there are better ways of using such powerful data-stores – but this is the approach we’re taking right now, and it seems to be working for us so far.

Clojure HBase helper:

I’ve written some Clojure code that helps this transformation back and forth: hash-tables into/out of HBase. It is open-source – and once I clean it up, I will write about it.

Hope this stuff helps – and if I’ve described something stupid, then please leave a correction. Thanks!

P. S. – I met a Googler today who said BigTable (and by inference HBase) is not a column-oriented database. I think that is incorrect – at least according to wikipedia. I read it on the Internet, I must be right 🙂