Why Datomic?

Cross-posted from Zololabs.

Many of you know we’re using Datomic for all our storage needs for Zolodeck. It’s an extremely new database (not even version 1.0 yet), and is not open-source. So why would we want to base our startup on something like it, especially when we have to pay for it? I’ve been asked this question a number of  times, so I figured I’d blog about my reasons:

  • I’m an unabashed fan of Clojure and Rich Hickey
  • I’ve always believed that databases (and the insane number of optimization options) could be simpler
  • We get basically unlimited read scalability (by upping read throughput in Amazon DynamoDB)
  • Automatic built-in caching (no more code to use memcached (makes DB effectively local))
  • Datalog-as-query language (declarative logic programming (and no explicit joins))
  • Datalog is extensible through user-defined functions
  • Full-text search (via Lucene) is built right in
  • Query engine on client-side, so no danger from long-running or computation-heavy queries
  • Immutable data – audits all versions everything automatically
  • “As of” queries and “time-window” queries are possible
  • Minimal schema (think RDF triples (except Datomic tuples also include the notion of time)
  • Supports cardinality out of the box (has-many or has-one)
  • These reference relationships are bi-directional, so you can traverse the relationship graph in either direction
  • Transactions are first-class (can be queried or “subscribed to” (for db-event-driven designs))
  • Transactions can be annotated (with custom meta-data) 
  • Elastic 
  • Write scaling without sharding (hundreds of thousands of facts (tuples) per second)
  • Supports “speculative” transactions that don’t actually persist to datastore
  • Out of the box support for in-memory version (great for unit-testing)
  • All this, and not even v1.0
  • It’s a particularly good fit with Clojure (and with Storm)

This is a long list, but perhaps begins to explain why Datomic is such an amazing step forward. Ping me with questions if you have ’em! And as far as the last point goes, I’ve talked about our technology choices and how they fit in with each other at the Strange Loop conference last year. Here’s a video of that talk.

Capjure: a simple HBase persistence layer

Or how to get some free help putting stuff into HBase. For Clojure programmers.

OK, so maybe this one can even be called simplistic. Still, it works for my needs right now – so I thought it might help others. I wrote about my use-case in a previous post about HBase schema design – and this is the little helper utility I mentioned towards the end of it.

How to use it

Download here. There are two vars that need appropriate bindings for capjure to work – *hbase-master* and *primary-keys-config*

*hbase-master*

This one is obvious. It must be bound to an hbase name-node or a master server. For example, I might bind it to hbase.test-site.net:60000. Watch out for the fact that Amazon’s EC2 instances need lots of configuration to open such ports to the world. This has no real relevance to this post, just thought I’d mention it.

*primary-keys-config*

This one is a bit more involved – and I’m sure I’ve done a bad job of simplifying this usage pattern. Still, lets consider the example from the previous post. When you have an array of several hashes as a value in your JSON object that is being persisted (eg. for the :cars key) –

  :cars => [
    {:make => 'honda', :model => 'fit', :license => 'ah12001'},
    {:make => 'toyota', :model => 'yaris', :license => 'xb34544'}],

it will be converted into

{
  "cars_model:ah12001" => "fit",
  "cars_make:ah12001" => "honda",
  "cars_model:xb34544" => "yaris",
  "cars_make:xb34544" => "toyota"
}

To make this happen, capjure needs to know what to use as a primary-key. Or something like that 🙂 Here, we have decided upon the :license attribute of each hash. Capjure then removes that property from the child-hashes being saved, and sticks the value into the key part of the flattened data-structure as shown above.

This is accomplished by –

(def encoders (config-keys
  (config-for :cars :license  (fn [car-map]
					       (car-map :license))))

Similarly, other primary-keys can be configured. And because the actual value used is the value returned by the function defined (as above), it can be as complex as needed. For example of the values have spaces, you can encode it using some scheme.

A similar configuration is needed for this process to be reversed during reading out of HBase. The

(def decoders (config-keys
  (config-for :cars :license  (fn [value]
					       		value)))

In this case, we just use an identity function because the reverse mapping is straight-forward (in other words, we didn’t do anything fancy during the previous flattening operation). What happens is that a key-value pair (key being the one specified (:license)) and the value as whatever is returned by the function is added to the flattened object being re-hydrated.

Similarly, other configuration parameters can be added for other sub-objects that have primary-keys.

Together, the encoders and decoders form the *primary-keys-config*. Thus, if you do the following –

 
(def keys-config {:encode encoders :decode decoders})

then keys-config should be used as the value that *primary-keys-config* gets bound to.

Methods of interest

Once this is done, objects can be pushed into and out of HBase quite trivially –

 
(binding [*hbase-master* "hbase.test-site.net:60000" *primary-keys-config* keys-config]
	(capjure-insert some-json-object "hbase_table_name" "some-row-id"))

and –

 
(binding [*hbase-master* "hbase.test-site.net:60000" *primary-keys-config* keys-config]
	(read-as-hydrated "hbase_table_name" "some-row-id"))

Other convenience methods

Capjure provides other convenience methods like –

 
row-exists? [hbase-table-name row-id-string]
cell-value-as-string [row column-name]
read-all-versions-as-strings [hbase-table-name row-id-string number-of-versions column-family-as-string]
read-cell [hbase-table-name row-id column-name]
rowcount [hbase-table-name & columns]
delete-all [hbase-table-name & row-ids-as-strings]

and others. Everything is based off (uses it underneath) the HBase client API. Thanks to Dan Larkin for clojure-json.

Limitations

I’m no expert in persistence systems – and I’m sure this one has plenty of issues. The main limitation is that the object that capjure can persist can only be so deep. Specifically, the object should be a hash that contain symbols (or strings) as keys, and the values can either be strings (or other primitives), arrays of such primitives, a hash with one level of key-values, or an array of hashes that are one level deep.

Feedback welcome

Please contact me if you have suggestions and stuff. Again, the code is on github.

HBase: On designing schemas for column-oriented data-stores

At Runa, we made an early decision to optimize. 🙂

We decided not go down the scaling a traditional database system (the one in question at the time was MySQL) route and instead to use a column-oriented data-store, that was built for scaling. After a cursory evaluation (which was done by me – and it involved a few hours of browsing the web (pseudo research),  checking email (not research),  and instant-messaging with a couple of buddies (definitely not research) – we picked HBase.

OK, so the real reason was that  we knew a few companies are using it, and that there seemed to be a little bit more of a community around it. In fact, there are a couple of HBase user-groups out here in the bay area. Someone recently asked me about CouchDB, and I’ve noticed a lot more buzz about it now that I’m watching for it… I’ve no reason why we didn’t pick it instead. They’re both Apache projects… maybe we’ll have use of CouchDB someday.

HBase schemas:

So anyway, we picked HBase. Now, having done so, and also having spent all my life using relational databases, I had no idea how to begin using HBase – the very first question sort of stumped me – what should the schema look like?

Here’s what I figured out – and I’m sure people will flame me for this – you don’t really need to design one. You just take your objects, extract their data using something like protocol buffers, yaml/json, or even (ugh!) XML – and then stick those into a single column in an HBase table.

Our simplistic HBase mapper:

We use JSON – but we’re doing something slightly different. Think of our persistable object represented by a fairly shallow JSON object – like so:

:painting => {
  :trees => [ "cedar", "maple", "oak"],
  :houses => 4,
  :cars => [
    {:make => 'honda', :model => 'fit', :license => 'ah12001'},
    {:make => 'toyota', :model => 'yaris', :license => 'xb34544'}],
  :road => {:name => '101N', :speed => 65}
}

OK, bizarre example. Still – the way you would persist this in HBase with the common approach would be to create an HBase table that had a simple schema – a single column family with one column – and you’d store the entire JSON message as text in each row. Maybe you would compress it. The row-id could be something that makes sense in your domain – I’ve seen user-ids, other object-ids, even time-stamps used. We use time-stamps for one of our core tables.

The variation we’re using is instead of storing the whole thing as a ‘blob’, we’re splitting it into columns – so that the tree-like structure of this JSON message is represented as a flat set of key-value pairs. Then, each value can be stored under a column-family:column-name.

The example above would translate to the following flat structure –

{
  "road:name" => "101N",
  "road:speed" => 65,
  "houses:" => 4,
  "trees:cedar" => "cedar",
  "trees:maple" => "maple",
  "trees:oak" => "oak",
  "cars_model:ah12001" => "fit",
  "cars_model:xb34544" => "yaris",
  "cars_make:ah12001" => "honda",
  "cars_make:xb34544" => "toyota"
}

Now it is ready to be inserted into a single row in HBase. The table we’re inserting into has the following column-families –

"road"
"houses"
"trees"
"cars_model"
"cars_make"

This can then be read back and converted into the original object easily enough.

Column-names as data:

Something to note here is that there are now column-family:column-name pairs (that together constitute the ‘column-name’) contain data themselves. An example is ‘cars_model:ah12001’ which is the name of the column, whose value is ‘fit’.

Why do we do this? Because we want the entire object-graph flattened into one row, and this allows us to do that.(what is the primary key?)

The thing to remember here is that in HBase (and others like it), each row can have any number of columns (constrained only by the column-families defined) and rows can populate values for different columns, leaving others blank. Nulls are stored for free. Coupled with the fact that HBase is optimized for millions of columns, this pattern of data-modeling becomes feasible. In fact you could store tons of data in any row in this manner.

Final comments:

If you’re always going to operate on the full object graph, then you don’t really need to split things up this way – you could use one of the options described above (xml, json, or protocol buffers). If different clients of this data typically need only a subset of the object graph (say only the car models or only the speed limits of roads, or some such combination), then with this data-splitting approach, they could only load up the required columns.

This idea of using columns as data take a little getting used to. I’m sure there are better ways of using such powerful data-stores – but this is the approach we’re taking right now, and it seems to be working for us so far.

Clojure HBase helper:

I’ve written some Clojure code that helps this transformation back and forth: hash-tables into/out of HBase. It is open-source – and once I clean it up, I will write about it.

Hope this stuff helps – and if I’ve described something stupid, then please leave a correction. Thanks!

P. S. – I met a Googler today who said BigTable (and by inference HBase) is not a column-oriented database. I think that is incorrect – at least according to wikipedia. I read it on the Internet, I must be right 🙂

The mysql gem and Intel macs

I tried getting rails up and running on my (fairly) new iMac a few days back, and although I installed mysql gem currectly, the code was unable to connect to the mysql server. In fact, it threw an error –

mysql.c:2015: error: ulong undeclared

It befuddled me for a bit, because when the mysql gem is installed (gem list shows it), it compiles the native extensions – and that had succeeded. Or so I thought. Long story short – there appears to be a problem when trying to compile this gem on Intel Macs – and here’s the fix.

1. Go to gems directory – it will be somewhere under your ruby directory. Edit mysql.c, and add “#define ulong unsigned long” to it.
2. Then do ruby extconf.rb install mysql — –with-mysql-dir=yourmysqldirectory/
3. make
4. make install

This fixed things, and now, mysql can be used satisfactorily.

SQLite with RubyOnRails

That was nightmarish!

All I wanted was SQLite working with Rails. That’s all. And it was quick and easy – on my Ubuntu workstation at home, that is. And then, began 2 hours of gruesome agony – as I tried desperately to get the damn thing to work on my RHEL staging box.

I only have shell access to this machine – so no fancy graphical package manager. If I had a package manager, I would have selected libsqlite0-dev (the missing item) as well as libsqlite3-dev just to be safe, and would have saved myself the afore-mentioned two hour hair-pulling.

Here are the steps to get SQLite working with Rails –

* Download sqlite from http://www.sqlite.org/download.html. Compile and install the usual way. Or just get the compiled versions.
* Install libsqlite-dev (both libsqlite-dev3 AND libsqlite-dev0-> AAARGH!)
* Download, compile and install swig -> http://www.swig.org/download.html
* Install the sqlite3-ruby gem
* Downlad ruby-dbi and follow instructions on http://ruby-dbi.rubyforge.org/

Everything should work now! I tested this with Typo 4.0.3 and it works. Phew!