Startup logbook – v0.2 – distributed Clojure system in production
Posted by Amit Rathore on May 2, 2009
This past weekend, we pushed another major release into production. We’ve been working on several things and have made a few pushes since the last time I wrote about this – but this release has a bunch of interesting Clojure related stuff.
The main thing of note is that the majority of our back-end is now written in Clojure. You might recall that our online-merchant customers send us a lot of data, and we run a ton of analytics on that data. Our initial plans involved Ruby, but as we started using Clojure, it turned out that it is very well suited for this job as well (long running, message-driven processes that crunch numbers).
The raw data sits in HBase, and every night a “master” process starts up which kicks-off the processing of the previous day’s worth of data. The job of this master is only to coordinate the work (it doesn’t actually do any real work), it does this by breaking work into chunks and dispatching messages that each assign work to any worker process that picks it up. The master is single threaded for simplicity, but failure tolerant – it checkpoints everything in a local MySQL database, and if it crashes, it is automatically re-spawned and it recovers from where it left off.
An elastic cloud of worker processes run in anticipation of the master handing out this work. The worker processes use the MySQL database to keep track of their progress as well. The rest is rather domain-specific. We use intermediate representations of the raw data, which is also stored in HBase, before finally storing the summarized version again in HBase.
We use an in-house distributed-programming framework called Swarmiji to make such distributed programs very easy to write and run. Swarmiji implements a flavor of staged event-driven architecture (SEDA) to allow server processes that exhibit scalable, predictable throughput. This is especially true in the face of over-load, which we can certainly expect in our environment.
The reason I wrote this framework was that I wanted to create distributed, parallel programs which exploited large numbers of machines (like in a data-center) – without being limited by clojure’s in-JVM-threads-based model. So each worker process in Swarmiji gets deployed as a shared-nothing JVM process.
I will write up a post introducing Swarmiji in the next few weeks – once its a bit more battle-tested, and I’ve added a few more features (mainly around process management).