Startup logbook – v0.1 – Clojure in production

Late last night, around 3 AM to be exact, we made one of our usual releases to production (unfortunately, we’re still in semi-stealth mode, so I’m not talking a lot about our product yet – I will in a few weeks). There was nothing particularly remarkable about the release from a business point of view. It had a couple of enhancements to functionality, and a couple of bug-fixes.

What was rather interesting, at least to the geek in me, was that it was something of a technology milestone for us. It was the first production release of a new architecture that I’d been working on over the past 2-3 weeks. Our service contains several pieces, and until this release we had a rather traditional architecture – we were using ruby on rails for all our back-end logic (eg. real-time analytics and pricing calculations) as well as the user-facing website. And we were using mysql for our storage requirements.

This release has seen our production system undergo some changes. Going forward, the rails portion will continue to do what it was originally designed for – supporting a user-facing web-UI. The run-time service is now a combination of rails and a cluster of clojure processes.

When data needs to be collected (for further analytics down the line), the rails application simply drops JSON messages on a queue (we’re using the excellent erlang-based RabbitMQ), and one of a cluster of clojure processes picks it up, processes it, and stores it in an HBase store. Since each message can result in several actions that need to be performed (and these are mostly independent), clojure’s safe concurrency helps a lot. And since its a lisp, the code is just so much shorter than equivalent ruby could ever be.

Currently, all business rules, analytics, and pricing calculations are still being handled by the ruby/rails code. Over the next few releases we’re looking to move away from this – to instead let the clojure processes do most of the heavy lifting.

We’re hoping we can continue to do this in a highly incremental fashion, as the risk of trying to get this perfect the first time is too high. We absolutely need to get the feedback that only production can give us – so we’re more sure that we’re building the thing right.

The last few days have been the most fun I’ve had in any job so far. Besides learning clojure, and hadoop/ hbase pretty much at the same time (and getting paid for doing that!), it has also been a great opportunity to do this as incrementally as possible. I strongly believe in set-based engineering methods, and this is the approach I took with this as well – currently, we haven’t turned off the ruby/rails/mysql system – it is doing essentially the same thing that the new clojure/hbase system is doing. We’re looking to build the rest of the system out (incrementally), ensure it works (and repeat until it does) – before turning off the (nearly legacy) ruby system.

I’ll keep posting as I progress on this front. Overall, we’re very excited at the possibilities that using clojure represents – and hey, if it turns out to be a mistake – we’ll throw it out instead.

Design for throwability

Build one to throw away, you will anyhow. — Fred Brooks

Fred Brooks said what he did many, many years ago but it is probably just as true today. How many times have you and your team gotten a few months into a project only to realize all the design mistakes you made? Ask any engineer, and they’ll tell you they would build it right the second time.

This is just reality, the nature of discovering the complexity of the domain or the technology or the usage pattern or whatever else you didn’t know about when you started.

On the other hand, there’s this [what Joel Spolsky says about rewriting software] –

It is the single worst strategic mistake that any software company can make — Joel Spolsky

So… what gives? The answer, IMHO, is basically two things –

1. Understand and internalize the idea of the strangler application

2. Architect your system in such a way to support strangling it later

In essence, this means that because it would be a bad idea to rewrite the entire system from scratch, it must be built in a way so as to enable swapping out components of it as they are rewritten (or perhaps heavily refactored).

The architecture must draw from an approach called  concurrent set-based engineering (CSBE) – and indeed, sometimes each logical component would have more than one implementation. At Runa, two components of our system actually have two implementations each. And in each case, they’re both running in production – in parallel.

The way we accomplish this is through very loose-coupling. Additionally, because we take a very release-driven approach to our software process, our architecture evolves according to our current needs… and we refactor and extend things as new requirements are prioritized. At all times, despite our super-short release-cycles, our goal is to always have a version of the system in production. Whenever our pipeline tells us that a peice of the existing design may not work in the long term, we start to work on the replacements – more than one, and in parallel.

We run them in what I’ve been calling shadow-mode. This implies that its not quite part of the official system, but is running in order to prove some design hypothesis. Once everyone involved is satisfied with the results, we pick the most suitable sub-system and decommission  the other contenders (including the old one). At Runa, we achieve much of our inter-component loose-coupling via messaging (our current choice is RabbitMQ).

To summarize – we design everything  with one over-arching goal in mind  – the thing will be thrown away someday, and be replaced with another. As I said before, this enforces a few things –

1. Loose coupling

2. Clear interfaces between components

3. Good automated system testing!

About that last point – because we have many moving parts, functional testing becomes even more important. We currently use Selenium for true functional testing (Runa is a web-based service) – and a variety of other home-grown tools for custom systems testing. Not only do automated system tests tell us that the collaborating set of components are working right, but they also allow us to change things with impunity – knowing that we’ll know if things break.

This thinking is what I’ve been jokingly calling Design For Throwability – and it’s been working rather nicely. It’s essentially a design philosophy that embraces CSBE – and is especially useful for small startups where everything is changing quickly – almost by definition.

Lean Software Development For Startups

Here’s what I will be talking about at the Lean & Kanban 2009 conference in Miami this year.

Lean Software Development For Startups

(Or Why Agile Isnʼt Enough And How To Do More With Less)

Abstract:

If youʼre in a startup, then you know that statistically, the odds are heavily against you. Pretty much the only inherent characteristic of a startup that can be counted upon to help, is that of its small size. If the company can be nimble and agile, then it can hope to
gain some traction against its larger rivals.

In such an environment, using an Agile methodology is a given. Without some form of a hyper-iterative software process, it is impossible for a startup to create a successful product. Or even to determine what that product is!

In todayʼs climate, exacerbated as it is due to competition, lower capital requirements for software companies, the compression of Internet time, and the recessionary economic conditions, it is no longer enough to just use an Agile method. To stay
competitive, indeed to just survive, something more is required.

Lean Thinking provides just such an advantage.

A startup needs to ground its philosophy in Lean Thinking, Theory of Constraints, Critical Chain, Queueing Theory, Systems Thinking, and the like. It will obviously gain from the long-term focus, throughput-based accounting, and value-based constructs that these provide.

This presentation is about how Lean ideas when applied to standard Agile processes can make an organization super-productive even in the extreme short-term. Specifically, it draws on my experiences from having run multiple projects using this philosophy during my consulting career at ThoughtWorks, and more recently as a founding member of an Internet startup called Cinch.

—-

So, that’s what the talk will be about. I have a 45 minute slot – so I’m looking to finish speaking in about 20-25 minutes and the rest could be for a discussion. Hope to see you there!