Our load also grows in large step-functions. Each time we get a new customer, our services get called by another set of users (as all our customers’ customers get added on). We need our services to keep up with all this demand.
Finally, lets take an example of one of the things we provide via these services – dynamic pricing for products. Obviously, the response of such calls needs to be in real-time – since the price has to be shown next to the product being browsed.
So we have a load as well as a response-time concern – as most typical web-scale services do.
Our approach has been to favor simplicity – very early on we introduced a messaging backbone.
Despite the fact that this picture looks a bit more complex than it would without the RabbitMQ portion, this has allowed us to do a few things –
- For those service calls that don’t need immediate responses – (for instance, our client websites send us data that we analyze later, or we need to send an email notification) – we just drop these onto an appropriate queue. An asynchronous processor picks up the message, and does the needful.
- For those services that need responses immediate responses, the call is handled synchronously by one of the application servers.
- For those services that are heavier in terms of the computation required, we split the request into pieces and have them run on separate machines. A simple federation model is used to coordinate the responses and they’re combined to return the result to the requester.
- With the above in place, and by ensuring that each partial service handler is completely stateless, we can scale by simply adding more machines. Transparently. The same is true for all the asynchronous processors.
As an aside – I’ve written a mini-framework to help with those last couple of bits. It is called Swarmiji – and once it works (well) in production, I will open-source it. It may be useful for other folks who want a sort of message-passing based parallel programming system in Clojure.
So anyway, with this messaging system in place, we can do a lot of things with individual services. Importantly, we can try different approaches to implementing them – including trying different technologies and even languages.
IMHO, you can’t really have a conversation about scalability without context of how much load you’re talking about. When you’re just getting started with the system, this is moot – you don’t want to optimize early at all – and you can (mostly) ignore the issue altogether.
When you get to the next level – asynchronicity can get you far – and this side-steps the whole discussion of whether the language you’re using is efficient or not. Ruby is as good a choice as any in this scenario – Python or Java or most other languages will leave you in the same order of magnitude of capability. The key here is high-level design, and not doing too many obviously stupid things in the code.
When you do get crazy large (like Google or whatever), then you can start looking at ways to squeeze more out of each box – and here it may be possible that using the right language can be an important issue. I know Google even has a bunch of people working on compilers – purely to squeeze more out of the generated code . When you have tens of thousands of computers in your server farms, a 2% increase is probably worth a lot of money.
Still, this choice of language issue should be treated with caution. I’m personally of the opinion that programmer productivity is more important than raw language efficiency. That is one reason why we’re writing most of our system in a lisp (Clojure) this time around. The other thing is that these days runtimes are not what they used to be – Ruby code (and Python and Clojure and Scala) can all run on the JVM. So you can get the benefits of all those years of R&D basically for free.
Finally, a word on our messaging system. We’re using RabbitMQ – and it is *fast*, reliable and scalable. It truly is the backbone of our system – allowing us to pick and choose technology, language, and approach. It’s also a good way to minimize risk – a sub-system can be replaced with no impact to the rest of the pieces.
Anyway – take all of this with many grains of salt – after all, we’re not the size of Google (or even Twitter) – so what do I know?