Lean Software Development: How to find bottlenecks – Metrics that Matter

In the past, I’ve written about why measuring actuals in the way a lot of people do, isn’t particularly effective. I’ve also written that I think the only way to truly improve the quality of a development team is by reducing its cycle-time. Which begs the question – what metrics ought one to collect in order to reduce cycle-time? This post focuses itself on one set of numbers that can be useful in order to pursue this reduction.

Queue-sizes as symptoms of constraints

Typical software teams have multiple roles – customers, analysts, developers, testers, and so on. To develop the best possible software, these folks should work together in a collaborative way. Despite this mode of interaction, there often are localized work products that each function works on – story cards and narratives, the code that implements the desired functionality, the manual, ad-hoc, and exploratory testing that needs to get done outside of what automation covers.

If we consider the entire team to be a system comprising of the various functional components, then the work of creating working software flows through it in the form of a story card that changes state from requirement-analysis to elaboration to development to testing. This shouldn’t be considered as a waterfall process – this is just how the work flows. An agile team would iterate through these phases very quickly and in a highly collaborative manner.

To be able to determine the health of such a system, which by the way can also be thought of as a set of queues that feed into each other, one needs to know how much work-in-progress is actually inside it at any given moment of time. To tell if the system is running at steady-state, the dashboard of the project team must include number of story-cards in each queue.

Let’s take an example –

queues.png

So the moment we see that the number of cards in a particular area begin to grow in an out-of-proportion manner, we know there is a bottle-neck, or in other words – we have found a constraint. By resolving it – maybe by adding people (always a good idea, right?), or reducing the batch-size (the number of stories picked up in an iteration), or by reducing the arrival rate (stretching out the iteration), or removing other waste that might be the cause (found through root-cause analysis) – one can improve the throughput of the software development team (system).

A Finger Chart

Analyzing carefully collected data is the only way to reliably diagnose a problem in a system – probably something familiar to all developers that were ever tasked with improving the performance of a piece of software. After collecting data, visualizing it helps even more. In the case of trying to discover a bottleneck through the data above, a finger chart can help. It looks like this –

finger_chart.png

The reason it’s called a finger chart is because it is supposed to look like the lines formed between stacked fingers. The width of each color represents the size of the queue. Now, finding the bottleneck is as simple as looking for a widening band. There’s your trouble area, the one you need to optimize.

Systems thinking

This technique may be useful in your team situation. However, it never does to miss the forest for the trees. In many cases, much more bang for the buck can be got by simply looking at an extended value-stream, as opposed to a localized one. In other words, the largest constraint may lie outside the team boundary – for example from having multiple stake-holders that often set conflicting goals for the team. Or that several critical team-members are involved in multiple projects and are therefore spread too thin to be productive on any. It behooves the team members, and indeed the management, to try and resolve issues at all levels by taking a systems approach to improvement.

In any case, despite the fact that queueing theory teaches us (among much else) that optimizing locally at various points ultimately results in lowered throughput of the complete system, it is useful at times to fix what one can. There are certainly several areas where things can be improved, and several ways of handling each. The finger chart above helps in one such area, and by mounting it on a team wall, it becomes an information radiator that lets everyone in on the health of the system.