Zero (known) bugs in production

Sep 22, 2023

Bugs are an inevitability in software. Either due to unforeseen complexity, or poorly translated user requirements, or emergent behaviour between new features and old - or heck, due to plain old human error - bugs are going to happen.

I feel like the way most organisations deal with bugs is broken. The never ending backlog of bugs reminding everyone of the deficiencies of the system they manage. The endless prioritisation and reprioritisation meetings. The need to manage duplicates. The customer dissatisfaction in huge delays in fixing issues they felt strongly enough to report. This isn’t a fulfilling or enjoyable state to be in.

There’s also the problem with the cost of fixing bugs. The longer it takes to fix a bug after it has been introduced, the greater the cost to fix it in terms of engineering time and effort, not to mention the cost of churned customers, and lost sales.

An alternative approach that I recommend to every team I talk to is to have a policy of zero known bugs in production. Or, if that’s too hard, a practice of zero known bugs on release. This seems like an idealistic utopia, but it’s actually pretty simple and straightforward to implement.

When a bug gets reported, it becomes the next highest priority item to work on. Team members are allowed to finish the work they are currently on, but then they must pick up bugs until there are no more before moving on to new feature work.

This obviously works best when a team is already working in small batches (a topic for another post), but essentially it means that no new work can be started while known bug reports exist.
When a team member picks up a bug report they make a determination - fix now, or fix later. If the answer if fix later, be honest and delete it. Fix later issues are the ones that end up languishing on the end of a never ending list of things. You’re better off acknowledging that you will never get to it and treating it as such.

Determining what is classified this way is up to the individual team. Some teams may decide that every bug no matter how small is worthy of a fix. Others may decide that the odd typo that only appears in an error message on the 29th of February, and only on Android Phones in portrait orientation isn’t worth anyones time. My current team uses the measure of “would it surprise or confuse a user?” when considering borderline issues. In most cases we just end up fixing them.

In essence, this is elevating bug reports to the same level that any outage that degrades system functionality.

This approach has many benefits over the traditional approach to managing bugs. First, it eliminates all triage and prioritisation meetings. Second, it fixes issues as they are surfaced - usually very close to when the issue was introduced but at least when the person impacted remembers the specific details to replicate it. Third, it brings the pain of bugs forward - Bugs can no longer hide in a backlog, which means they have consequence.

That consequence is that when a bug is reported, it has to be fixed. Initially, this can be frustrating - engineers loath context switching. So this acts as subtle forcing function to change the way software is built in order to reduce the rate at which bugs enter the system. Solutions I’ve seen teams pick up include:

More focus on automated tests, to assist with building quality into the system
Greater care in deployment
Support for faster (and perhaps entirely automated) deployments and rollbacks
A focus on whether any of the underlying architectural decisions are the root cause of issues.

There is also an overall improvement to morale. Fewer meetings focused on discussing the failings of a system combined with a lack of a never ending list of bugs does wonders for a the team’s focus and morale. Instead of the constant reminder of all the things that are wrong, they can focus on adding value to the system.

I’ve also seen that it increases trust with stakeholders and users. Decisions on bugs are made quickly, and feedback is usually immediate. One leader in my current workplace is always very happy that her bug reports are immediately fixed. Now, we don’t give users any special treatment, but it does make your users feel pretty special when an issue they have reported is fixed quickly, often in the same day.

Implementing zero known bugs in 5 steps

So you’ve got an existing backlog of bugs. Obviously you can’t maintain zero bugs in production until you get to zero bugs. This can seem daunting. I’ve worked with teams that have thousands, or even tens of thousands of bugs sitting in their backlog.

Here’s a fairly reliable recipe for getting started.

Step 1 - Start measuring

One of the first things that I recommend getting to grips with is the size of the problem. There are 2 measures I use to do this:

First, track mean time to close bugs on a weekly basis. This is a measure of work hours from time of reporting, to a fix being deployed and running in production. This number will probably initially terrify you, but thats fine. In the early days, this number is useful to shine a spotlight on how long users are waiting for a remedy to their issues.

Secondly, track the rate of incoming bugs into the system each day on average. To do this, take the total number of bugs reported in the current month, and divide by the number of elapsed days. This gives you visibility into the rate at which your bug backlog is growing.

Step 2 - Clear the decks

Get brutal with the bug backlog. Anything that hasn’t been actively discussed in the last 3 months, delete it. Anything that is currently in the lowest priority state, delete it. Chances are these are not really important issues. If they are, they will get reported again!

Its worth at this point resetting any automatic bug reporting tools (bugsnag, rollbar etc) so that if an issue reoccurs it will be logged again automatically.

Step 3 - Commit to not making the problem any worse

Once you know the rate at which bugs are entering your system, commit to closing bugs at the same rate. If your defect rate is 2 per day, the team commits to closing 2 bugs per day. They don’t have to be the ones that just got reported. The goal is to stop the backlog getting bigger. Your current backlog size becomes your first baseline.

Step 4 - Set a new baseline

Next, the team should commit to a new maximum bug backlog size. The level you set the baseline will depend on the teams goals and constraints. Its important that the baseline is achievable and takes into account the rate at which bugs are entering the system.

Once your new baseline is achieved, reset the baseline lower, and repeat until you get to zero.

Step 5 - Defect days

Carve out a regular day in your schedule dedicated to crushing the bug backlog. Make a day of it - put the whole team in shared space or offsite, get some coffees and pizzas. Motivate the team to close as many bugs as possible in a single work day. At the end of the defect day, set the baseline to the current bug backlog size.

Final thoughts

Once you are at zero bugs, the team can start focusing on other areas of improvement. A high defect rate is still going to carve into time that could be spent adding value, so a focus on assessing and improving the system to reduce the rate that bugs arrive in the system might be the next step.

The team may also look at reporting their Mean Time to Close, and focus on ways to improve responsiveness and severity might be the focus.

It’s been my experience that the biggest benefits of an approach like this is developer happiness, and customer satisfaction. Nobody likes working on a system with thousands of known issues sitting on a list somewhere. Zero known bugs gives a team a sense of pride in their product, and customers confidence that their concerns will be addressed promptly.

Give it a try and let me know your experience!

Engineering Engineering Teams

Discussion about this post

Ready for more?