This series deals with the implementation of a unit testing process in a team or across multiple teams in an organization. Posts in the series include: | |||
---|---|---|---|
Goals | Outcomes | Leading Indicators I | Leading Indicators II |
Leading Indicators III | Leadership I | Leadership II | The plan |
in the last post we talked about the failing builds trend as an indicator of success of implementation.
The final metric we’ll look at, that can indicate how our process will go, is also related to broken builds. It is the “time until broken builds are fixed” or TUBBF. Ok, I made up this acronym. But you can use it anyway.
If we want to make sure the process is implemented effectively, knowing that builds are broken is not enough. When builds break, they need to be fixed.
No surprise there.
Remember that the long-term goal is to have working code, and for that we need people to be attentive, responsive and fixing broken builds quickly. Tracking the TUBBF, can help us achieve that goal. We can infer how people understand the importance of working code, by looking at how they treat broken builds.
Sharing is caring
One of eXtreme Programming’s principles is shared code ownership, where no single person was the caretaker of a single piece of code. When our process succeeds, we want to see the corollary – everyone is responsible to every piece of code.
With small teams it’s easier to achieve. Alas, with scale it becomes harder. Usually teams specialize in bits of code, and conjure the old demon of ownership. With ownership comes blame and the traditional passing of the buck.
After all, our CI log says it right there: They broke the build, by committing their code that doesn’t have any resemblence or relation to our code. We can’t and won’t fix it. They broke it. They should fix it.
Then comes the next logical conclusion: If we didn’t break the code, we can continue אם safely commit the code. After all, we know our code works, we wrote it.
And so, every team blames the other team, committing unchecked changes and the build remains red.
(by the way, maybe they commited the last bit that broke the build, but that doesn’t mean their changes were at fault. If a build takes a long time, usually changes are collected until it starts, and it only flags the last commit, although that last one may be innocent).
Everybody’s Mr. Fix-It
One of the drastic measures we can do, is to lock the SCM system when the build breaks. That’ll teach them collective ownership.
But that doesn’t always work. People just continue to work on local copies, believing that somebody else is working relentlessly, even as we speak, on fixing the build.
Another option is to put the training wheels on. Train a team about keeping build green without interference from other teams, by developing on team-owned branches. We track the team’s behavior on their branch, encouraging them to fix the build. They are responsible to keep the build on their own branch green. Only when branch builds are stable and green, it’s ok to merge them to trunk.
The worst option, and I’ve seen it many times, is having someone else be the bad cop.
Imagine an IT/DevOps/CI master that starts each day checking all the bad news from the night, tracking the culprit, and making them, but mostly begging them, to make amends. Apart from not making the team responsible for their code, it doesn’t stop others from committing, because of the malfunctioning process.
As long as we can track the TUBBF is some manner, we can redirect the programmers’ behavior toward a stable build, and teach the responsibility of keeping it green. As we do this, we focus on the importance of shared responsibility and collect a bonus for working, sometimes even shippable, code.
2 Comments
David V. Corbin · June 20, 2017 at 4:30 pm
A simple solution to a complex problem: “Gated Builds”, then “broken code” (that which does not compile and pass all relevant tests) *never* gets checked into the repository in the first place.
Mark Waite · June 25, 2017 at 2:51 am
I don’t understand David Corbin’s claim that gated builds are a simple solution to a complex problem. The process that is being described provides a form of gating already. The gating is visible to all and causes finger-pointing and blame assignment. I believe he’s proposing to add another layer of safety checks (“gated builds”) which will detect more problems before they reach a destination where they can do harm.
I think the author is trying to lead us towards an organizational and behavioral change that will cause each person to act when a component is broken in a way that affects others. Gated builds may help with that organizational and behavioral change, but they (of themselves) seem unlikely to create that change.
David, can you explain further how gated builds are related to the organizational and behavioral issues which the author noted?
As an additional challenge, many problems are not discovered until they have already passed one or more gates. In that case, “gated builds” may need to be more like “a series of gated builds for each of the stages in the deployment lifecycle”. If that is the case, then isn’t the ability to “undo” the breaking change (whether that is a code change, an update to a new component, or an operating system update) a critical part of supporting the team as they make the organizational and behavioral changes?
Are there facilities that can support easier, more reliable rollback of changes? Or, maybe the preferred pattern is “roll forward, but with the breaking change reverted”.