Tuesday, January 08, 2019

nightly builds are too fast and too slow

I love build systems. I love working on them because it's such a unique blend between development and operations. A great build system is a great foundation for delivering value to users.

One term I often hear is "nightly build". As in, "Where can I download the nightly build?" or "Let's set up a nightly build."

"Nightlies" is a concept from the time where you'd set up a cron job to build your code from source control. You just poll CVS or Subversion every 24 hours, and build whatever's there. Tack a datestamp on the end of the build artifact and you're good to go.

In this post I want to talk about how "nightly" is almost always the wrong concept. They are too frequent, or else they are not frequent enough. Or if you're writing a catchy blog post title, they're too fast and too slow.

Nightlies are too slow

When you write code and test it, you want that feedback loop to be as tight as possible. Write code, save, compile, test - the faster these things happen, the faster your brain stays engaged.

If you have to sit and wait a few minutes to get information back about whether your code is correct or your build process succeeded, you're going to context switch to something else and lose time when you forget to switch back.

When we reach build processes that take hours, now we're in the "Meh, I'll check it when I'm back from lunch" territory. At that rate, you're probably only going to be running that process three or four times a day, max. Your workday is only eight hours, after all. The thoroughput for your changes drops through the floor.

Now imagine extending that feedback loop even further, to a full 24 hours.  You've just arrived at the "nightly build".

When that nightly build breaks, you have eight working hours to fix it and then you get to wait again for tomorrow morning when you find out the new problem.

After a few days of this, you no longer arrive at work with the same positive mental energy. Your morning email inbox experience becomes a thing where you discover what has gone wrong during the night, because you never saw it go right during the daytime.

Operational tempo slides further, because it feels like "everything takes so long around here." Teams lower their optimistic expectations that anything should ever happen quickly.

I've seen several odd knock-on effects here.

Sometimes what happens then is that you have multiple "nightlies" for a single day. One is the first broken nightly that ran in cron, and the others are multiple attempts where someone ran the script by hand trying to get it to pass. The "nightly" is no longer nightly. Odds are that those manual runs did not do everything exactly like the full cron job did. More confusion ensues across the organization.

When we only run a big ugly task once at midnight, then we don't care strongly about how long it takes. We've removed a big incentive to pay down the tech debt and work on shortening the long tasks, because they always happen while we're asleep. The big ugly tasks get progressively longer and longer, until an important emergency happens, and we have to run the task during working hours and we're unable to deliver in a timely way.

Another common papercut: someone will increase the frequency of the cron task so that it runs hourly, or every 20 minutes, instead of 24 hours. This is better, but unfortunately 20 minutes is still quite slow, and users will frequently multi-task away and forget to see the failure until hours or days have gone past.  There is also something maddeningly unclear about this type of every-couple-of-minutes scheduling. Is that cron job going to kick off at the top of the hour, or some other time? Did I just miss it and I have to wait the full 20 minute period, or will it happen sooner? Should I bother someone if nothing appears to be happening, or did I just do my clock math wrong? This user experience is particularly demoralizing.

Increasing the cron task model's frequency also leads to the next problem, which is:

Nightlies are too fast

If you have a project with code that changes daily, then yep, you want to build it at least daily. But does your project change literally every day 365 days of every year? For most projects, the answer is no. Did any code really change on Saturday? Or Sunday? Not just one weekend, but every weekend?

If we simply build every day (or even every weekday), this only works for projects that always have one or more changes every 24 hours, on to infinity. In the case where nothing has changed in the last 24 hours, then we are needlessly rebuilding for no reason. If your artifacts are multiple gigabytes, stored on highly available storage, that is a lot of duplicated disk space.

There is also an impact to the rest of the pipeline here. If the QE team thinks they have to test every build, they may be wasting human effort and compute costs.

The typical improvement in this case is to build some kind of polling in, like "Poll this GitHub repository every day and run a build only if there are changes from last time". Jenkins in particular has really helped spread this model, because it can do this out of the box.

For small projects, it's usually trivial to answer "did anything change here"?  For example, it's really easy to run "git fetch" and see if there are any new changes, and then build those.

Sometimes your build process depends on many other things besides that single Git repository. For example, if you build a container that includes artifacts from several locations, then you will need to poll all of them to know if anything has changed. Many times those those locations are not trivial to poll with a one-liner.

Now you are in a poll-the-world model, asking yourself how to poll, what is a reasonable frequency to poll, and how annoyed will those administrators be if I hit their systems every 60 seconds?

These questions lead to spending more engineering effort or taking shortcuts which the QE team must pay for later.

What should we do instead?

Instead of talking about "nightly builds", let's talk about "CI builds".

Instead of a poll-the-world model, make the build systems event-driven.

This requires having a really solid grasp of your inputs and outputs. For example: my Jenkins job should fire if the code changes in Git *or* if one of the input binaries change versions, *and* it should feed its pass/fail status into these other three systems."

If you don't know the input events for your process, research more about the system that is upstream of you, instead of simply configuring your system to poll it.

Set the expectation that all the build pipeline processes for which you are responsible will happen immediately, without polling. This implicitly sets other expectations for all your other teams, particularly those upstream and downstream to you.

For the dev teams feeding into your build system, they should expect actions to happen immediately. If a developer does not see the build system immediately respond to their changes, their first mental response should be "that's broken and we can fix or escalate it" instead of "it's just slow" or "it's just me".

For QE teams that take output from your build system, you're communicating two things with an event-driven model. Firstly, when QE talk directly to a developer (skipping your role in the pipeline), and the developer says they've pushed some code, QE should immediately be able to see that the new code is building and is coming towards them. They should be checking the health of the pipeline as well, with positive expectations that they do not need to do complicated polling or involve you. Secondly, the fact that builds can arrive *at any time* means QE should set up their own automated triggers on your events, rather than polling you every 24 hours.

Technical implementations

Making all your tools event-driven is a long process, and in large systems it can take years. It's a culture shift as well as a technical shift.

You can definitely go a long way by using GitHub's webhooks and running everything in a single Jenkins instance.

When that no longer scales, you can run a message bus like RabbitMQ or ActiveMQ. At my current employer we have a company-wide message bus, and almost all the build and release tooing feeds into this bus. This lets each engineering team build operational pipelines that are loosely coupled from each other. There is a upward spiral effect: the more tools use the unified message bus, the more other tool authors want to use it. The messagebus has strong management support because it is the backbone of our CI efforts.

When all the automated stuff like webhooks or a messagebus are great, of course it is a good idea to build fallback support for polling as well in the off-chance that the messages do not get through. But polling should be the fallback position to get you past an emergency, not the norm.

Conclusion

We already have to wait for many things with our computers. Don't make "wall clock time" one of those things.

Don't build nightly. Build continuously on each change event.