Tuesday, January 08, 2019

nightly builds are too fast and too slow

I love build systems. I love working on them because it's such a unique blend between development and operations. A great build system is a great foundation for delivering value to users.

One term I often hear is "nightly build". As in, "Where can I download the nightly build?" or "Let's set up a nightly build."

"Nightlies" is a concept from the time where you'd set up a cron job to build your code from source control. You just poll CVS or Subversion every 24 hours, and build whatever's there. Tack a datestamp on the end of the build artifact and you're good to go.

In this post I want to talk about how "nightly" is almost always the wrong concept. They are too frequent, or else they are not frequent enough. Or if you're writing a catchy blog post title, they're too fast and too slow.

Nightlies are too slow

When you write code and test it, you want that feedback loop to be as tight as possible. Write code, save, compile, test - the faster these things happen, the faster your brain stays engaged.

If you have to sit and wait a few minutes to get information back about whether your code is correct or your build process succeeded, you're going to context switch to something else and lose time when you forget to switch back.

When we reach build processes that take hours, now we're in the "Meh, I'll check it when I'm back from lunch" territory. At that rate, you're probably only going to be running that process three or four times a day, max. Your workday is only eight hours, after all. The thoroughput for your changes drops through the floor.

Now imagine extending that feedback loop even further, to a full 24 hours.  You've just arrived at the "nightly build".

When that nightly build breaks, you have eight working hours to fix it and then you get to wait again for tomorrow morning when you find out the new problem.

After a few days of this, you no longer arrive at work with the same positive mental energy. Your morning email inbox experience becomes a thing where you discover what has gone wrong during the night, because you never saw it go right during the daytime.

Operational tempo slides further, because it feels like "everything takes so long around here." Teams lower their optimistic expectations that anything should ever happen quickly.

I've seen several odd knock-on effects here.

Sometimes what happens then is that you have multiple "nightlies" for a single day. One is the first broken nightly that ran in cron, and the others are multiple attempts where someone ran the script by hand trying to get it to pass. The "nightly" is no longer nightly. Odds are that those manual runs did not do everything exactly like the full cron job did. More confusion ensues across the organization.

When we only run a big ugly task once at midnight, then we don't care strongly about how long it takes. We've removed a big incentive to pay down the tech debt and work on shortening the long tasks, because they always happen while we're asleep. The big ugly tasks get progressively longer and longer, until an important emergency happens, and we have to run the task during working hours and we're unable to deliver in a timely way.

Another common papercut: someone will increase the frequency of the cron task so that it runs hourly, or every 20 minutes, instead of 24 hours. This is better, but unfortunately 20 minutes is still quite slow, and users will frequently multi-task away and forget to see the failure until hours or days have gone past.  There is also something maddeningly unclear about this type of every-couple-of-minutes scheduling. Is that cron job going to kick off at the top of the hour, or some other time? Did I just miss it and I have to wait the full 20 minute period, or will it happen sooner? Should I bother someone if nothing appears to be happening, or did I just do my clock math wrong? This user experience is particularly demoralizing.

Increasing the cron task model's frequency also leads to the next problem, which is:

Nightlies are too fast

If you have a project with code that changes daily, then yep, you want to build it at least daily. But does your project change literally every day 365 days of every year? For most projects, the answer is no. Did any code really change on Saturday? Or Sunday? Not just one weekend, but every weekend?

If we simply build every day (or even every weekday), this only works for projects that always have one or more changes every 24 hours, on to infinity. In the case where nothing has changed in the last 24 hours, then we are needlessly rebuilding for no reason. If your artifacts are multiple gigabytes, stored on highly available storage, that is a lot of duplicated disk space.

There is also an impact to the rest of the pipeline here. If the QE team thinks they have to test every build, they may be wasting human effort and compute costs.

The typical improvement in this case is to build some kind of polling in, like "Poll this GitHub repository every day and run a build only if there are changes from last time". Jenkins in particular has really helped spread this model, because it can do this out of the box.

For small projects, it's usually trivial to answer "did anything change here"?  For example, it's really easy to run "git fetch" and see if there are any new changes, and then build those.

Sometimes your build process depends on many other things besides that single Git repository. For example, if you build a container that includes artifacts from several locations, then you will need to poll all of them to know if anything has changed. Many times those those locations are not trivial to poll with a one-liner.

Now you are in a poll-the-world model, asking yourself how to poll, what is a reasonable frequency to poll, and how annoyed will those administrators be if I hit their systems every 60 seconds?

These questions lead to spending more engineering effort or taking shortcuts which the QE team must pay for later.

What should we do instead?

Instead of talking about "nightly builds", let's talk about "CI builds".

Instead of a poll-the-world model, make the build systems event-driven.

This requires having a really solid grasp of your inputs and outputs. For example: my Jenkins job should fire if the code changes in Git *or* if one of the input binaries change versions, *and* it should feed its pass/fail status into these other three systems."

If you don't know the input events for your process, research more about the system that is upstream of you, instead of simply configuring your system to poll it.

Set the expectation that all the build pipeline processes for which you are responsible will happen immediately, without polling. This implicitly sets other expectations for all your other teams, particularly those upstream and downstream to you.

For the dev teams feeding into your build system, they should expect actions to happen immediately. If a developer does not see the build system immediately respond to their changes, their first mental response should be "that's broken and we can fix or escalate it" instead of "it's just slow" or "it's just me".

For QE teams that take output from your build system, you're communicating two things with an event-driven model. Firstly, when QE talk directly to a developer (skipping your role in the pipeline), and the developer says they've pushed some code, QE should immediately be able to see that the new code is building and is coming towards them. They should be checking the health of the pipeline as well, with positive expectations that they do not need to do complicated polling or involve you. Secondly, the fact that builds can arrive *at any time* means QE should set up their own automated triggers on your events, rather than polling you every 24 hours.

Technical implementations

Making all your tools event-driven is a long process, and in large systems it can take years. It's a culture shift as well as a technical shift.

You can definitely go a long way by using GitHub's webhooks and running everything in a single Jenkins instance.

When that no longer scales, you can run a message bus like RabbitMQ or ActiveMQ. At my current employer we have a company-wide message bus, and almost all the build and release tooing feeds into this bus. This lets each engineering team build operational pipelines that are loosely coupled from each other. There is a upward spiral effect: the more tools use the unified message bus, the more other tool authors want to use it. The messagebus has strong management support because it is the backbone of our CI efforts.

When all the automated stuff like webhooks or a messagebus are great, of course it is a good idea to build fallback support for polling as well in the off-chance that the messages do not get through. But polling should be the fallback position to get you past an emergency, not the norm.

Conclusion

We already have to wait for many things with our computers. Don't make "wall clock time" one of those things.

Don't build nightly. Build continuously on each change event.

Wednesday, July 18, 2018

"What problem are we trying to solve?"

I was in a meeting with some folks recently where the leader opened the floor for Suggestions.

At this point the meeting got very quiet. Not everyone in the group is an introvert, but everyone felt the awkwardness and no one wanted to speak up and sound dumb.

I'm not sure what was happening in everyone else's minds at that point, but for myself, I wasn't even sure what we were talking about. We were all staring at a blank whiteboard, going to write down "ideas" about ... something.

I finally asked:

"Man I'm sorry, I'm just not following here. What problem are we trying to solve?"

The leader next listed out three problems that he saw as important. We wrote them down and it kicked the conversation into gear. People started engaging with the list, asking, "Is that a big problem?" or  "Here's how I see that problem manifesting."

Sometimes brainstorming sessions or conversation-starters can be too open. We want to not leave anybody out, and seek ideas from everyone, so we cast such a wide fishing net that it's awkward. The conscious people are wracking their brains trying to figure out what the leader is asking for on this fishing expedition.

Next time you're in a meeting that's really hard to follow, and several folks are falling silent, don't assume it's "just you". Throw yourself under the bus in a whimsical way and ask aloud:

"Sorry, I'm lost. What problem are we trying to solve here?"

The next 20 minutes will be a lot more interesting!

Friday, June 08, 2018

Hope is my strategy

One of my favorite tech books is Google's Site Reliability Engineering book. They open with this tongue-in-cheek quote:

  "Hope is not a strategy" --  Traditional SRE saying

This attitude reminds me of the common idea in system administration that anything that can go wrong will go wrong. Murphy's law, "never trust a happy system administrator", and so on.

When your goal as a system admin is "100% service uptime", there's simply no way to meet that goal. You can only fail.

The authors of the SRE book look at the business costs and value to uptime metrics. They propose a different strategy, focusing on an "error budget" instead of a pure 100% uptime unachievable goal. Exposing this error budget allows the business decision makers to align with the inherent constraints engineers face when making speed vs safety decisions.

"Hope is not a strategy" means making our decisions data-driven rather than wishful-thinking-driven.

Operating like this means gathering a *lot* of data. Lots of monitoring, A/B testing, phased rollouts, and so on.

How much data is enough, though?

If you're Google or another big corporation, you can spend a lot of resources on monitoring and benchmarking. There's always more to measure, tweak and improve.

In my own life I can see the effects of wanting more and more data before making big personal decisions. It often means I delay beyond what's reasonable and miss out on opportunities because I'm risk-averse.

There's two sides to this problem:

- Bad: Wishful thinking, blind optimism, recklessness,.
- Also bad: Analysis paralysis, perfectionism, fear.

In 2018 I've faced some hard decisions in my personal life, where I have to make choices every week for how I'm going to live and what I'm going to do. These choices affect others around me as well.

At some point I have to stop gathering data. I don't have the resources to do the exhaustive research I daydream about for every decision. And even if I did, it's pure fantasy to think I can avoid pain and suffering in this life.

This prayer about serenity has really helped me this year:

  God grant me the serenity
  to accept the things I cannot change;
  courage to change the things I can;
  and wisdom to know the difference.

Of course I have to dig in and do the hard work - that's the "courage to change" part. But when decisions are murky, things are unclear, that is where serenity and wisdom come in. That is where hope is my strategy.

Where does your hope lie?

Tuesday, February 06, 2018

What is the place for one-on-one communication in open-source?

One of Red Hat's mantras is "develop in the open". There is an entire website, opensource.com, with tons of articles about this idea (this article in particular is great).

One aspect of "develop in the open" means keeping conversations as public as possible. Don't email or IM a developer directly; instead, email a development mailing list (possibly using the To: and CC: fields for your intended developer) or public IRC channel instead. It's hard to overstate the community benefits of this, and again opensource.com explains the benefits in more detail than you could ever want.

Sometimes people send me direct instant messages seeking information, instead of asking in a public channel. I think there is a fear of "spamming the channel" or fear of looking foolish. I can respect people's desire to avoid looking foolish. I've even done the same, and some wise people called me out on it. I suggest that you will not look nearly as foolish as you expected, though. Let's face it, if this topic was so obvious, you would have found some documentation on it already, right :) Maybe things are just hard to figure out! Maybe many other people would benefit from this probably-under-documented information!

In these conversations, I try to steer back into an IRC channel, replying "That's a great question. Would you be ok if we continued the conversation in #channel-that-relates-to-what-we-are-talking-about?" Then I tab over to that channel and say "so-and-so: we were just discussing <my rephrasing of their question>" to give some context to everyone else in the room. Then I answer the question so everyone can see it.

I've been thinking about a corollary to this concept this week: There is also a time for one-on-one IM conversations, and that is when you have to bring up a sensitive topic and you need to build some relational credibility.

Let's say I've noticed a mistake in some code or process. Let's also imagine I do not have positive relational credibility with the person responsible. Maybe this person is a different personality type than me, and we both drive each other nuts. Maybe it's been a pressure-cooker situation for any number of other reasons. If I bring up this person's mistake in a public IRC channel, we just go deeper on the negative spiral, and my behavior can look threatening. I've found it's more effective to bring up mistakes as privately as possible.

Of course we want to default to open and develop in the open. On the other hand, sometimes there is a greater good, where we a trade bit of openness for relational credibility. Once the relationship is there, maybe we'll get a chance to discuss future problems more openly without fear.

Wednesday, June 28, 2017

Forwarding gpg-agent to a container

I use Fedora on my main laptop, but sometimes I need to GPG-sign something in an Ubuntu environment.

I store my GPG key on my Yubikey and access the device with gpg-agent. Here are instructions for forwarding my gpg-agent connection into a Docker container.

This will only work on with a ubuntu:xenial image and newer, because Trusty has GPG 2.0, and this needs 2.1. Earlier versions of GPG 2 failed because they still need access to the data in secing.gpg. See https://www.gnupg.org/faq/whats-new-in-2.1.html#nosecring for more information.

On the host, bind-mount the gpg-agent socket when running the container:

docker run --volume /home/kdreyer/.gnupg/S.gpg-agent-extra:/gpg-agent --env GPG_AGENT_INFO=/gpg-agent:0:1 -ti ubuntu:xenial

Within the container: Xenial's gpg2 looks for the socket in ~/.gnupg, ignoring GPG_AGENT_INFO, so we have to link it in:

mkdir -p ~/.gnupg && chmod 700 ~/.gnupg
ln -s /gpg-agent ~/.gnupg/S.gpg-agent

Trust the kdreyer@redhat.com key:

gpg2 --keyserver keys.fedoraproject.org --recv 478A947F782096AC
echo -e "trust\n5\ny\n" | gpg2 --command-fd 0 --edit-key kdreyer@redhat.com

Test a signature operation:
echo hi | gpg2 -as -u kdreyer@redhat.com --use-agent 

Now we can use GPG with other tools, for example debsign:
debsign -p gpg2 tambo_0.4.0-0ubuntu0.16.04.1_source.changes

Note there's a bug in dput that it hardcodes the use of /usr/bin/gpg when verifying sigs, so you'll have to import your key again into the gpg1 key store:
gpg --keyserver keys.fedoraproject.org --recv 478A947F782096AC

And then you can upload to a Launchpad PPA:
dput ppa:kdreyer-redhat/ceph-medic tambo_0.4.0-0ubuntu0.16.04.1_source.changes

Wednesday, October 29, 2014

Sigal packaging and CentOS


My home server was running CentOS 6, and this was getting a bit long in the tooth:

  • The libwww-perl version that ships in CentOS 6 does not handle HTTPS certificates in a secure way. This was only fixed in LWP version 6. There's almost no chance of LWP getting rebased, since that module is part of Perl core, and it's so late in the RHEL 6 lifecycle.
  • The Python version (2.6) is so old that many Python apps no longer support it. The one I was particularly interested in was Sigal to generate my own photo gallery for my family.

I tried using the Python 3.3 software collections, and this worked well to get Sigal running in a Python 3.3 virtualenv.

For Perl, I didn't want to deal with SCLs, because my application has a long dependency chain, and I would need to rebuild a lot of SCL-style RPMs to get my app to work. I could just use the "cpan" tool (similar to virtualenv/pip), but I wanted to avoid the security and stability issues associated with using an essentially random snapshot in time of modules that I grabbed from upstream. I like the fact that Bugzilla is a central place to track CVEs, and I like the waiting period in epel-testing and the possibility for community collaboration there, etc.

The idea of using multiple SCLs and lots of non-packaged upstream modules was what pushed me to just bite the bullet and update to CentOS 7. CentOS base + CentOS extras + EPEL 7 already had all the deps for Sigal, except python-pilkit. I buckled down and learned just enough Python packaging techniques in order to package python-pilkit and python-sigal. And the best part is that the packages actually work on my new EL7 system (knock on wood).

sigal bundles some Javascript bits, and I'm not sure about the JS guidelines for EPEL. But otherwise I think the packages are close to being ready to submit to Fedora.