Tuesday, July 09, 2024

fail faster: better Python SSH timeouts

I use Python's Paramiko to automate many things at work. Recently I wrote a SFTP client library. It it has a helper method like this:

 def connect(username):
     password = getpass.getpass()
     transport = paramiko.Transport((HOST, PORT))
     transport.connect(username=username, password=password)
     return paramiko.SFTPClient.from_transport(transport)

Callers can do "sftp = connect('kdreyer@example.com')", then do SFTP operations like uploading files, listing directories, etc.

Recently I hit a problem where the SSH server ("HOST") was not available over the network. A firewall was silently dropping TCP packets without rejecting them (RST). As a result, when we called this connect() method in a CI job, we had to wait a long time for this code to time out.

As in all good networking stories, this did not happen every time, only sometimes, and especially at important times when we really needed the connection to work. This CI job was already large and long, the perfect opportunity for losing a human operator's focus and costing even more time when it dies slowly like this.

I began researching how to make Paramiko time out faster.

To begin, I set up another network environment that drops SSH packets in a similar way so that I can reproduce it on my laptop. I found Python was hanging in socket.connect().

It turns out that I can initialize Paramiko's Transport() class two ways. The first way is to pass in my own socket object. This is powerful but too advanced for most Paramiko users, so Paramiko also allows me to initialize with a hostname + port tuple (as I've done above). Transport operates on sockets, so the constructor will create the socket and call socket.connect() for me. Critically, it calls connect() with no timeout. There is no way to pass in a timeout to the Transport() constructor.

At this point I began looking at the broader Paramiko API. It turns out there is a much better way to initialize an SFTP client. Instead of creating my own paramiko.Transport, I can start with paramiko.SSHClient(). This method has several important features (like pre-selecting allowed authentication mechanisms), but the most important is the "timeout" argument! Paramiko sets this timeout on the socket object so we get TimeoutError sooner. I wrote a patch and the new method connect() method looks like this:

 def connect(username):
     password = getpass.getpass()
     client = paramiko.SSHClient()
     client.connect(HOST, PORT, username, password, timeout=3.0,
                    allow_agent=False, look_for_keys=False)
     return client.open_sftp()

In terms of solving the bigger issue, usually I'd reach for the excellent backoff module in cases where infrastructure is unreliable, but this particular network failure is not one where we can simply catch Paramiko's TimeoutError and retry - unfortunately today it requires human involvement to fix. Eventually I will move the client to a more stable network, but that will require some broader architecture work. This Paramiko fix incrementally improves our operations in the meantime.

Wednesday, December 22, 2021

Intro to Koji video

(Cross-posting here from koji-devel)

This week I created an Introduction to Koji video. Koji is the build system we use for the Fedora Project and Red Hat products.

This video answers the following questions:

  • What is Koji, and how does it work?
  • What is the history of Koji in Fedora?
  • What are the unique design features of Koji?


 Direct link here: https://youtu.be/B0xfsJ1G4v0

Wednesday, October 07, 2020

How to succeed at Red Hat

I met someone recently who just graduated college and was on the job search.  I linked him to my favorite podcast episode on this topic. After listening to the episode, this person then asked me:

"What kind of person would succeed as a company such as Red Hat?"

I love this question. I asked around, and one colleague gave the following answer:

"I'd say someone that is passionate about enterprise technology and loves to learn in a fast moving environment."

I love this answer, and I'm going to break it into pieces to explain why this applies.


This means that you understand more than just code. It means you understand the business (who our customers are, and how do we make customers happy, and why would people pay us money). It also means that you care about your work: you understand who your users are, and you think about how to improve the experience of those users consuming the work that you produce.

"Enterprise technology"

What's Enterprise technology? It's technology that is stable, secure, and used in a wide variety of places. You will be a good fit if:

  • You care about making things work "out of the box" so users can get going quickly without stumbling blocks or a lot of other busy work.
  • You like writing clear documentation to help people who are not complete experts or Linux ninjas
  • You care about making things stable and well-tested (understanding things like "backwards compatibility" and the different types of testing).
  • You care about security (understanding why things like authentication, encryption and signing are important)

"Learning in a fast-paced environment"

In Enterprise software, there is a natural pull to move slowly and stagnate. However, open-source is gigantic and moves incredibly fast. (In fact, the core of Red Hat's value proposition is that Red Hat drives innovation in "upstream" projects and then distills that down into stable products that customers can trust.)

I've been at Red Hat for six years, and the product and technology landscape has changed a lot. During that time, Red Hat acquired Ansible and we've integrated so many things with that now. Ceph used to just integrate with OpenStack, and now we've integrated Ceph with NFS, iSCSI and OpenShift as well. Every year I work on different things and find new ways to integrate new and different projects together.

When I joined Red Hat in 2014, I did not know Python or C++. One of the best things about accepting the job was that I knew I would be surrounding myself with people who knew these languages incredibly well. I picked up so much by osmosis, working alongside some senior developers on the team who could patiently explain things to me.

Technology aside, a few years ago I found the Manager Tools podcast and learned so much about how to communicate effectively (really important at Red Hat were so much work is remote work). I also found a coach and learned how to get really clear on what I want, how to set boundaries effectively, how to finish projects on time, and how to handle personal challenges at work.

"Being a passionate learner" does not mean the often-ridiculed caricature of the programmer who spends every spare night and weekend away from family slaving away on open-source for free. It means someone who spends their time effectively, someone who knows how to balance the critical fire-fighting of the moment with the time to make long-term investments (whether that's technical research or organizational relationship-building). In fact, one of the things I'm passionate about is quitting work on time (or even early) and encouraging folks on my teams to do the same. Quitting on time is how we keep coming back tomorrow.

Saturday, May 02, 2020

in defense of code coverage

"How much do I care about code coverage?"

"100%!" (Kidding.)

SQLite's article on testing is helpful in this discussion. SQLite famously has 100% code coverage, but developers still find and fix bugs in SQLite. Why? Because code coverage is just one part of a testing strategy.

When I find a software project that publishes code coverage metrics, it tells me some things:

  • The developers know what a unit test is, and they care about writing tests. This increases my confidence in the stability of the project and my respect for the developers.
  • If I contribute to this project, I will need to learn enough to contribute a test for my feature as well.
Yes, it's helpful to know the percentage of code coverage for a project. "100%" doesn't mean the project is bug-free. It means that the developers care about shipping quality software, and they have an (imperfect) metric to help achieve that outcome.

Who are my expected users?

This question informs how much effort I put into code coverage.

A) I'm writing a library or API that will be used in many different ways by different teams that do not communicate with me or each other: code coverage tools are very important.

B) I'm writing a specialized tool that will only be used in one well-understood case by my immediate team of developers: coverage is nice to have, but I don't prioritize this as much.

C) I'm writing a one-off script that I will only be the sole user: This depends on a lot of factors of course, but normally I'm not going to put much effort into writing tests.

Of course the answer to "who are my expected users on this project?" can change.  One time I wrote a project for myself that ended up drawing a lot of unexpected users. It's important to periodically reevaluate the nature of a project's user base. Example: "Given the last 12 months, have my users' expectations for stability and quality changed?"

How are my users changing?

On one project I wrote, my first user base was very small. It consisted of developers and hackers who were very involved with feedback, design, and testing. Those original users moved on to other responsibilities, and new users have replaced them who are unfamiliar with the code. They have very different expectations and want your project to "just work".

This is an entirely different scenario. Documentation and regression testing are critical to sustaining the growth of the project. The new user base does not share the initial users' tolerance for breaking changes.

Bug reports will continue to come in. On one recent bug report, once I identified the root-cause and the exact function that is buggy, the next question I ask is "Do the unit tests cover this method?" Code coverage tools can quickly answer this question. This makes it easier for me to confidently modify the method because I know that I'm not introducing regressions.


Unit tests and code coverage are important, and the question is "how important?".

The best way to evaluate the importance of unit tests and code coverage is asking the question "What is today's cost of introducing and fixing a regression"? On a personal project, it's very low. On a popular core library with many users, the cost is very high. To answer that question accurately, you must understand your users.

Tuesday, January 08, 2019

nightly builds are too fast and too slow

I love build systems. I love working on them because it's such a unique blend between development and operations. A great build system is a great foundation for delivering value to users.

One term I often hear is "nightly build". As in, "Where can I download the nightly build?" or "Let's set up a nightly build."

"Nightlies" is a concept from the time where you'd set up a cron job to build your code from source control. You just poll CVS or Subversion every 24 hours, and build whatever's there. Tack a datestamp on the end of the build artifact and you're good to go.

In this post I want to talk about how "nightly" is almost always the wrong concept. They are too frequent, or else they are not frequent enough. Or if you're writing a catchy blog post title, they're too fast and too slow.

Nightlies are too slow

When you write code and test it, you want that feedback loop to be as tight as possible. Write code, save, compile, test - the faster these things happen, the faster your brain stays engaged.

If you have to sit and wait a few minutes to get information back about whether your code is correct or your build process succeeded, you're going to context switch to something else and lose time when you forget to switch back.

When we reach build processes that take hours, now we're in the "Meh, I'll check it when I'm back from lunch" territory. At that rate, you're probably only going to be running that process three or four times a day, max. Your workday is only eight hours, after all. The thoroughput for your changes drops through the floor.

Now imagine extending that feedback loop even further, to a full 24 hours.  You've just arrived at the "nightly build".

When that nightly build breaks, you have eight working hours to fix it and then you get to wait again for tomorrow morning when you find out the new problem.

After a few days of this, you no longer arrive at work with the same positive mental energy. Your morning email inbox experience becomes a thing where you discover what has gone wrong during the night, because you never saw it go right during the daytime.

Operational tempo slides further, because it feels like "everything takes so long around here." Teams lower their optimistic expectations that anything should ever happen quickly.

I've seen several odd knock-on effects here.

Sometimes what happens then is that you have multiple "nightlies" for a single day. One is the first broken nightly that ran in cron, and the others are multiple attempts where someone ran the script by hand trying to get it to pass. The "nightly" is no longer nightly. Odds are that those manual runs did not do everything exactly like the full cron job did. More confusion ensues across the organization.

When we only run a big ugly task once at midnight, then we don't care strongly about how long it takes. We've removed a big incentive to pay down the tech debt and work on shortening the long tasks, because they always happen while we're asleep. The big ugly tasks get progressively longer and longer, until an important emergency happens, and we have to run the task during working hours and we're unable to deliver in a timely way.

Another common papercut: someone will increase the frequency of the cron task so that it runs hourly, or every 20 minutes, instead of 24 hours. This is better, but unfortunately 20 minutes is still quite slow, and users will frequently multi-task away and forget to see the failure until hours or days have gone past.  There is also something maddeningly unclear about this type of every-couple-of-minutes scheduling. Is that cron job going to kick off at the top of the hour, or some other time? Did I just miss it and I have to wait the full 20 minute period, or will it happen sooner? Should I bother someone if nothing appears to be happening, or did I just do my clock math wrong? This user experience is particularly demoralizing.

Increasing the cron task model's frequency also leads to the next problem, which is:

Nightlies are too fast

If you have a project with code that changes daily, then yep, you want to build it at least daily. But does your project change literally every day 365 days of every year? For most projects, the answer is no. Did any code really change on Saturday? Or Sunday? Not just one weekend, but every weekend?

If we simply build every day (or even every weekday), this only works for projects that always have one or more changes every 24 hours, on to infinity. In the case where nothing has changed in the last 24 hours, then we are needlessly rebuilding for no reason. If your artifacts are multiple gigabytes, stored on highly available storage, that is a lot of duplicated disk space.

There is also an impact to the rest of the pipeline here. If the QE team thinks they have to test every build, they may be wasting human effort and compute costs.

The typical improvement in this case is to build some kind of polling in, like "Poll this GitHub repository every day and run a build only if there are changes from last time". Jenkins in particular has really helped spread this model, because it can do this out of the box.

For small projects, it's usually trivial to answer "did anything change here"?  For example, it's really easy to run "git fetch" and see if there are any new changes, and then build those.

Sometimes your build process depends on many other things besides that single Git repository. For example, if you build a container that includes artifacts from several locations, then you will need to poll all of them to know if anything has changed. Many times those those locations are not trivial to poll with a one-liner.

Now you are in a poll-the-world model, asking yourself how to poll, what is a reasonable frequency to poll, and how annoyed will those administrators be if I hit their systems every 60 seconds?

These questions lead to spending more engineering effort or taking shortcuts which the QE team must pay for later.

What should we do instead?

Instead of talking about "nightly builds", let's talk about "CI builds".

Instead of a poll-the-world model, make the build systems event-driven.

This requires having a really solid grasp of your inputs and outputs. For example: my Jenkins job should fire if the code changes in Git *or* if one of the input binaries change versions, *and* it should feed its pass/fail status into these other three systems."

If you don't know the input events for your process, research more about the system that is upstream of you, instead of simply configuring your system to poll it.

Set the expectation that all the build pipeline processes for which you are responsible will happen immediately, without polling. This implicitly sets other expectations for all your other teams, particularly those upstream and downstream to you.

For the dev teams feeding into your build system, they should expect actions to happen immediately. If a developer does not see the build system immediately respond to their changes, their first mental response should be "that's broken and we can fix or escalate it" instead of "it's just slow" or "it's just me".

For QE teams that take output from your build system, you're communicating two things with an event-driven model. Firstly, when QE talk directly to a developer (skipping your role in the pipeline), and the developer says they've pushed some code, QE should immediately be able to see that the new code is building and is coming towards them. They should be checking the health of the pipeline as well, with positive expectations that they do not need to do complicated polling or involve you. Secondly, the fact that builds can arrive *at any time* means QE should set up their own automated triggers on your events, rather than polling you every 24 hours.

Technical implementations

Making all your tools event-driven is a long process, and in large systems it can take years. It's a culture shift as well as a technical shift.

You can definitely go a long way by using GitHub's webhooks and running everything in a single Jenkins instance.

When that no longer scales, you can run a message bus like RabbitMQ or ActiveMQ. At my current employer we have a company-wide message bus, and almost all the build and release tooing feeds into this bus. This lets each engineering team build operational pipelines that are loosely coupled from each other. There is a upward spiral effect: the more tools use the unified message bus, the more other tool authors want to use it. The messagebus has strong management support because it is the backbone of our CI efforts.

When all the automated stuff like webhooks or a messagebus are great, of course it is a good idea to build fallback support for polling as well in the off-chance that the messages do not get through. But polling should be the fallback position to get you past an emergency, not the norm.


We already have to wait for many things with our computers. Don't make "wall clock time" one of those things.

Don't build nightly. Build continuously on each change event.

Wednesday, July 18, 2018

"What problem are we trying to solve?"

I was in a meeting with some folks recently where the leader opened the floor for Suggestions.

At this point the meeting got very quiet. Not everyone in the group is an introvert, but everyone felt the awkwardness and no one wanted to speak up and sound dumb.

I'm not sure what was happening in everyone else's minds at that point, but for myself, I wasn't even sure what we were talking about. We were all staring at a blank whiteboard, going to write down "ideas" about ... something.

I finally asked:

"Man I'm sorry, I'm just not following here. What problem are we trying to solve?"

The leader next listed out three problems that he saw as important. We wrote them down and it kicked the conversation into gear. People started engaging with the list, asking, "Is that a big problem?" or  "Here's how I see that problem manifesting."

Sometimes brainstorming sessions or conversation-starters can be too open. We want to not leave anybody out, and seek ideas from everyone, so we cast such a wide fishing net that it's awkward. The conscious people are wracking their brains trying to figure out what the leader is asking for on this fishing expedition.

Next time you're in a meeting that's really hard to follow, and several folks are falling silent, don't assume it's "just you". Throw yourself under the bus in a whimsical way and ask aloud:

"Sorry, I'm lost. What problem are we trying to solve here?"

The next 20 minutes will be a lot more interesting!

Friday, June 08, 2018

Hope is my strategy

One of my favorite tech books is Google's Site Reliability Engineering book. They open with this tongue-in-cheek quote:

  "Hope is not a strategy" --  Traditional SRE saying

This attitude reminds me of the common idea in system administration that anything that can go wrong will go wrong. Murphy's law, "never trust a happy system administrator", and so on.

When your goal as a system admin is "100% service uptime", there's simply no way to meet that goal. You can only fail.

The authors of the SRE book look at the business costs and value to uptime metrics. They propose a different strategy, focusing on an "error budget" instead of a pure 100% uptime unachievable goal. Exposing this error budget allows the business decision makers to align with the inherent constraints engineers face when making speed vs safety decisions.

"Hope is not a strategy" means making our decisions data-driven rather than wishful-thinking-driven.

Operating like this means gathering a *lot* of data. Lots of monitoring, A/B testing, phased rollouts, and so on.

How much data is enough, though?

If you're Google or another big corporation, you can spend a lot of resources on monitoring and benchmarking. There's always more to measure, tweak and improve.

In my own life I can see the effects of wanting more and more data before making big personal decisions. It often means I delay beyond what's reasonable and miss out on opportunities because I'm risk-averse.

There's two sides to this problem:

- Bad: Wishful thinking, blind optimism, recklessness,.
- Also bad: Analysis paralysis, perfectionism, fear.

In 2018 I've faced some hard decisions in my personal life, where I have to make choices every week for how I'm going to live and what I'm going to do. These choices affect others around me as well.

At some point I have to stop gathering data. I don't have the resources to do the exhaustive research I daydream about for every decision. And even if I did, it's pure fantasy to think I can avoid pain and suffering in this life.

This prayer about serenity has really helped me this year:

  God grant me the serenity
  to accept the things I cannot change;
  courage to change the things I can;
  and wisdom to know the difference.

Of course I have to dig in and do the hard work - that's the "courage to change" part. But when decisions are murky, things are unclear, that is where serenity and wisdom come in. That is where hope is my strategy.

Where does your hope lie?