Tuesday, July 09, 2024

fail faster: better Python SSH timeouts

I use Python's Paramiko to automate many things at work. Recently I wrote a SFTP client library. It it has a helper method like this:

 def connect(username):
     password = getpass.getpass()
     transport = paramiko.Transport((HOST, PORT))
     transport.connect(username=username, password=password)
     return paramiko.SFTPClient.from_transport(transport)


Callers can do "sftp = connect('kdreyer@example.com')", then do SFTP operations like uploading files, listing directories, etc.

Recently I hit a problem where the SSH server ("HOST") was not available over the network. A firewall was silently dropping TCP packets without rejecting them (RST). As a result, when we called this connect() method in a CI job, we had to wait a long time for this code to time out.

As in all good networking stories, this did not happen every time, only sometimes, and especially at important times when we really needed the connection to work. This CI job was already large and long, the perfect opportunity for losing a human operator's focus and costing even more time when it dies slowly like this.

I began researching how to make Paramiko time out faster.

To begin, I set up another network environment that drops SSH packets in a similar way so that I can reproduce it on my laptop. I found Python was hanging in socket.connect().

It turns out that I can initialize Paramiko's Transport() class two ways. The first way is to pass in my own socket object. This is powerful but too advanced for most Paramiko users, so Paramiko also allows me to initialize with a hostname + port tuple (as I've done above). Transport operates on sockets, so the constructor will create the socket and call socket.connect() for me. Critically, it calls connect() with no timeout. There is no way to pass in a timeout to the Transport() constructor.

At this point I began looking at the broader Paramiko API. It turns out there is a much better way to initialize an SFTP client. Instead of creating my own paramiko.Transport, I can start with paramiko.SSHClient(). This method has several important features (like pre-selecting allowed authentication mechanisms), but the most important is the "timeout" argument! Paramiko sets this timeout on the socket object so we get TimeoutError sooner. I wrote a patch and the new method connect() method looks like this:

 def connect(username):
     password = getpass.getpass()
     client = paramiko.SSHClient()
     client.load_system_host_keys()
     client.connect(HOST, PORT, username, password, timeout=3.0,
                    allow_agent=False, look_for_keys=False)
     return client.open_sftp()


In terms of solving the bigger issue, usually I'd reach for the excellent backoff module in cases where infrastructure is unreliable, but this particular network failure is not one where we can simply catch Paramiko's TimeoutError and retry - unfortunately today it requires human involvement to fix. Eventually I will move the client to a more stable network, but that will require some broader architecture work. This Paramiko fix incrementally improves our operations in the meantime.

No comments: