Lately I've been dealing with an asynchronous TCP client app which sends
messages to a remote server. Some of these messages are important, and cannot
get lost. Because the connection may drop at any time, I had to implement a
mechanism to resend the message once the client reconnects. As such, I needed
a way to identify what constitutes a connection error.
Python provides a builtin ConnectionError exception precisely for this
purpose, but it turns out it's not enough. After observing logs in production,
I found some errors that were not related to the socket connection per se, but
rather to the system connectivity, like ENETUNREACH
("network unreachable") or ENETDOWN ("network down"). It's interesting to
note how this distinction is reflected in the UNIX errno code prefixes:
ECONN* (connection errors) vs. ENET* (network errors). I've noticed
ENET* errors usually occur on a DHCP renewal, or more in general when the
Wi-Fi signal is weak or absent. Because this code runs on a cleaning robot
which constantly moves around the house, connection can become unstable when
the robot gets far from the Wi-Fi Access Point, so it's pretty common to bump
into errors like these:
File "/usr/lib/python3.7/ssl.py", line 934, in send
return self._sslobj.write(data)
OSError: [Errno 101] Network is unreachable
File "/usr/lib/python3.7/socket.py", line 222, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
File "/usr/lib/python3.7/ssl.py", line 934, in send
return self._sslobj.write(data)
BrokenPipeError: [Errno 32] Broken pipe
File "/usr/lib/python3.7/ssl.py", line 934, in send
return self._sslobj.write(data)
socket.timeout: The write operation timed out
Production logs also revealed a considerable amount of SSL-related errors. I
was uncertain what to do about those. The app is supposed to gracefully handle
them, so theoretically they should represent a bug. Still, they are
unequivocally related to the connection stream, and represent a failed attempt
to send data, so we want to retry it. Example of logs I found:
File "/usr/lib/python3.7/ssl.py", line 934, in send
return self._sslobj.write(data)
ssl.SSLZeroReturnError: TLS/SSL connection has been closed (EOF)
File "/usr/lib/python3.7/ssl.py", line 934, in send
return self._sslobj.write(data)
ssl.SSLError: [SSL: BAD_LENGTH] bad length
Looking at production logs revealed what sort of brutal, rough and tumble place
the Internet is, and how a network app must be ready to handle all sorts of
unexpected error conditions which hardly show up during testing. To handle all
of these cases I came up with this solution which I think is worth sharing, as
it's generic enough to be reused in similar situations. If needed, this can be
easily extended to include specific exceptions of third party libraries, like
requests.exceptions.ConnectionError.
import errno, socket, ssl
# Network errors, usually related to DHCP or wpa_supplicant (Wi-Fi).
NETWORK_ERRNOS = frozenset((
errno.ENETUNREACH, # "Network is unreachable"
errno.ENETDOWN, # "Network is down"
errno.ENETRESET, # "Network dropped connection on reset"
errno.ENONET, # "Machine is not on the network"
))
def is_connection_err(exc):
"""Return True if an exception is connection-related."""
if isinstance(exc, ConnectionError):
# https://docs.python.org/3/library/exceptions.html#ConnectionError
# ConnectionError includes:
# * BrokenPipeError (EPIPE, ESHUTDOWN)
# * ConnectionAbortedError (ECONNABORTED)
# * ConnectionRefusedError (ECONNREFUSED)
# * ConnectionResetError (ECONNRESET)
return True
if isinstance(exc, socket.gaierror):
# failed DNS resolution on connect()
return True
if isinstance(exc, (socket.timeout, TimeoutError)):
# timeout on connect(), recv(), send()
return True
if isinstance(exc, OSError):
# ENOTCONN == "Transport endpoint is not connected"
return (exc.errno in NETWORK_ERRNOS) or (exc.errno == errno.ENOTCONN)
if isinstance(exc, ssl.SSLError):
# Let's consider any SSL error a connection error. Usually this is:
# * ssl.SSLZeroReturnError: "TLS/SSL connection has been closed"
# * ssl.SSLError: [SSL: BAD_LENGTH]
return True
return False
To use it:
try:
sock.sendall(b"hello there")
except Exception as err:
if is_connection_err(err):
schedule_on_reconnect(lambda: sock.sendall(b"hello there"))
raise