From Python 3.3 to today: ending 15 years of subprocess polling

One of the less fun aspects of process management on POSIX systems is waiting for a process to terminate. The standard library's subprocess module has relied on a busy-loop polling approach since the timeout parameter was added to Popen.wait() in Python 3.3, around 15 years ago (see source). And psutil's Process.wait() method uses exactly the same technique (see source).

The logic is straightforward: check whether the process has exited using non-blocking waitpid(WNOHANG), sleep briefly, check again, sleep a bit longer, and so on.

import os, time

def wait_busy(pid, timeout):
    end = time.monotonic() + timeout
    interval = 0.0001
    while time.monotonic() < end:
        pid_done, _ = os.waitpid(pid, os.WNOHANG)
        if pid_done:
            return
        time.sleep(interval)
        interval = min(interval * 2, 0.04)
    raise TimeoutExpired

In this blog post I'll show how I finally addressed this long-standing inefficiency, first in psutil, and most excitingly, directly in CPython's standard library subprocess module.

The problem with busy-polling

  • CPU wake-ups: even with exponential backoff (starting at 0.1ms, capping at 40ms), the system constantly wakes up to check process status, wasting CPU cycles and draining batteries.
  • Latency: there's always a gap between when a process actually terminates and when you detect it.
  • Scalability: monitoring many processes simultaneously magnifies all of the above.

Event-driven waiting

All POSIX systems provide at least one mechanism to be notified when a file descriptor becomes ready. These are select(), poll(), epoll() (Linux) and kqueue() (BSD / macOS) system calls. Until recently, I believed they could only be used with file descriptors referencing sockets, pipes, etc., but it turns out they can also be used to wait for events on process PIDs!

Linux

In 2019, Linux 5.3 introduced a new syscall, pidfd_open(), which was added to the os module in Python 3.9. It returns a file descriptor referencing a process PID. The interesting thing is that pidfd_open() can be used in conjunction with select(), poll() or epoll() to effectively wait until the process exits. E.g. by using poll():

import os, select

def wait_pidfd(pid, timeout):
    pidfd = os.pidfd_open(pid)
    poller = select.poll()
    poller.register(pidfd, select.POLLIN)
    # block until process exits or timeout occurs
    events = poller.poll(timeout * 1000)
    if events:
        return
    raise TimeoutError

This approach has zero busy-looping. The kernel wakes us up exactly when the process terminates or when the timeout expires if the PID is still alive.

I chose poll() over select() because select() has a historical file descriptor limit (FD_SETSIZE), which typically caps it at 1024 file descriptors per-process (reminded me of BPO-1685000).

I chose poll() over epoll() because it does not require creating an additional file descriptor. It also needs only a single syscall, which should make it a bit more efficient when monitoring a single FD rather than many.

macOS and BSD

BSD-derived systems (including macOS) provide the kqueue() syscall. It's conceptually similar to select(), poll() and epoll(), but more powerful (e.g. it can also handle regular files). kqueue() can be passed a PID directly, and it will return once the PID disappears or the timeout expires:

import select

def wait_kqueue(pid, timeout):
    kq = select.kqueue()
    kev = select.kevent(
        pid,
        filter=select.KQ_FILTER_PROC,
        flags=select.KQ_EV_ADD | select.KQ_EV_ONESHOT,
        fflags=select.KQ_NOTE_EXIT,
    )
    # block until process exits or timeout occurs
    events = kq.control([kev], 1, timeout)
    if events:
        return
    raise TimeoutError

Windows

Windows does not busy-loop, both in psutil and subprocess module, thanks to WaitForSingleObject. This means Windows has effectively had event-driven process waiting from the start. So nothing to do on that front.

Graceful fallbacks

Both pidfd_open() and kqueue() can fail for different reasons. For example, with EMFILE if the process runs out of file descriptors (usually 1024), or with EACCES / EPERM if the syscall was explicitly blocked at the system level by the sysadmin (e.g. via SECCOMP). In all cases, psutil silently falls back to the traditional busy-loop polling approach rather than raising an exception.

This fast-path-with-fallback approach is similar in spirit to BPO-33671, where I sped up shutil.copyfile() by using zero-copy system calls back in 2018. In there, more efficient os.sendfile() is attempted first, and if it fails (e.g. on network filesystems) we fall back to the traditional read() / write() approach to copy regular files.

Measurement

As a simple experiment, here's a simple program which waits on itself for 10 seconds without terminating:

# test.py
import psutil, os
try:
    psutil.Process(os.getpid()).wait(timeout=10)
except psutil.TimeoutExpired:
    pass

We can measure the CPU context switching using /usr/bin/time -v. Before the patch (the busy-loop):

$ /usr/bin/time -v python3 test.py 2>&1 | grep context
    Voluntary context switches: 258
    Involuntary context switches: 4

After the patch (the event-driven approach):

$ /usr/bin/time -v python3 test.py 2>&1 | grep context
    Voluntary context switches: 2
    Involuntary context switches: 1

This shows that instead of spinning in userspace, the process blocks in poll() / kqueue(), and is woken up only when the kernel notifies it, resulting in just a few CPU context switches.

Sleeping state

It's also interesting to note that waiting via poll() (or kqueue()) puts the process into the exact same sleeping state as a plain time.sleep() call. From the kernel's perspective, both are interruptible sleeps: the process is de-scheduled, consumes zero CPU, and sits quietly in kernel space.

The "S+" state shown below by ps means that the process "sleeps in foreground".

  • time.sleep():
$ (python3 -c 'import time; time.sleep(10)' & pid=$!; sleep 0.3; ps -o pid,stat,comm -p $pid) && fg &>/dev/null
    PID STAT COMMAND
 491573 S+   python3
  • poll():
$ (python3 -c 'import os,select; fd = os.pidfd_open(os.getpid(),0); p = select.poll(); p.register(fd,select.POLLIN); p.poll(10_000)' & pid=$!; sleep 0.3; ps -o pid,stat,comm -p $pid) && fg &>/dev/null
    PID STAT COMMAND
 491748 S+   python3

CPython contribution

After landing the psutil implementation (psutil/PR-2706), I took the extra step and submitted a matching pull request for CPython subprocess module: cpython/PR-144047.

I'm especially proud of this one: this is the second time in psutil's 17+ year history that a feature developed in psutil made its way upstream into the Python standard library. The first was back in 2011, when psutil.disk_usage() inspired shutil.disk_usage() (see python-ideas ML proposal).

Funny thing: 15 years ago, Python 3.3 added the timeout parameter to subprocess.Popen.wait() (see commit). That's probably where I took inspiration when I first added the timeout parameter to psutil's Process.wait() around the same time (see commit). Now, 15 years later, I'm contributing back a similar improvement for that very same timeout parameter. The circle is complete.

Links

Topics related to this:

  • psutil/#2712: proposal to extend this to multiple PIDs (psutil.wait_procs()).
  • psutil/#2703: proposal for asynchronous psutil.Process.wait() integration with asyncio.
  • cpython/#144211: proposal to extend the selectors module to enable asyncio optimization on BSD / macOS via kqueue().

Discussion

Social

Feeds