Blog posts for tags/performance

  1. From Python 3.3 to today: ending 15 years of subprocess polling

    One of the less fun aspects of process management on POSIX systems is waiting for a process to terminate. The standard library's subprocess module has relied on a busy-loop polling approach since the timeout parameter was added to subprocess.Popen.wait() in Python 3.3, around 15 years ago (see source). And psutil's Process.wait() method uses exactly the same technique (see source).

    The logic is straightforward: check whether the process has exited using non-blocking waitpid(WNOHANG), sleep briefly, check again, sleep a bit longer, and so on.

    import os, time
    
    def wait_busy(pid, timeout):
        end = time.monotonic() + timeout
        interval = 0.0001
        while time.monotonic() < end:
            pid_done, _ = os.waitpid(pid, os.WNOHANG)
            if pid_done:
                return
            time.sleep(interval)
            interval = min(interval * 2, 0.04)
        raise TimeoutError
    

    In this blog post I'll show how I finally addressed this long-standing inefficiency, first in psutil, and most excitingly, directly in CPython's standard library subprocess module.

    The problem with busy-polling

    • CPU wake-ups: even with exponential backoff (starting at 0.1ms, capping at 40ms), the system constantly wakes up to check process status, wasting CPU cycles and draining batteries.
    • Latency: there's always a gap between when a process actually terminates and when you detect it.
    • Scalability: monitoring many processes simultaneously magnifies all of the above.

    Event-driven waiting

    All POSIX systems provide at least one mechanism to be notified when a file descriptor becomes ready. These are select(), poll(), epoll() (Linux) and kqueue() (BSD / macOS) system calls. Until recently, I believed they could only be used with file descriptors referencing sockets, pipes, etc., but it turns out they can also be used to wait for events on process PIDs!

    Linux

    In 2019, Linux 5.3 introduced a new syscall, os.pidfd_open(), which was added in Python 3.9. It returns a file descriptor referencing a process PID. The interesting thing is that pidfd_open() can be used in conjunction with select(), poll() or epoll() to effectively wait until the process exits. E.g. by using poll():

    import os, select
    
    def wait_pidfd(pid, timeout):
        pidfd = os.pidfd_open(pid)
        poller = select.poll()
        poller.register(pidfd, select.POLLIN)
        # block until process exits or timeout occurs
        events = poller.poll(timeout * 1000)
        if events:
            return
        raise TimeoutError
    

    This approach has zero busy-looping. The kernel wakes us up exactly when the process terminates or when the timeout expires if the PID is still alive.

    I chose poll() over select() because select() has a historical file descriptor limit (FD_SETSIZE), which typically caps it at 1024 file descriptors per-process (reminded me of BPO-1685000).

    I chose poll() over epoll() because it does not require creating an additional file descriptor. It also needs only a single syscall, which should make it a bit more efficient when monitoring a single FD rather than many.

    macOS and BSD

    BSD-derived systems (including macOS) provide the kqueue() syscall. It's conceptually similar to select(), poll() and epoll(), but more powerful (e.g. it can also handle regular files). kqueue() can be passed a PID directly, and it will return once the PID disappears or the timeout expires:

    import select
    
    def wait_kqueue(pid, timeout):
        kq = select.kqueue()
        kev = select.kevent(
            pid,
            filter=select.KQ_FILTER_PROC,
            flags=select.KQ_EV_ADD | select.KQ_EV_ONESHOT,
            fflags=select.KQ_NOTE_EXIT,
        )
        # block until process exits or timeout occurs
        events = kq.control([kev], 1, timeout)
        if events:
            return
        raise TimeoutError
    

    Windows

    Windows does not busy-loop, both in psutil and subprocess module, thanks to WaitForSingleObject. This means Windows has effectively had event-driven process waiting from the start. So nothing to do on that front.

    Graceful fallbacks

    Both pidfd_open() and kqueue() can fail for different reasons. For example, with EMFILE if the process runs out of file descriptors (usually 1024), or with EACCES / EPERM if the syscall was explicitly blocked at the system level by the sysadmin (e.g. via SECCOMP). In all cases, psutil silently falls back to the traditional busy-loop polling approach rather than raising an exception.

    This fast-path-with-fallback approach is similar in spirit to BPO-33671, where I sped up shutil.copyfile() by using zero-copy system calls back in 2018. In there, more efficient os.sendfile() is attempted first, and if it fails (e.g. on network filesystems) we fall back to the traditional read() / write() approach to copy regular files.

    Measurement

    As a simple experiment, here's a simple program which waits on itself for 10 seconds without terminating:

    # test.py
    import psutil, os
    try:
        psutil.Process(os.getpid()).wait(timeout=10)
    except psutil.TimeoutExpired:
        pass
    

    We can measure the CPU context switching using /usr/bin/time -v. Before the patch (the busy-loop):

    $ /usr/bin/time -v python3 test.py 2>&1 | grep context
        Voluntary context switches: 258
        Involuntary context switches: 4
    

    After the patch (the event-driven approach):

    $ /usr/bin/time -v python3 test.py 2>&1 | grep context
        Voluntary context switches: 2
        Involuntary context switches: 1
    

    This shows that instead of spinning in userspace, the process blocks in poll() / kqueue(), and is woken up only when the kernel notifies it, resulting in just a few CPU context switches.

    Sleeping state

    It's also interesting to note that waiting via poll() (or kqueue()) puts the process into the exact same sleeping state as a plain time.sleep call. From the kernel's perspective, both are interruptible sleeps: the process is de-scheduled, consumes zero CPU, and sits quietly in kernel space.

    The "S+" state shown below by ps means that the process "sleeps in foreground".

    • time.sleep:
    $ (python3 -c 'import time; time.sleep(10)' & pid=$!; sleep 0.3; ps -o pid,stat,comm -p $pid) && fg &>/dev/null
        PID STAT COMMAND
     491573 S+   python3
    
    • select.poll:
    $ (python3 -c 'import os,select; fd = os.pidfd_open(os.getpid(),0); p = select.poll(); p.register(fd,select.POLLIN); p.poll(10_000)' & pid=$!; sleep 0.3; ps -o pid,stat,comm -p $pid) && fg &>/dev/null
        PID STAT COMMAND
     491748 S+   python3
    

    CPython contribution

    After landing the psutil implementation (PR-2706), I took the extra step and submitted a matching pull request for CPython subprocess module: cpython/PR-144047.

    I'm especially proud of this one: this is the third time in psutil's 17+ year history that a feature developed in psutil made its way upstream into the Python standard library.

    • The first was back in 2010, when Process.nice() inspired os.getpriority() and os.setpriority(), see BPO-10784. Landed in Python 3.3.
    • The second was back in 2011, when psutil.disk_usage() inspired shutil.disk_usage(), see python-ideas ML proposal. Landed in Python 3.3.

    Funny thing: 15 years ago, Python 3.3 added the timeout parameter to subprocess.Popen.wait (see commit). That's probably where I took inspiration when I first added the timeout parameter to psutil's Process.wait() around the same time (see commit). Now, 15 years later, I'm contributing back a similar improvement for that very same timeout parameter. The circle is complete.

  2. Speeding up pytest startup

    Preface: the migration to pytest

    Last year, after 17 years of unittest, I started adopting pytest in psutil (see #2446). The two advantages I cared about most were plain assert statements and pytest-xdist's free parallelism. Tests still inherit from unittest.TestCase and there's no conftest.py or fixture use (rationale in PR-2456).

    What I want to focus on here is one of pytest's most frustrating aspects: slow startup.

    pytest invocation is slow

    To measure pytest's startup time, let's run a very simple test where execution time itself is negligible:

    $ time python3 -m pytest psutil/tests/test_misc.py::TestMisc::test_version
    1 passed in 0.05s
    real    0m0,427s
    

    Almost half a second, which is excessive for something I run repeatedly during development. For comparison, the same test under unittest:

    $ time python3 -m unittest psutil.tests.test_misc.TestMisc.test_version
    Ran 1 test in 0.000s
    real    0m0,204s
    

    Roughly twice as fast. Why?

    Where is time being spent?

    A significant portion of pytest's overhead comes from import time, and there's not much one can do about it:

    $ time python3 -c "import pytest"
    real    0m0,151s
    
    $ time python3 -c "import unittest"
    real    0m0,065s
    
    $ time python3 -c "import psutil"
    real    0m0,056s
    

    Disable plugin auto loading

    After some research, I discovered that pytest automatically loads all plugins installed on the system, even if they aren't used. Here's how to list them (output is cut):

    $ pytest --trace-config --collect-only
    ...
    active plugins:
        ...
        setupplan           : ~/.local/lib/python3.12/site-packages/_pytest/setupplan.py
        stepwise            : ~/.local/lib/python3.12/site-packages/_pytest/stepwise.py
        warnings            : ~/.local/lib/python3.12/site-packages/_pytest/warnings.py
        logging             : ~/.local/lib/python3.12/site-packages/_pytest/logging.py
        reports             : ~/.local/lib/python3.12/site-packages/_pytest/reports.py
        python_path         : ~/.local/lib/python3.12/site-packages/_pytest/python_path.py
        unraisableexception : ~/.local/lib/python3.12/site-packages/_pytest/unraisableexception.py
        threadexception     : ~/.local/lib/python3.12/site-packages/_pytest/threadexception.py
        faulthandler        : ~/.local/lib/python3.12/site-packages/_pytest/faulthandler.py
        instafail           : ~/.local/lib/python3.12/site-packages/pytest_instafail.py
        anyio               : ~/.local/lib/python3.12/site-packages/anyio/pytest_plugin.py
        pytest_cov          : ~/.local/lib/python3.12/site-packages/pytest_cov/plugin.py
        subtests            : ~/.local/lib/python3.12/site-packages/pytest_subtests/plugin.py
        xdist               : ~/.local/lib/python3.12/site-packages/xdist/plugin.py
        xdist.looponfail    : ~/.local/lib/python3.12/site-packages/xdist/looponfail.py
        ...
    

    It turns out PYTEST_DISABLE_PLUGIN_AUTOLOAD environment variable can be used to disable them. By running PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest --trace-config --collect-only again I can see that the following plugins disappeared:

    anyio
    pytest_cov
    pytest_instafail
    pytest_subtests
    xdist
    xdist.looponfail
    

    Now let's run the test again with PYTEST_DISABLE_PLUGIN_AUTOLOAD:

    $ time PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python3 -m pytest psutil/tests/test_misc.py::TestMisc::test_version
    1 passed in 0.05s
    real    0m0,285s
    

    We went from 0.427s to 0.285s, a ~40% improvement. Not bad. We now need to selectively enable only the plugins we actually use, via -p. psutil uses pytest-instafail and pytest-subtests (we'll deal with pytest-xdist later):

    $ time PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python3 -m pytest -p instafail -p subtests ...
    real    0m0,320s
    

    Time went back up to 0.320s. Quite a slowdown, but still better than the original 0.427s. Adding pytest-xdist:

    $ time PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python3 -m pytest -p instafail -p subtests -p xdist ...
    real    0m0,369s
    

    0.369s. Not much, but still a pity to pay the price when NOT running tests in parallel.

    Handling pytest-xdist

    If we disable pytest-xdist psutil tests still run, but we get a warning:

    psutil/tests/test_testutils.py:367
      ~/svn/psutil/psutil/tests/test_testutils.py:367: PytestUnknownMarkWarning: Unknown pytest.mark.xdist_group - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
        @pytest.mark.xdist_group(name="serial")
    

    This warning appears for methods that are intended to run serially, those decorated with @pytest.mark.xdist_group(name="serial"). However, since pytest-xdist is now disabled, the decorator no longer exists. To address this, I implemented the following solution in psutil/tests/__init__.py:

    import pytest, functools
    
    PYTEST_PARALLEL = "PYTEST_XDIST_WORKER" in os.environ  # True if running parallel tests
    
    if not PYTEST_PARALLEL:
        def fake_xdist_group(*_args, **_kwargs):
            """Mimics `@pytest.mark.xdist_group` decorator. No-op: it just
            calls the test method or return the decorated class."""
            def wrapper(obj):
                @functools.wraps(obj)
                def inner(*args, **kwargs):
                    return obj(*args, **kwargs)
    
                return obj if isinstance(obj, type) else inner
    
            return wrapper
    
        pytest.mark.xdist_group = fake_xdist_group  # monkey patch
    

    With this in place the warning disappears when running tests serially. To run tests in parallel, we'll manually enable xdist:

    $ python3 -m pytest -p xdist -n auto --dist loadgroup
    

    Optimizing test collection time

    By default, pytest searches the entire directory for tests, adding unnecessary overhead. In pyproject.toml you can tell pytest where test files are located, and only to consider test_*.py files:

    [tool.pytest.ini_options]
    testpaths = ["psutil/tests/"]
    python_files = ["test_*.py"]
    

    Collection time dropped from 0.20s to 0.17s, another ~0.03s shaved off.

    Putting it all together

    With these small optimizations, I managed to reduce pytest startup time by ~0.12 seconds, bringing it down from 0.42 seconds. While this improvement is insignificant for full test runs, it makes a noticeable difference (~28% faster) when repeatedly running individual tests from the command line, which is something I do frequently during development. Final result is visible in PR-2538.

  3. Improved process_iter()

    This is part of the psutil 5.3.0 release (see the changelog for the full list of changes).

    The old pattern

    Iterating over processes and collecting attributes requires more boilerplate than it should. A process returned by psutil.process_iter() may disappear before you access it, or require elevated privileges, so every lookup has to be guarded with a try / except:

    >>> import psutil
    >>> for proc in psutil.process_iter():
    ...     try:
    ...         pinfo = proc.as_dict(attrs=['pid', 'name'])
    ...     except (psutil.NoSuchProcess, psutil.AccessDenied):
    ...         pass
    ...     else:
    ...         print(pinfo)
    ...
    {'pid': 1, 'name': 'systemd'}
    {'pid': 2, 'name': 'kthreadd'}
    {'pid': 3, 'name': 'ksoftirqd/0'}
    

    This is not decorative. It's necessary to avoid the race condition.

    The new pattern

    5.3.0 adds attrs and ad_value parameters to psutil.process_iter(). With these, the loop body becomes:

    >>> import psutil
    >>> for proc in psutil.process_iter(attrs=['pid', 'name']):
    ...     print(proc.info)
    ...
    {'pid': 1, 'name': 'systemd'}
    {'pid': 2, 'name': 'kthreadd'}
    {'pid': 3, 'name': 'ksoftirqd/0'}
    

    Internally, process_iter() attaches an info dict to the Process instance. The attributes are pre-fetched in one shot. Processes that disappear during iteration are silently skipped, and attributes that would raise AccessDenied get assigned ad_value, which defaults to None:

    for p in psutil.process_iter(['name', 'username'], ad_value="N/A"):
        print(p.name(), p.username())
    

    Performance

    Beyond the syntactic win, the new syntax is also faster than calling individual methods in a loop. process_iter(attrs=[...]) is equivalent to using Process.oneshot() on each process (see Making psutil twice as fast for how that works): attributes that share a syscall or a /proc file are fetched together instead of re-read on every method call, which is a lot faster.

    Comprehensions

    With the exception boilerplate out of the way, comprehensions finally work cleanly. E.g. getting processes owned by the current user can be written as:

    >>> import getpass
    >>> from pprint import pprint as pp
    >>> pp([(p.pid, p.info['name']) for p in psutil.process_iter(attrs=['name', 'username']) if p.info['username'] == getpass.getuser()])
    [(16832, 'bash'),
     (19772, 'ssh'),
     (20492, 'python')]
    
  4. Making psutil twice as fast

    Starting from psutil 5.0.0 you can query multiple Process fields around twice as fast as before (see #799 and Process.oneshot() doc). It took 7 months, 108 commits, and a massive refactoring of psutil internals (PR-937), and I think it's one of the best improvements ever shipped in a psutil release.

    The problem

    How process information is retrieved varies by OS. Sometimes it means reading a file in /proc (Linux), other times calling C (Windows, BSD, macOS, SunOS), but it's always done differently. Psutil abstracts this away: you call Process.name() without worrying about what happens under the hood or which OS you're on.

    Internally, multiple pieces of process info (e.g. Process.name(), Process.ppid(), Process.uids(), Process.create_time()) are fetched by the same syscall. On Linux we read /proc/PID/stat to get the process name, terminal, CPU times, creation time, status and parent PID, but only one value is returned: the others are discarded. On Linux this code reads /proc/PID/stat 6 times:

    >>> import psutil
    >>> p = psutil.Process()
    >>> p.name()
    >>> p.cpu_times()
    >>> p.create_time()
    >>> p.ppid()
    >>> p.status()
    >>> p.terminal()
    

    On BSD most process metrics can be fetched with a single sysctl(), yet psutil was invoking it for each process method (e.g. see here and here).

    Do it in one shot

    It's clear that this approach is inefficient, especially in tools like top or htop, where process info is continuously fetched in a loop. psutil 5.0.0 introduces a new Process.oneshot() context manager. Inside it, the internal routine runs once (in the example, on the first Process.name() call) and the other values are cached. Subsequent calls sharing the same internal routine (read /proc/PID/stat, call sysctl() or whatever) return the cached value. The code above can now be rewritten like this, and on Linux it runs 2.4 times faster:

    >>> import psutil
    >>> p = psutil.Process()
    >>> with p.oneshot():
    ...     p.name()
    ...     p.cpu_times()
    ...     p.create_time()
    ...     p.ppid()
    ...     p.status()
    ...     p.terminal()
    

    Implementation

    One great thing about psutil's design is its abstraction. It is divided into 3 "layers". Layer 1 is represented by the main Process class (Python), which exposes the high-level API. Layer 2 is the OS-specific Python module, which is a thin wrapper on top of the OS-specific C extension module (layer 3).

    Because the code was organized this way (modular), the refactoring was reasonably smooth. I first refactored those C functions that collect multiple pieces of info and grouped them into a single function (e.g. see BSD implementation). Then I wrote a decorator that enables the cache only when requested (when entering the context manager), and decorated the "grouped functions" with it. The caching mechanism is controlled by the Process.oneshot() context manager, which is the only thing exposed to the end user. Here's the decorator:

    def memoize_when_activated(fun):
        """A memoize decorator which is disabled by default. It can be
        activated and deactivated on request.
        """
        @functools.wraps(fun)
        def wrapper(self):
            if not wrapper.cache_activated:
                return fun(self)
            else:
                try:
                    ret = cache[fun]
                except KeyError:
                    ret = cache[fun] = fun(self)
                return ret
    
        def cache_activate():
            """Activate cache."""
            wrapper.cache_activated = True
    
        def cache_deactivate():
            """Deactivate and clear cache."""
            wrapper.cache_activated = False
            cache.clear()
    
        cache = {}
        wrapper.cache_activated = False
        wrapper.cache_activate = cache_activate
        wrapper.cache_deactivate = cache_deactivate
        return wrapper
    

    To measure the speedup I wrote a benchmark script (well, two actually), and kept tuning until I was sure the change actually made psutil faster. The scripts report the speedup for calling all the "grouped" methods together (best-case scenario).

    Linux: +2.56x speedup

    The Linux implementation is mostly Python, reading files in /proc. These files typically expose multiple pieces of info per process; /proc/PID/stat and /proc/PID/status are the perfect example. We aggregate them into three groups. See the relevant code here.

    Windows: from +1.9x to +6.5x speedup

    Windows is an interesting one. For a process owned by our user, we group only Process.num_threads(), Process.num_ctx_switches() and Process.num_handles(), for a +1.9x speedup if we access those methods in one shot.

    Windows is special though, because certain methods have a dual implementation (#304): a "fast method" is tried first, but if the process is owned by another user it fails with AccessDenied. psutil then falls back to a second, "slower" method (see here for example).

    It's slower because it iterates over all PIDs, but unlike the "plain" Windows APIs it can still retrieve multiple pieces of information in one shot: number of threads, context switches, handles, CPU times, create time, and I/O counters.

    That's why querying processes owned by other users results in an impressive +6.5x speedup.

    macOS: +1.92x speedup

    On macOS we can get 2 groups of information. With sysctl() we get process parent PID, uids, gids, terminal, create time, name. With proc_info() we get CPU times (for PIDs owned by another user), memory metrics and ctx switches. Not bad.

    BSD: +2.18x speedup

    On BSD we gather tons of process info just by calling sysctl() (see implementation): process name, ppid, status, uids, gids, IO counters, CPU and create times, terminal and ctx switches.

    SunOS: +1.37x speedup

    SunOS is like Linux (it reads files in /proc), but the code is in C. Here too, we group different metrics together (see here and here).

    Discussion

Social

Feed