Blog posts for tags/performance

  1. Speedup pytest startup

    Preface: the migration to pytest

    Last year, after 17 years since the inception of the project, I decided to start adopting pytest into psutil (see psutil/#2446). The advantages over unittest are numerous, but the two I cared about most are:

    • Being able to use base assert statements instead of unittest's self.assert*() APIs.
    • The excellent pytest-xdist extension, that lets you run tests in parallel, basically for free.

    Beyond that, I don't rely on any pytest-specific features in the code, like fixtures or conftest.py. I still organize tests in classes, with each one inheriting from unittest.TestCase. Why?

    • I like unittest's self.addCleanup too much to give it up (see some usages). I find it superior to fixtures. Less magical and more explicit.
    • I want users to be able to test their psutil installation in production environments where pytest might not be installed. To accommodate this, I created a minimal "fake" pytest class that emulates essential features like pytest.raises, @pytest.skip etc. (see PR-2456).

    But that's a separate topic. What I want to focus on here is one of pytest's most frustrating aspects: slow startup times.

    pytest invocation is slow

    To measure pytest's startup time, let's run a very simple test where execution time won't significantly affect the results:

    $ time python3 -m pytest --no-header psutil/tests/test_misc.py::TestMisc::test_version
    ============================= test session starts =============================
    collected 1 item
    psutil/tests/test_misc.py::TestMisc::test_version PASSED
    ============================== 1 passed in 0.05s ==============================
    
    real    0m0,427s
    user    0m0,375s
    sys     0m0,051s
    

    0,427s. Almost half of a second. That's excessive for something I frequently execute during development. For comparison, running the same test with unittest:

    $ time python3 -m unittest psutil.tests.test_misc.TestMisc.test_version
    ----------------------------------------------------------------------
    Ran 1 test in 0.000s
    OK
    
    real    0m0,204s
    user    0m0,169s
    sys     0m0,035s
    

    0,204 secs. Meaning unittest is roughly twice as fast as pytest. But why?

    Where is time being spent?

    A significant portion of pytest's overhead comes from import time:

    $ time python3 -c "import pytest"
    real    0m0,151s
    user    0m0,135s
    sys     0m0,016s
    
    $ time python3 -c "import unittest"
    real    0m0,065s
    user    0m0,055s
    sys     0m0,010s
    

    There's nothing I can do about that. For the record, psutil import timing is:

    $ time python3 -c "import psutil"
    real    0m0,056s
    user    0m0,050s
    sys     0m0,006s
    

    Disable plugin auto loading

    After some research, I discovered that pytest automatically loads all plugins installed on the system, even if they aren't used. Here's how to list them (output is cut):

    $ pytest --trace-config --collect-only
    ...
    active plugins:
        ...
        setupplan           : ~/.local/lib/python3.12/site-packages/_pytest/setupplan.py
        stepwise            : ~/.local/lib/python3.12/site-packages/_pytest/stepwise.py
        warnings            : ~/.local/lib/python3.12/site-packages/_pytest/warnings.py
        logging             : ~/.local/lib/python3.12/site-packages/_pytest/logging.py
        reports             : ~/.local/lib/python3.12/site-packages/_pytest/reports.py
        python_path         : ~/.local/lib/python3.12/site-packages/_pytest/python_path.py
        unraisableexception : ~/.local/lib/python3.12/site-packages/_pytest/unraisableexception.py
        threadexception     : ~/.local/lib/python3.12/site-packages/_pytest/threadexception.py
        faulthandler        : ~/.local/lib/python3.12/site-packages/_pytest/faulthandler.py
        instafail           : ~/.local/lib/python3.12/site-packages/pytest_instafail.py
        anyio               : ~/.local/lib/python3.12/site-packages/anyio/pytest_plugin.py
        pytest_cov          : ~/.local/lib/python3.12/site-packages/pytest_cov/plugin.py
        subtests            : ~/.local/lib/python3.12/site-packages/pytest_subtests/plugin.py
        xdist               : ~/.local/lib/python3.12/site-packages/xdist/plugin.py
        xdist.looponfail    : ~/.local/lib/python3.12/site-packages/xdist/looponfail.py
        ...
    

    It turns out PYTEST_DISABLE_PLUGIN_AUTOLOAD environment variable can be used to disable them. By running PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest --trace-config --collect-only again I can see that the following plugins disappeared:

    anyio
    pytest_cov
    pytest_instafail
    pytest_subtests
    xdist
    xdist.looponfail
    

    Now let's run the test again by using PYTEST_DISABLE_PLUGIN_AUTOLOAD:

    $ time PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python3 -m pytest --no-header psutil/tests/test_misc.py::TestMisc::test_version
    ============================= test session starts =============================
    collected 1 item
    psutil/tests/test_misc.py::TestMisc::test_version PASSED
    ============================== 1 passed in 0.05s ==============================
    
    real    0m0,285s
    user    0m0,267s
    sys     0m0,040s
    

    We went from 0,427 secs to 0,285 secs, a ~40% improvement. Not bad. We now need to selectively enable only the plugins we actually use, via -p CLI option. Plugins used by psutil are pytest-instafail and pytest-subtests (we'll think about pytest-xdist later):

    $ time PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python3 -m pytest -p instafail -p subtests --no-header psutil/tests/test_misc.py::TestMisc::test_version
    ========================================================= test session starts =========================================================
    collected 1 item
    psutil/tests/test_misc.py::TestMisc::test_version PASSED
    ========================================================== 1 passed in 0.05s ==========================================================
    real    0m0,320s
    user    0m0,283s
    sys     0m0,037s
    

    Time went up again, from 0,285 secs to 0,320s. Quite a slowdown, but still better than the initial 0,427s. Now, let's add pytest-xdist to the mix:

    $ time PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python3 -m pytest -p instafail -p subtests -p xdist --no-header psutil/tests/test_misc.py::TestMisc::test_version
    ========================================================= test session starts =========================================================
    collected 1 item
    psutil/tests/test_misc.py::TestMisc::test_version PASSED
    ========================================================== 1 passed in 0.05s ==========================================================
    
    real    0m0,369s
    user    0m0,286s
    sys     0m0,049s
    

    We now went from 0,320s to 0,369s. Not too much, but still it's a pity to pay the price when NOT running tests in parallel.

    Handling pytest-xdist

    If we disable pytest-xdist psutil tests still run, but we get a warning:

    psutil/tests/test_testutils.py:367
      ~/svn/psutil/psutil/tests/test_testutils.py:367: PytestUnknownMarkWarning: Unknown pytest.mark.xdist_group - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
        @pytest.mark.xdist_group(name="serial")
    

    This warning appears for methods that are intended to run serially, those decorated with @pytest.mark.xdist_group(name="serial"). However, since pytest-xdist is now disabled, the decorator no longer exists. To address this, I implemented the following solution in psutil/tests/__init__.py:

    import pytest, functools
    
    PYTEST_PARALLEL = "PYTEST_XDIST_WORKER" in os.environ  # True if running parallel tests
    
    if not PYTEST_PARALLEL:
        def fake_xdist_group(*_args, **_kwargs):
            """Mimics `@pytest.mark.xdist_group` decorator. No-op: it just
            calls the test method or return the decorated class."""
            def wrapper(obj):
                @functools.wraps(obj)
                def inner(*args, **kwargs):
                    return obj(*args, **kwargs)
    
                return obj if isinstance(obj, type) else inner
    
            return wrapper
    
        pytest.mark.xdist_group = fake_xdist_group  # monkey patch
    

    With this in place the warning disappears when running tests serially. To run tests in parallel, we'll manually enable xdist:

    $ python3 -m pytest -p xdist -n auto --dist loadgroup
    

    Disable some default plugins

    pytests also loads quite a bunch of plugins by default (see output of pytest --trace-config --collect-only). I tried to disable some of them with:

    pytest -p no:junitxml -p no:doctest -p no:nose -p no:pastebin
    

    ...but that didn't make much of a difference.

    Optimizing test collection time

    By default, pytest searches the entire directory for tests, adding unnecessary overhead. In pyproject.toml you can tell pytest where test files are located:

    [tool.pytest.ini_options]
    testpaths = ["psutil/tests/"]
    

    With this I saved another 0.03 seconds. Before:

    $ python3 -m pytest --collect-only
    ...
    ======================== 685 tests collected in 0.20s =========================
    

    After:

    $ python3 -m pytest --collect-only
    ...
    ======================== 685 tests collected in 0.17s =========================
    

    Putting it all together

    With these small optimizations, I managed to reduce pytest startup time by ~0.12 seconds, bringing it down from 0.42 seconds. While this improvement is insignificant for full test runs, it somewhat makes a noticeable difference (~28% faster) when repeatedly running individual tests from the command line, which is something I do frequently during development. Final result is visible in PR-2538.

    Other links which may be useful

  2. psutil 5.5.0 is twice as fast

    OK, this is a big one. Starting from psutil 5.0.0 you can query multiple Process information around twice as fast than with previous versions (see original ticket and updated doc). It took me 7 months, 108 commits and a massive refactoring of psutil internals (here is the big commit), and I can safely say this is one of the best improvements and long standing issues which have been addressed in a major psutil release. Here goes.

    The problem

    Except for some cases, the way different process information are retrieved varies depending on the OS. Sometimes it requires reading a file in /proc filesystem (Linux), some other times it requires using C (Windows, BSD, OSX, SunOS), but every time it's done differently. Psutil abstracts this complexity by providing a nice high-level interface so that you, say, call Process.name() without worrying about what happens behind the curtains or on what OS you're on.

    Internally, it is not rare that multiple process info (e.g. name(), ppid(), uids(), create_time()) may be fetched by using the same routine. For example, on Linux we read /proc/stat to get the process name, terminal, CPU times, creation time, status and parent PID, but only one value is returned and the others are discarded. On Linux the code below reads /proc/stat 6 times:

    >>> import psutil
    >>> p = psutil.Process()
    >>> p.name()
    >>> p.cpu_times()
    >>> p.create_time()
    >>> p.ppid()
    >>> p.status()
    >>> p.terminal()
    

    Another example is BSD. In order to get process name, memory, CPU times and other metrics, a single sysctl() call is necessary, but again, because of how psutil used to work so far that same sysctl() call is executed every time (see here, here, and so on), one information is returned (say name()) and the rest is discarded. Not anymore.

    Do it in one shot

    It appears clear how the approach described above is not efficient, also considering that applications similar to top, htop, ps or glances usually collect more than one info per-process. psutil 5.0.0 introduces a new oneshot() context manager. When used, the internal routine is executed once (in the example below on name()) and the other values are cached. The subsequent calls sharing the same internal routine (read /proc/stat, call sysctl() or whatever) will return the cached value. With psutil 5.0.0 the code above can be rewritten like this, and on Linux it will run 2.4 times faster:

    >>> import psutil
    >>> p = psutil.Process()
    >>> with p.oneshot():
    ...     p.name()
    ...     p.cpu_times()
    ...     p.create_time()
    ...     p.ppid()
    ...     p.status()
    ...     p.terminal()
    

    Implementation

    One great thing about psutil design is its abstraction. It is dived in 3 "layers". The first layer is represented by the main Process class (python), which is what dictates the end-user high-level API. The second layer is the OS-specific Python module which is thin wrapper on top of the OS-specific C extension module (third layer). Because this was organized this way (modularly) the refactoring was reasonably smooth. In order to do this I first refactored those C functions collecting multiple info and grouped them in a single function (e.g. see BSD implementation). Then I wrote a decorator which enables the cache only when requested (when entering the context manager) and decorated the "grouped functions" with with it. The whole thing is enabled on request by the highest-level oneshot() context manager, which is the only thing which is exposed to the end user. Here's the decorator:

    def memoize_when_activated(fun):
        """A memoize decorator which is disabled by default. It can be
        activated and deactivated on request.
        """
        @functools.wraps(fun)
        def wrapper(self):
            if not wrapper.cache_activated:
                return fun(self)
            else:
                try:
                    ret = cache[fun]
                except KeyError:
                    ret = cache[fun] = fun(self)
                return ret
    
        def cache_activate():
            """Activate cache."""
            wrapper.cache_activated = True
    
        def cache_deactivate():
            """Deactivate and clear cache."""
            wrapper.cache_activated = False
            cache.clear()
    
        cache = {}
        wrapper.cache_activated = False
        wrapper.cache_activate = cache_activate
        wrapper.cache_deactivate = cache_deactivate
        return wrapper
    

    In order to measure the various speedups I finally wrote a benchmark script (well, two actually) and kept tuning until I was sure the various changes made psutil actually faster. The benchmark scripts calculate the speedup you can get if you call all the "grouped" methods together (best case scenario).

    Linux: +2.56x speedup

    Linux process is the only pure-python implementation as (almost) all process info are gathered by reading files in the /proc filesystem. /proc files typically contain different information about the process and /proc/PID/stat and /proc/PID/status are the perfect examples. That's why on Linux we aggregate them in 3 groups. The relevant part of the Linux implementation can be seen here.

    Windows: from +1.9x to +6.5x speedup

    Windows is an interesting one. In normal circumstances, if we're querying a process owned by our user, we group together only process' num_threads(), num_ctx_switches() and num_handles(), getting a +1.9x speedup if we access those methods in one shot. Windows is particular though, because certain methods use a dual implementation: a "fast method" is attempted first, but if the process is owned by another user it fails with AccessDenied. In that case psutil falls back on using a second "slower" method (see here for example). The second method is slower because it iterates over all PIDs but differently than "plain" Windows APIs it can be used to get multiple info in one shot: num threads, context switches, handles, CPU times, create time and IO counters. That is why querying processes owned by other users results in an impressive +6.5 speedup.

    OSX: +1.92x speedup

    On OSX we can get 2 groups of information. With sysctl() syscall we get process parent PID, uids, gids, terminal, create time, name. With proc_info() syscall we get CPU times (for PIDs owned by another user) memory metrics and ctx switches. Not bad.

    BSD: +2.18x speedup

    BSD was an interesting one as we gather a tons of process info just by calling sysctl() (see implementation). In a single shot we get process name, ppid, status, uids, gids, IO counters, CPU and create times, terminal and ctx switches.

    SunOS: +1.37 speedup

    SunOS implementation is similar to Linux implementation in that it reads files in /proc filesystem but differently from Linux this is done in C. Also in this case, we can group different metrics together (see here and here).

Social

Feeds