Blog posts for tags/python

From Python 3.3 to today: ending 15 years of subprocess polling
Featured 28 Jan 2026 Tags: psutil, python, python-core, performance

One of the less fun aspects of process management on POSIX systems is waiting for a process to terminate. The standard library's subprocess module has relied on a busy-loop polling approach since the timeout parameter was added to subprocess.Popen.wait() in Python 3.3, around 15 years ago (see source). And psutil's Process.wait() method uses exactly the same technique (see source).

The logic is straightforward: check whether the process has exited using non-blocking waitpid(WNOHANG), sleep briefly, check again, sleep a bit longer, and so on.
```
import os, time

def wait_busy(pid, timeout):
    end = time.monotonic() + timeout
    interval = 0.0001
    while time.monotonic() < end:
        pid_done, _ = os.waitpid(pid, os.WNOHANG)
        if pid_done:
            return
        time.sleep(interval)
        interval = min(interval * 2, 0.04)
    raise TimeoutError
```
In this blog post I'll show how I finally addressed this long-standing inefficiency, first in psutil, and most excitingly, directly in CPython's standard library subprocess module.
The problem with busy-polling¶
- CPU wake-ups: even with exponential backoff (starting at 0.1ms, capping at 40ms), the system constantly wakes up to check process status, wasting CPU cycles and draining batteries.
- Latency: there's always a gap between when a process actually terminates and when you detect it.
- Scalability: monitoring many processes simultaneously magnifies all of the above.
Event-driven waiting¶

All POSIX systems provide at least one mechanism to be notified when a file descriptor becomes ready. These are select(), poll(), epoll() (Linux) and kqueue() (BSD / macOS) system calls. Until recently, I believed they could only be used with file descriptors referencing sockets, pipes, etc., but it turns out they can also be used to wait for events on process PIDs!
Linux¶

In 2019, Linux 5.3 introduced a new syscall, os.pidfd_open(), which was added in Python 3.9. It returns a file descriptor referencing a process PID. The interesting thing is that pidfd_open() can be used in conjunction with select(), poll() or epoll() to effectively wait until the process exits. E.g. by using poll():
import os, select def wait_pidfd(pid, timeout): pidfd = os.pidfd_open(pid) poller = select.poll() poller.register(pidfd, select.POLLIN) # block until process exits or timeout occurs events = poller.poll(timeout * 1000) if events: return raise TimeoutError
This approach has zero busy-looping. The kernel wakes us up exactly when the process terminates or when the timeout expires if the PID is still alive.

I chose poll() over select() because select() has a historical file descriptor limit (FD_SETSIZE), which typically caps it at 1024 file descriptors per-process (reminded me of BPO-1685000).

I chose poll() over epoll() because it does not require creating an additional file descriptor. It also needs only a single syscall, which should make it a bit more efficient when monitoring a single FD rather than many.
macOS and BSD¶

BSD-derived systems (including macOS) provide the kqueue() syscall. It's conceptually similar to select(), poll() and epoll(), but more powerful (e.g. it can also handle regular files). kqueue() can be passed a PID directly, and it will return once the PID disappears or the timeout expires:
import select def wait_kqueue(pid, timeout): kq = select.kqueue() kev = select.kevent( pid, filter=select.KQ_FILTER_PROC, flags=select.KQ_EV_ADD | select.KQ_EV_ONESHOT, fflags=select.KQ_NOTE_EXIT, ) # block until process exits or timeout occurs events = kq.control([kev], 1, timeout) if events: return raise TimeoutError
Windows¶

Windows does not busy-loop, both in psutil and subprocess module, thanks to WaitForSingleObject. This means Windows has effectively had event-driven process waiting from the start. So nothing to do on that front.

Graceful fallbacks¶

Both pidfd_open() and kqueue() can fail for different reasons. For example, with EMFILE if the process runs out of file descriptors (usually 1024), or with EACCES / EPERM if the syscall was explicitly blocked at the system level by the sysadmin (e.g. via SECCOMP). In all cases, psutil silently falls back to the traditional busy-loop polling approach rather than raising an exception.

This fast-path-with-fallback approach is similar in spirit to BPO-33671, where I sped up shutil.copyfile() by using zero-copy system calls back in 2018. In there, more efficient os.sendfile() is attempted first, and if it fails (e.g. on network filesystems) we fall back to the traditional read() / write() approach to copy regular files.
Measurement¶

As a simple experiment, here's a simple program which waits on itself for 10 seconds without terminating:
# test.py import psutil, os try: psutil.Process(os.getpid()).wait(timeout=10) except psutil.TimeoutExpired: pass
We can measure the CPU context switching using /usr/bin/time -v. Before the patch (the busy-loop):
$ /usr/bin/time -v python3 test.py 2>&1 | grep context Voluntary context switches: 258 Involuntary context switches: 4
After the patch (the event-driven approach):
$ /usr/bin/time -v python3 test.py 2>&1 | grep context Voluntary context switches: 2 Involuntary context switches: 1
This shows that instead of spinning in userspace, the process blocks in poll() / kqueue(), and is woken up only when the kernel notifies it, resulting in just a few CPU context switches.
Sleeping state¶

It's also interesting to note that waiting via poll() (or kqueue()) puts the process into the exact same sleeping state as a plain time.sleep call. From the kernel's perspective, both are interruptible sleeps: the process is de-scheduled, consumes zero CPU, and sits quietly in kernel space.

The "S+" state shown below by ps means that the process "sleeps in foreground".
- time.sleep:
$ (python3 -c 'import time; time.sleep(10)' & pid=$!; sleep 0.3; ps -o pid,stat,comm -p $pid) && fg &>/dev/null PID STAT COMMAND 491573 S+ python3
- select.poll:
$ (python3 -c 'import os,select; fd = os.pidfd_open(os.getpid(),0); p = select.poll(); p.register(fd,select.POLLIN); p.poll(10_000)' & pid=$!; sleep 0.3; ps -o pid,stat,comm -p $pid) && fg &>/dev/null PID STAT COMMAND 491748 S+ python3
CPython contribution¶

After landing the psutil implementation (PR-2706), I took the extra step and submitted a matching pull request for CPython subprocess module: cpython/PR-144047.

I'm especially proud of this one: this is the third time in psutil's 17+ year history that a feature developed in psutil made its way upstream into the Python standard library.
- The first was back in 2010, when Process.nice() inspired os.getpriority() and os.setpriority(), see BPO-10784. Landed in Python 3.3.
- The second was back in 2011, when psutil.disk_usage() inspired shutil.disk_usage(), see python-ideas ML proposal. Landed in Python 3.3.
Funny thing: 15 years ago, Python 3.3 added the timeout parameter to subprocess.Popen.wait (see commit). That's probably where I took inspiration when I first added the timeout parameter to psutil's Process.wait() around the same time (see commit). Now, 15 years later, I'm contributing back a similar improvement for that very same timeout parameter. The circle is complete.
Links¶

Topics related to this:
- #2712: proposal to extend this to multiple PIDs (psutil.wait_procs()).
- #2703: proposal for asynchronous Process.wait() integration with asyncio.
- cpython/#144211: proposal to extend the selectors module to enable asyncio optimization on BSD / macOS via kqueue().
Discussion¶
- Reddit
- Hacker News
- Medium
- Linkedin

Detecting memory leaks in C extensions with psutil and psleak

Memory leaks in Python are usually straightforward to diagnose. Just look at RSS, track Python object counts, follow reference graphs, etc. But leaks inside C extension modules are another story. Traditional memory metrics such as RSS and VMS fail to reveal them because Python's memory allocator (pymalloc) sits above the platform's native heap. If something in an extension calls malloc() without a corresponding free(), that memory often won't show up in RSS / VMS. You have a leak, and you don't know.

psutil 7.2.0 introduces two new APIs for C heap introspection, designed specifically to catch these kinds of native leaks. They give you a window directly into the underlying platform allocator (e.g. glibc's malloc), letting you track how much memory the C layer actually allocates. If your RSS is flat but your C heap usage climbs, you now have a way to see it.

Why native heap introspection matters¶

Many Python projects rely on C extensions: psutil, NumPy, pandas, PIL, lxml, psycopg, PyTorch, custom in-house modules, etc. And even CPython itself, which implements many of its standard library modules in C. If any of these components mishandle memory at the C level, you get a leak that doesn't show up in:

Python reference counts (sys.getrefcount).
tracemalloc module.
Python's gc stats.
RSS, VMS or USS due to allocator caching, especially for small objects. This can happen, for example, when you forget to Py_DECREF a Python object.

psutil's new functions let you query the allocator (e.g. glibc) directly, returning low-level metrics from the platform's native heap.

heap_info(): direct allocator statistics¶

psutil.heap_info() exposes the following metrics:

heap_used: total number of bytes currently allocated via malloc() (small allocations).
mmap_used: total number of bytes currently allocated via mmap() or via large malloc() allocations.
heap_count: (Windows only) number of private heaps created via HeapCreate().

Example:

>>> import psutil
>>> psutil.heap_info()
pheap(heap_used=5177792, mmap_used=819200)

Reference for what contributes to each field:

Platform	Allocation type	Field affected
UNIX / Windows	small `malloc()` ≤128 KB without `free()`	`heap_used`
UNIX / Windows	large `malloc()` >128 KB without `free()`, or `mmap()` without `munmap()` (UNIX)	`mmap_used`
Windows	`HeapAlloc()` without `HeapFree()`	`heap_used`
Windows	`VirtualAlloc()` without `VirtualFree()`	`mmap_used`
Windows	`HeapCreate()` without `HeapDestroy()`	`heap_count`

heap_trim(): returning unused heap memory¶

psutil.heap_trim() provides a cross-platform way to request that the underlying allocator free any unused memory it's holding in the heap (typically small malloc() allocations).

In practice, modern allocators rarely comply, so this is not a general-purpose memory-reduction tool and won't meaningfully shrink RSS in real programs. Its primary value is in leak detection tools. Calling psutil.heap_trim() before taking measurements helps reduce allocator noise, giving you a cleaner baseline so that changes in heap_used come from the code you're testing, not from internal allocator caching or fragmentation.

Real-world use: finding a C extension leak¶

The workflow is simple:

Take a baseline snapshot of the heap.
Call the C extension hundreds of times.
Take another snapshot.
Compare.

import psutil

psutil.heap_trim()  # reduce noise

before = psutil.heap_info()
for _ in range(200):
    my_cext_function()
after = psutil.heap_info()

print("delta heap_used =", after.heap_used - before.heap_used)
print("delta mmap_used =", after.mmap_used - before.mmap_used)

If heap_used or mmap_used values increase consistently, you've found a native leak.

To reduce false positives, repeat the test multiple times, increasing the number of calls on each retry. This approach helps distinguish real leaks from random noise or transient allocations.

A new tool: psleak¶

The strategy described above is exactly what I implemented in a new PyPI package, which I called psleak. It runs the target function repeatedly, trims the allocator before each run, and tracks differences across retries. Memory that grows consistently after several runs is flagged as a leak.

A minimal test suite looks like this:

from psleak import MemoryLeakTestCase

class TestLeaks(MemoryLeakTestCase):
    def test_fun(self):
        self.execute(some_c_function)

If the function leaks memory, the test will fail with a descriptive exception:

psleak.MemoryLeakError: memory kept increasing after 10 runs
Run # 1: heap=+388160  | uss=+356352  | rss=+327680  | (calls= 200, avg/call=+1940)
Run # 2: heap=+584848  | uss=+614400  | rss=+491520  | (calls= 300, avg/call=+1949)
Run # 3: heap=+778320  | uss=+782336  | rss=+819200  | (calls= 400, avg/call=+1945)
Run # 4: heap=+970512  | uss=+1032192 | rss=+1146880 | (calls= 500, avg/call=+1941)
Run # 5: heap=+1169024 | uss=+1171456 | rss=+1146880 | (calls= 600, avg/call=+1948)
Run # 6: heap=+1357360 | uss=+1413120 | rss=+1310720 | (calls= 700, avg/call=+1939)
Run # 7: heap=+1552336 | uss=+1634304 | rss=+1638400 | (calls= 800, avg/call=+1940)
Run # 8: heap=+1752032 | uss=+1781760 | rss=+1802240 | (calls= 900, avg/call=+1946)
Run # 9: heap=+1945056 | uss=+2031616 | rss=+2129920 | (calls=1000, avg/call=+1945)
Run #10: heap=+2140624 | uss=+2179072 | rss=+2293760 | (calls=1100, avg/call=+1946)

Psleak is now part of the psutil test suite. All psutil APIs are tested (see test_memleaks.py), making it a de facto regression-testing tool.

It's worth noting that without inspecting heap metrics, missing calls in the C code such as Py_CLEAR and Py_DECREF often go unnoticed, because they don't affect RSS, VMS, and USS. I confirmed this by commenting them out. Monitoring the heap is therefore essential to reliably detect memory leaks in Python C extensions.

Under the hood¶

For those interested in seeing how I did this in terms of code:

Linux: uses glibc's mallinfo2() to report uordblks (heap allocations) and hblkhd (mmap-backed blocks).
Windows: enumerates heaps and aggregates HeapAlloc / VirtualAlloc usage.
macOS: uses malloc zone statistics.
BSD: uses jemalloc's arena and stats interfaces.

References¶

psleak, the new memory leak testing framework.
PR-2692, the implementation.
#1275, the original proposal from 8 years earlier.

Discussion¶

Wheels for free-threaded Python now available
25 Oct 2025 Tags: psutil, python, wheels, community, python-core

With the release of psutil 7.1.2, wheels for free-threaded Python are now available. This milestone was achieved largely through a community effort, as several internal refactorings to the C code were required to make it possible (see #2565). Many of these changes were contributed by Lysandros Nikolaou. Thanks to him for the effort and for bearing with me in code reviews! ;-)

What is free-threaded Python?¶

Free-threaded Python (available since Python 3.13) refers to Python builds that are compiled with the GIL (Global Interpreter Lock) disabled, allowing true parallel execution of Python bytecodes across multiple threads. This is particularly beneficial for CPU-bound applications, as it enables better utilization of multi-core processors.

The state of free-threaded wheels¶

According to Hugo van Kemenade's free-threaded wheels tracker, the adoption of free-threaded wheels among the top 360 most-downloaded PyPI packages with C extensions is still limited. Only 128 out of these 360 packages provide wheels compiled for free-threaded Python, meaning they can run on Python builds with the GIL disabled. This shows that, while progress has been made, most popular packages with C extensions still do not offer ready-made wheels for free-threaded Python.

What it means for users¶

When a library author provides a wheel, users can install a pre-compiled binary package without having to build it from source. This is especially important for packages with C extensions, like psutil, which is largely written in C. Such packages often have complex build requirements and require installing a C compiler. On Windows, that means installing Visual Studio or the Build Tools, which can take several gigabytes and a significant setup effort. Providing wheels spares users from this hassle, makes installation far simpler, and is effectively essential for the users of that package. You basically pip install psutil and you're done.
What it means for library authors¶

Currently, universal wheels for free-threaded Python do not exist. Each wheel must be built specifically for a Python version. Right now authors must create separate wheels for Python 3.13 and 3.14. Which means distributing a lot of files already:
psutil-7.1.2-cp313-cp313t-macosx_10_13_x86_64.whl psutil-7.1.2-cp313-cp313t-macosx_11_0_arm64.whl psutil-7.1.2-cp313-cp313t-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl psutil-7.1.2-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl psutil-7.1.2-cp313-cp313t-win_amd64.whl psutil-7.1.2-cp313-cp313t-win_arm64.whl psutil-7.1.2-cp314-cp314t-macosx_10_15_x86_64.whl psutil-7.1.2-cp314-cp314t-macosx_11_0_arm64.whl psutil-7.1.2-cp314-cp314t-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl psutil-7.1.2-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl psutil-7.1.2-cp314-cp314t-win_amd64.whl psutil-7.1.2-cp314-cp314t-win_arm64.whl
This also multiplies CI jobs and slows down the test matrix (see build.yml). A true universal wheel would greatly reduce this overhead, allowing a single wheel to support multiple Python versions and platforms. Hopefully, Python 3.15 will simplify this process. Two competing proposals, PEP 803 and PEP 809, aim to standardize wheel naming and metadata to allow producing a single wheel that covers multiple Python versions. That would drastically reduce distribution complexity for library authors, and it's fair to say it's essential for free-threaded CPython to truly succeed.
How to install free-threaded psutil¶

You can now install psutil for free-threaded Python directly via pip:
pip install psutil --only-binary=:all:
This ensures you get the pre-compiled wheels without triggering a source build.
Discussion¶
- Reddit

Speeding up pytest startup

Preface: the migration to pytest¶

Last year, after 17 years of unittest, I started adopting pytest in psutil (see #2446). The two advantages I cared about most were plain assert statements and pytest-xdist's free parallelism. Tests still inherit from unittest.TestCase and there's no conftest.py or fixture use (rationale in PR-2456).

What I want to focus on here is one of pytest's most frustrating aspects: slow startup.

pytest invocation is slow¶

To measure pytest's startup time, let's run a very simple test where execution time itself is negligible:

$ time python3 -m pytest psutil/tests/test_misc.py::TestMisc::test_version
1 passed in 0.05s
real    0m0,427s

Almost half a second, which is excessive for something I run repeatedly during development. For comparison, the same test under unittest:

$ time python3 -m unittest psutil.tests.test_misc.TestMisc.test_version
Ran 1 test in 0.000s
real    0m0,204s

Roughly twice as fast. Why?

Where is time being spent?¶

A significant portion of pytest's overhead comes from import time, and there's not much one can do about it:

$ time python3 -c "import pytest"
real    0m0,151s

$ time python3 -c "import unittest"
real    0m0,065s

$ time python3 -c "import psutil"
real    0m0,056s

Disable plugin auto loading¶

After some research, I discovered that pytest automatically loads all plugins installed on the system, even if they aren't used. Here's how to list them (output is cut):

$ pytest --trace-config --collect-only
...
active plugins:
    ...
    setupplan           : ~/.local/lib/python3.12/site-packages/_pytest/setupplan.py
    stepwise            : ~/.local/lib/python3.12/site-packages/_pytest/stepwise.py
    warnings            : ~/.local/lib/python3.12/site-packages/_pytest/warnings.py
    logging             : ~/.local/lib/python3.12/site-packages/_pytest/logging.py
    reports             : ~/.local/lib/python3.12/site-packages/_pytest/reports.py
    python_path         : ~/.local/lib/python3.12/site-packages/_pytest/python_path.py
    unraisableexception : ~/.local/lib/python3.12/site-packages/_pytest/unraisableexception.py
    threadexception     : ~/.local/lib/python3.12/site-packages/_pytest/threadexception.py
    faulthandler        : ~/.local/lib/python3.12/site-packages/_pytest/faulthandler.py
    instafail           : ~/.local/lib/python3.12/site-packages/pytest_instafail.py
    anyio               : ~/.local/lib/python3.12/site-packages/anyio/pytest_plugin.py
    pytest_cov          : ~/.local/lib/python3.12/site-packages/pytest_cov/plugin.py
    subtests            : ~/.local/lib/python3.12/site-packages/pytest_subtests/plugin.py
    xdist               : ~/.local/lib/python3.12/site-packages/xdist/plugin.py
    xdist.looponfail    : ~/.local/lib/python3.12/site-packages/xdist/looponfail.py
    ...

It turns out PYTEST_DISABLE_PLUGIN_AUTOLOAD environment variable can be used to disable them. By running PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest --trace-config --collect-only again I can see that the following plugins disappeared:

anyio
pytest_cov
pytest_instafail
pytest_subtests
xdist
xdist.looponfail

Now let's run the test again with PYTEST_DISABLE_PLUGIN_AUTOLOAD:

$ time PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python3 -m pytest psutil/tests/test_misc.py::TestMisc::test_version
1 passed in 0.05s
real    0m0,285s

We went from 0.427s to 0.285s, a ~40% improvement. Not bad. We now need to selectively enable only the plugins we actually use, via -p. psutil uses pytest-instafail and pytest-subtests (we'll deal with pytest-xdist later):

$ time PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python3 -m pytest -p instafail -p subtests ...
real    0m0,320s

Time went back up to 0.320s. Quite a slowdown, but still better than the original 0.427s. Adding pytest-xdist:

$ time PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python3 -m pytest -p instafail -p subtests -p xdist ...
real    0m0,369s

0.369s. Not much, but still a pity to pay the price when NOT running tests in parallel.

Handling pytest-xdist¶

If we disable pytest-xdist psutil tests still run, but we get a warning:

psutil/tests/test_testutils.py:367
  ~/svn/psutil/psutil/tests/test_testutils.py:367: PytestUnknownMarkWarning: Unknown pytest.mark.xdist_group - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
    @pytest.mark.xdist_group(name="serial")

This warning appears for methods that are intended to run serially, those decorated with @pytest.mark.xdist_group(name="serial"). However, since pytest-xdist is now disabled, the decorator no longer exists. To address this, I implemented the following solution in psutil/tests/__init__.py:

import pytest, functools

PYTEST_PARALLEL = "PYTEST_XDIST_WORKER" in os.environ  # True if running parallel tests

if not PYTEST_PARALLEL:
    def fake_xdist_group(*_args, **_kwargs):
        """Mimics `@pytest.mark.xdist_group` decorator. No-op: it just
        calls the test method or return the decorated class."""
        def wrapper(obj):
            @functools.wraps(obj)
            def inner(*args, **kwargs):
                return obj(*args, **kwargs)

            return obj if isinstance(obj, type) else inner

        return wrapper

    pytest.mark.xdist_group = fake_xdist_group  # monkey patch

With this in place the warning disappears when running tests serially. To run tests in parallel, we'll manually enable xdist:

$ python3 -m pytest -p xdist -n auto --dist loadgroup

Optimizing test collection time¶

By default, pytest searches the entire directory for tests, adding unnecessary overhead. In pyproject.toml you can tell pytest where test files are located, and only to consider test_*.py files:

[tool.pytest.ini_options]
testpaths = ["psutil/tests/"]
python_files = ["test_*.py"]

Collection time dropped from 0.20s to 0.17s, another ~0.03s shaved off.

Putting it all together¶

With these small optimizations, I managed to reduce pytest startup time by ~0.12 seconds, bringing it down from 0.42 seconds. While this improvement is insignificant for full test runs, it makes a noticeable difference (~28% faster) when repeatedly running individual tests from the command line, which is something I do frequently during development. Final result is visible in PR-2538.

Blog posts for tags/python

The problem with busy-polling¶

Event-driven waiting¶

Linux¶

macOS and BSD¶

Windows¶

Graceful fallbacks¶

Measurement¶

Sleeping state¶

CPython contribution¶

Links¶

Discussion¶

Why native heap introspection matters¶

heap_info(): direct allocator statistics¶

heap_trim(): returning unused heap memory¶

Real-world use: finding a C extension leak¶

A new tool: psleak¶

Under the hood¶

References¶

Discussion¶

What is free-threaded Python?¶

The state of free-threaded wheels¶

What it means for users¶

What it means for library authors¶

How to install free-threaded psutil¶

Discussion¶

Preface: the migration to pytest¶

pytest invocation is slow¶

Where is time being spent?¶

Disable plugin auto loading¶

Handling pytest-xdist¶

Optimizing test collection time¶

Putting it all together¶

Other links which may be useful¶

The numbers¶

The pain¶

The removal¶

Related¶

Social

Feed