Starting from psutil 5.0.0 you can query multiple Process fields around twice as fast as before (see #799 and Process.oneshot() doc). It took 7 months, 108 commits, and a massive refactoring of psutil internals (PR-937), and I think it's one of the best improvements ever shipped in a psutil release.
The problem¶
How process information is retrieved varies by OS. Sometimes it means reading a file in /proc (Linux), other times calling C (Windows, BSD, macOS, SunOS), but it's always done differently. Psutil abstracts this away: you call Process.name() without worrying about what happens under the hood or which OS you're on.
Internally, multiple pieces of process info (e.g. Process.name(), Process.ppid(), Process.uids(), Process.create_time()) are fetched by the same syscall. On Linux we read /proc/PID/stat to get the process name, terminal, CPU times, creation time, status and parent PID, but only one value is returned: the others are discarded. On Linux this code reads /proc/PID/stat 6 times:
>>> import psutil
>>> p = psutil.Process()
>>> p.name()
>>> p.cpu_times()
>>> p.create_time()
>>> p.ppid()
>>> p.status()
>>> p.terminal()
On BSD most process metrics can be fetched with a single sysctl(), yet psutil was invoking it for each process method (e.g. see here and here).
Do it in one shot¶
It's clear that this approach is inefficient, especially in tools like top or htop, where process info is continuously fetched in a loop. psutil 5.0.0 introduces a new Process.oneshot() context manager. Inside it, the internal routine runs once (in the example, on the first Process.name() call) and the other values are cached. Subsequent calls sharing the same internal routine (read /proc/PID/stat, call sysctl() or whatever) return the cached value. The code above can now be rewritten like this, and on Linux it runs 2.4 times faster:
>>> import psutil
>>> p = psutil.Process()
>>> with p.oneshot():
... p.name()
... p.cpu_times()
... p.create_time()
... p.ppid()
... p.status()
... p.terminal()
Implementation¶
One great thing about psutil's design is its abstraction. It is divided into 3 "layers". Layer 1 is represented by the main Process class (Python), which exposes the high-level API. Layer 2 is the OS-specific Python module, which is a thin wrapper on top of the OS-specific C extension module (layer 3).
Because the code was organized this way (modular), the refactoring was reasonably smooth. I first refactored those C functions that collect multiple pieces of info and grouped them into a single function (e.g. see BSD implementation). Then I wrote a decorator that enables the cache only when requested (when entering the context manager), and decorated the "grouped functions" with it. The caching mechanism is controlled by the Process.oneshot() context manager, which is the only thing exposed to the end user. Here's the decorator:
def memoize_when_activated(fun):
"""A memoize decorator which is disabled by default. It can be
activated and deactivated on request.
"""
@functools.wraps(fun)
def wrapper(self):
if not wrapper.cache_activated:
return fun(self)
else:
try:
ret = cache[fun]
except KeyError:
ret = cache[fun] = fun(self)
return ret
def cache_activate():
"""Activate cache."""
wrapper.cache_activated = True
def cache_deactivate():
"""Deactivate and clear cache."""
wrapper.cache_activated = False
cache.clear()
cache = {}
wrapper.cache_activated = False
wrapper.cache_activate = cache_activate
wrapper.cache_deactivate = cache_deactivate
return wrapper
To measure the speedup I wrote a benchmark script (well, two actually), and kept tuning until I was sure the change actually made psutil faster. The scripts report the speedup for calling all the "grouped" methods together (best-case scenario).
Linux: +2.56x speedup¶
The Linux implementation is mostly Python, reading files in /proc. These files typically expose multiple pieces of info per process; /proc/PID/stat and /proc/PID/status are the perfect example. We aggregate them into three groups. See the relevant code here.
Windows: from +1.9x to +6.5x speedup¶
Windows is an interesting one. For a process owned by our user, we group only Process.num_threads(), Process.num_ctx_switches() and Process.num_handles(), for a +1.9x speedup if we access those methods in one shot.
Windows is special though, because certain methods have a dual implementation (#304): a "fast method" is tried first, but if the process is owned by another user it fails with AccessDenied. psutil then falls back to a second, "slower" method (see here for example).
It's slower because it iterates over all PIDs, but unlike the "plain" Windows APIs it can still retrieve multiple pieces of information in one shot: number of threads, context switches, handles, CPU times, create time, and I/O counters.
That's why querying processes owned by other users results in an impressive +6.5x speedup.
macOS: +1.92x speedup¶
On macOS we can get 2 groups of information. With sysctl() we get process parent PID, uids, gids, terminal, create time, name. With proc_info() we get CPU times (for PIDs owned by another user), memory metrics and ctx switches. Not bad.
BSD: +2.18x speedup¶
On BSD we gather tons of process info just by calling sysctl() (see implementation): process name, ppid, status, uids, gids, IO counters, CPU and create times, terminal and ctx switches.