How to get world-class SMP performance with NetBSD, by ad and rmind
Andew Doran is currently employed by The NetBSD Foundation to change
NetBSD's SMP implementation from big-lock to fine-grained kernel locking.
With hin, Mindaugas Rasiukevicius has done a lot of work on NetBSD's
scheduler, and Yamamoto Takashi has added a fair share of further
infrastructure work in the kernel.
I've asked them about their progress
in the past months, and the following points give a rough idea on what
was achieved so far, and what can still be expected.
The story so far.
Andrew Doran writes: ``
The kernel synchronization model has been completely revamped since NetBSD
4.0, with the goal of making NetBSD a fully multithreaded, multiprocessor
system with complete support for soft real-time applications.
Through NetBSD 4.0, the kernel used spinlocks and a per-CPU interrupt
priority level (the spl(9) system) to provide mutual exclusion. These
mechanisms did not lend themselves well to a multiprocessor environment
supporting kernel preemption. Rules governing their use had been built up
over many years of development, making them difficult to understand and use
well. The use of thread based (lock) synchronization was limited and the
available synchronization primitive (lockmgr) was inefficient and slow to
In development branch that will becomple NetBSD 5.0, a new rich set of
synchronization primitives and software tools have been developed to ease
writing multithreaded kernel code that executes efficiently and safely on
SMP systems. Some of these are:
Mindaugas adds a few more items: ``
- Thread-base adaptive mutexes. These are lightweight, exclusive locks that
use threads as the focus of synchronization activity. Adaptive mutexes
typically behave like spinlock, but under specific conditions an attempt
to acquire an already held adaptive mutex may cause the acquring thread to
sleep. Sleep activity occurs rarely. Busy-waiting is typically more
efficient because mutex hold times are most often short. In contrast to
pure spinlocks, a thread holding an adaptive mutex may be preempted in the
kernel, which can allow for reduced latency where soft real-time
application are in use on the system.
- Reader/writer locks. These are lightweight shared/exclusive locks that
again use threads as the focus of synchronization activity. Their area of
use is limited, most of it being in the file system code.
- CPU based spin mutexes, used mainly within the scheduler, device drivers
and device driver support code. Pure spin mutexes are used when it is not
safe, or impossible for, a thread to use a synchronization method that
could block such as an adaptive mutex.
- Priority inheritance, implemented in support of soft real-time applications.
Where a high priority thread is blocked waiting for a resource held by a
lower priority thread, the kernel can temporarily "lend" a high priority
level to the lower priority thread. This helps to ensure that the lower
priority thread does no unduly obstruct the progress of the higher
- Atomic operations. A full set of atomic operations implementing arithmetic
and memory barrier operations has been provided. The atomic operations are
available both in the kernel and to user applications, via the C library.
- Generic cross calls: a facility that allows one CPU or thread to easily
make an arbitrary function call on any other CPUs in the system.
- The interrupt priority level interfaces (spl(9)) have long been used to
block interrupts on a CPU-local basis. These interfaces have been
simplified and streamlined to allow for code and algorithms that make use
of low cost CPU-local synchronization. In addition, APIs are provided that
allow detection of kernel preemption and allow the programmer to
temporarily disable preemption across critical sections of code that
cannot tolerate any interruption.
- "percpu" memory allocator: a simple memory allocator that provides
arbitrary amounts of keyed storage. Allocations are replicated across all
CPUs in the system and each CPU has its own private instance of any
allocated object. Together, the cross call facility, atomic operations,
spl(9) interfaces and percpu allocator make it easy to build lightweight,
- Lockless memory allocators: the standard kernel memory allocators have been
augmented with per-CPU caches which signficantly avoid costly synchronization
overhead typically associated with allocation of memory on a multiprocessor
Besides those achievements, there is more development work ongoing,
and a number of items were presented for review and comment the
past week, which will have further impact on NetBSD's performace
on multicore and SMP machines:
- New thread scheduler, named M2: it reduces the lock contention, and
increases the thread affinity to avoid cache thrashing - this essentially
improves the performance on SMP systems. M2 implements time-sharing class,
and POSIX real-time extensions, used by soft real-time applications.
- Processor sets and affinity API provides possibility to bind the processes
or threads to the specific CPU or group of CPUs. This allows applications
to achieve better concurrency, CPU utilization, avoid cache thrashing and
thus improve the performance on SMP systems.
A call for participation: Benchmark!
I think this is a very long list of changes, which will all be available
in the next major release of NetBSD. Starting now, it would be interesting
to measure and estimate the performance of NetBSD in comparison to
other operating systems that emphasize SMP (but still keep performance
a goal on uniprocessor machines) -- FreeBSD, Linux and Solaris/x86 come
to mind. Possible benchmarks could include simple Bytebench, dhrystone
and Bonnie benchmarks over more complex ones like postmark and database
and webserver benchmarks. If anyone has numbers and/or graphs,
please post them to the tech-perform@NetBSD.org mailing list!
- A scheduler is responsible for distributing workdload on CPUs,
and besides the 4BSD scheduler, a more modern "M2"-scheduler was
recently added to NetBSD, see above. Parts of that scheduler were
now suggested to be included in the general scheduling framework.
That way, the 4BSD scheduler gets processor affinity (so threads /
processes keep stuck to a single CPU and thus reduce cache misses
when migrating between CPUs/cores).
With other changes surrounding this, NetBSD-current beats
FreeBSD 7.0 and all earlier NetBSD releases when running build.sh
(i.e. compiling the whole system) on a 8-core machine. In the image,
small values mean less time for the build, and are thus good.
I find the results impressive.
For more information, see
Andrew's posting to tech-kern.
- Reader/writer locks
are a locking primitive used to allow multiple readers, but to
block them if one or more processes want to write to a ressource.
Those locks are used in the NetBSD kernel, see the
In order to further optimize the performance of the
rwlock primitives, a few optimizations were
suggested by Andrew Doran
which reduces the build time on an 8-cpu machine by 4%:
``There is less idle time during the build because the windows where
a rwlock can be held but the owner is not running is reduced''.
- Another optimization was
which cuts down another 5% of the time for a complete system build
via build.sh on an 8-core machine, this time by replacing a linear
list of locks in the namei cache with a hash table for the locks.
The namei cache helps to speed up translations from a path name
to the corresponding vnodes, see