hubertf's NetBSD blog

hubertf's NetBSD Blog
Send interesting links to hubert at feyrer dot de!

[20080409]

How to get world-class SMP performance with NetBSD, by ad and rmind
Andew Doran is currently employed by The NetBSD Foundation to change NetBSD's SMP implementation from big-lock to fine-grained kernel locking. With hin, Mindaugas Rasiukevicius has done a lot of work on NetBSD's scheduler, and Yamamoto Takashi has added a fair share of further infrastructure work in the kernel. I've asked them about their progress in the past months, and the following points give a rough idea on what was achieved so far, and what can still be expected.

The story so far. Andrew Doran writes: `` The kernel synchronization model has been completely revamped since NetBSD 4.0, with the goal of making NetBSD a fully multithreaded, multiprocessor system with complete support for soft real-time applications.

Through NetBSD 4.0, the kernel used spinlocks and a per-CPU interrupt priority level (the spl(9) system) to provide mutual exclusion. These mechanisms did not lend themselves well to a multiprocessor environment supporting kernel preemption. Rules governing their use had been built up over many years of development, making them difficult to understand and use well. The use of thread based (lock) synchronization was limited and the available synchronization primitive (lockmgr) was inefficient and slow to execute.

In development branch that will becomple NetBSD 5.0, a new rich set of synchronization primitives and software tools have been developed to ease writing multithreaded kernel code that executes efficiently and safely on SMP systems. Some of these are:

Thread-base adaptive mutexes. These are lightweight, exclusive locks that use threads as the focus of synchronization activity. Adaptive mutexes typically behave like spinlock, but under specific conditions an attempt to acquire an already held adaptive mutex may cause the acquring thread to sleep. Sleep activity occurs rarely. Busy-waiting is typically more efficient because mutex hold times are most often short. In contrast to pure spinlocks, a thread holding an adaptive mutex may be preempted in the kernel, which can allow for reduced latency where soft real-time application are in use on the system.

Reader/writer locks. These are lightweight shared/exclusive locks that again use threads as the focus of synchronization activity. Their area of use is limited, most of it being in the file system code.

CPU based spin mutexes, used mainly within the scheduler, device drivers and device driver support code. Pure spin mutexes are used when it is not safe, or impossible for, a thread to use a synchronization method that could block such as an adaptive mutex.

Priority inheritance, implemented in support of soft real-time applications. Where a high priority thread is blocked waiting for a resource held by a lower priority thread, the kernel can temporarily "lend" a high priority level to the lower priority thread. This helps to ensure that the lower priority thread does no unduly obstruct the progress of the higher priority thread.

Atomic operations. A full set of atomic operations implementing arithmetic and memory barrier operations has been provided. The atomic operations are available both in the kernel and to user applications, via the C library.

Generic cross calls: a facility that allows one CPU or thread to easily make an arbitrary function call on any other CPUs in the system.

The interrupt priority level interfaces (spl(9)) have long been used to block interrupts on a CPU-local basis. These interfaces have been simplified and streamlined to allow for code and algorithms that make use of low cost CPU-local synchronization. In addition, APIs are provided that allow detection of kernel preemption and allow the programmer to temporarily disable preemption across critical sections of code that cannot tolerate any interruption.

"percpu" memory allocator: a simple memory allocator that provides arbitrary amounts of keyed storage. Allocations are replicated across all CPUs in the system and each CPU has its own private instance of any allocated object. Together, the cross call facility, atomic operations, spl(9) interfaces and percpu allocator make it easy to build lightweight, lock-free algorithms.

Lockless memory allocators: the standard kernel memory allocators have been augmented with per-CPU caches which signficantly avoid costly synchronization overhead typically associated with allocation of memory on a multiprocessor system. ''

Mindaugas adds a few more items: ``

New thread scheduler, named M2: it reduces the lock contention, and increases the thread affinity to avoid cache thrashing - this essentially improves the performance on SMP systems. M2 implements time-sharing class, and POSIX real-time extensions, used by soft real-time applications.

Processor sets and affinity API provides possibility to bind the processes or threads to the specific CPU or group of CPUs. This allows applications to achieve better concurrency, CPU utilization, avoid cache thrashing and thus improve the performance on SMP systems.''

The Future. Besides those achievements, there is more development work ongoing, and a number of items were presented for review and comment the past week, which will have further impact on NetBSD's performace on multicore and SMP machines:

A scheduler is responsible for distributing workdload on CPUs, and besides the 4BSD scheduler, a more modern "M2"-scheduler was recently added to NetBSD, see above. Parts of that scheduler were now suggested to be included in the general scheduling framework. That way, the 4BSD scheduler gets processor affinity (so threads / processes keep stuck to a single CPU and thus reduce cache misses when migrating between CPUs/cores).
With other changes surrounding this, NetBSD-current beats FreeBSD 7.0 and all earlier NetBSD releases when running build.sh (i.e. compiling the whole system) on a 8-core machine. In the image, small values mean less time for the build, and are thus good. I find the results impressive. For more information, see Andrew's posting to tech-kern.
Reader/writer locks are a locking primitive used to allow multiple readers, but to block them if one or more processes want to write to a ressource. Those locks are used in the NetBSD kernel, see the rwlock(9) manpage. In order to further optimize the performance of the rwlock primitives, a few optimizations were suggested by Andrew Doran which reduces the build time on an 8-cpu machine by 4%: ``There is less idle time during the build because the windows where a rwlock can be held but the owner is not running is reduced''.
Another optimization was suggested which cuts down another 5% of the time for a complete system build via build.sh on an 8-core machine, this time by replacing a linear list of locks in the namei cache with a hash table for the locks. The namei cache helps to speed up translations from a path name to the corresponding vnodes, see namei(9).

A call for participation: Benchmark! I think this is a very long list of changes, which will all be available in the next major release of NetBSD. Starting now, it would be interesting to measure and estimate the performance of NetBSD in comparison to other operating systems that emphasize SMP (but still keep performance a goal on uniprocessor machines) -- FreeBSD, Linux and Solaris/x86 come to mind. Possible benchmarks could include simple Bytebench, dhrystone and Bonnie benchmarks over more complex ones like postmark and database and webserver benchmarks. If anyone has numbers and/or graphs, please post them to the tech-perform@NetBSD.org mailing list!

[Tags: smp]

Disclaimer: All opinion expressed here is purely my own. No responsibility is taken for anything.

Access count: 35505715
Copyright (c) Hubert Feyrer