Documenting NetBSD's scheduler tweaks
NetBSD's scheduler was recently changed to better
distribute load of long-running processes on multiple CPUs.
So far, the associated sysctl tweaks were not documented,
and this was changed now,
documenting the kern.sched sysctls.
For reference, here is the text that was added to the
Influence the scheduling of LWPs, their priorisation and how they
are distributed on and moved between CPUs.
Third level name Type Changeable
kern.sched.cacheht_time integer yes
kern.sched.balance_period integer yes
kern.sched.average_weight integer yes
kern.sched.min_catch integer yes
kern.sched.timesoftints integer yes
kern.sched.kpreempt_pri integer yes
kern.sched.upreempt_pri integer yes
kern.sched.maxts integer yes
kern.sched.mints integer yes
kern.sched.name string no
kern.sched.rtts integer no
kern.sched.pri_min integer no
kern.sched.pri_max integer no
The variables are as follows:
Cache hotness time in which a LWP is kept on one particu-
lar CPU and not moved to another CPU. This reduces the
overhead of flushing and reloading caches. Defaults to
3ms. Needs to be given in ``hz'' units, see mstohz(9).
Interval at which the CPU queues are checked for re-bal-
ancing. Defaults to 300ms. Needs to be given in ``hz''
units, see mstohz(9).
Can be used to influence how likely LWPs are to be
migrated from one CPU's queue of LWPs that are ready to
run to a different, idle CPU. The value gives the per-
centage for weighting the average count of migratable
threads from the past against the current number of
migratable threads. A small value gives more weight to
the past, a larger values more weight on the current sit-
uation. Defaults to 50 and must be between 0 and 100.
Minimum count of migratable (runable) threads for catch-
ing (stealing) from another CPU. Defaults to 1 but can
be increased to decrease chance of thread migration
Enable tracking of CPU time for soft interrupts as part
of a LWP's real execution time. Set to a non-zero value
to enable, and see ps(1) for printing CPU times.
Minimum priority to trigger kernel preemption.
Minimum priority to trigger user preemption.
Scheduler specific maximal time quantum (in millisec-
onds). Must be set to a value larger than ``mints'' and
between 10 and ``hz'' as given by the kern.clockrate
sysctl. Provided by the M2 scheduler.
Scheduler specific minimal time quantum (in millisec-
onds). Must be set to a value smaller than ``maxts'' and
between 1 and ``hz'' as given by the ``kern.clockrate''
sysctl. Provided by the M2 scheduler.
Scheduler name. Provided both by the M2 and the 4BSD
Fixed scheduler specific round-robin time quantum in mil-
liseconds. Provided both by the M2 and the 4BSD sched-
Minimal POSIX real-time priority. See sched(3).
Maximal POSIX real-time priority. See sched(3).
[Tags: scheduler, sysctl]
NetBSD 7.1_RC1 available
Well, subject says it all. To quote
from Soren Jacobsen's email:
``The first release candidate of NetBSD 7.1 is now available for
Those of you who prefer to build from source can continue to follow
the netbsd-7 branch or use the netbsd-7-1-RC1 tag.
There have been quite a lot of changes since 7.0. See
src/doc/CHANGES-7.1 for the full list.
Please help us out by testing 7.1_RC1. We love any and all feedback.
Report problems through the usual channels (submit a PR or write to
the appropriate list). More general feedback is welcome at
Hotplugging RAM - uvm_hotplug(9), the Xen balloon(4) driver and portmasters' FAQ
Adding and removing hardware components in operation is common in
today's commoditized computing environments. This was not always
the case - in the past century, one had to power down a machine
in order to change network cards, harddisks or RAM.
A major step towards changing a system's configuration at runtime
for customers came with USB, but that's not where it ends - other
systems like PCI support hotplugging as well.
Another area where changing of the system's configuration is
the amount of Ramdom Access Memory (RAM) of a system.
Usually fixed, this is determined at system start time, and
then managed by the operating system's memory managent system.
But esp. with today's virtualized hardware systems, even
the amount of RAM assigned to a system can easily be changed.
For example a VM can be assigned more RAM when needed,
without even rebooting the system, leading to
performance without introducing swapping/paging overhead.
Of course this required support from the operating system and
its memory management subsystem.
For NetBSD, the UVM
virtual memory system was now changed to support this via the
API, and a first user for this is the Xen
Quoting from the
``The balloon driver supports the memory ballooning operations offered in
Xen environments. It allows shrinking or extending a domain's available
memory by passing pages between different domains.''
manpage gives us more information on the UVM hotplug functionality:
``When the kernel is compiled with 'options UVM_HOTPLUG',
memory segments are handled in a dynamic data structure (rbtree(3)) com-
pared to a static array when not. This enables kernel code to add or
remove information about memory segments at any point after boot - thus
To answer more questions for portmasters who want to change
their ports, Cherry G. Mathew has now
posted a uvm_hotplug(9) port masters' FAQ.
It covers questions on the background, affected files,
and needed changes.
For more information on UVM,
see Charles' Chuck' Cranor's PhD disertation on
Design and Implementation of UVM
(PDF) as well as his
Usenix talk on the UVM Virtual Memory System (PS).
There is also
plenty of information available on Xen ballooning
- check it out and share your experiences on NetBSD's
port-xen mailing list!
[Tags: balloon, faq, hotplug, uvm, xen]
Bringing the scheduler saga to the finishing line
After my last
blog postings on the NetBSD scheduler,
some time went by. What has happened that the code
to handle process migration was rewritten to give
more knobs for tuning, and some testing was done.
The initial problem state
in PR kern/51615
is solved by the code.
To reach a wider audience and get more testing,
the code was
committed to NetBSD-current today.
Now, two things remain to be seen:
So just now when you thought there is no more research to be
done in scheduling algorithms, here is your chance
to fame and glory! :-)
- More testing. This best involved situations that compare
the system's behaviour without and with the patch.
Situations to test include
If you have time and an interesting set of numbers,
please feel free to
let us know on tech-kern@..
- pure computation jobs that involve multiple parallel processes
- a mix of CPU-crunching and input/output, again on a number of
- full build.sh examples
- Documentation. There is already a number of undocumented
sysctls under "kern.sched", which was now extended by one more,
"average_weight". While it's obvious to add the knob from
the formula, testing it under various real-life conditions
and see how things change is left to be determined by
a PhD thesis or two - be sure to drop us your patches for
if you can come up with a comprehensible
description of all the scheduler sysctls!
[Tags: scheduler, Xen]
Apple Releases macOS 10.12 Sierra Open Source Darwin Code
Interesting news come in via slashdot:
Apple Releases macOS 10.12 Sierra Open Source Darwin Code:
``Apple has released the open source Darwin code for macOS 10.12 Sierra. The code, located on Apple's open source website, can be accessed via
direct link now, although it doesn't yet appear on the site's home page. The release builds on a long-standing library of open source code that dates all the way back to OS X 10.0. There, you'll also find the Open Source Reference Library, developer tools, along with iOS and OS X Server resources. The lowest layers of macOS, including the kernel, BSD portions, and drivers are based mainly on open source technologies, collectively called Darwin. As such, Apple provides download links to the latest versions of these technologies for the open source community to learn and to use.''
This may not only be of interest to the
(or rather their successors in PureDarwin)
but more investigation not only on the code itself,
but also the license it is released under is neccessary
to learn if anything can be gained back for NetBSD.
Why "back"? As you may or may not remember,
mac OS includes some parts of NetBSD (besides lots of
FreeBSD, probably some OpenBSD, much other Open Source
software and sure a big lot of Apple's own code).
[Tags: apple, opendarwin, puredarwin]
BSD now 169: Scheduling your NetBSD, plus a comment
BSD Now 169 is out, entitled "Scheduling your NetBSD".
Yai, exciting contents in this video BSD centric video podcast!
As it turns out, Allan Jude and Kris Moore
actually read from some guy's NetBSD blog
starting at 0:22:50, going over the
the NetBSD scheduler there.
Exciting - I think I want to blog about this to get
more NetBSD content on the 'net. ;-)
Now, serious, to avoid getting into a recursive content loop,
I'd like to add one thing that may have caused a bit of confusion
at the end:
The problem mentioned at the end that led to the statement
that the patch wasn't perfect wasn't to blame on the patch,
but on my testing environment.
Using all CPU cores on VMware left none for my
normal operating system, and as such it was not funny to test.
That was the reason why I aborted the build-test went from 4 to 2 CPU cores.
Nothing related to the patch itself.
Sorry to Allan and Kris if that didn't come out clear.
Feel free to add that in BSD now 170! :-)
[Tags: bsdnow, scheduler]
EuroBSDCon 2016 Talks and NetBSD
This year's EuroBSDCon took place in Belgrade,
the slides are now available.
Have a look at the full lot - or pick the ones
that are relevant to NetBSD:
Reminder: Presentations about either NetBSD itself,
its internals but also how to use NetBSD to do something
cool, neat, useful or just utterly obscure are always welcome.
Let me know, or even better: file your (Euro)BSDCon talk! :)
- Abhinav Upadhyay: Automated Learning from Man Pages with NetBSD's apropos(1) - kudos for cool HTML presentation!
- Arun Thomas: DTrace Internals: Digging into DTrace - Goes from a big chunk of interesting, general DTrace material into FreeBSD, definitely calls for an introduction of DTrace on NetBSD :)
- Ryota Ozaki, Kengo Nakahara: Toward MP-safe Networking in NetBSD - excellent feature of SMP architecture, locking, testing-howto and much more
- Sevan Janiyan: Synchronisation of userland source code amongst the BSD's - I have no clue what this talk is about only from looking at the slides. Video, anyone?
- William Dobbins, Philip Nelson: NetBSD Subfiles - What they are, where they come from, and state of the existing work
- Joerg Sonnenberger: Bulk Building In The Many Core Era - hard facts on pbulk, where it comes form, recent challenges and where to go form there
[Tags: eurobsdcon, slides, talks]
In-kernel audio mixing ahead
NetBSD's sound device is currently only available for exclusive use.
If one program uses it, another cannot. So if you want to play
some music (mp3, audio stream) that's fine, but if you want
to also have your web browser or mail client make some noise,
this is not possible. Until now.
The solution is to mix
multiple audio sources together, in effect allowing
/dev/sound (etc.) access to be non-exclusive for a
single process but several ones instead.
To make this happen, audio from those sources needs to be
mixed to come out of the same speaker, and since data
writte to /dev/sound gets inside the kernel, that is
a good place to do the mixing.
Challenges in the play are if audio sources are of different
quality (bitrate, stereo/mono, bitrate), so some adjusting
may be needed. All this is met by
the latest patch by Nathanial Sloss,
his posting to tech-kern
for more information.
Also, note his request for review and testing! :-)
[Tags: audio, mixer, sound]
Learning more about the NetBSD scheduler (... than I wanted to know)
I've had another chat with Michael on the scheduler issue,
and we agreed that someone should review his proposed patch.
Some interesting things came out from there:
So, with this in mind, I went to do a bit of testing.
I had already tested running concurrent, long-running processes
that did use up all the CPU they got, and the test was good.
- I learned a bit more about the scheduler from Michael.
With multiple CPUs, each CPU has a queue of processes that
are either "on the CPU" (running) or waiting to be serviced
(run) on that CPU. Those processes count as "migratable"
Every now and then, the system checks all its run queues
to see if a CPU is idle, and can thus "steal" (migrate) processes from
a busy CPU. This is done in
Such "stealing" (migration) has the positive effect that the
process doesn't have to wait for getting serviced on the CPU
it's currently waiting on. On the other side, migrating the
process has effects on CPU's data and instruction caches,
so switching CPUs shouldn't be taken too easy.
If migration happens, then this should be done from the CPU
with the most processes that are waiting for CPU time.
In this calculation, not only the current number should be
counted in, but a bit of the CPU's history is taken into
account, so processes that just started on a CPU are
not taken away again immediately. This is what is done
with the help of the processes currently migratable
(r_mcount) and also some "historic"
average. This "historic" value is taken from the previous round in
More or less weight can be given to this, and it seems
that the current number of migratable processes had too
little weight over all to be considerend.
What happens in effect is that a process is not taken from its
CPU, left waiting there, with another CPU spinning idle.
Which is exactly what I saw
in the first place.
- What I also learned from Michael was that there are a number of
sysctl variables that can be used to influence the scheduler.
Those are available under the "kern.sched" sysctl-tree:
% sysctl -d kern.sched
kern.sched.cacheht_time: Cache hotness time (in ticks)
kern.sched.balance_period: Balance period (in ticks)
kern.sched.min_catch: Minimal count of threads for catching
kern.sched.timesoftints: Track CPU time for soft interrupts
kern.sched.kpreempt_pri: Minimum priority to trigger kernel preemption
kern.sched.upreempt_pri: Minimum priority to trigger user preemption
kern.sched.rtts: Round-robin time quantum (in milliseconds)
kern.sched.pri_min: Minimal POSIX real-time priority
kern.sched.pri_max: Maximal POSIX real-time priority
The above text shows that much more can be written about
the scheduler and its whereabouts, but this remains to be done
by someone else (volunteers welcome!).
- Now, while digging into this, I also learned that I'm not the first
to discover this issue, and there is already another PR on this.
I have opened PR
but there is also
kern/43561. Funny enough, the solution proposed there is about the same,
though with a slightly different implementation. Still, *2 and
<<1 are the same as are /2 and >>1, so no change there.
And renaming variables for fun doesn't count anyways. ;)
Last but not least, it's worth noting that this whole
issue is not Xen-specific.
To test a different load on the system,
I've started a "build.sh -j8" on a (VMware Fusion) VM with 4 CPUs on a
Macbook Pro, and it nearly brought the machine to a halt - What I saw was
lots of idle time on all CPUs though. I aborted the exercise to get some
CPU cycles for me back. I blame the VM handling here, not the guest
I restarted the exercise with 2 CPUs in the same VM, and there I saw load
distribution on both CPUs (not much wonder with -j8), but there was also
quite some idle times in the 'make clean / install' phases that I'm not
sure is normal. During the actual build phases I wasn't able to see idle
time, though the system spent quite some time in the kernel (system).
Example top(1) output:
load averages: 9.01, 8.60, 7.15; up 0+01:24:11 01:19:33
67 processes: 7 runnable, 58 sleeping, 2 on CPU
CPU0 states: 0.0% user, 55.4% nice, 44.6% system, 0.0% interrupt, 0.0% idle
CPU1 states: 0.0% user, 69.3% nice, 30.7% system, 0.0% interrupt, 0.0% idle
Memory: 311M Act, 99M Inact, 6736K Wired, 23M Exec, 322M File, 395M Free
Swap: 1536M Total, 21M Used, 1516M Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
27028 feyrer 20 5 62M 27M CPU/1 0:00 9.74% 0.93% cc1
728 feyrer 85 0 78M 3808K select/1 1:03 0.73% 0.73% sshd
23274 feyrer 21 5 36M 14M RUN/0 0:00 10.00% 0.49% cc1
21634 feyrer 20 5 44M 20M RUN/0 0:00 7.00% 0.34% cc1
24697 feyrer 77 5 7988K 2480K select/1 0:00 0.31% 0.15% nbmake
24964 feyrer 74 5 11M 5496K select/1 0:00 0.44% 0.15% nbmake
18221 feyrer 21 5 49M 15M RUN/0 0:00 2.00% 0.10% cc1
14513 feyrer 20 5 43M 16M RUN/0 0:00 2.00% 0.10% cc1
518 feyrer 43 0 15M 1764K CPU/0 0:02 0.00% 0.00% top
20842 feyrer 21 5 6992K 340K RUN/0 0:00 0.00% 0.00% x86_64--netb
16215 feyrer 21 5 28M 172K RUN/0 0:00 0.00% 0.00% cc1
8922 feyrer 20 5 51M 14M RUN/0 0:00 0.00% 0.00% cc1
All in all, I'd say the patch is a good step forward from the current
situation, which does not properly distribute pure CPU hogs, at all.
[Tags: scheduler, smp]
Looking at the scheduler issue again (Updated)
I've encountered a
funny scheduler behaviour the other day in a Xen enviroment.
The behaviour was that
CPU load was not distributed evenly on all CPUs, i.e. in my case
on a 2-CPU-system, two CPU-bound processes fought over the same
CPU, leaving the other one idle.
I had another look at this today, and was able to reproduce the
behaviour using VMWare Fusion with two CPU cores
on both NetBSD 7.0_STABLE as well as -current, both with
sources as of today, 2016-11-08.
made a screenshot available
that shows the issue on both systems.
I have also
filed a problem report
to document the issue.
The one hint that I got so far was from Michael van Elst that
there may be a rounding error in sched_balance().
Looking at the code, there is not much room for a rounding error.
But I am not familiar enough (at all) with the code, so I cannot judge
if crucial bits are dropped here, or how that function fits in the
Pondering on the "rounding error", I've setup both VMs with
4 CPUs, and the behaviour shown there is that load is
distributed to about 3 and a half CPU - three CPUs under
full load, and one not reaching 100%. There's definitely
something fishy in there.
Splitting up the four CPUs on different processor sets with one process
assigned to each set (using psrset(8)) leads to an even load distribution
here, too. This leads me to thinking that the NetBSD scheduling works
well between different processor sets, but is busted within one set.
[Tags: amd64, scheduler, xen]