Learning more about the NetBSD scheduler (... than I wanted to know)
I've had another chat with Michael on the scheduler issue,
and we agreed that someone should review his proposed patch.
Some interesting things came out from there:
So, with this in mind, I went to do a bit of testing.
I had already tested running concurrent, long-running processes
that did use up all the CPU they got, and the test was good.
- I learned a bit more about the scheduler from Michael.
With multiple CPUs, each CPU has a queue of processes that
are either "on the CPU" (running) or waiting to be serviced
(run) on that CPU. Those processes count as "migratable"
Every now and then, the system checks all its run queues
to see if a CPU is idle, and can thus "steal" (migrate) processes from
a busy CPU. This is done in
Such "stealing" (migration) has the positive effect that the
process doesn't have to wait for getting serviced on the CPU
it's currently waiting on. On the other side, migrating the
process has effects on CPU's data and instruction caches,
so switching CPUs shouldn't be taken too easy.
If migration happens, then this should be done from the CPU
with the most processes that are waiting for CPU time.
In this calculation, not only the current number should be
counted in, but a bit of the CPU's history is taken into
account, so processes that just started on a CPU are
not taken away again immediately. This is what is done
with the help of the processes currently migratable
(r_mcount) and also some "historic"
average. This "historic" value is taken from the previous round in
More or less weight can be given to this, and it seems
that the current number of migratable processes had too
little weight over all to be considerend.
What happens in effect is that a process is not taken from its
CPU, left waiting there, with another CPU spinning idle.
Which is exactly what I saw
in the first place.
- What I also learned from Michael was that there are a number of
sysctl variables that can be used to influence the scheduler.
Those are available under the "kern.sched" sysctl-tree:
% sysctl -d kern.sched
kern.sched.cacheht_time: Cache hotness time (in ticks)
kern.sched.balance_period: Balance period (in ticks)
kern.sched.min_catch: Minimal count of threads for catching
kern.sched.timesoftints: Track CPU time for soft interrupts
kern.sched.kpreempt_pri: Minimum priority to trigger kernel preemption
kern.sched.upreempt_pri: Minimum priority to trigger user preemption
kern.sched.rtts: Round-robin time quantum (in milliseconds)
kern.sched.pri_min: Minimal POSIX real-time priority
kern.sched.pri_max: Maximal POSIX real-time priority
The above text shows that much more can be written about
the scheduler and its whereabouts, but this remains to be done
by someone else (volunteers welcome!).
- Now, while digging into this, I also learned that I'm not the first
to discover this issue, and there is already another PR on this.
I have opened PR
but there is also
kern/43561. Funny enough, the solution proposed there is about the same,
though with a slightly different implementation. Still, *2 and
<<1 are the same as are /2 and >>1, so no change there.
And renaming variables for fun doesn't count anyways. ;)
Last but not least, it's worth noting that this whole
issue is not Xen-specific.
To test a different load on the system,
I've started a "build.sh -j8" on a (VMware Fusion) VM with 4 CPUs on a
Macbook Pro, and it nearly brought the machine to a halt - What I saw was
lots of idle time on all CPUs though. I aborted the exercise to get some
CPU cycles for me back. I blame the VM handling here, not the guest
I restarted the exercise with 2 CPUs in the same VM, and there I saw load
distribution on both CPUs (not much wonder with -j8), but there was also
quite some idle times in the 'make clean / install' phases that I'm not
sure is normal. During the actual build phases I wasn't able to see idle
time, though the system spent quite some time in the kernel (system).
Example top(1) output:
load averages: 9.01, 8.60, 7.15; up 0+01:24:11 01:19:33
67 processes: 7 runnable, 58 sleeping, 2 on CPU
CPU0 states: 0.0% user, 55.4% nice, 44.6% system, 0.0% interrupt, 0.0% idle
CPU1 states: 0.0% user, 69.3% nice, 30.7% system, 0.0% interrupt, 0.0% idle
Memory: 311M Act, 99M Inact, 6736K Wired, 23M Exec, 322M File, 395M Free
Swap: 1536M Total, 21M Used, 1516M Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
27028 feyrer 20 5 62M 27M CPU/1 0:00 9.74% 0.93% cc1
728 feyrer 85 0 78M 3808K select/1 1:03 0.73% 0.73% sshd
23274 feyrer 21 5 36M 14M RUN/0 0:00 10.00% 0.49% cc1
21634 feyrer 20 5 44M 20M RUN/0 0:00 7.00% 0.34% cc1
24697 feyrer 77 5 7988K 2480K select/1 0:00 0.31% 0.15% nbmake
24964 feyrer 74 5 11M 5496K select/1 0:00 0.44% 0.15% nbmake
18221 feyrer 21 5 49M 15M RUN/0 0:00 2.00% 0.10% cc1
14513 feyrer 20 5 43M 16M RUN/0 0:00 2.00% 0.10% cc1
518 feyrer 43 0 15M 1764K CPU/0 0:02 0.00% 0.00% top
20842 feyrer 21 5 6992K 340K RUN/0 0:00 0.00% 0.00% x86_64--netb
16215 feyrer 21 5 28M 172K RUN/0 0:00 0.00% 0.00% cc1
8922 feyrer 20 5 51M 14M RUN/0 0:00 0.00% 0.00% cc1
All in all, I'd say the patch is a good step forward from the current
situation, which does not properly distribute pure CPU hogs, at all.
[Tags: scheduler, smp]