LinuxLists.cc - [RFC] NUMA schedulers benchmark results

2002-10-06 16:47:26

Subject: [RFC] NUMA schedulers benchmark results

Hi,

here are some results from testing various versions and approaches
to a NUMA scheduler. I used the numa_test benchmark which I'll post
in a separate email. It runs in parallel N tasks doing the same job:
access randomly a large array. As the array is large enough not to
fit into cache, this is very memory latency sensitive. Also it is
memory bandwidth sensitive. To emulate a real multi-user environment, the
jobs are disturbed by a short load peak. This is simulated by a call
to "hackbench" 3 seconds after the tasks were started. The performance
of the user tasks is depending very much on where they are scheduled
and are CPU hoggers such that the user times are quite independent of
the non-scheduler part of the underlying kernel. The elapsed times
are depending on "hackbench" which actually blocks the machine for the
time it is running. Hackbench is depending on the underlying kernel
and one should compare "elapsed_time - hackbench_time".

The test machine is a 16 CPU NEC Azusa with Itanium processors and
four nodes. The tested schedulers are:

A: O(1) scheduler in 2.5.39
B: O(1) scheduler with task steal limited to only one task (node
affine scheduler with CONFIG_NUMA_SCHED=n) under 2.4.18
C: Michael Hohnbaum's "simple NUMA scheduler" under 2.5.39
D: pooling NUMA scheduler, no initial load balancing, idle pool_delay
set to 0, under 2.4.18
E: node affine scheduler with initial load balancing and static homenode
F: node affine scheduler without initial load balancing and dynamic
homenode selection (homenode selected where most of the memory is
allocated).

As I'm rewriting the node affine scheduler to be more modular, I'll
redo the tests for cases D, E, F on top of 2.5.X kernels soon.

The results are summarized in the tables below. A set of outputs (for
N=8, 16, 32) is attached. They show clearly why the node affine
scheduler beats them all: The initial load balancing is node-aware,
the task-stealing too. Sometimes the node affine force is not large
enough to bring the task back to the homenode, but it is almost always
good enough.

The (F) solution with dynamically determined homenode show that the
initial load balancing is crucial, as the equal node balance is not
strongly enforced dynamically. So the optimal solution is probably
(F) with initial load balancing.

Average user time (U) and total user time (TU). Minimum per row should
be considered.
----------------------------------------------------------------------
Scheduler: A B C D E F
N=4 U 28.12 30.77 33.00 - 27.20 30.29
TU 112.55 123.13 132.08 - 108.88 121.25
N=8 U 30.47 31.39 31.65 30.76 28.67 30.08
TU 243.86 251.27 253.30 246.23 229.51 240.75
N=16 U 36.42 33.64 32.18 32.27 31.50 32.83
TU 582.91 538.49 515.11 516.53 504.17 525.59
N=32 U 38.69 34.83 34.05 33.76 33.89 34.11
TU 1238.4 1114.9 1090.1 1080.8 1084.9 1091.9
N=64 U 39.73 34.73 34.23 - (33.32) 34.98
TU 2543.4 2223.4 2191.7 - (2133) 2239.5

Elapsed time (E), hackbench time (H). Diferences between 2.4.18 and
2.5.39 based kernels due to "hackbench", which performs differently.
Compare E-H within a row, but don't take it too seriously.
--------------------------------------------------------------------
Scheduler: A B C D E F
N=4 E 37.33 37.96 48.31 - 28.14 35.91
H 9.98 1.49 10.65 - 1.99 1.43
N=8 E 46.17 39.50 42.53 39.72 30.28 38.28
H 9.64 1.86 7.27 2.07 2.33 1.86
N=16 E 47.21 44.67 49.66 42.97 36.98 42.51
H 5.90 4.69 2.93 5.178 5.56 5.94
N=32 E 88.60 79.92 80.34 78.35 76.84 77.38
H 6.29 5.23 2.85 4.51 5.29 4.28
N=64 E 167.10 147.16 150.59 - (133.9) 148.94
H 5.96 4.67 3.10 - (-) 6.86

(The E:N=64 results are without hackbench disturbance.)

Best regards,
Erich

Attachments:

NUMA_SCHED_8_16_32.results (23.48 kB)

2002-10-06 20:20:41

by Erich Focht

[permalink] [raw]

Subject: Re: [RFC] NUMA schedulers benchmark results

Hi,

here comes the benchmark I used for the NUMA scheduler test. Would
be interesting to know whether it is useful to any other NUMA
developer...

Regards,
Erich

PS: it uses a 'time' command that understands the --format option,
e.g. GNU time 1.7. Change it in the main script, if it doesn't
work for you.

On Sunday 06 October 2002 18:51, Erich Focht wrote:
> Hi,
>
> here are some results from testing various versions and approaches
> to a NUMA scheduler. I used the numa_test benchmark which I'll post
> in a separate email. It runs in parallel N tasks doing the same job:
> access randomly a large array. As the array is large enough not to
> fit into cache, this is very memory latency sensitive. Also it is
> memory bandwidth sensitive. To emulate a real multi-user environment, the
> jobs are disturbed by a short load peak. This is simulated by a call
> to "hackbench" 3 seconds after the tasks were started. The performance
> of the user tasks is depending very much on where they are scheduled
> and are CPU hoggers such that the user times are quite independent of
> the non-scheduler part of the underlying kernel. The elapsed times
> are depending on "hackbench" which actually blocks the machine for the
> time it is running. Hackbench is depending on the underlying kernel
> and one should compare "elapsed_time - hackbench_time".
>
> The test machine is a 16 CPU NEC Azusa with Itanium processors and
> four nodes. The tested schedulers are:
>
> A: O(1) scheduler in 2.5.39
> B: O(1) scheduler with task steal limited to only one task (node
> affine scheduler with CONFIG_NUMA_SCHED=n) under 2.4.18
> C: Michael Hohnbaum's "simple NUMA scheduler" under 2.5.39
> D: pooling NUMA scheduler, no initial load balancing, idle pool_delay
> set to 0, under 2.4.18
> E: node affine scheduler with initial load balancing and static homenode
> F: node affine scheduler without initial load balancing and dynamic
> homenode selection (homenode selected where most of the memory is
> allocated).
>
> As I'm rewriting the node affine scheduler to be more modular, I'll
> redo the tests for cases D, E, F on top of 2.5.X kernels soon.
>
> The results are summarized in the tables below. A set of outputs (for
> N=8, 16, 32) is attached. They show clearly why the node affine
> scheduler beats them all: The initial load balancing is node-aware,
> the task-stealing too. Sometimes the node affine force is not large
> enough to bring the task back to the homenode, but it is almost always
> good enough.
>
> The (F) solution with dynamically determined homenode show that the
> initial load balancing is crucial, as the equal node balance is not
> strongly enforced dynamically. So the optimal solution is probably
> (F) with initial load balancing.
>
>
> Average user time (U) and total user time (TU). Minimum per row should
> be considered.
> ----------------------------------------------------------------------
> Scheduler: A B C D E F
> N=4 U 28.12 30.77 33.00 - 27.20 30.29
> TU 112.55 123.13 132.08 - 108.88 121.25
> N=8 U 30.47 31.39 31.65 30.76 28.67 30.08
> TU 243.86 251.27 253.30 246.23 229.51 240.75
> N=16 U 36.42 33.64 32.18 32.27 31.50 32.83
> TU 582.91 538.49 515.11 516.53 504.17 525.59
> N=32 U 38.69 34.83 34.05 33.76 33.89 34.11
> TU 1238.4 1114.9 1090.1 1080.8 1084.9 1091.9
> N=64 U 39.73 34.73 34.23 - (33.32) 34.98
> TU 2543.4 2223.4 2191.7 - (2133) 2239.5
>
>
> Elapsed time (E), hackbench time (H). Diferences between 2.4.18 and
> 2.5.39 based kernels due to "hackbench", which performs differently.
> Compare E-H within a row, but don't take it too seriously.
> --------------------------------------------------------------------
> Scheduler: A B C D E F
> N=4 E 37.33 37.96 48.31 - 28.14 35.91
> H 9.98 1.49 10.65 - 1.99 1.43
> N=8 E 46.17 39.50 42.53 39.72 30.28 38.28
> H 9.64 1.86 7.27 2.07 2.33 1.86
> N=16 E 47.21 44.67 49.66 42.97 36.98 42.51
> H 5.90 4.69 2.93 5.178 5.56 5.94
> N=32 E 88.60 79.92 80.34 78.35 76.84 77.38
> H 6.29 5.23 2.85 4.51 5.29 4.28
> N=64 E 167.10 147.16 150.59 - (133.9) 148.94
> H 5.96 4.67 3.10 - (-) 6.86
>
> (The E:N=64 results are without hackbench disturbance.)
>
> Best regards,
> Erich

Attachments:

numa_test (8.13 kB)

2002-10-06 23:57:38

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [RFC] NUMA schedulers benchmark results

Errm, your magic shellscript is too wierd (aka doesn't work).
Can you just send the files out as attatchments?

M.

--On Sunday, October 06, 2002 10:24 PM +0200 Erich Focht <[email protected]> wrote:

> Hi,
>
> here comes the benchmark I used for the NUMA scheduler test. Would
> be interesting to know whether it is useful to any other NUMA
> developer...
>
> Regards,
> Erich
>
> PS: it uses a 'time' command that understands the --format option,
> e.g. GNU time 1.7. Change it in the main script, if it doesn't
> work for you.
>
>
> On Sunday 06 October 2002 18:51, Erich Focht wrote:
>> Hi,
>>
>> here are some results from testing various versions and approaches
>> to a NUMA scheduler. I used the numa_test benchmark which I'll post
>> in a separate email. It runs in parallel N tasks doing the same job:
>> access randomly a large array. As the array is large enough not to
>> fit into cache, this is very memory latency sensitive. Also it is
>> memory bandwidth sensitive. To emulate a real multi-user environment, the
>> jobs are disturbed by a short load peak. This is simulated by a call
>> to "hackbench" 3 seconds after the tasks were started. The performance
>> of the user tasks is depending very much on where they are scheduled
>> and are CPU hoggers such that the user times are quite independent of
>> the non-scheduler part of the underlying kernel. The elapsed times
>> are depending on "hackbench" which actually blocks the machine for the
>> time it is running. Hackbench is depending on the underlying kernel
>> and one should compare "elapsed_time - hackbench_time".
>>
>> The test machine is a 16 CPU NEC Azusa with Itanium processors and
>> four nodes. The tested schedulers are:
>>
>> A: O(1) scheduler in 2.5.39
>> B: O(1) scheduler with task steal limited to only one task (node
>> affine scheduler with CONFIG_NUMA_SCHED=n) under 2.4.18
>> C: Michael Hohnbaum's "simple NUMA scheduler" under 2.5.39
>> D: pooling NUMA scheduler, no initial load balancing, idle pool_delay
>> set to 0, under 2.4.18
>> E: node affine scheduler with initial load balancing and static homenode
>> F: node affine scheduler without initial load balancing and dynamic
>> homenode selection (homenode selected where most of the memory is
>> allocated).
>>
>> As I'm rewriting the node affine scheduler to be more modular, I'll
>> redo the tests for cases D, E, F on top of 2.5.X kernels soon.
>>
>> The results are summarized in the tables below. A set of outputs (for
>> N=8, 16, 32) is attached. They show clearly why the node affine
>> scheduler beats them all: The initial load balancing is node-aware,
>> the task-stealing too. Sometimes the node affine force is not large
>> enough to bring the task back to the homenode, but it is almost always
>> good enough.
>>
>> The (F) solution with dynamically determined homenode show that the
>> initial load balancing is crucial, as the equal node balance is not
>> strongly enforced dynamically. So the optimal solution is probably
>> (F) with initial load balancing.
>>
>>
>> Average user time (U) and total user time (TU). Minimum per row should
>> be considered.
>> ----------------------------------------------------------------------
>> Scheduler: A B C D E F
>> N=4 U 28.12 30.77 33.00 - 27.20 30.29
>> TU 112.55 123.13 132.08 - 108.88 121.25
>> N=8 U 30.47 31.39 31.65 30.76 28.67 30.08
>> TU 243.86 251.27 253.30 246.23 229.51 240.75
>> N=16 U 36.42 33.64 32.18 32.27 31.50 32.83
>> TU 582.91 538.49 515.11 516.53 504.17 525.59
>> N=32 U 38.69 34.83 34.05 33.76 33.89 34.11
>> TU 1238.4 1114.9 1090.1 1080.8 1084.9 1091.9
>> N=64 U 39.73 34.73 34.23 - (33.32) 34.98
>> TU 2543.4 2223.4 2191.7 - (2133) 2239.5
>>
>>
>> Elapsed time (E), hackbench time (H). Diferences between 2.4.18 and
>> 2.5.39 based kernels due to "hackbench", which performs differently.
>> Compare E-H within a row, but don't take it too seriously.
>> --------------------------------------------------------------------
>> Scheduler: A B C D E F
>> N=4 E 37.33 37.96 48.31 - 28.14 35.91
>> H 9.98 1.49 10.65 - 1.99 1.43
>> N=8 E 46.17 39.50 42.53 39.72 30.28 38.28
>> H 9.64 1.86 7.27 2.07 2.33 1.86
>> N=16 E 47.21 44.67 49.66 42.97 36.98 42.51
>> H 5.90 4.69 2.93 5.178 5.56 5.94
>> N=32 E 88.60 79.92 80.34 78.35 76.84 77.38
>> H 6.29 5.23 2.85 4.51 5.29 4.28
>> N=64 E 167.10 147.16 150.59 - (133.9) 148.94
>> H 5.96 4.67 3.10 - (-) 6.86
>>
>> (The E:N=64 results are without hackbench disturbance.)
>>
>> Best regards,
>> Erich

2002-10-07 00:55:41

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [RFC] NUMA schedulers benchmark results

> As I'm rewriting the node affine scheduler to be more modular, I'll
> redo the tests for cases D, E, F on top of 2.5.X kernels soon.

It's really hard to see anything comparing 2.4 vs 2.5 results,
so when you have results for everything across the same kernel,
it'll be a lot easier to decipher ...

For the record, I really don't care whose code gets accepted, I'm
just interested in having a scheduler that works well for me ;-)

However, I think it would be a good idea to run some other benchmarks
on these, that are a little more general ... dbench, kernel compile,
sdet, whatever, as *well* as your own benchmark ... I think we're
working on that.

> Average user time (U) and total user time (TU). Minimum per row should
> be considered.
> ----------------------------------------------------------------------
> Scheduler: A B C D E F
> N=4 U 28.12 30.77 33.00 - 27.20 30.29
> N=8 U 30.47 31.39 31.65 30.76 28.67 30.08
> N=16 U 36.42 33.64 32.18 32.27 31.50 32.83
> N=32 U 38.69 34.83 34.05 33.76 33.89 34.11
> N=64 U 39.73 34.73 34.23 - (33.32) 34.98

So lower numbers are better, right? So Michael's stuff seems to
work fine at the higher end, but poorly at the lower end - I think
this may be due to a bug he found on Friday, if that gets fixed, it
might make a difference.

> The results are summarized in the tables below. A set of outputs
> (for N=8, 16, 32) is attached. They show clearly why the node affine
> scheduler beats them all

It would be a lot clearer if you baselines weren't different.
Now maybe I'm just totally misreading your results (tell us if so)
but I presume it's more valid to compare:

1. difference between A and C (2.5 virgin vs 2.5 + Michael's stuff)
vs
2. difference between E and B (2.4 virgin vs 2.4 + your stuff)

rather than across 2.4 vs 2.5, which seems meaningless.

N=32 (E) .... C vs A 80.34 vs 88.60 (about 10% improvement)
E vs B 76.84 vs 79.92 (about 5% improvement)
N=32 (H) .... C vs A 2.85 vs 6.29 (over 50% improvement)
E vs B 5.29 vs 5.23 (about 1% impromement)

Which actually makes C better than E on both measures, surely?
Obviously Michael's stuff is worse at the low end, but to say
"They show clearly why the node affine scheduler beats them all"
seems a little optimistic ;-)

I'm not trying to say you're wrong, or one scheduler is better than
the other ... I'm just saying I can't really draw any clear-cut
conclusion from your results other than that Michael's stuff looks
borked for low numbers of tasks ...

I'll run your tests on the NUMA-Q comparing 2.5.40 vs Michael's stuff.
If you send me a cleaned up version (ie without the ia64 stuff in
generic code) of your stuff against 2.5.40 (or ideally 2.5.40-mm2),
I'll run that too ... (if you're putting the latencies into arch code,
you can set i386 (well NUMA-Q) for 20:1 if you think that'll help).

M.

PS. the wierd "this one was run without hackbench" thing makes the
results even harder to read ...

PPS. Does Kimio's discontigmem patch work for you on 2.5?

2002-10-07 07:23:41

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC] NUMA schedulers benchmark results

On Mon, 7 Oct 2002, Martin J. Bligh wrote:

> Profile looks horrible (what the hell is fib_node_get_info?
> net/ipv4/fib_semantics.c):

most likely the wrong System.map?

Ingo

2002-10-07 07:22:15

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [RFC] NUMA schedulers benchmark results

Just a data dump for you, I'm sleepy ... will look at it in the
morning. 16-way NUMA-Q running your benchmark:

2.5.40-mm2

time.4:AverageUserTime 40.69 seconds
time.4:ElapsedTime 52.14
time.4:TotalUserTime 162.85
time.4:TotalSysTime 0.71
time.8:AverageUserTime 36.90 seconds
time.8:ElapsedTime 62.12
time.8:TotalUserTime 295.29
time.8:TotalSysTime 1.79
time.16:AverageUserTime 63.91 seconds
time.16:ElapsedTime 88.82
time.16:TotalUserTime 1022.71
time.16:TotalSysTime 4.85
time.32:AverageUserTime 102.11 seconds
time.32:ElapsedTime 224.24
time.32:TotalUserTime 3267.84
time.32:TotalSysTime 11.73
time.64:AverageUserTime 148.33 seconds
time.64:ElapsedTime 611.71
time.64:TotalUserTime 9494.10
time.64:TotalSysTime 25.81

2.5.40-mm2 + michael's stuff (latest on tux1)

time.4:AverageUserTime 31.55 seconds
time.4:ElapsedTime 41.58
time.4:TotalUserTime 126.23
time.4:TotalSysTime 1.11
time.8:AverageUserTime 51.80 seconds
time.8:ElapsedTime 80.38
time.8:TotalUserTime 414.50
time.8:TotalSysTime 2.67
time.16:AverageUserTime 50.75 seconds
time.16:ElapsedTime 72.28
time.16:TotalUserTime 812.08
time.16:TotalSysTime 4.60
time.32:AverageUserTime 55.78 seconds
time.32:ElapsedTime 126.45
time.32:TotalUserTime 1785.21
time.32:TotalSysTime 8.79
time.64:AverageUserTime 57.49 seconds
time.64:ElapsedTime 243.27
time.64:TotalUserTime 3679.94
time.64:TotalSysTime 15.09

Looks pretty sweet to me, apart from 8, which is just wierd.

Profile looks horrible (what the hell is fib_node_get_info?
net/ipv4/fib_semantics.c):

without:

241652 total 0.2455
120117 default_idle
19728 fib_node_get_info
17400 swapin_readahead
12805 kmalloc
11047 kfree
7918 sock_alloc_send_pskb
5956 alloc_skb
4690 __wake_up
4194 kmem_cache_free

with:

213131 total 0.2165
138082 default_idle
14620 fib_node_get_info
9631 swapin_readahead
5316 kmalloc
5279 kfree
5251 sock_alloc_send_pskb
4154 __wake_up
4116 schedule
3002 alloc_skb

2002-10-07 16:36:33

by Michael Hohnbaum

[permalink] [raw]

Subject: Re: [RFC] NUMA schedulers benchmark results

Erich,

These numbers look pretty good for your scheduler. It would be good
if you could post numbers against 2.5.x where x is the same for all
tests. I thought you had your scheduler working on 2.5.x recently.

Running numa_test on my systems I find a lot of variation (as much
as 10%) between identical runs. Are all of these results averaged
over multiple runs? Any idea why the hackbench times vary so widely?

Looking at the results for my scheduler, it is real obvious what the
algorithm is that I use for selecting a cpu for a process. This has
not been quite as apparent on my systems, probably due to other tasks
running in the background. The code in your scheduler that causes a
different node to start the search for a cpu really pays off here.

I know that there are some balance issues, especially on light loads
for my scheduler. I've got two tweaks that help that, but I think that
adding something similar to what you have done to keep track of the last
node picked and starting at the next node could be beneficial.

Michael

On Sun, 2002-10-06 at 09:51, Erich Focht wrote:
> Hi,
>
> here are some results from testing various versions and approaches
> to a NUMA scheduler. I used the numa_test benchmark which I'll post
> in a separate email. It runs in parallel N tasks doing the same job:
> access randomly a large array. As the array is large enough not to
> fit into cache, this is very memory latency sensitive. Also it is
> memory bandwidth sensitive. To emulate a real multi-user environment, the
> jobs are disturbed by a short load peak. This is simulated by a call
> to "hackbench" 3 seconds after the tasks were started. The performance
> of the user tasks is depending very much on where they are scheduled
> and are CPU hoggers such that the user times are quite independent of
> the non-scheduler part of the underlying kernel. The elapsed times
> are depending on "hackbench" which actually blocks the machine for the
> time it is running. Hackbench is depending on the underlying kernel
> and one should compare "elapsed_time - hackbench_time".
>
> The test machine is a 16 CPU NEC Azusa with Itanium processors and
> four nodes. The tested schedulers are:
>
> A: O(1) scheduler in 2.5.39
> B: O(1) scheduler with task steal limited to only one task (node
> affine scheduler with CONFIG_NUMA_SCHED=n) under 2.4.18
> C: Michael Hohnbaum's "simple NUMA scheduler" under 2.5.39
> D: pooling NUMA scheduler, no initial load balancing, idle pool_delay
> set to 0, under 2.4.18
> E: node affine scheduler with initial load balancing and static homenode
> F: node affine scheduler without initial load balancing and dynamic
> homenode selection (homenode selected where most of the memory is
> allocated).
>
> As I'm rewriting the node affine scheduler to be more modular, I'll
> redo the tests for cases D, E, F on top of 2.5.X kernels soon.
>
> The results are summarized in the tables below. A set of outputs (for
> N=8, 16, 32) is attached. They show clearly why the node affine
> scheduler beats them all: The initial load balancing is node-aware,
> the task-stealing too. Sometimes the node affine force is not large
> enough to bring the task back to the homenode, but it is almost always
> good enough.
>
> The (F) solution with dynamically determined homenode show that the
> initial load balancing is crucial, as the equal node balance is not
> strongly enforced dynamically. So the optimal solution is probably
> (F) with initial load balancing.
>
>
> Average user time (U) and total user time (TU). Minimum per row should
> be considered.
> ----------------------------------------------------------------------
> Scheduler: A B C D E F
> N=4 U 28.12 30.77 33.00 - 27.20 30.29
> TU 112.55 123.13 132.08 - 108.88 121.25
> N=8 U 30.47 31.39 31.65 30.76 28.67 30.08
> TU 243.86 251.27 253.30 246.23 229.51 240.75
> N=16 U 36.42 33.64 32.18 32.27 31.50 32.83
> TU 582.91 538.49 515.11 516.53 504.17 525.59
> N=32 U 38.69 34.83 34.05 33.76 33.89 34.11
> TU 1238.4 1114.9 1090.1 1080.8 1084.9 1091.9
> N=64 U 39.73 34.73 34.23 - (33.32) 34.98
> TU 2543.4 2223.4 2191.7 - (2133) 2239.5
>
>
> Elapsed time (E), hackbench time (H). Diferences between 2.4.18 and
> 2.5.39 based kernels due to "hackbench", which performs differently.
> Compare E-H within a row, but don't take it too seriously.
> --------------------------------------------------------------------
> Scheduler: A B C D E F
> N=4 E 37.33 37.96 48.31 - 28.14 35.91
> H 9.98 1.49 10.65 - 1.99 1.43
> N=8 E 46.17 39.50 42.53 39.72 30.28 38.28
> H 9.64 1.86 7.27 2.07 2.33 1.86
> N=16 E 47.21 44.67 49.66 42.97 36.98 42.51
> H 5.90 4.69 2.93 5.178 5.56 5.94
> N=32 E 88.60 79.92 80.34 78.35 76.84 77.38
> H 6.29 5.23 2.85 4.51 5.29 4.28
> N=64 E 167.10 147.16 150.59 - (133.9) 148.94
> H 5.96 4.67 3.10 - (-) 6.86
>
> (The E:N=64 results are without hackbench disturbance.)
>
> Best regards,
> Erich
> ----
>

> A: 2.5.39 O(1) scheduler
>
> Running 'hackbench 20' in the background: Time: 9.639
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 7.5 92.5 0.0 0.0 | 0 1 *| 38.29
> 2 0.0 0.0 0.0 100.0 | 0 3 *| 30.19
> 3 0.0 100.0 0.0 0.0 | 0 1 *| 27.84
> 4 0.0 0.0 0.0 100.0 | 0 3 *| 30.38
> 5 100.0 0.0 0.0 0.0 | 0 0 | 29.96
> 6 100.0 0.0 0.0 0.0 | 0 0 | 29.63
> 7 0.0 0.0 100.0 0.0 | 0 2 *| 27.33
> 8 0.0 0.0 0.0 100.0 | 0 3 *| 30.14
> AverageUserTime 30.47 seconds
> ElapsedTime 46.17
> TotalUserTime 243.86
> TotalSysTime 0.41
>
> Running 'hackbench 20' in the background: Time: 5.896
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 0.0 98.7 1.3 0.0 | 0 1 *| 41.07
> 2 89.2 10.8 0.0 0.0 | 0 0 | 39.16
> 3 9.4 0.0 90.6 0.0 | 0 2 *| 40.72
> 4 0.0 0.0 0.0 100.0 | 0 3 *| 31.96
> 5 0.0 0.1 99.0 0.9 | 0 2 *| 41.36
> 6 0.0 0.0 100.0 0.0 | 0 2 *| 30.62
> 7 0.6 0.0 86.2 13.2 | 0 2 *| 40.11
> 8 100.0 0.0 0.0 0.0 | 0 0 | 31.25
> 9 8.0 0.0 0.0 92.0 | 0 3 *| 40.63
> 10 0.0 100.0 0.0 0.0 | 0 1 *| 30.43
> 11 0.0 0.0 0.0 100.0 | 0 3 *| 31.64
> 12 91.6 0.0 8.4 0.0 | 0 0 | 39.92
> 13 0.0 100.0 0.0 0.0 | 0 1 *| 30.46
> 14 92.1 0.0 7.9 0.0 | 0 0 | 40.36
> 15 0.0 0.0 0.0 100.0 | 0 3 *| 32.29
> 16 4.8 95.2 0.0 0.0 | 0 1 *| 40.70
> AverageUserTime 36.42 seconds
> ElapsedTime 47.21
> TotalUserTime 582.91
> TotalSysTime 0.85
>
> Running 'hackbench 20' in the background: Time: 6.293
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 0.0 0.0 100.0 0.0 | 0 2 *| 31.53
> 2 100.0 0.0 0.0 0.0 | 0 0 | 34.84
> 3 100.0 0.0 0.0 0.0 | 0 0 | 32.70
> 4 0.0 100.0 0.0 0.0 | 0 1 *| 32.25
> 5 0.0 0.0 0.0 100.0 | 0 3 *| 33.41
> 6 96.5 2.7 0.8 0.0 | 0 0 | 43.39
> 7 5.6 0.0 94.4 0.0 | 0 2 *| 44.42
> 8 73.4 22.9 3.7 0.0 | 0 0 | 42.14
> 9 0.9 0.0 99.1 0.0 | 0 2 *| 45.10
> 10 0.0 8.4 91.6 0.0 | 0 2 *| 42.85
> 11 0.0 0.0 0.0 100.0 | 0 3 *| 32.11
> 12 0.0 100.0 0.0 0.0 | 0 1 *| 32.16
> 13 0.0 0.0 0.0 100.0 | 0 3 *| 33.48
> 14 0.0 0.0 100.0 0.0 | 0 2 *| 31.35
> 15 0.0 0.3 3.2 96.4 | 0 3 *| 42.27
> 16 91.0 0.0 0.0 9.0 | 0 0 | 35.50
> 17 0.0 100.0 0.0 0.0 | 0 1 *| 32.45
> 18 0.0 97.6 2.4 0.0 | 0 1 *| 41.97
> 19 0.0 100.0 0.0 0.0 | 0 1 *| 32.46
> 20 1.3 0.0 74.2 24.5 | 0 2 *| 43.78
> 21 0.0 0.0 0.9 99.1 | 0 3 *| 32.74
> 22 0.8 0.0 99.2 0.0 | 0 2 *| 45.61
> 23 100.0 0.0 0.0 0.0 | 0 0 | 33.78
> 24 97.0 1.0 0.0 1.9 | 0 0 | 43.74
> 25 0.0 1.8 0.0 98.2 | 0 3 *| 43.31
> 26 0.0 96.1 0.0 3.9 | 0 1 *| 43.61
> 27 2.6 0.0 0.3 97.0 | 0 3 *| 46.54
> 28 0.0 0.0 0.0 100.0 | 0 3 *| 33.59
> 29 0.0 92.1 0.0 7.9 | 0 1 *| 43.43
> 30 2.2 0.0 97.8 0.0 | 0 2 *| 45.19
> 31 26.0 72.3 0.0 1.8 | 0 1 *| 43.29
> 32 98.5 1.3 0.2 0.0 | 0 0 | 42.97
> AverageUserTime 38.69 seconds
> ElapsedTime 88.60
> TotalUserTime 1238.43
> TotalSysTime 1.85
>
> -------------------------------------------------------------
> B: O(1) scheduler equivalent, 1 task steal per load balance step
> (2.4.18 node affine scheduler with CONFIG_NUMA_SCHED=n)
>
> Running 'hackbench 20' in the background: Time: 1.861
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 0.0 0.0 100.0 0.0 | 2 2 | 31.62
> 2 0.0 0.0 100.0 0.0 | 2 2 | 31.68
> 3 100.0 0.0 0.0 0.0 | 0 0 | 29.40
> 4 0.0 0.0 100.0 0.0 | 2 2 | 31.56
> 5 0.0 0.0 100.0 0.0 | 2 2 | 31.60
> 6 95.0 5.0 0.0 0.0 | 1 0 *| 38.14
> 7 100.0 0.0 0.0 0.0 | 0 0 | 29.27
> 8 0.0 100.0 0.0 0.0 | 1 1 | 27.87
> AverageUserTime 31.39 seconds
> ElapsedTime 39.50
> TotalUserTime 251.27
> TotalSysTime 0.48
>
> Running 'hackbench 20' in the background: Time: 4.688
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 0.0 0.0 100.0 0.0 | 2 2 | 31.49
> 2 0.0 0.0 100.0 0.0 | 2 2 | 31.33
> 3 0.0 0.0 100.0 0.0 | 2 2 | 31.30
> 4 100.0 0.0 0.0 0.0 | 0 0 | 31.98
> 5 100.0 0.0 0.0 0.0 | 0 0 | 32.21
> 6 0.0 91.3 8.7 0.0 | 2 1 *| 39.88
> 7 0.0 100.0 0.0 0.0 | 1 1 | 30.72
> 8 0.0 0.0 0.0 100.0 | 3 3 | 31.56
> 9 1.2 98.8 0.0 0.0 | 0 1 *| 41.31
> 10 0.0 0.0 0.0 100.0 | 3 3 | 31.56
> 11 0.0 100.0 0.0 0.0 | 1 1 | 30.67
> 12 0.0 0.0 0.0 100.0 | 3 3 | 31.45
> 13 100.0 0.0 0.0 0.0 | 0 0 | 31.94
> 14 5.5 0.0 94.5 0.0 | 0 2 *| 40.83
> 15 0.0 0.0 0.0 100.0 | 3 3 | 31.53
> 16 83.0 17.0 0.0 0.0 | 1 0 *| 38.50
> AverageUserTime 33.64 seconds
> ElapsedTime 44.67
> TotalUserTime 538.49
> TotalSysTime 1.02
>
> Running 'hackbench 20' in the background: Time: 5.229
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 67.6 0.0 17.4 15.0 | 2 0 *| 41.36
> 2 0.0 0.0 100.0 0.0 | 2 2 | 32.78
> 3 100.0 0.0 0.0 0.0 | 0 0 | 32.92
> 4 0.0 0.0 100.0 0.0 | 2 2 | 33.56
> 5 22.4 75.3 2.3 0.0 | 2 1 *| 43.26
> 6 0.0 0.0 100.0 0.0 | 2 2 | 33.38
> 7 100.0 0.0 0.0 0.0 | 0 0 | 33.16
> 8 0.0 0.0 0.0 100.0 | 3 3 | 32.03
> 9 0.0 100.0 0.0 0.0 | 1 1 | 32.64
> 10 0.0 0.0 0.0 100.0 | 3 3 | 32.00
> 11 4.5 0.0 0.0 95.5 | 0 3 *| 43.82
> 12 0.0 0.0 0.0 100.0 | 3 3 | 31.91
> 13 100.0 0.0 0.0 0.0 | 0 0 | 33.46
> 14 0.0 100.0 0.0 0.0 | 1 1 | 32.39
> 15 0.0 0.0 0.0 100.0 | 3 3 | 31.86
> 16 0.0 0.0 100.0 0.0 | 2 2 | 33.35
> 17 100.0 0.0 0.0 0.0 | 0 0 | 33.95
> 18 0.0 0.0 0.0 100.0 | 3 3 | 31.68
> 19 0.0 5.3 0.0 94.7 | 1 3 *| 42.97
> 20 2.0 98.0 0.0 0.0 | 0 1 *| 43.56
> 21 0.0 0.0 0.0 100.0 | 3 3 | 31.65
> 22 100.0 0.0 0.0 0.0 | 0 0 | 34.13
> 23 0.0 100.0 0.0 0.0 | 1 1 | 32.52
> 24 9.9 90.1 0.0 0.0 | 1 1 | 33.58
> 25 0.0 3.2 96.8 0.0 | 1 2 *| 42.51
> 26 89.8 0.0 0.0 10.2 | 0 0 | 34.18
> 27 0.0 100.0 0.0 0.0 | 1 1 | 32.44
> 28 0.0 2.4 97.6 0.0 | 2 2 | 33.59
> 29 8.4 0.0 91.6 0.0 | 2 2 | 33.78
> 30 100.0 0.0 0.0 0.0 | 0 0 | 33.52
> 31 0.0 6.3 93.7 0.0 | 2 2 | 33.37
> 32 0.0 100.0 0.0 0.0 | 1 1 | 33.20
> AverageUserTime 34.83 seconds
> ElapsedTime 79.92
> TotalUserTime 1114.91
> TotalSysTime 2.30
>
> -----------------------------------------------------------
> C: simple scheduler MH
>
> Running 'hackbench 20' in the background: Time: 7.271
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 100.0 0.0 0.0 0.0 | 0 0 | 32.66
> 2 100.0 0.0 0.0 0.0 | 0 0 | 32.53
> 3 100.0 0.0 0.0 0.0 | 0 0 | 32.47
> 4 0.0 100.0 0.0 0.0 | 1 1 | 30.33
> 5 0.0 100.0 0.0 0.0 | 1 1 | 30.20
> 6 0.0 100.0 0.0 0.0 | 1 1 | 30.06
> 7 100.0 0.0 0.0 0.0 | 0 0 | 32.35
> 8 100.0 0.0 0.0 0.0 | 0 0 | 32.58
> AverageUserTime 31.65 seconds
> ElapsedTime 42.53
> TotalUserTime 253.30
> TotalSysTime 0.48
>
> Running 'hackbench 20' in the background: Time: 2.929
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 100.0 0.0 0.0 0.0 | 0 0 | 31.07
> 2 100.0 0.0 0.0 0.0 | 0 0 | 30.97
> 3 100.0 0.0 0.0 0.0 | 0 0 | 30.86
> 4 0.0 100.0 0.0 0.0 | 1 1 | 31.91
> 5 0.0 100.0 0.0 0.0 | 1 1 | 32.04
> 6 0.0 100.0 0.0 0.0 | 1 1 | 31.98
> 7 0.0 100.0 0.0 0.0 | 1 1 | 32.01
> 8 0.0 0.0 100.0 0.0 | 2 2 | 31.83
> 9 0.0 0.0 100.0 0.0 | 2 2 | 31.66
> 10 0.0 0.0 100.0 0.0 | 2 2 | 31.47
> 11 0.0 0.0 100.0 0.0 | 2 2 | 31.74
> 12 0.0 0.0 0.0 100.0 | 3 3 | 31.54
> 13 0.0 0.0 0.0 100.0 | 3 3 | 31.48
> 14 0.0 0.0 0.0 100.0 | 3 3 | 31.49
> 15 0.0 0.0 0.0 100.0 | 3 3 | 31.97
> 16 5.4 0.0 0.0 94.6 | 0 3 *| 40.86
> AverageUserTime 32.18 seconds
> ElapsedTime 49.66
> TotalUserTime 515.11
> TotalSysTime 0.95
>
> Running 'hackbench 20' in the background: Time: 2.849
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 100.0 0.0 0.0 0.0 | 0 0 | 33.75
> 2 100.0 0.0 0.0 0.0 | 0 0 | 33.89
> 3 100.0 0.0 0.0 0.0 | 0 0 | 33.87
> 4 0.0 100.0 0.0 0.0 | 1 1 | 33.34
> 5 0.0 100.0 0.0 0.0 | 1 1 | 33.41
> 6 0.0 100.0 0.0 0.0 | 1 1 | 33.16
> 7 0.0 100.0 0.0 0.0 | 1 1 | 33.25
> 8 0.0 0.0 100.0 0.0 | 2 2 | 32.90
> 9 0.0 0.0 100.0 0.0 | 2 2 | 33.08
> 10 0.0 0.0 100.0 0.0 | 2 2 | 33.07
> 11 0.0 0.0 100.0 0.0 | 2 2 | 33.11
> 12 0.0 0.0 0.0 100.0 | 3 3 | 31.46
> 13 0.0 0.0 0.0 100.0 | 3 3 | 32.04
> 14 0.0 0.0 0.0 100.0 | 3 3 | 31.59
> 15 0.0 0.0 0.0 100.0 | 3 3 | 31.83
> 16 100.0 0.0 0.0 0.0 | 0 0 | 33.95
> 17 1.2 0.0 0.0 98.8 | 0 3 *| 44.66
> 18 100.0 0.0 0.0 0.0 | 0 0 | 34.02
> 19 55.4 0.0 0.0 44.6 | 3 0 | 37.11
> 20 100.0 0.0 0.0 0.0 | 0 0 | 33.62
> 21 0.0 100.0 0.0 0.0 | 1 1 | 33.33
> 22 0.0 100.0 0.0 0.0 | 1 1 | 33.26
> 23 0.0 100.0 0.0 0.0 | 1 1 | 33.41
> 24 0.0 100.0 0.0 0.0 | 1 1 | 33.14
> 25 0.0 0.0 100.0 0.0 | 2 2 | 33.26
> 26 0.0 0.0 100.0 0.0 | 2 2 | 33.05
> 27 0.0 0.0 3.1 96.9 | 2 3 *| 43.50
> 28 100.0 0.0 0.0 0.0 | 0 0 | 34.02
> 29 60.4 0.0 39.6 0.0 | 2 0 *| 37.05
> 30 0.0 0.0 0.0 100.0 | 3 3 | 31.53
> 31 100.0 0.0 0.0 0.0 | 0 0 | 33.94
> 32 0.0 0.0 100.0 0.0 | 2 2 | 33.03
> AverageUserTime 34.05 seconds
> ElapsedTime 80.34
> TotalUserTime 1090.10
> TotalSysTime 2.15
>
> ---------------------------------------------------------------
> D: numasched : pooling only (idle pool delay set to 0)
>
> Running 'hackbench 20' in the background: Time: 2.066
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 0.0 0.0 0.0 100.0 | 3 3 | 28.50
> 2 100.0 0.0 0.0 0.0 | 0 0 | 30.97
> 3 0.0 0.0 0.0 100.0 | 3 3 | 28.49
> 4 0.0 0.0 100.0 0.0 | 2 2 | 28.65
> 5 100.0 0.0 0.0 0.0 | 0 0 | 31.18
> 6 100.0 0.0 0.0 0.0 | 0 0 | 30.96
> 7 5.7 94.3 0.0 0.0 | 0 1 *| 38.69
> 8 0.0 0.0 100.0 0.0 | 2 2 | 28.65
> AverageUserTime 30.76 seconds
> ElapsedTime 39.72
> TotalUserTime 246.23
> TotalSysTime 0.46
>
> Running 'hackbench 20' in the background: Time: 5.178
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 100.0 0.0 0.0 0.0 | 0 0 | 32.01
> 2 100.0 0.0 0.0 0.0 | 0 0 | 31.82
> 3 0.0 3.4 0.0 96.6 | 1 3 *| 41.01
> 4 100.0 0.0 0.0 0.0 | 0 0 | 32.24
> 5 100.0 0.0 0.0 0.0 | 0 0 | 31.91
> 6 0.0 0.0 100.0 0.0 | 2 2 | 31.75
> 7 0.0 0.0 0.0 100.0 | 3 3 | 30.55
> 8 0.0 0.0 100.0 0.0 | 2 2 | 31.74
> 9 0.0 0.0 100.0 0.0 | 2 2 | 31.74
> 10 0.0 0.0 0.0 100.0 | 3 3 | 30.63
> 11 0.0 0.0 0.0 100.0 | 3 3 | 30.67
> 12 0.0 0.0 100.0 0.0 | 2 2 | 31.75
> 13 0.0 100.0 0.0 0.0 | 1 1 | 31.94
> 14 0.0 100.0 0.0 0.0 | 1 1 | 32.20
> 15 0.0 100.0 0.0 0.0 | 1 1 | 32.24
> 16 0.0 100.0 0.0 0.0 | 1 1 | 32.07
> AverageUserTime 32.27 seconds
> ElapsedTime 42.97
> TotalUserTime 516.53
> TotalSysTime 1.16
>
> Running 'hackbench 20' in the background: Time: 4.505
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 0.0 49.4 0.0 50.6 | 3 3 | 36.48
> 2 0.3 99.7 0.0 0.0 | 0 1 *| 39.21
> 3 0.0 100.0 0.0 0.0 | 1 1 | 32.23
> 4 100.0 0.0 0.0 0.0 | 0 0 | 32.88
> 5 100.0 0.0 0.0 0.0 | 0 0 | 32.98
> 6 100.0 0.0 0.0 0.0 | 0 0 | 32.97
> 7 0.0 18.3 81.7 0.0 | 2 2 | 35.13
> 8 0.0 5.7 94.3 0.0 | 2 2 | 34.00
> 9 0.0 0.0 100.0 0.0 | 2 2 | 33.14
> 10 0.0 3.3 96.7 0.0 | 2 2 | 34.22
> 11 0.0 100.0 0.0 0.0 | 1 1 | 32.34
> 12 0.0 100.0 0.0 0.0 | 1 1 | 32.37
> 13 0.0 13.4 0.0 86.6 | 3 3 | 34.55
> 14 0.0 0.0 0.0 100.0 | 3 3 | 33.93
> 15 0.0 0.0 0.0 100.0 | 3 3 | 33.08
> 16 100.0 0.0 0.0 0.0 | 0 0 | 33.31
> 17 100.0 0.0 0.0 0.0 | 0 0 | 33.08
> 18 0.0 8.2 0.0 91.8 | 3 3 | 33.76
> 19 0.0 0.0 0.0 100.0 | 3 3 | 33.23
> 20 0.0 32.7 0.0 67.3 | 3 3 | 34.79
> 21 0.0 0.0 0.0 100.0 | 3 3 | 32.94
> 22 100.0 0.0 0.0 0.0 | 0 0 | 33.23
> 23 0.0 100.0 0.0 0.0 | 1 1 | 32.35
> 24 0.0 0.0 100.0 0.0 | 2 2 | 33.44
> 25 0.0 100.0 0.0 0.0 | 1 1 | 32.42
> 26 0.0 0.0 100.0 0.0 | 2 2 | 33.25
> 27 0.0 0.0 100.0 0.0 | 2 2 | 33.82
> 28 0.0 0.0 100.0 0.0 | 2 2 | 32.88
> 29 0.0 0.0 0.0 100.0 | 3 3 | 31.99
> 30 0.0 100.0 0.0 0.0 | 1 1 | 31.90
> 31 100.0 0.0 0.0 0.0 | 0 0 | 32.93
> 32 99.6 0.0 0.4 0.0 | 2 0 *| 41.48
> AverageUserTime 33.76 seconds
> ElapsedTime 78.35
> TotalUserTime 1080.75
> TotalSysTime 2.03
>
> -----------------------------------------------------------
>
> E: node affine scheduler with initial load balancing, statically
> assigned homenode
>
> Running 'hackbench 20' in the background: Time: 2.330
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 0.0 0.0 100.0 0.0 | 2 2 | 28.86
> 2 0.0 100.0 0.0 0.0 | 1 1 | 28.82
> 3 0.0 100.0 0.0 0.0 | 1 1 | 28.85
> 4 0.0 0.0 0.0 100.0 | 3 3 | 28.47
> 5 100.0 0.0 0.0 0.0 | 0 0 | 28.69
> 6 0.0 0.0 100.0 0.0 | 2 2 | 28.65
> 7 0.0 0.0 0.0 100.0 | 3 3 | 28.49
> 8 100.0 0.0 0.0 0.0 | 0 0 | 28.55
> AverageUserTime 28.67 seconds
> ElapsedTime 30.28
> TotalUserTime 229.51
> TotalSysTime 0.43
>
>
> Running 'hackbench 20' in the background: Time: 5.562
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 0.0 100.0 0.0 0.0 | 1 1 | 31.47
> 2 0.0 100.0 0.0 0.0 | 1 1 | 31.38
> 3 0.0 0.0 0.0 100.0 | 3 3 | 31.53
> 4 0.0 0.0 0.0 100.0 | 3 3 | 31.46
> 5 100.0 0.0 0.0 0.0 | 0 0 | 31.64
> 6 100.0 0.0 0.0 0.0 | 0 0 | 31.57
> 7 100.0 0.0 0.0 0.0 | 0 0 | 31.57
> 8 0.0 0.0 0.0 100.0 | 3 3 | 31.51
> 9 0.0 0.0 100.0 0.0 | 2 2 | 31.63
> 10 0.0 0.0 100.0 0.0 | 2 2 | 31.56
> 11 0.0 0.0 100.0 0.0 | 2 2 | 31.42
> 12 0.0 100.0 0.0 0.0 | 1 1 | 31.44
> 13 0.0 0.0 0.0 100.0 | 3 3 | 31.43
> 14 0.0 0.0 100.0 0.0 | 2 2 | 31.61
> 15 100.0 0.0 0.0 0.0 | 0 0 | 31.34
> 16 0.0 100.0 0.0 0.0 | 1 1 | 31.36
> AverageUserTime 31.50 seconds
> ElapsedTime 36.98
> TotalUserTime 504.17
> TotalSysTime 1.03
>
> Running 'hackbench 20' in the background: Time: 5.288
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 0.0 0.0 46.1 53.9 | 2 3 *| 37.11
> 2 0.0 0.0 0.0 100.0 | 3 3 | 31.32
> 3 0.0 100.0 0.0 0.0 | 1 1 | 33.15
> 4 0.0 4.9 95.1 0.0 | 2 2 | 33.55
> 5 0.0 0.0 100.0 0.0 | 2 2 | 32.81
> 6 0.0 100.0 0.0 0.0 | 1 1 | 32.88
> 7 0.0 100.0 0.0 0.0 | 1 1 | 33.31
> 8 82.9 17.1 0.0 0.0 | 1 0 *| 40.70
> 9 100.0 0.0 0.0 0.0 | 0 0 | 33.36
> 10 9.3 0.0 0.0 90.7 | 0 3 *| 42.12
> 11 100.0 0.0 0.0 0.0 | 0 0 | 33.38
> 12 0.0 0.0 0.0 100.0 | 3 3 | 31.28
> 13 0.0 0.0 0.0 100.0 | 3 3 | 31.43
> 14 0.0 100.0 0.0 0.0 | 1 1 | 33.33
> 15 0.0 100.0 0.0 0.0 | 1 1 | 33.04
> 16 100.0 0.0 0.0 0.0 | 0 0 | 33.02
> 17 95.2 4.8 0.0 0.0 | 0 0 | 33.82
> 18 3.8 0.0 0.0 96.2 | 0 3 *| 43.21
> 19 0.0 0.0 0.0 100.0 | 3 3 | 31.11
> 20 0.0 0.0 100.0 0.0 | 2 2 | 32.19
> 21 0.0 100.0 0.0 0.0 | 1 1 | 33.43
> 22 0.0 0.0 100.0 0.0 | 2 2 | 33.03
> 23 0.0 0.0 100.0 0.0 | 2 2 | 33.06
> 24 100.0 0.0 0.0 0.0 | 0 0 | 33.04
> 25 100.0 0.0 0.0 0.0 | 0 0 | 33.07
> 26 0.0 0.0 100.0 0.0 | 2 2 | 33.07
> 27 0.0 0.0 100.0 0.0 | 2 2 | 33.22
> 28 0.0 0.0 0.0 100.0 | 3 3 | 31.06
> 29 0.0 15.3 84.7 0.0 | 2 2 | 33.25
> 30 0.0 100.0 0.0 0.0 | 1 1 | 33.36
> 31 85.0 0.0 0.0 15.0 | 0 0 | 34.50
> 32 0.0 86.4 0.0 13.6 | 1 1 | 34.27
> AverageUserTime 33.89 seconds
> ElapsedTime 76.84
> TotalUserTime 1084.86
> TotalSysTime 2.12
>
> --------------------------------------------------------
>
> F: node affine scheduler, dynamically assigned homenode, no
> initial load balancing
>
> Running 'hackbench 20' in the background: Time: 1.861
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 0.0 0.0 0.0 100.0 | 3 3 | 30.75
> 2 100.0 0.0 0.0 0.0 | 0 0 | 27.12
> 3 0.0 0.0 0.0 100.0 | 3 3 | 30.60
> 4 0.0 0.0 0.0 100.0 | 3 3 | 30.65
> 5 0.0 100.0 0.0 0.0 | 1 1 | 28.65
> 6 0.0 0.0 78.4 21.6 | 3 2 *| 36.63
> 7 0.0 100.0 0.0 0.0 | 1 1 | 28.57
> 8 0.0 0.0 100.0 0.0 | 2 2 | 27.65
> AverageUserTime 30.08 seconds
> ElapsedTime 38.28
> TotalUserTime 240.75
> TotalSysTime 0.49
>
> Running 'hackbench 20' in the background: Time: 5.936
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 0.0 0.0 0.0 100.0 | 3 3 | 31.47
> 2 100.0 0.0 0.0 0.0 | 0 0 | 29.90
> 3 88.2 0.0 4.6 7.2 | 2 0 *| 40.96
> 4 0.0 0.0 0.0 100.0 | 3 3 | 31.53
> 5 0.0 0.0 0.0 100.0 | 3 3 | 31.50
> 6 0.0 100.0 0.0 0.0 | 1 1 | 32.25
> 7 0.0 84.5 0.0 15.5 | 1 1 | 33.41
> 8 0.0 0.0 100.0 0.0 | 2 2 | 32.38
> 9 83.0 17.0 0.0 0.0 | 1 0 *| 39.29
> 10 0.0 0.0 0.0 100.0 | 3 3 | 31.51
> 11 0.0 100.0 0.0 0.0 | 1 1 | 32.38
> 12 0.0 0.0 100.0 0.0 | 2 2 | 32.30
> 13 0.0 100.0 0.0 0.0 | 1 1 | 32.06
> 14 100.0 0.0 0.0 0.0 | 0 0 | 29.85
> 15 0.0 0.0 100.0 0.0 | 2 2 | 32.18
> 16 0.0 0.0 100.0 0.0 | 2 2 | 32.39
> AverageUserTime 32.83 seconds
> ElapsedTime 42.51
> TotalUserTime 525.59
> TotalSysTime 1.14
>
> Running 'hackbench 20' in the background: Time: 4.277
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 0.0 0.0 0.0 100.0 | 3 3 | 31.69
> 2 0.0 0.0 100.0 0.0 | 2 2 | 33.81
> 3 100.0 0.0 0.0 0.0 | 0 0 | 33.10
> 4 1.4 4.9 2.8 90.9 | 2 3 *| 44.26
> 5 0.0 0.0 0.0 100.0 | 3 3 | 31.76
> 6 0.0 0.0 0.0 100.0 | 3 3 | 31.69
> 7 0.0 0.0 3.6 96.4 | 2 3 *| 43.73
> 8 0.0 100.0 0.0 0.0 | 1 1 | 33.07
> 9 0.0 0.0 100.0 0.0 | 2 2 | 34.04
> 10 0.0 100.0 0.0 0.0 | 1 1 | 33.24
> 11 0.0 100.0 0.0 0.0 | 1 1 | 33.23
> 12 100.0 0.0 0.0 0.0 | 0 0 | 33.77
> 13 0.0 100.0 0.0 0.0 | 1 1 | 33.10
> 14 100.0 0.0 0.0 0.0 | 0 0 | 33.26
> 15 0.0 100.0 0.0 0.0 | 1 1 | 33.16
> 16 100.0 0.0 0.0 0.0 | 0 0 | 33.23
> 17 3.9 0.0 0.0 96.1 | 0 3 *| 42.74
> 18 0.0 100.0 0.0 0.0 | 1 1 | 32.97
> 19 100.0 0.0 0.0 0.0 | 0 0 | 33.21
> 20 0.0 0.0 92.1 7.9 | 2 2 | 34.30
> 21 0.0 0.0 95.9 4.1 | 2 2 | 34.31
> 22 0.0 0.0 100.0 0.0 | 2 2 | 33.88
> 23 0.0 0.0 0.0 100.0 | 3 3 | 30.85
> 24 0.0 100.0 0.0 0.0 | 1 1 | 32.99
> 25 0.0 0.0 100.0 0.0 | 2 2 | 33.84
> 26 0.0 100.0 0.0 0.0 | 1 1 | 33.35
> 27 100.0 0.0 0.0 0.0 | 0 0 | 33.29
> 28 0.0 0.0 100.0 0.0 | 2 2 | 33.79
> 29 100.0 0.0 0.0 0.0 | 0 0 | 33.25
> 30 100.0 0.0 0.0 0.0 | 0 0 | 33.41
> 31 0.0 0.0 100.0 0.0 | 2 2 | 33.92
> 32 0.0 0.0 0.0 100.0 | 3 3 | 31.16
> AverageUserTime 34.11 seconds
> ElapsedTime 77.38
> TotalUserTime 1091.86
> TotalSysTime 2.14
>
--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2002-10-07 16:48:24

by Erich Focht

[permalink] [raw]

Subject: Re: [RFC] NUMA schedulers benchmark results

On Monday 07 October 2002 02:58, Martin J. Bligh wrote:
> > As I'm rewriting the node affine scheduler to be more modular, I'll
> > redo the tests for cases D, E, F on top of 2.5.X kernels soon.
>
> It's really hard to see anything comparing 2.4 vs 2.5 results,
> so when you have results for everything across the same kernel,
> it'll be a lot easier to decipher ...

The user times are really comparable. And total user time, too. I'm
taking the node affine patches to 2.5.39 and will rerun them ASAP.

> For the record, I really don't care whose code gets accepted, I'm
> just interested in having a scheduler that works well for me ;-)

I was running several of these to see which features help and to try
to understand where the advantage comes from. In the output tables
you can see where the job has spentits time and on which node it started.
You can also recognize from the user time how well a job ran. Alone on
it's node: 27s, away from it's memory: 36s (or so), sharing the node
with others: 27-32s, running on wrong node with others: up to 43s. And
seeing these you can understand how well a feature works or how much
you miss another feature. It wasn't my plan to advertise the node
affine scheduler, it just has most of the features.

> However, I think it would be a good idea to run some other benchmarks
> on these, that are a little more general ... dbench, kernel compile,
> sdet, whatever, as *well* as your own benchmark ... I think we're
> working on that.
Yes, great!

> So lower numbers are better, right? So Michael's stuff seems to
> work fine at the higher end, but poorly at the lower end - I think
> this may be due to a bug he found on Friday, if that gets fixed, it
> might make a difference.

OK, but how about the node-wise selection and balancing? Then we get
the approaches closer...

> I'll run your tests on the NUMA-Q comparing 2.5.40 vs Michael's stuff.
> If you send me a cleaned up version (ie without the ia64 stuff in
> generic code) of your stuff against 2.5.40 (or ideally 2.5.40-mm2),
> I'll run that too ... (if you're putting the latencies into arch code,
> you can set i386 (well NUMA-Q) for 20:1 if you think that'll help).

Is in preparation. First step: pooling scheduler without multi-level
support but node-wise initial balancing.

> PS. the wierd "this one was run without hackbench" thing makes the
> results even harder to read ...
Sorry about that. Didn't have a newer measurement.

> PPS. Does Kimio's discontigmem patch work for you on 2.5?
Yes. We're trying to get that stuff in but David's away from his email
for a while as you know.

Regards,
Erich

2002-10-07 20:07:20

by Erich Focht

[permalink] [raw]

Subject: [PATCH] pooling NUMA scheduler with initial load balancing

Hi,

here come two patches which together implement the pooling NUMA scheduler
with initial load balancing stripped of the node affine scheduler.

First patch does:
- implement CPU pools structure
- changes find_busiest_queue to first scan the own node, then (if no
imbalance found) find the most loaded node and steal a task from it
if the imbalance exceeds 25%.
- The steal is delayed 100ms if we steal from a remote node and the own
node is averagely (within 25%) loaded.
- The steal is delayed 1ms if the own node is unloaded (this gives
CPUs from the own node an advantage).

I dropped the multilevel structure as it is only complicating things at
this stage and can be introduced later. The patch is prepared for it.

The second patch implements initial balancing by selecting the most
unloaded node and its most unloaded CPU for an execed task. It does
a round-robin selection of the next node if the loads are equal.

The node affine patch will be on top of this.

The pool setup is pretty messy, it needs some sort of RCU to become
nicer. Also undefining CONFIG_NUMA_SCHED could lead to errors, I didn't
clean this up. This is mainly for testing and as a base experimenting
with the node affine extensions.

Michael, you could just take the second patch for initial load balancing
of your approach. Compared to your patch this one tries to achieve
equal node load also in the find_busiest_queue step. With more effort
and code, though... I hope it works for you, I removed the arch
dependencies and just need __cpu_to_node().

Regards,
Erich

Attachments:

01-numa_sched_core5-noML-2.5.39.patch (20.67 kB)
02-numa_sched_ilb-2.5.39.patch (2.28 kB)
Download all attachments

2002-10-07 20:31:31

by Erich Focht

[permalink] [raw]

Subject: Re: [RFC] NUMA schedulers benchmark results

Hi Michael,
On Monday 07 October 2002 18:37, Michael Hohnbaum wrote:
> These numbers look pretty good for your scheduler. It would be good
> if you could post numbers against 2.5.x where x is the same for all
> tests. I thought you had your scheduler working on 2.5.x recently.
I had it running under 2.5.35, but some of the important peripherals
of the machine didn't work with that and I hate doing experiments
when I only have the serial console. I'll get the node affine stuff
finished tomorrow and will try to find a testing slot (that's another
problem :-( )

> Running numa_test on my systems I find a lot of variation (as much
> as 10%) between identical runs. Are all of these results averaged
> over multiple runs? Any idea why the hackbench times vary so widely?
No, they are not averaged. I often made 2-3 runs and they usually didn't
differ significantly (within 2-3% of average user time), so I just took
one. Would be cool to have a script for averaging several results, I'll
make that if I get some time, but then we can't look at the distribution
over the nodes. And that's what I find most useful for understanding
what's going on.

> Looking at the results for my scheduler, it is real obvious what the
> algorithm is that I use for selecting a cpu for a process. This has
> not been quite as apparent on my systems, probably due to other tasks
> running in the background. The code in your scheduler that causes a
> different node to start the search for a cpu really pays off here.
Yes, but it's not perfect, as it cannot predict how long the processes
will live. Also it takes into account every running task, no matter
whether it's just a shortly running migration_thread or something else.
A running average might be better, but I have no experience with that.
Any ideas?

Regarding the hackbench times: it forks 20 tasks which fork 20 tasks
each, these are sending messages to each other. Depending on the load
balancing (which is not really predictible) it could happen that you end
up with bad distributions of senders/receivers sometimes.

Best regards,
Erich

2002-10-08 17:28:48

by Erich Focht

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

On Tuesday 08 October 2002 08:08, Martin J. Bligh wrote:
> I did:
>
> 1. remove asm/numa.h declaration (you need to work out something
> generic here ... mmzone.h should include it, I think?)
> 2. changed NR_NODES to MAX_NUMNODES everywhere.
> 3. Put in an extern declartion into fs/exec.c for sched_balance_exec
>
> Then I got bored around the linking problem. Does this compile for
> you somehow?

Aaargh, you got the wrong second patch :-( Sorry for that...

Thanks for the hints, I cleaned up the first patch, too. No
CONFIG_NUMA_SCHED any more, switched to MAX_NUMNODES, including
asm/numa.h from asm/topology.h, so no need for you to see it.

Erich

Attachments:

01-numa_sched_core6-2.5.39.patch (18.86 kB)
02-numa_sched_ilb-2.5.39.patch (2.72 kB)
Download all attachments

2002-10-08 17:56:44

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

On Tue, Oct 08, 2002 at 07:33:06PM +0200, Erich Focht wrote:
> Aaargh, you got the wrong second patch :-( Sorry for that...
>
> Thanks for the hints, I cleaned up the first patch, too. No
> CONFIG_NUMA_SCHED any more, switched to MAX_NUMNODES, including
> asm/numa.h from asm/topology.h, so no need for you to see it.

diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c Fri Sep 27 23:49:54 2002
+++ b/arch/i386/kernel/smpboot.c Tue Oct 8 11:37:56 2002
@@ -1194,6 +1194,11 @@ int __devinit __cpu_up(unsigned int cpu)
void __init smp_cpus_done(unsigned int max_cpus)
{
zap_low_mappings();
+#ifdef CONFIG_NUMA
+ pooldata_lock();
+ bld_pools();
+ pooldata_unlock();
+#endif

All callers of bld_pools() need the pooldata lock - taking
it inside that function makes the code a little more readable..
Also I'd suggest to rename bld_pools() to build_pools() ;)

- cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth estimate */
+ cache_decay_ticks = 8; /* XXX base this on PAL info and cache-bandwidth estimate */

Could you explain this change? And it's affect on non-NUMA IA64
machines?

/**
+ * atomic_inc_return - increment atomic variable and return new value
+ * @v: pointer of type atomic_t
+ *
+ * Atomically increments @v by 1 and return it's new value. Note that
+ * the guaranteed useful range of an atomic_t is only 24 bits.
+ */
+static inline int atomic_inc_return(atomic_t *v){
+ atomic_inc(v);
+ return v->counter;
+}

Who do you guarantee this is atomic? Please make it fit
Documentation/CodyingStyle, btw..

+int numpools, pool_ptr[MAX_NUMNODES+1], pool_cpus[NR_CPUS], pool_nr_cpus[MAX_NUMNODES];
+unsigned long pool_mask[MAX_NUMNODES];

Hmm, shouldn't those [MAX_NUMNODES] arrays be in some per-node array
to avoid cacheline-bouncing?

+void pooldata_lock(void)
+{
+ int i;
+ retry:
+ while (atomic_read(&pool_lock));
+ if (atomic_inc_return(&pool_lock) > 1) {
+ atomic_dec(&pool_lock);
+ goto retry;
+ }

Why not a simple spin_lock()?

+ /*
+ * Wait a while, any loops using pool data should finish
+ * in between. This is VERY ugly and should be replaced
+ * by some real RCU stuff. [EF]
+ */
+ for (i=0; i<100; i++)
+ udelay(1000);

Urgg. I'd suggest you switch to RCU now and make your patch apply
ontop of it - another reason to apply the RCU core patch..

+void pooldata_unlock(void)
+{
+ atomic_dec(&pool_lock);
+}

Dito for spin_unlock.

+ /* avoid deadlock by timer interrupt on own cpu */
+ if (atomic_read(&pool_lock)) return;

spin_trylock..

All in all your code doesn't seem to be very cachelign-friendly,
lots of global bouncing. Do you have any numbers on what your
patch changes for normal SMP configurations?

2002-10-08 19:47:06

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

> Aaargh, you got the wrong second patch :-( Sorry for that...

No problem

> Thanks for the hints, I cleaned up the first patch, too. No
> CONFIG_NUMA_SCHED any more, switched to MAX_NUMNODES, including
> asm/numa.h from asm/topology.h, so no need for you to see it.

Cool. Well it compiles now, but:

CPU 3 IS NOW UP!
Starting migration thread for cpu 3
Bringing up 4
CP4 4ivi e e UPr: 0000
ing
igrU:i 4
eaIP: p060:[<c0114231>] Not tainted
EFLAGS: 00010046
EIP is at calc_pool_load+0x109/0x120
eax: 00000000 ebx: 00000001 ecx: c034f380 edx: 00000000
esi: c028efa0 edi: 00000020 ebp: c3a37efc esp: c3a37ed4
ds: 0068 es: 0068 ss: 0068
Process swapper (pid: 0, threadinfo=c3a36000 task=f01bf060)
Stack: c028f948 c028efa0 f01bf060 00002928 00002944 00002944 ffffffff ffffffff
c028efa0 ffffffff c3a37f78 c01142d5 c028f948 00000004 00000001 00000001
c3a36000 c028efa0 f01bf060 00000010 00000008 00000000 00000001 00002a13
Call Trace:
[<c01142d5>] load_balance+0x8d/0x5c4
[<c0118a8d>] call_console_drivers+0xdd/0xe4
[<c0118cc2>] release_console_sem+0x42/0xa4
[<c0114c21>] schedule+0xd5/0x3ac
[<c0105300>] default_idle+0x0/0x34
[<c01053bf>] cpu_idle+0x43/0x48
[<c0118cc2>] release_console_sem+0x42/0xa4
[<c0118c1d>] printk+0x125/0x140

Code: f7 3c 99 8b 55 08 89 84 9a 80 00 00 00 8b 45 f4 5b 5e 5f 89

2002-10-09 10:24:48

by Erich Focht

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

Hi Christoph,

thanks for looking at this and the comments. Replies are below.

On Wednesday 09 October 2002 03:15, Christoph Hellwig wrote:
> On Tue, Oct 08, 2002 at 07:33:06PM +0200, Erich Focht wrote:
> diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
> --- a/arch/i386/kernel/smpboot.c Fri Sep 27 23:49:54 2002
> +++ b/arch/i386/kernel/smpboot.c Tue Oct 8 11:37:56 2002
> @@ -1194,6 +1194,11 @@ int __devinit __cpu_up(unsigned int cpu)
> void __init smp_cpus_done(unsigned int max_cpus)
> {
> zap_low_mappings();
> +#ifdef CONFIG_NUMA
> + pooldata_lock();
> + bld_pools();
> + pooldata_unlock();
> +#endif
>
> All callers of bld_pools() need the pooldata lock - taking
> it inside that function makes the code a little more readable..
> Also I'd suggest to rename bld_pools() to build_pools() ;)
The separation is due to the needs of CPU hot-add. When integrating
with Kimi's CPU hotplug patch within the Atlas project, we found that
there was more to do in the scheduler blocked phase when adding one CPU
to the running system (besides recomputing the pools). Looking too far
into the future here...

> - cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth
> estimate */ + cache_decay_ticks = 8; /* XXX base this on PAL info and
> cache-bandwidth estimate */
>
> Could you explain this change? And it's affect on non-NUMA IA64
> machines?

At some stage I looked up the minimal timeslice and found it to be 10.
This means that when you have two tasks on a runqueue and zero on
another, you might see both tasks as absolutely cache-hot and cannot
steal any of them. The cache_decay_ticks is anyway a very rough
approximation. With the task_to_steal() selection from the patch this
change should have no significant impact because we make a ranking
and select the cache coolest task which slept more than 8ms. The default
O(1) behavior is to select the first task which slept more than 10ms,
without the ranking.

> /**
> + * atomic_inc_return - increment atomic variable and return new value
> + * @v: pointer of type atomic_t
> + *
> + * Atomically increments @v by 1 and return it's new value. Note that
> + * the guaranteed useful range of an atomic_t is only 24 bits.
> + */
> +static inline int atomic_inc_return(atomic_t *v){
> + atomic_inc(v);
> + return v->counter;
> +}
>
> Who do you guarantee this is atomic? Please make it fit
> Documentation/CodyingStyle, btw..

This code comes from a port of the node affine scheduler done by
somebody from IBM for the 2.4.18 version. On IA64 we have really atomic
stuff for this operation. It's a quick hack for now. I not good in ia32
assembler and don't know how to replace it.

> +int numpools, pool_ptr[MAX_NUMNODES+1], pool_cpus[NR_CPUS],
> pool_nr_cpus[MAX_NUMNODES]; +unsigned long pool_mask[MAX_NUMNODES];
>
> Hmm, shouldn't those [MAX_NUMNODES] arrays be in some per-node array
> to avoid cacheline-bouncing?

They could be replicated but I don't think this is necessary. They
populate cachelines on each CPU. As these arrays are never changed,
there is no need for bouncing them. The cachelines on every CPU are
only invalidated when some CPU changes these array, which just doesn't
happen after boot (except you add a CPU, which you'll do rarely).

> +void pooldata_lock(void)
> +{
> + int i;
> + retry:
> + while (atomic_read(&pool_lock));
> + if (atomic_inc_return(&pool_lock) > 1) {
> + atomic_dec(&pool_lock);
> + goto retry;
> + }
>
> Why not a simple spin_lock()?

Actually we need something like write_lock and read_lock. But read_lock
counts the readers, therefore changes the lock and we get
cacheline-bouncing.

A spinlock here would be ok. Then the readers would need to do something
like "while (spin_is_locked(pool_lock));". This only reads the lock
and (if nobody changes it) avoids cacheline bouncing. We still need to
check that all readers went out of the section where the pool data is
accessed (following ugly block)...

> + /*
> + * Wait a while, any loops using pool data should finish
> + * in between. This is VERY ugly and should be replaced
> + * by some real RCU stuff. [EF]
> + */
> + for (i=0; i<100; i++)
> + udelay(1000);
>
> Urgg. I'd suggest you switch to RCU now and make your patch apply
> ontop of it - another reason to apply the RCU core patch..

I agree (see the comment). I didn't replace it by RCU because RCU is
not in the mainline and it is easier to develop with an ugly but portable
replacement. But add this to the list of potential users of RCU.

> +void pooldata_unlock(void)
> +{
> + atomic_dec(&pool_lock);
> +}
>
> Dito for spin_unlock.
>
> + /* avoid deadlock by timer interrupt on own cpu */
> + if (atomic_read(&pool_lock)) return;
>
> spin_trylock..
No! No trylock here, that would change the lock and lead to cacheline
bouncing! We don't need the lock, just make sure that the pooldata is not
beeing changed right now.

> All in all your code doesn't seem to be very cachelign-friendly,
> lots of global bouncing. Do you have any numbers on what your
> patch changes for normal SMP configurations?

Once again: once the address of pool_lock is in the cache, I don't
expect any bouncing. The value of that lock almost never changes (except
in the boot phase and when you want to hot-add/remove a CPU), therefore
there is no reason for cache-line bouncing. The same applies for the
pool variables.

As you see, I thought a lot about cacheline bouncing and tried to avoid
it. I'll replace the pool_lock with a spin_lock, as the atomic_inc_return
for ia32 isn't really atomic, you're absolutely right here. As the
load balancing code is called pretty rarely (at most every 1ms, normally
every 250ms per CPU), additional cycles invested here are ok if the
benefit for NUMA is reasonable. If CONFIG_NUMA is undefined, the code
should be optimized by the compiler such that it looks like normal SMP
code. Still need some changes to get there, but that's not the big
problem.

Regards,
Erich

2002-10-09 16:22:23

by Erich Focht

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

On Tuesday 08 October 2002 21:44, Martin J. Bligh wrote:
> > Thanks for the hints, I cleaned up the first patch, too. No
> > CONFIG_NUMA_SCHED any more, switched to MAX_NUMNODES, including
> > asm/numa.h from asm/topology.h, so no need for you to see it.
>
> Cool. Well it compiles now, but:
>
> CPU 3 IS NOW UP!
> Starting migration thread for cpu 3
> Bringing up 4
> CP4 4ivi e e UPr: 0000
> ing
> igrU:i 4
> eaIP: p060:[<c0114231>] Not tainted
> EFLAGS: 00010046
> EIP is at calc_pool_load+0x109/0x120

This is strange. It works for me really reliably. I added a check
for non-online CPUs in calc_pool_load and changed the pool_lock to
be a spinlock. The patches are attached again.

Results so far (all based on 2.5.39 + ia64 + discontig, measured on
16 CPU Azusa), averaged over 4 measurements. The statistics show that
we need more measurements to get something meaningfull.

2.5.39 numa_sched+ilb
numa_test 4 4
AverageUserTime 31.18+-0.67 31.25+- 2.51
ElapsedTime 41.59+-1.95 38.55+- 4.79
TotalUserTime 124.78+-2.71 125.08+-10.03

numa_test 8 4
AverageUserTime 30.84+-0.51 28.94+- 0.04
ElapsedTime 45.02+-1.54 30.00+- 0.24
TotalUserTime 246.82+-4.02 231.66+- 0.34

numa_test 16 4
AverageUserTime 33.61+- 0.72 32.15+- 0.33
ElapsedTime 47.16+- 0.95 37.45+- 3.20
TotalUserTime 538.05+-11.58 514.72+- 5.25

numa_test 32 4
AverageUserTime 37.94+- 0.74 33.36+- 0.02
ElapsedTime 84.83+- 1.28 68.78+- 0.18
TotalUserTime 1214.37+-23.64 1068.06+- 0.53

numa_test 64 4
AverageUserTime 39.58+- 1.34 33.64+- 0.09
ElapsedTime 168.04+- 4.77 142.84+- 5.44
TotalUserTime 2533.82+-85.96 2153.69+- 5.58

Regards,
Erich

Attachments:

01-numa_sched_core7-2.5.39.patch (18.81 kB)
02-numa_sched_ilb-2.5.39.patch (2.73 kB)
Download all attachments

2002-10-09 17:30:21

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

> This is strange. It works for me really reliably. I added a check
> for non-online CPUs in calc_pool_load and changed the pool_lock to
> be a spinlock. The patches are attached again.

I'm testing on 2.5.41-mm1 ... your patches still apply clean. That
has a whole bunch of nice NUMA mods, you might want to try that
instead? All the changes in there will end up in mainline real soon
anyway ;-)

One minor warning:

arch/i386/kernel/smpboot.c: In function `smp_cpus_done':
arch/i386/kernel/smpboot.c:1199: warning: implicit declaration of function `bld_pools'

And the same panic:

Starting migration thread for cpu 3
Bringing up 4
CPU>dividNOWrro!
0St0
igrU: 4
ead :or 0060:[<c011422d>] Not tainted
EFLAGS: 00010046
EIP is at calc_pool_load+0x115/0x12c
eax: 00000000 ebx: 00000001 ecx: c028f948 edx: 00000000
esi: c034f380 edi: 00000020 ebp: c3a37efc esp: c3a37ed4
ds: 0068 es: 0068 ss: 0068
Process swapper (pid: 0, threadinfo=c3a36000 task=f01bf060)
Stack: c028f948 c028efa0 f01bf060 00002928 00002944 00002944 ffffffff ffffffff
c028efa0 ffffffff c3a37f78 c01142d1 c028f948 00000004 00000001 00000001
c3a36000 c028efa0 f01bf060 00000010 00000008 00000000 00000001 00002a13
Call Trace:
[<c01142d1>] load_balance+0x8d/0x5c8
[<c0118a9d>] call_console_drivers+0xdd/0xe4
[<c0118cd2>] release_console_sem+0x42/0xa4
[<c0114c21>] schedule+0xd5/0x3ac
[<c0105300>] default_idle+0x0/0x34
[<c01053bf>] cpu_idle+0x43/0x48
[<c0118cd2>] release_console_sem+0x42/0xa4
[<c0118c2d>] printk+0x125/0x140

Code: f7 3c 9e 89 84 99 80 00 00 00 8b 45 f4 5b 5e 5f 89 ec 5d c3

2002-10-09 18:03:47

by Andrew Theurer

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

> I'm testing on 2.5.41-mm1 ... your patches still apply clean. That
> has a whole bunch of nice NUMA mods, you might want to try that
> instead? All the changes in there will end up in mainline real soon
> anyway ;-)
>
> One minor warning:
>
> arch/i386/kernel/smpboot.c: In function `smp_cpus_done':
> arch/i386/kernel/smpboot.c:1199: warning: implicit declaration of function
> `bld_pools'
>
> And the same panic:
>
> Starting migration thread for cpu 3
> Bringing up 4
> CPU>dividNOWrro!

I got the same thing on 2.5.40-mm1. It looks like it may be a a divide by
zero in calc_pool_load. I am attempting to boot a band-aid version right
now. OK, got a little further:

PCI: Cannot allocate resource region 1 of device 03:0a.0
PCI: Cannot allocate resource region 4 of device 03:0e.1
PCI: Cannot allocate resource region 2 of device 05:0a.0
PCI: Cannot allocate resource region 0 of device 05:0b.0
PCI: Cannot allocate resource region 4 of device 05:0e.1
PCI: Cannot allocate resource region 1 of device 07:0a.0
PCI: Cannot allocate resource region 0 of device 07:0b.0
PCI: Cannot allocate resource region 4 of device 07:0e.1
Starting kswapd
d<4>highmem bounce poo
sCPU: 3
eEIP: 0060:[<c011a3d3>] Not tainted
EFLAGS: 00010046
eax: 00000001 ebx: 00000000 ecx: 00000003 edx: 00000000
esi: c02c98c4 edi: c39c5880 ebp: f0199ee8 esp: f0199ec0
ds: 0068 es: 0068 ss: 0068
Process swapper (pid: 0, threadinfo=f0198000 task=f01c1740)
Stack: c39c58a0 c02c94c8 c02c993c 00000000 c02c94c4 00000000 0000007d c02c94a0
00000001 00000003 f0199f1c c0117bec c02c94a0 00000003 00000003 00000001
00000000 c01135d1 f01a3f10 00000000 00000003 f01c1740 c02cb4e0 f0199f44
Call Trace: [<c0117bec>] [<c01135d1>] [<c0117eed>] [<c0122587>] [<c0105420>]
[<c0125e60>] [<c0113ca7>] [<c0105420>] [<c010818e>] [<c0105420>] [<c0105420>]
[<c010544a>] [<c01054ea>] [<c011cf50>]
Code: f7 f3 3b 45 e4 7e 06 89 45 e4 89 7d ec 8b 45 d8 8b 00 39 f0

-Andrew Theurer

2002-10-09 18:18:53

by Andrew Theurer

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

On Wednesday 09 October 2002 12:58 pm, Andrew Theurer wrote:
> > I'm testing on 2.5.41-mm1 ... your patches still apply clean. That
> > has a whole bunch of nice NUMA mods, you might want to try that
> > instead? All the changes in there will end up in mainline real soon
> > anyway ;-)
> >
> > One minor warning:
> >
> > arch/i386/kernel/smpboot.c: In function `smp_cpus_done':
> > arch/i386/kernel/smpboot.c:1199: warning: implicit declaration of
> > function `bld_pools'
> >
> > And the same panic:
> >
> > Starting migration thread for cpu 3
> > Bringing up 4
> > CPU>dividNOWrro!
>
> I got the same thing on 2.5.40-mm1. It looks like it may be a a divide by
> zero in calc_pool_load. I am attempting to boot a band-aid version right
> now. OK, got a little further:
>
A little more detail:
Code: f7 f3 3b 45 e4 7e 06 89 45 e4 89 7d ec 8b 45 d8 8b 00 39 f0
EFLAGS: 00010046
eax: 00000001 ebx: 00000000 ecx: 00000003 edx: 00000000
esi: c02c98c4 edi: c39c5880 ebp: f0199ee8 esp: f0199ec0
ds: 0068 es: 0068 ss: 0068
Stack: c39c58a0 c02c94c8 c02c993c 00000000 c02c94c4 00000000 0000007d c02c94a0
00000001 00000003 f0199f1c c0117bec c02c94a0 00000003 00000003 00000001
00000000 c01135d1 f01a3f10 00000000 00000003 f01c1740 c02cb4e0 f0199f44
Call Trace: [<c0117bec>] [<c01135d1>] [<c0117eed>] [<c0122587>] [<c0105420>]
[<c0125e60>] [<c0113ca7>] [<c0105420>] [<c010818e>] [<c
0105420>] [<c0105420>] [<c010544a>] [<c01054ea>] [<c011cf50>]
Code: f7 f3 3b 45 e4 7e 06 89 45 e4 89 7d ec 8b 45 d8 8b 00 39 f0
Using defaults from ksymoops -t elf32-i386 -a i386

>>esi; c02c98c4 <runqueues+424/15800>
>>edi; c39c5880 <END_OF_CODE+36419c4/????>
>>ebp; f0199ee8 <END_OF_CODE+2fe1602c/????>
>>esp; f0199ec0 <END_OF_CODE+2fe16004/????>

Trace; c0117bec <load_balance+8c/140>
Trace; c01135d1 <smp_call_function_interrupt+41/90>
Trace; c0117eed <scheduler_tick+24d/390>
Trace; c0122587 <tasklet_hi_action+77/c0>
Trace; c0105420 <default_idle+0/50>
Trace; c0125e60 <update_process_times+40/50>
Trace; c0113ca7 <smp_apic_timer_interrupt+117/120>
Trace; c0105420 <default_idle+0/50>
Trace; c010818e <apic_timer_interrupt+1a/20>
Trace; c0105420 <default_idle+0/50>
Trace; c0105420 <default_idle+0/50>
Trace; c010544a <default_idle+2a/50>
Trace; c01054ea <cpu_idle+3a/50>
Trace; c011cf50 <printk+140/180>

Code; 00000000 Before first symbol
00000000 <_EIP>:
Code; 00000000 Before first symbol
0: f7 f3 div %ebx
Code; 00000002 Before first symbol
2: 3b 45 e4 cmp 0xffffffe4(%ebp),%eax
Code; 00000005 Before first symbol
5: 7e 06 jle d <_EIP+0xd> 0000000d Before first
symbol
Code; 00000007 Before first symbol
7: 89 45 e4 mov %eax,0xffffffe4(%ebp)
Code; 0000000a Before first symbol
a: 89 7d ec mov %edi,0xffffffec(%ebp)
Code; 0000000d Before first symbol
d: 8b 45 d8 mov 0xffffffd8(%ebp),%eax
Code; 00000010 Before first symbol
10: 8b 00 mov (%eax),%eax
Code; 00000012 Before first symbol
12: 39 f0 cmp %esi,%eax

2002-10-09 22:57:49

by Erich Focht

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

> > Starting migration thread for cpu 3
> > Bringing up 4
> > CPU>dividNOWrro!
>
> I got the same thing on 2.5.40-mm1. It looks like it may be a a divide by
> zero in calc_pool_load. I am attempting to boot a band-aid version right
> now. OK, got a little further:
This opened my eyes, thanks for all your help and patience!!!

The problem is that the load balancer is called before the CPU pools
were set up. That's fine, I thought, because I define in sched_init
the default pool 0 to include all CPUs. But: in find_busiest_queue()
the cpu_to_node(this_cpu) delivers a non-zero pool which is not set up
yet, therefore pool_nr_cpus[pool]=0 and we get a zero divide.

I'm still wondering why this doesn't happen on our architecture. Maybe
the interrupts are disabled longer, I'll check. Anyway, a fix is to
force this_pool to be 0 as long as numpools=1. The attached patch is a
quick untested hack, maybe one can do it better. Has to be applied on top
of the other 2.

Going to sleep now...

Bye,
Erich

Attachments:

01a-numa_sched_fix-2.5.39.patch (1.27 kB)

2002-10-10 17:40:20

by Andrew Theurer

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

On Wednesday 09 October 2002 6:02 pm, Erich Focht wrote:
> > > Starting migration thread for cpu 3
> > > Bringing up 4
> > > CPU>dividNOWrro!
> >
> > I got the same thing on 2.5.40-mm1. It looks like it may be a a divide
> > by zero in calc_pool_load. I am attempting to boot a band-aid version
> > right now. OK, got a little further:
>
> This opened my eyes, thanks for all your help and patience!!!
>
> The problem is that the load balancer is called before the CPU pools
> were set up. That's fine, I thought, because I define in sched_init
> the default pool 0 to include all CPUs. But: in find_busiest_queue()
> the cpu_to_node(this_cpu) delivers a non-zero pool which is not set up
> yet, therefore pool_nr_cpus[pool]=0 and we get a zero divide.
>
> I'm still wondering why this doesn't happen on our architecture. Maybe
> the interrupts are disabled longer, I'll check. Anyway, a fix is to
> force this_pool to be 0 as long as numpools=1. The attached patch is a
> quick untested hack, maybe one can do it better. Has to be applied on top
> of the other 2.

Thanks very much Erich. I did come across another problem here on numa-q. In
task_to_steal() there is a divide by cache_decay_ticks, which apparantly is 0
on my system. This may have to do with notsc, but I am not sure. I set
cache_decay_ticks to 8, (I cannot boot without using notsc) which is probably
not correct, but I can now boot 16 processor numa-q on 2.5.40-mm1 with your
patches! I'll get some benchmark results soon.

Andrew Theurer

2002-10-11 08:23:32

by Erich Focht

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

On Friday 11 October 2002 09:47, Erich Focht wrote:
> Hi Andrew,
>
> On Thursday 10 October 2002 19:34, Andrew Theurer wrote:
> > Thanks very much Erich. I did come across another problem here on
> > numa-q. In task_to_steal() there is a divide by cache_decay_ticks, which
> > apparantly is 0 on my system. This may have to do with notsc, but I am
> > not sure. I set cache_decay_ticks to 8, (I cannot boot without using
> > notsc) which is probably not correct, but I can now boot 16 processor
> > numa-q on 2.5.40-mm1 with your patches! I'll get some benchmark results
> > soon.
>
> oops... This is a bug in 2.5-i386. It means that the O(1) scheduler in
> 2.5 doesn't work well either because it doesn't take into account cache
> coolness. I'll post a fix to LKML in a few minutes.

Sorry, I thought the smp_tune_scheduling() call went lost during the
transition to the new cpu boot scheme. But it's there. And the problem
is indeed "notsc". So you'll have to fix it, I can't.

If you set the cache_decay_ticks to something non-zero, you should
_really_ do this for all the scheduler tests, otherwise your measurements
will not be comparable.

Regards,
Erich

2002-10-11 14:46:44

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

> Sorry, I thought the smp_tune_scheduling() call went lost during the
> transition to the new cpu boot scheme. But it's there. And the problem
> is indeed "notsc". So you'll have to fix it, I can't.

Errrm ... not sure what you mean by this. notsc is a perfectly
valid thing to do, so if your patch panics with that option, I
suggest you make something conditional on notsc inside your patch?
Works just fine without your patch, or with Michael's patch ....

M.

2002-10-11 15:25:10

by Erich Focht

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

On Friday 11 October 2002 16:47, Martin J. Bligh wrote:
> > Sorry, I thought the smp_tune_scheduling() call went lost during the
> > transition to the new cpu boot scheme. But it's there. And the problem
> > is indeed "notsc". So you'll have to fix it, I can't.
>
> Errrm ... not sure what you mean by this. notsc is a perfectly
> valid thing to do, so if your patch panics with that option, I
> suggest you make something conditional on notsc inside your patch?
> Works just fine without your patch, or with Michael's patch ....

Martin,

arch/i386/kernel/smpboot.c:smp_tune_scheduling() says:

if (!cpu_khz) {
/*
* this basically disables processor-affinity
* scheduling on SMP without a TSC.
*/
cacheflush_time = 0;
return;

If you boot with notsc, you won't have cache affinity on your machine.
Which means that the load_balancer eventually selects cache hot tasks
for stealing. The O(1) scheduler doesn't do that under normal conditions!

Of course I'll add something to my patch such that it doesn't crash
if cache_decay_ticks is unset. But you might be measuring wrong things
right now if you leave cache_decay_ticks=0 as then the cache-affinity
on NUMAQ is switched off with the vanilla O(1) and with Michael's patch.
I want to say: you cannot evaluate the impact of Michael's patches if
you don't fix that. This issue is independent of my patches.

Regards,
Erich

2002-10-11 15:36:58

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [PATCH] pooling NUMA scheduler with initial load balancing

> arch/i386/kernel/smpboot.c:smp_tune_scheduling() says:
>
> if (!cpu_khz) {
> /*
> * this basically disables processor-affinity
> * scheduling on SMP without a TSC.
> */
> cacheflush_time = 0;
> return;
>
> If you boot with notsc, you won't have cache affinity on your machine.
> Which means that the load_balancer eventually selects cache hot tasks
> for stealing. The O(1) scheduler doesn't do that under normal conditions!

OK, that makes more sense ... I'll go stare at the code some more and
see what can be done.

> Of course I'll add something to my patch such that it doesn't crash
> if cache_decay_ticks is unset. But you might be measuring wrong things
> right now if you leave cache_decay_ticks=0 as then the cache-affinity
> on NUMAQ is switched off with the vanilla O(1) and with Michael's patch.
> I want to say: you cannot evaluate the impact of Michael's patches if
> you don't fix that. This issue is independent of my patches.

OK, I'll make sure to make the tests uniform somehow to get a fair
comparison.

Thanks!

M.