2003-01-09 23:52:51

by Martin J. Bligh

[permalink] [raw]
Subject: Minature NUMA scheduler

I tried a small experiment today - did a simple restriction of
the O(1) scheduler to only balance inside a node. Coupled with
the small initial load balancing patch floating around, this
covers 95% of cases, is a trivial change (3 lines), performs
just as well as Erich's patch on a kernel compile, and actually
better on schedbench.

This is NOT meant to be a replacement for the code Erich wrote,
it's meant to be a simple way to get integration and acceptance.
Code that just forks and never execs will stay on one node - but
we can take the code Erich wrote, and put it in seperate rebalancer
that fires much less often to do a cross-node rebalance. All that
would be under #ifdef CONFIG_NUMA, the only thing that would touch
mainline is these three lines of change, and it's trivial to see
they're completely equivalent to the current code on non-NUMA systems.

I also believe that this is the more correct approach in design, it
should result in much less cross-node migration of tasks, and less
scanning of remote runqueues.

Opinions / comments?

M.

Kernbench:
Elapsed User System CPU
2.5.54-mjb3 19.41s 186.38s 39.624s 1191.4%
2.5.54-mjb3-mjbsched 19.508s 186.356s 39.888s 1164.6%

Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
2.5.54-mjb3 0.00 35.14 88.82 0.64
2.5.54-mjb3-mjbsched 0.00 31.84 88.91 0.49

Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
2.5.54-mjb3 0.00 47.55 269.36 1.48
2.5.54-mjb3-mjbsched 0.00 41.01 252.34 1.07

Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
2.5.54-mjb3 0.00 76.53 957.48 4.17
2.5.54-mjb3-mjbsched 0.00 69.01 792.71 2.74

Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
2.5.54-mjb3 0.00 145.20 1993.97 11.05
2.5.54-mjb3-mjbsched 0.00 117.47 1798.93 5.95

Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
2.5.54-mjb3 0.00 307.80 4643.55 20.36
2.5.54-mjb3-mjbsched 0.00 241.04 3589.55 12.74

-----------------------------------------

diff -purN -X /home/mbligh/.diff.exclude virgin/kernel/sched.c mjbsched/kernel/sched.c
--- virgin/kernel/sched.c Mon Dec 9 18:46:15 2002
+++ mjbsched/kernel/sched.c Thu Jan 9 14:09:17 2003
@@ -654,7 +654,7 @@ static inline unsigned int double_lock_b
/*
* find_busiest_queue - find the busiest runqueue.
*/
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask)
{
int nr_running, load, max_load, i;
runqueue_t *busiest, *rq_src;
@@ -689,7 +689,7 @@ static inline runqueue_t *find_busiest_q
busiest = NULL;
max_load = 1;
for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_online(i))
+ if (!cpu_online(i) || !((1 << i) & cpumask) )
continue;

rq_src = cpu_rq(i);
@@ -764,7 +764,8 @@ static void load_balance(runqueue_t *thi
struct list_head *head, *curr;
task_t *tmp;

- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+ busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
+ __node_to_cpu_mask(__cpu_to_node(this_cpu)) );
if (!busiest)
goto out;


---------------------------------------------------

A tiny change in the current ilb patch is also needed to stop it
using a macro from the first patch:

diff -purN -X /home/mbligh/.diff.exclude ilbold/kernel/sched.c ilbnew/kernel/sched.c
--- ilbold/kernel/sched.c Thu Jan 9 15:20:53 2003
+++ ilbnew/kernel/sched.c Thu Jan 9 15:27:49 2003
@@ -2213,6 +2213,7 @@ static void sched_migrate_task(task_t *p
static int sched_best_cpu(struct task_struct *p)
{
int i, minload, load, best_cpu, node = 0;
+ unsigned long cpumask;

best_cpu = task_cpu(p);
if (cpu_rq(best_cpu)->nr_running <= 2)
@@ -2226,9 +2227,11 @@ static int sched_best_cpu(struct task_st
node = i;
}
}
+
minload = 10000000;
- loop_over_node(i,node) {
- if (!cpu_online(i))
+ cpumask = __node_to_cpu_mask(node);
+ for (i = 0; i < NR_CPUS; ++i) {
+ if (!(cpumask & (1 << i)))
continue;
if (cpu_rq(i)->nr_running < minload) {
best_cpu = i;




2003-01-10 05:28:21

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: [Lse-tech] Minature NUMA scheduler

On Thu, 2003-01-09 at 15:54, Martin J. Bligh wrote:
> I tried a small experiment today - did a simple restriction of
> the O(1) scheduler to only balance inside a node. Coupled with
> the small initial load balancing patch floating around, this
> covers 95% of cases, is a trivial change (3 lines), performs
> just as well as Erich's patch on a kernel compile, and actually
> better on schedbench.
>
> This is NOT meant to be a replacement for the code Erich wrote,
> it's meant to be a simple way to get integration and acceptance.
> Code that just forks and never execs will stay on one node - but
> we can take the code Erich wrote, and put it in seperate rebalancer
> that fires much less often to do a cross-node rebalance.

I tried this on my 4 node NUMAQ (16 procs, 16GB memory) and got
similar results. Also, added in the cputime_stats patch and am
attaching the matrix results from the 32 process run. Basically,
all runs show that the initial load balancer is able to place the
tasks evenly across the nodes, and the better overall times show
that not migrating to keep the nodes balanced over time results
in better performance - at least on these boxes.

Obviously, there can be situations where load balancing across
nodes is necessary, but these results point to less load balancing
being better, at least on these machines. It will be interesting
to repeat this on boxes with other interconnects.

$ reportbench hacksched54 sched54 stock54
Kernbench:
Elapsed User System CPU
hacksched54 29.406s 282.18s 81.716s 1236.8%
sched54 29.112s 283.888s 82.84s 1259.4%
stock54 31.348s 303.134s 87.824s 1247.2%

Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
hacksched54 21.94 31.93 87.81 0.53
sched54 22.03 34.90 88.15 0.75
stock54 49.35 57.53 197.45 0.86

Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
hacksched54 28.23 31.62 225.87 1.11
sched54 27.95 37.12 223.66 1.50
stock54 43.14 62.97 345.18 2.12

Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
hacksched54 49.29 71.31 788.83 2.88
sched54 55.37 69.58 886.10 3.79
stock54 66.00 81.25 1056.25 7.12

Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
hacksched54 56.41 117.98 1805.35 5.90
sched54 57.93 132.11 1854.01 10.74
stock54 77.81 173.26 2490.31 12.37

Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
hacksched54 56.62 231.93 3624.42 13.32
sched54 72.91 308.87 4667.03 21.06
stock54 86.68 368.55 5548.57 25.73

hacksched54 = 2.5.54 + Martin's tiny NUMA patch +
03-cputimes_stat-2.5.53.patch +
02-numa-sched-ilb-2.5.53-21.patch
sched54 = 2.5.54 + 01-numa-sched-core-2.5.53-24.patch +
02-ilb-2.5.53-21.patch02 +
03-cputimes_stat-2.5.53.patch
stock54 - 2.5.54 + 03-cputimes_stat-2.5.53.patch

Detailed results from running numa_test 32:

Executing 32 times: ./randupdt 1000000
Running 'hackbench 20' in the background: Time: 4.557
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 100.0 0.0 0.0 0.0 | 0 0 | 57.12
2 0.0 100.0 0.0 0.0 | 1 1 | 55.89
3 100.0 0.0 0.0 0.0 | 0 0 | 55.39
4 0.0 0.0 100.0 0.0 | 2 2 | 56.67
5 0.0 0.0 0.0 100.0 | 3 3 | 57.08
6 0.0 100.0 0.0 0.0 | 1 1 | 56.31
7 100.0 0.0 0.0 0.0 | 0 0 | 57.11
8 0.0 0.0 0.0 100.0 | 3 3 | 56.66
9 0.0 100.0 0.0 0.0 | 1 1 | 55.87
10 0.0 0.0 100.0 0.0 | 2 2 | 55.83
11 0.0 0.0 0.0 100.0 | 3 3 | 56.01
12 0.0 100.0 0.0 0.0 | 1 1 | 56.56
13 0.0 0.0 100.0 0.0 | 2 2 | 56.31
14 0.0 0.0 0.0 100.0 | 3 3 | 56.40
15 100.0 0.0 0.0 0.0 | 0 0 | 55.82
16 0.0 100.0 0.0 0.0 | 1 1 | 57.47
17 0.0 0.0 100.0 0.0 | 2 2 | 56.76
18 0.0 0.0 0.0 100.0 | 3 3 | 57.10
19 0.0 100.0 0.0 0.0 | 1 1 | 57.26
20 0.0 0.0 100.0 0.0 | 2 2 | 56.48
21 0.0 0.0 0.0 100.0 | 3 3 | 56.65
22 0.0 100.0 0.0 0.0 | 1 1 | 55.81
23 0.0 0.0 100.0 0.0 | 2 2 | 55.77
24 0.0 0.0 0.0 100.0 | 3 3 | 56.91
25 0.0 100.0 0.0 0.0 | 1 1 | 56.86
26 0.0 0.0 100.0 0.0 | 2 2 | 56.62
27 0.0 0.0 0.0 100.0 | 3 3 | 57.16
28 0.0 0.0 100.0 0.0 | 2 2 | 56.36
29 100.0 0.0 0.0 0.0 | 0 0 | 55.72
30 100.0 0.0 0.0 0.0 | 0 0 | 56.00
31 100.0 0.0 0.0 0.0 | 0 0 | 55.48
32 100.0 0.0 0.0 0.0 | 0 0 | 55.59
AverageUserTime 56.41 seconds
ElapsedTime 117.98
TotalUserTime 1805.35
TotalSysTime 5.90

Runs with 4, 8, 16, and 64 processes all showed the same even
distribution across all nodes and 100% time on node for each
process.
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2003-01-10 16:26:06

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] Minature NUMA scheduler

Hi Martin & Michael,

indeed, restricting a process to the node on which it was started
helps, as the memory will always be local. The NUMA scheduler allows a
process to move away from it's node, if the load conditions require
it, but in the current form the process will not come back to its
homenode. That's what the "node affine scheduler" tried to realise.

The miniature NUMA scheduler relies on the quality of the initial load
balancer, and that one seems to be good enough. As you mentioned,
multithreaded jobs are disadvantaged as their threads have to stick on
the originating node.

Having some sort of automatic node affinity of processes and equal
node loads in mind (as design targets), we could:
- take the minimal NUMA scheduler
- if the normal (node-restricted) find_busiest_queue() fails and
certain conditions are fulfilled (tried to balance inside own node
for a while and didn't succeed, own CPU idle, etc... ???) rebalance
over node boundaries (eg. using my load balancer)
This actually resembles the original design of the node affine
scheduler, having the cross-node balancing separate is ok and might
make the ideas clearer.

I made some measurements, too, and found basically what I
expected. The numbers are from a machine with 4 nodes of 2 CPUs
each. It's on ia64, so 2.5.52 based.

As the minsched cannot make mistakes (by moving tasks away from their
homenode), it leads to the best results with numa_test. OTOH hackbench
suffers a lot from the limitation to one node. The hackbench tasks are
not latency/bandwidth limited, the faster they spread over the whole
machine, the quicker finishes the job. That's why NUMA-sched is
slightly worse than a stock kernel. But minsched looses >50%. Funilly,
on my machine kernbench is slightly faster with the normal NUMA
scheduler.

Regards,
Erich

Results on a 8 CPU machine with 4 nodes (2 CPUs per node).

kernbench:
elapsed user system
stock52 134.52(0.84) 951.64(0.97) 20.72(0.22)
sched52 133.19(1.49) 944.24(0.50) 21.36(0.24)
minsched52 135.47(0.47) 937.61(0.20) 21.30(0.14)

schedbench/hackbench: time(s)
10 25 50 100
stock52 0.81(0.04) 2.06(0.07) 4.09(0.13) 7.89(0.25)
sched52 0.81(0.04) 2.03(0.07) 4.14(0.20) 8.61(0.35)
minsched52 1.28(0.05) 3.19(0.06) 6.59(0.13) 13.56(0.27)

numabench/numa_test 4
AvgUserTime ElapsedTime TotUserTime TotSysTime
stock52 --- 27.23(0.52) 89.30(4.18) 0.09(0.01)
sched52 22.32(1.00) 27.39(0.42) 89.29(4.02) 0.10(0.01)
minsched52 20.01(0.01) 23.40(0.13) 80.05(0.02) 0.08(0.01)

numabench/numa_test 8
AvgUserTime ElapsedTime TotUserTime TotSysTime
stock52 --- 27.50(2.58) 174.74(6.66) 0.18(0.01)
sched52 21.73(1.00) 33.70(1.82) 173.87(7.96) 0.18(0.01)
minsched52 20.31(0.00) 23.50(0.12) 162.47(0.04) 0.16(0.01)

numabench/numa_test 16
AvgUserTime ElapsedTime TotUserTime TotSysTime
stock52 --- 52.68(1.51) 390.03(15.10) 0.34(0.01)
sched52 21.51(0.80) 47.18(3.24) 344.29(12.78) 0.36(0.01)
minsched52 20.50(0.03) 43.82(0.08) 328.05(0.45) 0.34(0.01)

numabench/numa_test 32
AvgUserTime ElapsedTime TotUserTime TotSysTime
stock52 --- 102.60(3.89) 794.57(31.72) 0.65(0.01)
sched52 21.93(0.57) 92.46(1.10) 701.75(18.38) 0.67(0.02)
minsched52 20.64(0.10) 89.95(3.16) 660.72(3.13) 0.68(0.07)



On Friday 10 January 2003 06:36, Michael Hohnbaum wrote:
> On Thu, 2003-01-09 at 15:54, Martin J. Bligh wrote:
> > I tried a small experiment today - did a simple restriction of
> > the O(1) scheduler to only balance inside a node. Coupled with
> > the small initial load balancing patch floating around, this
> > covers 95% of cases, is a trivial change (3 lines), performs
> > just as well as Erich's patch on a kernel compile, and actually
> > better on schedbench.
> >
> > This is NOT meant to be a replacement for the code Erich wrote,
> > it's meant to be a simple way to get integration and acceptance.
> > Code that just forks and never execs will stay on one node - but
> > we can take the code Erich wrote, and put it in seperate rebalancer
> > that fires much less often to do a cross-node rebalance.
>
> I tried this on my 4 node NUMAQ (16 procs, 16GB memory) and got
> similar results. Also, added in the cputime_stats patch and am
> attaching the matrix results from the 32 process run. Basically,
> all runs show that the initial load balancer is able to place the
> tasks evenly across the nodes, and the better overall times show
> that not migrating to keep the nodes balanced over time results
> in better performance - at least on these boxes.
>
> Obviously, there can be situations where load balancing across
> nodes is necessary, but these results point to less load balancing
> being better, at least on these machines. It will be interesting
> to repeat this on boxes with other interconnects.
>
> $ reportbench hacksched54 sched54 stock54
> Kernbench:
> Elapsed User System CPU
> hacksched54 29.406s 282.18s 81.716s 1236.8%
> sched54 29.112s 283.888s 82.84s 1259.4%
> stock54 31.348s 303.134s 87.824s 1247.2%
>
> Schedbench 4:
> AvgUser Elapsed TotalUser TotalSys
> hacksched54 21.94 31.93 87.81 0.53
> sched54 22.03 34.90 88.15 0.75
> stock54 49.35 57.53 197.45 0.86
>
> Schedbench 8:
> AvgUser Elapsed TotalUser TotalSys
> hacksched54 28.23 31.62 225.87 1.11
> sched54 27.95 37.12 223.66 1.50
> stock54 43.14 62.97 345.18 2.12
>
> Schedbench 16:
> AvgUser Elapsed TotalUser TotalSys
> hacksched54 49.29 71.31 788.83 2.88
> sched54 55.37 69.58 886.10 3.79
> stock54 66.00 81.25 1056.25 7.12
>
> Schedbench 32:
> AvgUser Elapsed TotalUser TotalSys
> hacksched54 56.41 117.98 1805.35 5.90
> sched54 57.93 132.11 1854.01 10.74
> stock54 77.81 173.26 2490.31 12.37
>
> Schedbench 64:
> AvgUser Elapsed TotalUser TotalSys
> hacksched54 56.62 231.93 3624.42 13.32
> sched54 72.91 308.87 4667.03 21.06
> stock54 86.68 368.55 5548.57 25.73
>
> hacksched54 = 2.5.54 + Martin's tiny NUMA patch +
> 03-cputimes_stat-2.5.53.patch +
> 02-numa-sched-ilb-2.5.53-21.patch
> sched54 = 2.5.54 + 01-numa-sched-core-2.5.53-24.patch +
> 02-ilb-2.5.53-21.patch02 +
> 03-cputimes_stat-2.5.53.patch
> stock54 - 2.5.54 + 03-cputimes_stat-2.5.53.patch

2003-01-10 16:49:10

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] Minature NUMA scheduler

> Having some sort of automatic node affinity of processes and equal
> node loads in mind (as design targets), we could:
> - take the minimal NUMA scheduler
> - if the normal (node-restricted) find_busiest_queue() fails and
> certain conditions are fulfilled (tried to balance inside own node
> for a while and didn't succeed, own CPU idle, etc... ???) rebalance
> over node boundaries (eg. using my load balancer)
> This actually resembles the original design of the node affine
> scheduler, having the cross-node balancing separate is ok and might
> make the ideas clearer.

This seems like the right approach to me, apart from the trigger to
do the cross-node rebalance. I don't believe that has anything to do
with when we're internally balanced within a node or not, it's
whether the nodes are balanced relative to each other. I think we should
just check that every N ticks, looking at node load averages, and do
a cross-node rebalance if they're "significantly out".

The definintion of "N ticks" and "significantly out" would be a tunable
number, defined by each platform; roughly speaking, the lower the NUMA
ratio, the lower these numbers would be. That also allows us to wedge
all sorts of smarts in the NUMA rebalance part of the scheduler, such
as moving the tasks with the smallest RSS off node. The NUMA rebalancer
is obviously completely missing from the current implementation, and
I expect we'd use mainly Erich's current code to implement that.
However, it's suprising how well we do with no rebalancer at all,
apart from the exec-time initial load balance code.

Another big advantage of this approach is that it *obviously* changes
nothing at all for standard SMP systems (whereas your current patch does),
so it should be much easier to get it accepted into mainline ....

M.

2003-01-11 14:38:06

by Bill Davidsen

[permalink] [raw]
Subject: Re: [Lse-tech] Minature NUMA scheduler

On Fri, 10 Jan 2003, Erich Focht wrote:

> Having some sort of automatic node affinity of processes and equal
> node loads in mind (as design targets), we could:
> - take the minimal NUMA scheduler
> - if the normal (node-restricted) find_busiest_queue() fails and
> certain conditions are fulfilled (tried to balance inside own node
> for a while and didn't succeed, own CPU idle, etc... ???) rebalance
> over node boundaries (eg. using my load balancer)
> This actually resembles the original design of the node affine
> scheduler, having the cross-node balancing separate is ok and might
> make the ideas clearer.

Agreed, but honestly just this explanation would make it easier to
understand! I'm not sure you have the "balance of nodes" trigger defined
quite right, but I'm assuming if this gets implemented as described that
some long term umbalance detector mechanism will be run occasionally.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-01-12 23:15:56

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] Minature NUMA scheduler

On Saturday 11 January 2003 15:43, Bill Davidsen wrote:
> Agreed, but honestly just this explanation would make it easier to
> understand! I'm not sure you have the "balance of nodes" trigger defined
> quite right, but I'm assuming if this gets implemented as described that
> some long term umbalance detector mechanism will be run occasionally.

Yes, the current plan is to extend the miniature NUMA scheduler by a
inter-node balancer which is called less frequently.

Regards,
Erich


2003-01-12 23:25:51

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] Minature NUMA scheduler

On Friday 10 January 2003 17:57, Martin J. Bligh wrote:
> This seems like the right approach to me, apart from the trigger to
> do the cross-node rebalance. I don't believe that has anything to do
> with when we're internally balanced within a node or not, it's
> whether the nodes are balanced relative to each other. I think we should
> just check that every N ticks, looking at node load averages, and do
> a cross-node rebalance if they're "significantly out".

OK, I changed my mind about the trigger and made some experiments with
the cross-node balancer called after every N calls of
load_balance. If we make N tunable, we can even have some dynamics in
case the nodes are unbalanced: make N small if the current node is
less loaded than the average node loads, make N large if the node is
averagely loaded. I'll send the patches in a separate email.

> The NUMA rebalancer
> is obviously completely missing from the current implementation, and
> I expect we'd use mainly Erich's current code to implement that.
> However, it's suprising how well we do with no rebalancer at all,
> apart from the exec-time initial load balance code.

The fact that we're doing fine on kernbench and numa_test/schedbench
is (I think) understandeable. In both benchmarks a process cannot
change its node during its lifetime, therefore has minimal memory
latency. In numa_test the "disturbing" hackbench just cannot move away
any of the tasks from their originating node. Therefore the results
are the best possible.

Regards,
Erich

2003-01-12 23:46:31

by Erich Focht

[permalink] [raw]
Subject: NUMA scheduler 2nd approach

Hi Martin & Michael,

as discussed on the LSE call I played around with a cross-node
balancer approach put on top of the miniature NUMA scheduler. The
patches are attached and it seems to be clear that we can regain the
good performance for hackbench by adding a cross-node balancer.

The patches are:

01-minisched-2.5.55.patch : the miniature scheduler from
Martin. Balances strictly within a node. Added an inlined function to
make the integration of the cross-node balancer easier. The added code
is optimised away by the compiler.

02-initial-lb-2.5.55.patch : Michael's initial load balancer at
exec().

03-internode-lb-2.5.55.patch : internode load balancer core. Called
after INTERNODE_LB calls of the inter-node load balancer. The most
loaded node's cpu-mask is ORed to the own node's cpu-mask and
find_busiest_in_mask() finds the most loaded CPU in this set.

04-smooth-node-load-2.5.55.patch : The node load measure is smoothed
by adding half of the previous node load (and 1/4 of the one before,
etc..., as discussed in the LSE call). This should improve a bit the
behavior in case of short timed load peaks and avoid bouncing tasks
between nodes.

05-var-intnode-lb2-2.5.55.patch : Replaces the fixed INTERNODE_LB
interval (between cross-node balancer calls) by a variable
node-specific interval. Currently only two values used. Certainly
needs some tweaking and tuning.

01, 02, 03 are enough to produce a working NUMA scheduler, when
including all patches the behavior should be better. I made some
measurements which I'd like to post in a separate email tomorrow.

Comments? Ideas?

Regards,
Erich


Attachments:
01-minisched-2.5.55.patch (1.42 kB)
02-initial-lb-2.5.55.patch (5.48 kB)
03-internode-lb-2.5.55.patch (2.25 kB)
04-smooth-node-load-2.5.55.patch (2.18 kB)
05-var-intnode-lb2-2.5.55.patch (2.19 kB)
Download all attachments

2003-01-13 07:53:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: NUMA scheduler 2nd approach

On Mon, Jan 13, 2003 at 12:55:28AM +0100, Erich Focht wrote:
> Hi Martin & Michael,
>
> as discussed on the LSE call I played around with a cross-node
> balancer approach put on top of the miniature NUMA scheduler. The
> patches are attached and it seems to be clear that we can regain the
> good performance for hackbench by adding a cross-node balancer.

The changes look fine to me. But I think there's some conding style
issues that need cleaning up (see below). Also is there a reason
patches 2/3 and 4/5 aren't merged into one patch each?

/*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_in_mask - find the busiest runqueue among the cpus in cpumask
*/
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_in_mask(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask)


find_busiest_queue has just one caller in 2.5.56, I'd suggest just
changing the prototype and updating that single caller to pass in
the cpumask opencoded.


@@ -160,7 +160,6 @@ extern void update_one_process(struct ta
extern void scheduler_tick(int user_tick, int system);
extern unsigned long cache_decay_ticks;

-
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
asmlinkage void schedule(void);

I don't think you need this spurious whitespace change :)

+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+ rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+ rq->nr_running--

static inline void nr_running_inc(runqueue_t *rq)
{
atomic_inc(rq->node_ptr);
rq->nr_running++
}

etc.. would look a bit nicer.

diff -urNp linux-2.5.55-ms/kernel/sched.c linux-2.5.55-ms-ilb/kernel/sched.c
--- linux-2.5.55-ms/kernel/sched.c 2003-01-10 23:01:02.000000000 +0100
+++ linux-2.5.55-ms-ilb/kernel/sched.c 2003-01-11 01:12:43.000000000 +0100
@@ -153,6 +153,7 @@ struct runqueue {
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
+ atomic_t * node_ptr;

atomic_t *node_ptr; would match the style above.

+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};

Maybe wants some linewrapping after 80 chars?


+__init void node_nr_running_init(void)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+ }
+ return;
+}

The braces and the return are superflous. Also kernel/sched.c (or
mingo codein general) seems to prefer array + i instead of &array[i]
(not that I have a general preferences, but you should try to match
the surrounding code)

+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+ unsigned long old_mask;
+
+ old_mask = p->cpus_allowed;
+ if (!(old_mask & (1UL << dest_cpu)))
+ return;
+ /* force the process onto the specified CPU */
+ set_cpus_allowed(p, 1UL << dest_cpu);
+
+ /* restore the cpus allowed mask */
+ set_cpus_allowed(p, old_mask);

This double set_cpus_allowed doesn't look nice to me. I don't
have a better suggestion of-hand, though :(

+#define decl_numa_int(ctr) int ctr

This is ugly as hell. I'd prefer wasting one int in each runqueue
or even an ifdef in the struct declaration over this obsfucation
all the time.

@@ -816,6 +834,16 @@ out:
static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
{
unsigned long cpumask = __node_to_cpu_mask(__cpu_to_node(this_cpu));
+#if CONFIG_NUMA
+ int node;
+# define INTERNODE_LB 10

This wants to be put to the other magic constants in the scheduler
and needs an explanation there.


#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
rq->nr_running--
#define decl_numa_int(ctr) int ctr
+#define decl_numa_nodeint(v) int v[MAX_NUMNODES]

Another one of those.. You should reall just stick the CONFIG_NUMA
ifdef into the actual struct declaration.

+/*
+ * Find the busiest node. All previous node loads contribute with a geometrically
+ * deccaying weight to the load measure:

Linewrapping again..


2003-01-13 11:23:28

by Erich Focht

[permalink] [raw]
Subject: Re: NUMA scheduler 2nd approach

Hi Christoph,

thanks for reviewing the code and the helpful comments!

On Monday 13 January 2003 09:02, Christoph Hellwig wrote:
> On Mon, Jan 13, 2003 at 12:55:28AM +0100, Erich Focht wrote:
> > as discussed on the LSE call I played around with a cross-node
> > balancer approach put on top of the miniature NUMA scheduler. The
> > patches are attached and it seems to be clear that we can regain the
> > good performance for hackbench by adding a cross-node balancer.
>
> The changes look fine to me. But I think there's some conding style
> issues that need cleaning up (see below). Also is there a reason
> patches 2/3 and 4/5 aren't merged into one patch each?

The patches are separated by their functionality. Patch 2 comes from
Michael Hohnbaum, so I kept that separate for that reason. Right now
we can exchange the single components but when we decide that they are
doing the job, I'd also prefer to have just one patch.

> /*
> - * find_busiest_queue - find the busiest runqueue.
> + * find_busiest_in_mask - find the busiest runqueue among the cpus in
> cpumask */
> -static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int
> this_cpu, int idle, int *imbalance) +static inline runqueue_t
> *find_busiest_in_mask(runqueue_t *this_rq, int this_cpu, int idle, int
> *imbalance, unsigned long cpumask)
>
>
> find_busiest_queue has just one caller in 2.5.56, I'd suggest just
> changing the prototype and updating that single caller to pass in
> the cpumask opencoded.

Having find_busiest_queue() and find_busiest_in_mask() as separate
function makes it simpler to merge in the cross-node balancer (patch
3). Otherwise we'd have to add two #ifdef CONFIG_NUMA blocks into
load_balance() (one for new variable declarations, the other one for
selecting the target node mask). We might have several calls to
find_busiest_in_mask() later, if we decide to add multi-level node
hierarchy support...


> I don't think you need this spurious whitespace change :)

:-) slipped in somehow.


> +#ifdef CONFIG_NUMA
> +extern void sched_balance_exec(void);
> +extern void node_nr_running_init(void);
> +#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
> + rq->nr_running++
> +#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
> + rq->nr_running--
>
> static inline void nr_running_inc(runqueue_t *rq)
> {
> atomic_inc(rq->node_ptr);
> rq->nr_running++
> }
>
> etc.. would look a bit nicer.

We can change this. Michael, ok with you?


> +#if CONFIG_NUMA
> +static atomic_t node_nr_running[MAX_NUMNODES]
> ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
>
> Maybe wants some linewrapping after 80 chars?

Yes.


> + for (i = 0; i < NR_CPUS; i++) {
> + cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
> + }
> + return;
> +}
>
> The braces and the return are superflous. Also kernel/sched.c (or
> mingo codein general) seems to prefer array + i instead of &array[i]
> (not that I have a general preferences, but you should try to match
> the surrounding code)

Will change the braces and remove the return. I personally find
&array[i] more readable.

> +static void sched_migrate_task(task_t *p, int dest_cpu)
> +{
> + unsigned long old_mask;
> +
> + old_mask = p->cpus_allowed;
> + if (!(old_mask & (1UL << dest_cpu)))
> + return;
> + /* force the process onto the specified CPU */
> + set_cpus_allowed(p, 1UL << dest_cpu);
> +
> + /* restore the cpus allowed mask */
> + set_cpus_allowed(p, old_mask);
>
> This double set_cpus_allowed doesn't look nice to me. I don't
> have a better suggestion of-hand, though :(

This is not that bad. It involves only one single wakeup of the
migration thread, and that's more important. Doing it another way
would mean to replicate the set_cpus_allowed() code.

> +#define decl_numa_int(ctr) int ctr
>
> This is ugly as hell. I'd prefer wasting one int in each runqueue
> or even an ifdef in the struct declaration over this obsfucation
> all the time.

Agreed :-) Just trying to avoid #ifdefs in sched.c as much as
possible. Somehow had the feeling Linus doesn't like that. On the
other hand: CONFIG_NUMA is a special case of CONFIG_SMP and nobody has
anything against CONFIG_SMP in sched.c...


> @@ -816,6 +834,16 @@ out:
> static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int
> this_cpu, int idle, int *imbalance) {
> unsigned long cpumask = __node_to_cpu_mask(__cpu_to_node(this_cpu));
> +#if CONFIG_NUMA
> + int node;
> +# define INTERNODE_LB 10
>
> This wants to be put to the other magic constants in the scheduler
> and needs an explanation there.

We actually get rid of this in patch #5 (variable internode_lb,
depending on the load of the current node). But ok, I'll move the
MIN_INTERNODE_LB and MAX_INTERNODE_LB variables to the magic
constants. They'll be outside the #ifdef CONFIG_NUMA block...

Thanks,

best regards,
Erich

2003-01-13 15:18:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Lse-tech] Re: NUMA scheduler 2nd approach

Anyone interested in this cleaned up minimal numa scheduler? This
is basically Erich's patches 1-3 with my suggestions applied.

This does not mean I don't like 4 & 5, but I'd rather get a small,
non-intrusive patch into Linus' tree now and do the fine-tuning later.


--- 1.62/fs/exec.c Fri Jan 10 08:21:00 2003
+++ edited/fs/exec.c Mon Jan 13 15:33:32 2003
@@ -1031,6 +1031,8 @@
int retval;
int i;

+ sched_balance_exec();
+
file = open_exec(filename);

retval = PTR_ERR(file);
--- 1.119/include/linux/sched.h Sat Jan 11 07:44:15 2003
+++ edited/include/linux/sched.h Mon Jan 13 15:58:11 2003
@@ -444,6 +444,14 @@
# define set_cpus_allowed(p, new_mask) do { } while (0)
#endif

+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#else
+# define sched_balance_exec() do { } while (0)
+# define node_nr_running_init() do { } while (0)
+#endif
+
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
--- 1.91/init/main.c Mon Jan 6 04:08:49 2003
+++ edited/init/main.c Mon Jan 13 15:33:33 2003
@@ -495,6 +495,7 @@

migration_init();
#endif
+ node_nr_running_init();
spawn_ksoftirqd();
}

--- 1.148/kernel/sched.c Sat Jan 11 07:44:22 2003
+++ edited/kernel/sched.c Mon Jan 13 16:17:34 2003
@@ -67,6 +67,7 @@
#define INTERACTIVE_DELTA 2
#define MAX_SLEEP_AVG (2*HZ)
#define STARVATION_LIMIT (2*HZ)
+#define NODE_BALANCE_RATIO 10

/*
* If a task is 'interactive' then we reinsert it in the active
@@ -154,6 +155,11 @@
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];

+#ifdef CONFIG_NUMA
+ atomic_t *node_nr_running;
+ int nr_balanced;
+#endif
+
task_t *migration_thread;
struct list_head migration_queue;

@@ -178,6 +184,38 @@
#endif

/*
+ * Keep track of running tasks.
+ */
+#if CONFIG_NUMA
+
+/* XXX(hch): this should go into a struct sched_node_data */
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+ {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+static inline void nr_running_init(struct runqueue *rq)
+{
+ rq->node_nr_running = &node_nr_running[0];
+}
+
+static inline void nr_running_inc(struct runqueue *rq)
+{
+ atomic_inc(rq->node_nr_running);
+ rq->nr_running++;
+}
+
+static inline void nr_running_dec(struct runqueue *rq)
+{
+ atomic_dec(rq->node_nr_running);
+ rq->nr_running--;
+}
+
+#else
+# define nr_running_init(rq) do { } while (0)
+# define nr_running_inc(rq) do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq) do { (rq)->nr_running--; } while (0)
+#endif /* CONFIG_NUMA */
+
+/*
* task_rq_lock - lock the runqueue a given task resides on and disable
* interrupts. Note the ordering: we can safely lookup the task_rq without
* explicitly disabling preemption.
@@ -294,7 +332,7 @@
p->prio = effective_prio(p);
}
enqueue_task(p, array);
- rq->nr_running++;
+ nr_running_inc(rq);
}

/*
@@ -302,7 +340,7 @@
*/
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
- rq->nr_running--;
+ nr_running_dec(rq);
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
dequeue_task(p, p->array);
@@ -624,9 +662,108 @@
spin_unlock(&rq2->lock);
}

-#if CONFIG_SMP
+#if CONFIG_NUMA
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu. Then
+ * the cpu_allowed mask is restored.
+ *
+ * Note: This isn't actually numa-specific, but just not used otherwise.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+ unsigned long old_mask = p->cpus_allowed;
+
+ if (old_mask & (1UL << dest_cpu)) {
+ unsigned long flags;
+ struct runqueue *rq;
+
+ /* force the process onto the specified CPU */
+ set_cpus_allowed(p, 1UL << dest_cpu);
+
+ /* restore the cpus allowed mask */
+ rq = task_rq_lock(p, &flags);
+ p->cpus_allowed = old_mask;
+ task_rq_unlock(rq, &flags);
+ }
+}

/*
+ * Find the least loaded CPU. Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+ int i, minload, load, best_cpu, node = 0;
+ unsigned long cpumask;
+
+ best_cpu = task_cpu(p);
+ if (cpu_rq(best_cpu)->nr_running <= 2)
+ return best_cpu;
+
+ minload = 10000000;
+ for (i = 0; i < numnodes; i++) {
+ load = atomic_read(&node_nr_running[i]);
+ if (load < minload) {
+ minload = load;
+ node = i;
+ }
+ }
+
+ minload = 10000000;
+ cpumask = __node_to_cpu_mask(node);
+ for (i = 0; i < NR_CPUS; ++i) {
+ if (!(cpumask & (1UL << i)))
+ continue;
+ if (cpu_rq(i)->nr_running < minload) {
+ best_cpu = i;
+ minload = cpu_rq(i)->nr_running;
+ }
+ }
+ return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+ int new_cpu;
+
+ if (numnodes > 1) {
+ new_cpu = sched_best_cpu(current);
+ if (new_cpu != smp_processor_id())
+ sched_migrate_task(current, new_cpu);
+ }
+}
+
+static int find_busiest_node(int this_node)
+{
+ int i, node = this_node, load, this_load, maxload;
+
+ this_load = maxload = atomic_read(&node_nr_running[this_node]);
+ for (i = 0; i < numnodes; i++) {
+ if (i == this_node)
+ continue;
+ load = atomic_read(&node_nr_running[i]);
+ if (load > maxload && (4*load > ((5*4*this_load)/4))) {
+ maxload = load;
+ node = i;
+ }
+ }
+
+ return node;
+}
+
+__init void node_nr_running_init(void)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++)
+ cpu_rq(i)->node_nr_running = node_nr_running + __cpu_to_node(i);
+}
+#endif /* CONFIG_NUMA */
+
+#if CONFIG_SMP
+/*
* double_lock_balance - lock the busiest runqueue
*
* this_rq is locked already. Recalculate nr_running if we have to
@@ -652,9 +789,10 @@
}

/*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_queue - find the busiest runqueue among the cpus in cpumask
*/
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu,
+ int idle, int *imbalance, unsigned long cpumask)
{
int nr_running, load, max_load, i;
runqueue_t *busiest, *rq_src;
@@ -689,7 +827,7 @@
busiest = NULL;
max_load = 1;
for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_online(i))
+ if (!cpu_online(i) || !((1UL << i) & cpumask))
continue;

rq_src = cpu_rq(i);
@@ -736,9 +874,9 @@
static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
{
dequeue_task(p, src_array);
- src_rq->nr_running--;
+ nr_running_dec(src_rq);
set_task_cpu(p, this_cpu);
- this_rq->nr_running++;
+ nr_running_inc(this_rq);
enqueue_task(p, this_rq->active);
/*
* Note that idle threads have a prio of MAX_PRIO, for this test
@@ -758,13 +896,27 @@
*/
static void load_balance(runqueue_t *this_rq, int idle)
{
- int imbalance, idx, this_cpu = smp_processor_id();
+ int imbalance, idx, this_cpu, this_node;
+ unsigned long cpumask;
runqueue_t *busiest;
prio_array_t *array;
struct list_head *head, *curr;
task_t *tmp;

- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+ this_cpu = smp_processor_id();
+ this_node = __cpu_to_node(this_cpu);
+ cpumask = __node_to_cpu_mask(this_node);
+
+#if CONFIG_NUMA
+ /*
+ * Avoid rebalancing between nodes too often.
+ */
+ if (!(++this_rq->nr_balanced % NODE_BALANCE_RATIO))
+ cpumask |= __node_to_cpu_mask(find_busiest_node(this_node));
+#endif
+
+ busiest = find_busiest_queue(this_rq, this_cpu, idle,
+ &imbalance, cpumask);
if (!busiest)
goto out;

@@ -2231,6 +2383,7 @@
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
atomic_set(&rq->nr_iowait, 0);
+ nr_running_init(rq);

for (j = 0; j < 2; j++) {
array = rq->arrays + j;

2003-01-13 15:36:57

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] Re: NUMA scheduler 2nd approach

Hi Christoph,

I just finished some experiments which show that the finetuning can
really be left for later. So this approach is ok for me. I hope we can
get enough support for integrating this tiny numa scheduler.

I didn't do all possible measurements, the interesting ones are with
patches 1-4 (nb-smooth) and 1-5 (nb-sm-var1, nb-sm-var2) applied. They
show pretty consistent results (within error bars). The fine-tuning in
patch #5 doesn't buy us much right now (on my platform), so we can
leave it out.

Here's the data:

Results on a 8 CPU ia64 machine with 4 nodes (2 CPUs per node).

kernbench:
elapsed user system
stock52 134.52(0.84) 951.64(0.97) 20.72(0.22)
sched52 133.19(1.49) 944.24(0.50) 21.36(0.24)
minsched52 135.47(0.47) 937.61(0.20) 21.30(0.14)
nb-smooth 133.61(0.71) 944.71(0.35) 21.22(0.22)
nb-sm-var1 135.23(2.07) 943.78(0.54) 21.54(0.17)
nb-sm-var2 133.87(0.61) 944.18(0.62) 21.32(0.13)

schedbench/hackbench: time(s)
10 25 50 100
stock52 0.81(0.04) 2.06(0.07) 4.09(0.13) 7.89(0.25)
sched52 0.81(0.04) 2.03(0.07) 4.14(0.20) 8.61(0.35)
minsched52 1.28(0.05) 3.19(0.06) 6.59(0.13) 13.56(0.27)
nb-smooth 0.77(0.03) 1.94(0.04) 3.80(0.06) 7.97(0.35)
nb-sm-var1 0.81(0.05) 2.01(0.09) 3.89(0.21) 8.20(0.34)
nb-sm-var2 0.82(0.04) 2.10(0.09) 4.19(0.14) 8.15(0.24)

numabench/numa_test 4
AvgUserTime ElapsedTime TotUserTime TotSysTime
stock52 --- 27.23(0.52) 89.30(4.18) 0.09(0.01)
sched52 22.32(1.00) 27.39(0.42) 89.29(4.02) 0.10(0.01)
minsched52 20.01(0.01) 23.40(0.13) 80.05(0.02) 0.08(0.01)
nb-smooth 21.01(0.79) 24.70(2.75) 84.04(3.15) 0.09(0.01)
nb-sm-var1 21.39(0.83) 26.03(2.15) 85.56(3.31) 0.09(0.01)
nb-sm-var2 22.18(0.74) 27.36(0.42) 88.72(2.94) 0.09(0.01)

numabench/numa_test 8
AvgUserTime ElapsedTime TotUserTime TotSysTime
stock52 --- 27.50(2.58) 174.74(6.66) 0.18(0.01)
sched52 21.73(1.00) 33.70(1.82) 173.87(7.96) 0.18(0.01)
minsched52 20.31(0.00) 23.50(0.12) 162.47(0.04) 0.16(0.01)
nb-smooth 20.46(0.44) 24.21(1.95) 163.68(3.56) 0.16(0.01)
nb-sm-var1 20.45(0.44) 23.95(1.73) 163.62(3.49) 0.17(0.01)
nb-sm-var2 20.71(0.82) 23.78(2.42) 165.74(6.58) 0.17(0.01)

numabench/numa_test 16
AvgUserTime ElapsedTime TotUserTime TotSysTime
stock52 --- 52.68(1.51) 390.03(15.10) 0.34(0.01)
sched52 21.51(0.80) 47.18(3.24) 344.29(12.78) 0.36(0.01)
minsched52 20.50(0.03) 43.82(0.08) 328.05(0.45) 0.34(0.01)
nb-smooth 21.12(0.69) 47.42(4.02) 337.99(10.99) 0.34(0.01)
nb-sm-var1 21.18(0.77) 48.19(5.05) 338.94(12.38) 0.34(0.01)
nb-sm-var2 21.69(0.91) 49.05(4.36) 347.03(14.49) 0.34(0.01)

numabench/numa_test 32
AvgUserTime ElapsedTime TotUserTime TotSysTime
stock52 --- 102.60(3.89) 794.57(31.72) 0.65(0.01)
sched52 21.93(0.57) 92.46(1.10) 701.75(18.38) 0.67(0.02)
minsched52 20.64(0.10) 89.95(3.16) 660.72(3.13) 0.68(0.07)
nb-smooth 20.95(0.19) 86.63(1.74) 670.56(6.02) 0.66(0.02)
nb-sm-var1 21.47(0.54) 90.95(3.28) 687.12(17.42) 0.67(0.02)
nb-sm-var2 21.45(0.67) 89.91(3.80) 686.47(21.37) 0.68(0.02)


The kernels used:
- stock52 : 2.5.52 + ia64 patch
- sched52 : stock52 + old numa scheduler
- minisched52 : stock52 + miniature NUMA scheduler (cannot load
balance across nodes)
- nb-smooth : minisched52 + node balancer + smooth node load patch
- nb-sm-var1 : nb-smooth + variable internode_lb, (MIN,MAX) = (4,40)
- nb-sm-var2 : nb-smooth + variable internode_lb, (MIN,MAX) = (1,16)

Best regards,
Erich


On Monday 13 January 2003 16:26, Christoph Hellwig wrote:
> Anyone interested in this cleaned up minimal numa scheduler? This
> is basically Erich's patches 1-3 with my suggestions applied.
>
> This does not mean I don't like 4 & 5, but I'd rather get a small,
> non-intrusive patch into Linus' tree now and do the fine-tuning later.
>
[...]

2003-01-13 18:55:39

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: NUMA scheduler 2nd approach

On Mon, 2003-01-13 at 03:32, Erich Focht wrote:
> Hi Christoph,
>
> thanks for reviewing the code and the helpful comments!
>
> On Monday 13 January 2003 09:02, Christoph Hellwig wrote:
> > On Mon, Jan 13, 2003 at 12:55:28AM +0100, Erich Focht wrote:
> > > as discussed on the LSE call I played around with a cross-node
> > > balancer approach put on top of the miniature NUMA scheduler. The
> > > patches are attached and it seems to be clear that we can regain the
> > > good performance for hackbench by adding a cross-node balancer.
> >
> > The changes look fine to me. But I think there's some conding style
> > issues that need cleaning up (see below). Also is there a reason
> > patches 2/3 and 4/5 aren't merged into one patch each?
>
> The patches are separated by their functionality. Patch 2 comes from
> Michael Hohnbaum, so I kept that separate for that reason. Right now
> we can exchange the single components but when we decide that they are
> doing the job, I'd also prefer to have just one patch.

I'm not opposed to combining the patches, but isn't it typically
preferred to have patches as small as possible? The initial load
balance patch (patch 2) applies and functions fairly independent
of the the other patches, so has been easy to maintain as a separate
patch.
>
>
> > +#ifdef CONFIG_NUMA
> > +extern void sched_balance_exec(void);
> > +extern void node_nr_running_init(void);
> > +#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
> > + rq->nr_running++
> > +#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
> > + rq->nr_running--
> >
> > static inline void nr_running_inc(runqueue_t *rq)
> > {
> > atomic_inc(rq->node_ptr);
> > rq->nr_running++
> > }
> >
> > etc.. would look a bit nicer.
>
> We can change this. Michael, ok with you?

Fine with me.
>
>
> > +#if CONFIG_NUMA
> > +static atomic_t node_nr_running[MAX_NUMNODES]
> > ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
> >
> > Maybe wants some linewrapping after 80 chars?
>
> Yes.
>
>
> > + for (i = 0; i < NR_CPUS; i++) {
> > + cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
> > + }
> > + return;
> > +}
> >
> > The braces and the return are superflous. Also kernel/sched.c (or
> > mingo codein general) seems to prefer array + i instead of &array[i]
> > (not that I have a general preferences, but you should try to match
> > the surrounding code)
>
> Will change the braces and remove the return. I personally find
> &array[i] more readable.

I agree with Erich here - &array[i] is clearer.
>
> > +static void sched_migrate_task(task_t *p, int dest_cpu)
> > +{
> > + unsigned long old_mask;
> > +
> > + old_mask = p->cpus_allowed;
> > + if (!(old_mask & (1UL << dest_cpu)))
> > + return;
> > + /* force the process onto the specified CPU */
> > + set_cpus_allowed(p, 1UL << dest_cpu);
> > +
> > + /* restore the cpus allowed mask */
> > + set_cpus_allowed(p, old_mask);
> >
> > This double set_cpus_allowed doesn't look nice to me. I don't
> > have a better suggestion of-hand, though :(
>
> This is not that bad. It involves only one single wakeup of the
> migration thread, and that's more important. Doing it another way
> would mean to replicate the set_cpus_allowed() code.

While this shows up in the patch credited to me, it was actually
Erich's idea, and I think a quite good one. My initial impression
of this was that it was ugly, but upon closer examination (and trying
to implement this some other way) realized that Erich's implementation,
which I borrowed from, was quite good.

Good work Erich in getting the internode balancer written and
working. I'm currently building a kernel with these patches and
will post results later in the day. Thanks.

--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2003-01-14 01:13:35

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: NUMA scheduler 2nd approach

On Sun, 2003-01-12 at 15:55, Erich Focht wrote:
> Hi Martin & Michael,
>
> as discussed on the LSE call I played around with a cross-node
> balancer approach put on top of the miniature NUMA scheduler. The
> patches are attached and it seems to be clear that we can regain the
> good performance for hackbench by adding a cross-node balancer.

Erich,

I played with this today on my 4 node (16 CPU) NUMAQ. Spent most
of the time working with the first three patches. What I found was
that rebalancing was happening too much between nodes. I tried a
few things to change this, but have not yet settled on the best
approach. A key item to work with is the check in find_busiest_node
to determine if the found node is busier enough to warrant stealing
from it. Currently the check is that the node has 125% of the load
of the current node. I think that, for my system at least, we need
to add in a constant to this equation. I tried using 4 and that
helped a little. Finally I added in the 04 patch, and that helped
a lot. Still, there is too much process movement between nodes.

Tomorrow, I will continue experiments, but base them on the first
4 patches. Two suggestions for minor changes:

* Make the check in find_busiest_node into a macro that is defined
in the arch specific topology header file. Then different NUMA
architectures can tune this appropriately.

* In find_busiest_queue change:

cpumask |= __node_to_cpu_mask(node);
to:
cpumask = __node_to_cpu_mask(node) | (1UL << (this_cpu));


There is no reason to iterate over the runqueues on the current
node, which is what the code currently does.

Some numbers for anyone interested:Kernbench:

All numbers are based on a 2.5.55 kernel with the cputime stats patch:
* stock55 = no additional patches
* mini+rebal-44 = patches 01, 02, and 03
* rebal+4+fix = patches 01, 02, 03, and the cpumask change described
above, and a +4 constant added to the check in find_busiest_node
* rebal+4+fix+04 = above with the 04 patch added

Elapsed User System CPU
rebal+4+fix+04-55 29.302s 285.136s 82.106s 1253%
rebal+4+fix-55 30.498s 286.586s 88.176s 1228.6%
mini+rebal-55 30.756s 287.646s 85.512s 1212.8%
stock55 31.018s 303.084s 86.626s 1256.2%

Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
rebal+4+fix+04-55 27.34 40.49 109.39 0.88
rebal+4+fix-55 24.73 38.50 98.94 0.84
mini+rebal-55 25.18 43.23 100.76 0.68
stock55 31.38 41.55 125.54 1.24

Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
rebal+4+fix+04-55 30.05 44.15 240.48 2.50
rebal+4+fix-55 34.33 46.40 274.73 2.31
mini+rebal-55 32.99 52.42 264.00 2.08
stock55 44.63 61.28 357.11 2.22

Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
rebal+4+fix+04-55 52.13 57.68 834.23 3.55
rebal+4+fix-55 52.72 65.16 843.70 4.55
mini+rebal-55 57.29 71.51 916.84 5.10
stock55 66.91 85.08 1070.72 6.05

Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
rebal+4+fix+04-55 56.38 124.09 1804.67 7.71
rebal+4+fix-55 55.13 115.18 1764.46 8.86
mini+rebal-55 57.83 125.80 1850.84 10.19
stock55 80.38 181.80 2572.70 13.22

Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
rebal+4+fix+04-55 57.42 238.32 3675.77 17.68
rebal+4+fix-55 60.06 252.96 3844.62 18.88
mini+rebal-55 58.15 245.30 3722.38 19.64
stock55 123.96 513.66 7934.07 26.39


And here is the results from running numa_test 32 on rebal+4+fix+04:

Executing 32 times: ./randupdt 1000000
Running 'hackbench 20' in the background: Time: 8.383
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 100.0 0.0 0.0 0.0 | 0 0 | 56.19
2 100.0 0.0 0.0 0.0 | 0 0 | 53.80
3 0.0 0.0 100.0 0.0 | 2 2 | 55.61
4 100.0 0.0 0.0 0.0 | 0 0 | 54.13
5 3.7 0.0 0.0 96.3 | 3 3 | 56.48
6 0.0 0.0 100.0 0.0 | 2 2 | 55.11
7 0.0 0.0 100.0 0.0 | 2 2 | 55.94
8 0.0 0.0 100.0 0.0 | 2 2 | 55.69
9 80.6 19.4 0.0 0.0 | 1 0 *| 56.53
10 0.0 0.0 0.0 100.0 | 3 3 | 53.00
11 0.0 99.2 0.0 0.8 | 1 1 | 56.72
12 0.0 0.0 0.0 100.0 | 3 3 | 54.58
13 0.0 100.0 0.0 0.0 | 1 1 | 59.38
14 0.0 55.6 0.0 44.4 | 3 1 *| 63.06
15 0.0 100.0 0.0 0.0 | 1 1 | 56.02
16 0.0 19.2 0.0 80.8 | 1 3 *| 58.07
17 0.0 100.0 0.0 0.0 | 1 1 | 53.78
18 0.0 0.0 100.0 0.0 | 2 2 | 55.28
19 0.0 78.6 0.0 21.4 | 3 1 *| 63.20
20 0.0 100.0 0.0 0.0 | 1 1 | 53.27
21 0.0 0.0 100.0 0.0 | 2 2 | 55.79
22 0.0 0.0 0.0 100.0 | 3 3 | 57.23
23 12.4 19.1 0.0 68.5 | 1 3 *| 61.05
24 0.0 0.0 100.0 0.0 | 2 2 | 54.50
25 0.0 0.0 0.0 100.0 | 3 3 | 56.82
26 0.0 0.0 100.0 0.0 | 2 2 | 56.28
27 15.3 0.0 0.0 84.7 | 3 3 | 57.12
28 100.0 0.0 0.0 0.0 | 0 0 | 53.85
29 32.7 67.2 0.0 0.0 | 0 1 *| 62.66
30 100.0 0.0 0.0 0.0 | 0 0 | 53.86
31 100.0 0.0 0.0 0.0 | 0 0 | 53.94
32 100.0 0.0 0.0 0.0 | 0 0 | 55.36
AverageUserTime 56.38 seconds
ElapsedTime 124.09
TotalUserTime 1804.67
TotalSysTime 7.71

Ideally, there would be nothing but 100.0 in all non-zero entries.
I'll try adding in the 05 patch, and if that does not help, will
try a few other adjustments.


Thanks for the quick effort on putting together the node rebalance
code. I'll also get some hackbench numbers soon.
--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2003-01-14 10:47:07

by Erich Focht

[permalink] [raw]
Subject: Re: NUMA scheduler 2nd approach

Hi Michael,

thanks for the many experiments, they help giving some more
insight. It's quite clear that we'll have to tune the knobs a bit to
make this fit well.

At first a comment on the "target" of the tuning:

> 26 0.0 0.0 100.0 0.0 | 2 2 | 56.28
> 27 15.3 0.0 0.0 84.7 | 3 3 | 57.12
> 28 100.0 0.0 0.0 0.0 | 0 0 | 53.85
> 29 32.7 67.2 0.0 0.0 | 0 1 *| 62.66
> 30 100.0 0.0 0.0 0.0 | 0 0 | 53.86
> 31 100.0 0.0 0.0 0.0 | 0 0 | 53.94
> 32 100.0 0.0 0.0 0.0 | 0 0 | 55.36
> AverageUserTime 56.38 seconds
> ElapsedTime 124.09
> TotalUserTime 1804.67
> TotalSysTime 7.71
>
> Ideally, there would be nothing but 100.0 in all non-zero entries.
> I'll try adding in the 05 patch, and if that does not help, will
> try a few other adjustments.

The numa_test is written such that you MUST get numbers below 100%
with this kind of scheduler! There is a "disturbing" process started 3
seconds after the parallel tasks are up and running which has the aim
to "shake" the system a bit and the scheduler ends up with some of
the parallel tasks moved away from their homenode. This is a
reallistic situation for long running parallel background jobs. In the
min-sched case this cannot happen, therefore you see the 100% all
over!

Increasing more and more the barrier for moving a task apart from it's
node won't necessarilly help the scheduler as we end up with the
min-sched limitations. The aim will be to introduce node affinity for
processes such that they can return to their homenode when the
exceptional load situation is gone. We should not try to reach the
min-sched numbers without having node-affine tasks.


> * Make the check in find_busiest_node into a macro that is defined
> in the arch specific topology header file. Then different NUMA
> architectures can tune this appropriately.

I still think that the load balancing condition should be clear
and the same for every architecture. But tunable. I understand that
the constant added helps but I don't have a clear motivation for it.
We could keep the imbalance ratio as a condition for searching the
maximum:
> + if (load > maxload && (4*load > ((5*4*this_load)/4))) {
(with tunable barrier height: say replace 5/4 by 11/8 or similar, i.e.
(8*load > ((11*8*this_load)/8))
)
and rule out any special unwanted cases after the loop. These would be
cases where it is obvious that we don't have to steal anything, like:
load of remote node is <= number of cpus in remote node.

For making this tunable I suggest something like:
#define CROSS_NODE_BAL 125
(100*load > ((CROSS_NODE_BAL*100*this_load)/100))
and let platforms define their own value. Using 100 makes it easy to
understand.

> * In find_busiest_queue change:
>
> cpumask |= __node_to_cpu_mask(node);
> to:
> cpumask = __node_to_cpu_mask(node) | (1UL << (this_cpu));
>
>
> There is no reason to iterate over the runqueues on the current
> node, which is what the code currently does.

Here I thought I'd give the load balancer the chance to balance
internally just in case the own node's load jumped up in the mean
time. But this should ideally be caught by find_busiest_node(),
therefore I agree with your change.


> Some numbers for anyone interested:Kernbench:
>
> All numbers are based on a 2.5.55 kernel with the cputime stats patch:
> * stock55 = no additional patches
> * mini+rebal-44 = patches 01, 02, and 03
> * rebal+4+fix = patches 01, 02, 03, and the cpumask change described
> above, and a +4 constant added to the check in find_busiest_node
> * rebal+4+fix+04 = above with the 04 patch added
>
> Elapsed User System CPU
> rebal+4+fix+04-55 29.302s 285.136s 82.106s 1253%
> rebal+4+fix-55 30.498s 286.586s 88.176s 1228.6%
> mini+rebal-55 30.756s 287.646s 85.512s 1212.8%
> stock55 31.018s 303.084s 86.626s 1256.2%

The first line is VERY good!

The results below (for Schedbench) are not really meaningful without
statistics... But the combination rebal+4+fix+04-55 is the clear
winner for me. Still, I'd be curious to know the rebal+fix+04-55
numbers ;-)

> Schedbench 4:
> AvgUser Elapsed TotalUser TotalSys
> rebal+4+fix+04-55 27.34 40.49 109.39 0.88
> rebal+4+fix-55 24.73 38.50 98.94 0.84
> mini+rebal-55 25.18 43.23 100.76 0.68
> stock55 31.38 41.55 125.54 1.24
>
> Schedbench 8:
> AvgUser Elapsed TotalUser TotalSys
> rebal+4+fix+04-55 30.05 44.15 240.48 2.50
> rebal+4+fix-55 34.33 46.40 274.73 2.31
> mini+rebal-55 32.99 52.42 264.00 2.08
> stock55 44.63 61.28 357.11 2.22
>
> Schedbench 16:
> AvgUser Elapsed TotalUser TotalSys
> rebal+4+fix+04-55 52.13 57.68 834.23 3.55
> rebal+4+fix-55 52.72 65.16 843.70 4.55
> mini+rebal-55 57.29 71.51 916.84 5.10
> stock55 66.91 85.08 1070.72 6.05
>
> Schedbench 32:
> AvgUser Elapsed TotalUser TotalSys
> rebal+4+fix+04-55 56.38 124.09 1804.67 7.71
> rebal+4+fix-55 55.13 115.18 1764.46 8.86
> mini+rebal-55 57.83 125.80 1850.84 10.19
> stock55 80.38 181.80 2572.70 13.22
>
> Schedbench 64:
> AvgUser Elapsed TotalUser TotalSys
> rebal+4+fix+04-55 57.42 238.32 3675.77 17.68
> rebal+4+fix-55 60.06 252.96 3844.62 18.88
> mini+rebal-55 58.15 245.30 3722.38 19.64
> stock55 123.96 513.66 7934.07 26.39
>
>
> And here is the results from running numa_test 32 on rebal+4+fix+04:
>
> Executing 32 times: ./randupdt 1000000
> Running 'hackbench 20' in the background: Time: 8.383
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 100.0 0.0 0.0 0.0 | 0 0 | 56.19
> 2 100.0 0.0 0.0 0.0 | 0 0 | 53.80
> 3 0.0 0.0 100.0 0.0 | 2 2 | 55.61
> 4 100.0 0.0 0.0 0.0 | 0 0 | 54.13
> 5 3.7 0.0 0.0 96.3 | 3 3 | 56.48
> 6 0.0 0.0 100.0 0.0 | 2 2 | 55.11
> 7 0.0 0.0 100.0 0.0 | 2 2 | 55.94
> 8 0.0 0.0 100.0 0.0 | 2 2 | 55.69
> 9 80.6 19.4 0.0 0.0 | 1 0 *| 56.53
> 10 0.0 0.0 0.0 100.0 | 3 3 | 53.00
> 11 0.0 99.2 0.0 0.8 | 1 1 | 56.72
> 12 0.0 0.0 0.0 100.0 | 3 3 | 54.58
> 13 0.0 100.0 0.0 0.0 | 1 1 | 59.38
> 14 0.0 55.6 0.0 44.4 | 3 1 *| 63.06
> 15 0.0 100.0 0.0 0.0 | 1 1 | 56.02
> 16 0.0 19.2 0.0 80.8 | 1 3 *| 58.07
> 17 0.0 100.0 0.0 0.0 | 1 1 | 53.78
> 18 0.0 0.0 100.0 0.0 | 2 2 | 55.28
> 19 0.0 78.6 0.0 21.4 | 3 1 *| 63.20
> 20 0.0 100.0 0.0 0.0 | 1 1 | 53.27
> 21 0.0 0.0 100.0 0.0 | 2 2 | 55.79
> 22 0.0 0.0 0.0 100.0 | 3 3 | 57.23
> 23 12.4 19.1 0.0 68.5 | 1 3 *| 61.05
> 24 0.0 0.0 100.0 0.0 | 2 2 | 54.50
> 25 0.0 0.0 0.0 100.0 | 3 3 | 56.82
> 26 0.0 0.0 100.0 0.0 | 2 2 | 56.28
> 27 15.3 0.0 0.0 84.7 | 3 3 | 57.12
> 28 100.0 0.0 0.0 0.0 | 0 0 | 53.85
> 29 32.7 67.2 0.0 0.0 | 0 1 *| 62.66
> 30 100.0 0.0 0.0 0.0 | 0 0 | 53.86
> 31 100.0 0.0 0.0 0.0 | 0 0 | 53.94
> 32 100.0 0.0 0.0 0.0 | 0 0 | 55.36
> AverageUserTime 56.38 seconds
> ElapsedTime 124.09
> TotalUserTime 1804.67
> TotalSysTime 7.71
>
> Ideally, there would be nothing but 100.0 in all non-zero entries.
> I'll try adding in the 05 patch, and if that does not help, will
> try a few other adjustments.
>
> Thanks for the quick effort on putting together the node rebalance
> code. I'll also get some hackbench numbers soon.

Best regards,

Erich