Commit d7102e95b7b9c00277562c29aad421d2d521c5f6 in linus's git tree
[PATCH] sched: filter affine wakeups
From: Nick Piggin <[email protected]>
Track the last waker CPU, and only consider wakeup-balancing if there's a
match between current waker CPU and the previous waker CPU. This ensures
that there is some correlation between two subsequent wakeup events before
we move the task. Should help random-wakeup workloads on large SMP
systems, by reducing the migration attempts by a factor of nr_cpus.
Apparently caused more than 10% performance regression for aim7 benchmark.
The setup in use is 16-cpu HP rx8620, 64Gb of memory and 12 MSA1000s with
144 disks. Each disk is 72Gb with a single ext3 filesystem (courtesy of
HP, who supplied benchmark results).
The problem is, for aim7, the wake-up pattern is random, but it still needs
load balancing action in the wake-up path to achieve best performance. With
the above commit, lack of load balancing hurts that workload.
However, for workloads like database transaction processing, the requirement
is exactly opposite. In the wake up path, best performance is achieved with
absolutely zero load balancing. We simply wake up the process on the CPU
that it was previously run. Worst performance is obtained when we do load
balancing at wake up.
There isn't an easy way to auto detect the workload characteristics. Ingo's
earlier patch that detects idle CPU and decide whether to load balance or not
doesn't perform with aim7 either since all CPUs are busy (it causes even
bigger perf. regression).
We should back out the above commit and add a sysctl variable to control the
behavior of load balancing in wake up path, so user can dynamically select
a mode that best fit for the workload environment. And kernel can achieve
best performance in two extreme ends of incompatible workload environments.
Two patches to follow.
- Ken
"Chen, Kenneth W" <[email protected]> wrote:
>
> Commit d7102e95b7b9c00277562c29aad421d2d521c5f6 in linus's git tree
>
> [PATCH] sched: filter affine wakeups
> From: Nick Piggin <[email protected]>
> Track the last waker CPU, and only consider wakeup-balancing if there's a
> match between current waker CPU and the previous waker CPU. This ensures
> that there is some correlation between two subsequent wakeup events before
> we move the task. Should help random-wakeup workloads on large SMP
> systems, by reducing the migration attempts by a factor of nr_cpus.
>
>
> Apparently caused more than 10% performance regression for aim7 benchmark.
Post-mortem time. Why was it merged?
This patch was added to -mm on 8 November 2006. Was merged into mainline
12 January 2006. That's two months in -mm and one month in mainline.
I don't think it's reasonable to stretch the latency of scheduler patches
to even longer than three months and I doubt if that'll solve the problem.
Oh well, at least we found it.
>
> We should back out the above commit and add a sysctl variable to control the
> behavior of load balancing in wake up path, so user can dynamically select
> a mode that best fit for the workload environment. And kernel can achieve
> best performance in two extreme ends of incompatible workload environments.
Well I don't see any benchmark numbers in the original patch. Just an
assertion that it "should" help something.
I'm more inclined to revert it and not add the sysctl (ugh) until we have a
good reason to do so.
Andrew Morton wrote:
>
>
> Well I don't see any benchmark numbers in the original patch. Just an
> assertion that it "should" help something.
>
The regression was in a Ken's commercial database benchmark. I couldn't
reproduce it but presumably it did fix it otherwise Ken would would have
piped up?
> I'm more inclined to revert it and not add the sysctl (ugh) until we have a
> good reason to do so.
>
If you revert this then Ken's database benchmark gets worse. Hence the
sysctl.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin wrote:
> Andrew Morton wrote:
>
>>
>>
>> Well I don't see any benchmark numbers in the original patch. Just an
>> assertion that it "should" help something.
>>
>
> The regression was in a Ken's commercial database benchmark. I couldn't
> reproduce it but presumably it did fix it otherwise Ken would would have
> piped up?
>
BTW, I did actually ask that you hold off merging it until Ken came
back with some numbers.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin <[email protected]> wrote:
>
> > I'm more inclined to revert it and not add the sysctl (ugh) until we have a
> > good reason to do so.
> >
>
> If you revert this then Ken's database benchmark gets worse. Hence the
> sysctl.
Putting that knob in there hurts. There's just no way in which we can
autodetect the right setting?
Ho hum. Well could someone please resend Ken's sysctl-knob patch, this
time
a) with a good changelog which is useful to people who write things to
/proc. ie: end users, not scheduler hackers and
b) with an update to Documentation/kernel-parameters.txt and
c) if poss, with a better name? Something which describes what its
effect is upon the overall system not its effect upon mysterious
scheduler internals.
If this thing is to be useful to administrators, they actually have to have
a chance of understanding what it does.
Thanks.
Nick Piggin <[email protected]> wrote:
>
> Nick Piggin wrote:
> > Andrew Morton wrote:
> >
> >>
> >>
> >> Well I don't see any benchmark numbers in the original patch. Just an
> >> assertion that it "should" help something.
> >>
> >
> > The regression was in a Ken's commercial database benchmark. I couldn't
> > reproduce it but presumably it did fix it otherwise Ken would would have
> > piped up?
> >
>
> BTW, I did actually ask that you hold off merging it until Ken came
> back with some numbers.
>
So you did.
Andrew Morton wrote on Monday, February 13, 2006 7:39 PM
> Post-mortem time. Why was it merged?
>
> This patch was added to -mm on 8 November 2006. Was merged into mainline
> 12 January 2006. That's two months in -mm and one month in mainline.
>
> I don't think it's reasonable to stretch the latency of scheduler patches
> to even longer than three months and I doubt if that'll solve the problem.
>
> Oh well, at least we found it.
I guess I'm the one to blame :-( Our kernel performance project needs to
be more active in testing pending performance patches that sits in -mm.
> > We should back out the above commit and add a sysctl variable to control the
> > behavior of load balancing in wake up path, so user can dynamically select
> > a mode that best fit for the workload environment. And kernel can achieve
> > best performance in two extreme ends of incompatible workload environments.
>
> Well I don't see any benchmark numbers in the original patch. Just an
> assertion that it "should" help something.
>
> I'm more inclined to revert it and not add the sysctl (ugh) until we have a
> good reason to do so.
The discussion on wake-up load balancing started July 28 last year [1].
In that post, it was outlined that the performance patch gave 2.3% gain
for db transaction processing workload. It was then followed up by a
couple of iterations, including one from Ingo[2] and one from Nick[3].
I think we just demonstrated that there are two workloads exist today
which has completely opposite requirement on load balancing in the
wake up path. People has tried to make kernel smart and detect work-
load environment, but so far it hasn't worked. And hence the sysctl
that I'm proposing right now.
[1] http://marc.theaimsgroup.com/?l=linux-kernel&m=112259224917701&w=2
[2] http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2
[3] http://marc.theaimsgroup.com/?l=linux-kernel&m=113051195601837&w=2
Nick Piggin wrote on Monday, February 13, 2006 7:44 PM
> Andrew Morton wrote:
> > Well I don't see any benchmark numbers in the original patch. Just an
> > assertion that it "should" help something.
> >
> The regression was in a Ken's commercial database benchmark. I couldn't
> reproduce it but presumably it did fix it otherwise Ken would would have
> piped up?
I wasn't entirely happy though ;-)
> > I'm more inclined to revert it and not add the sysctl (ugh) until we have a
> > good reason to do so.
> >
>
> If you revert this then Ken's database benchmark gets worse. Hence the
> sysctl.
Yes, Nick is correct. For db workload, the wake-ups type are mixed.
Some of them are random, some of them are not because we do interrupt
binding. The break down between random/fixed is about 30/70. Thus,
Nick's patch helps 30% of time. With sysctl, we can regain the entire
up side for db workload while retain workload on the other end of
spectrum.
- Ken
* Andrew Morton <[email protected]> wrote:
> > We should back out the above commit and add a sysctl variable to control the
> > behavior of load balancing in wake up path, so user can dynamically select
> > a mode that best fit for the workload environment. And kernel can achieve
> > best performance in two extreme ends of incompatible workload environments.
>
> Well I don't see any benchmark numbers in the original patch. Just an
> assertion that it "should" help something.
>
> I'm more inclined to revert it and not add the sysctl (ugh) until we
> have a good reason to do so.
yeah, we should do the revert:
Acked-by: Ingo Molnar <[email protected]>
but i'm quite against a sysctl knob. Especially not for 2.6.16.
Ingo
Andrew Morton wrote on Monday, February 13, 2006 8:00 PM
> Ho hum. Well could someone please resend Ken's sysctl-knob patch,
> this time
>
> a) with a good changelog which is useful to people who write things to
> /proc. ie: end users, not scheduler hackers and
>
> b) with an update to Documentation/kernel-parameters.txt and
>
> c) if poss, with a better name? Something which describes what its
> effect is upon the overall system not its effect upon mysterious
> scheduler internals.
>
> If this thing is to be useful to administrators, they actually have to
> have a chance of understanding what it does.
Here is a respin of the patch with more documentation. I can't come up
with a better name without make it long: cpu_load_balance_on_wakeup?
(quickly duck behind a bunker).
[patch] add sysctl to control wake up load balancing
Some workload like aim7 relies on CPU load balancing action in the
wake-up path to achieve best performance. Process wake-up time is
typically a good opportunity for kernel to perform CPU load balance
and by using heuristics, it can perform relatively well for workloads
that exhibits large dynamics. However, there are other workload which
exhibit fixed wake-up pattern, i.e. because of interrupt binding or
already established balanced CPU load via periodic load balancing,
such balancing action in wake up will disturb the system and cause
excessive process migration. Experiments showed that for aim7 type
of workload, 10% performance is gained if load balance is performed,
while 2.3% performance is gained for db transaction processing workload
if load balance is not performed. Because of opposite requirements
for these two type of environment, it is best to have a sysctl variable
to control the behavior of load balancing in wake up path, so user can
dynamically select a mode that best fit for the workload environment.
And kernel can achieve best performance in two extreme ends of
incompatible workload environments.
Signed-off-by: Ken Chen <[email protected]>
--- linux-2.6.16-rc2/Documentation/filesystems/proc.txt.orig
2006-02-14 00:06:14.692245170 -0800
+++ linux-2.6.16-rc2/Documentation/filesystems/proc.txt 2006-02-14
00:08:37.735212168 -0800
@@ -1130,6 +1130,22 @@ If a system hangs up, try pressing the N
And NMI watchdog will be disabled when the value in this file is set
to
non-zero.
+wake_balance
+------------
+
+When value in this file is 1 [default], kernel will performance CPU
load
+balance when wake up process. Wake-up is a term used to describe
transition
+of a sleeping process become run-able and getting scheduled on
available CPU.
+Kernel uses heuristics to achieve best performance for vast majority of
+workloads, e.g. workloads exhibit large amount of dynamics. When the
value
+is 0, kernel will not perform any CPU load balance and directly queue
up the
+process on the CPU that was previously ran. This is best suitable for
workload
+environment that exhibit fixed wake-up pattern or already established
balanced
+CPU load via periodic load balancing. Thus keep disturbance on process
+migration to minimal. If such value is to be changed from the default,
it
+is recommended that sys admin to measure the effect. If in doubt, use
default
+value.
+
2.4 /proc/sys/vm - The virtual memory subsystem
-----------------------------------------------
--- linux-2.6.16-rc2/include/linux/sysctl.h.orig 2006-02-13
18:48:13.287205480 -0800
+++ linux-2.6.16-rc2/include/linux/sysctl.h 2006-02-13
18:49:32.138767014 -0800
@@ -146,6 +146,7 @@ enum
KERN_RANDOMIZE=68, /* int: randomize virtual address space */
KERN_SETUID_DUMPABLE=69, /* int: behaviour of dumps for setuid
core */
KERN_SPIN_RETRY=70, /* int: number of spinlock retries */
+ KERN_WAKE_BALANCE=71, /* int: load balance on wakeup */
};
--- linux-2.6.16-rc2/kernel/sched.c.orig 2006-02-13
18:24:03.035270121 -0800
+++ linux-2.6.16-rc2/kernel/sched.c 2006-02-13 19:04:22.872154540
-0800
@@ -1237,6 +1237,8 @@ static inline int wake_idle(int cpu, tas
}
#endif
+int sysctl_wake_balance=1;
+
/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
@@ -1294,6 +1296,9 @@ static int try_to_wake_up(task_t *p, uns
}
}
+ if (!sysctl_wake_balance)
+ goto out_activate;
+
if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
goto out_set_cpu;
--- linux-2.6.16-rc2/kernel/sysctl.c.orig 2006-02-13
18:49:44.342868427 -0800
+++ linux-2.6.16-rc2/kernel/sysctl.c 2006-02-13 18:52:37.038178812
-0800
@@ -71,6 +71,7 @@ extern int printk_ratelimit_burst;
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
extern int percpu_pagelist_fraction;
+extern int sysctl_wake_balance;
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
@@ -658,6 +659,14 @@ static ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+ {
+ .ctl_name = KERN_WAKE_BALANCE,
+ .procname = "wake_balance",
+ .data = &sysctl_wake_balance,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
{ .ctl_name = 0 }
};
"Chen, Kenneth W" <[email protected]> wrote:
>
> Here is a respin of the patch with more documentation.
>
Thanks. Can you send me an unwordwrapped version off-list?
Did I mention "ug"?
Ingo, what's your plan here?
My stupid email client .... Resent, hopefully it is better without
word wrap.
Andrew Morton wrote on Monday, February 13, 2006 8:00 PM
> Ho hum. Well could someone please resend Ken's sysctl-knob patch,
> this time
>
> a) with a good changelog which is useful to people who write things to
> /proc. ie: end users, not scheduler hackers and
>
> b) with an update to Documentation/kernel-parameters.txt and
>
> c) if poss, with a better name? Something which describes what its
> effect is upon the overall system not its effect upon mysterious
> scheduler internals.
>
> If this thing is to be useful to administrators, they actually have to
> have a chance of understanding what it does.
Here is a respin of the patch with more documentation. I can't come up
with a better name without make it long: cpu_load_balance_on_wakeup?
(quickly duck behind a bunker).
[patch] add sysctl to control wake up load balancing
Some workload like aim7 relies on CPU load balancing action in the
wake-up path to achieve best performance. Process wake-up time is
typically a good opportunity for kernel to perform CPU load balance
and by using heuristics, it can perform relatively well for workloads
that exhibits large dynamics. However, there are other workload which
exhibit fixed wake-up pattern, i.e. because of interrupt binding or
already established balanced CPU load via periodic load balancing,
such balancing action in wake up will disturb the system and cause
excessive process migration. Experiments showed that for aim7 type
of workload, 10% performance is gained if load balance is performed,
while 2.3% performance is gained for db transaction processing workload
if load balance is not performed. Because of opposite requirements
for these two type of environment, it is best to have a sysctl variable
to control the behavior of load balancing in wake up path, so user can
dynamically select a mode that best fit for the workload environment.
And kernel can achieve best performance in two extreme ends of
incompatible workload environments.
Signed-off-by: Ken Chen <[email protected]>
--- linux-2.6.16-rc2/Documentation/filesystems/proc.txt.orig
2006-02-14 00:06:14.692245170 -0800
+++ linux-2.6.16-rc2/Documentation/filesystems/proc.txt 2006-02-14
00:08:37.735212168 -0800
@@ -1130,6 +1130,22 @@ If a system hangs up, try pressing the N
And NMI watchdog will be disabled when the value in this file is set
to
non-zero.
+wake_balance
+------------
+
+When value in this file is 1 [default], kernel will performance CPU
load
+balance when wake up process. Wake-up is a term used to describe
transition
+of a sleeping process become run-able and getting scheduled on
available CPU.
+Kernel uses heuristics to achieve best performance for vast majority of
+workloads, e.g. workloads exhibit large amount of dynamics. When the
value
+is 0, kernel will not perform any CPU load balance and directly queue
up the
+process on the CPU that was previously ran. This is best suitable for
workload
+environment that exhibit fixed wake-up pattern or already established
balanced
+CPU load via periodic load balancing. Thus keep disturbance on process
+migration to minimal. If such value is to be changed from the default,
it
+is recommended that sys admin to measure the effect. If in doubt, use
default
+value.
+
2.4 /proc/sys/vm - The virtual memory subsystem
-----------------------------------------------
--- linux-2.6.16-rc2/include/linux/sysctl.h.orig 2006-02-13
18:48:13.287205480 -0800
+++ linux-2.6.16-rc2/include/linux/sysctl.h 2006-02-13
18:49:32.138767014 -0800
@@ -146,6 +146,7 @@ enum
KERN_RANDOMIZE=68, /* int: randomize virtual address space */
KERN_SETUID_DUMPABLE=69, /* int: behaviour of dumps for setuid
core */
KERN_SPIN_RETRY=70, /* int: number of spinlock retries */
+ KERN_WAKE_BALANCE=71, /* int: load balance on wakeup */
};
--- linux-2.6.16-rc2/kernel/sched.c.orig 2006-02-13
18:24:03.035270121 -0800
+++ linux-2.6.16-rc2/kernel/sched.c 2006-02-13 19:04:22.872154540
-0800
@@ -1237,6 +1237,8 @@ static inline int wake_idle(int cpu, tas
}
#endif
+int sysctl_wake_balance=1;
+
/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
@@ -1294,6 +1296,9 @@ static int try_to_wake_up(task_t *p, uns
}
}
+ if (!sysctl_wake_balance)
+ goto out_activate;
+
if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
goto out_set_cpu;
--- linux-2.6.16-rc2/kernel/sysctl.c.orig 2006-02-13
18:49:44.342868427 -0800
+++ linux-2.6.16-rc2/kernel/sysctl.c 2006-02-13 18:52:37.038178812
-0800
@@ -71,6 +71,7 @@ extern int printk_ratelimit_burst;
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
extern int percpu_pagelist_fraction;
+extern int sysctl_wake_balance;
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
@@ -658,6 +659,14 @@ static ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+ {
+ .ctl_name = KERN_WAKE_BALANCE,
+ .procname = "wake_balance",
+ .data = &sysctl_wake_balance,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
{ .ctl_name = 0 }
};
* Andrew Morton <[email protected]> wrote:
> "Chen, Kenneth W" <[email protected]> wrote:
> >
> > Here is a respin of the patch with more documentation.
> >
>
> Thanks. Can you send me an unwordwrapped version off-list?
>
> Did I mention "ug"?
>
> Ingo, what's your plan here?
I really dont like the sysctl hack. Firstly, which precise kernel
version was tested - do we know that it wasnt e.g. the smpnice
regression interfering? Secondly, i dont like the sysctl concept itself:
i really think we should try to find a way for _applications_ to be
woken up according to their workload.
If we add the sysctl then basically only the benchmarkers will use it -
99.99% of users will get whatever default we (and distros) provide, and
the problem wont be solved in any way. In fact, we'll never be able to
get rid of the knob again i suspect. I'd rather have the wakeup patch
reverted, and some better method presented. Adding the sysctl just
removes all the incentive for people to work on solving this problem in
some real way.
I also refuse to regard this as any sort of emergency that justifies the
sysctl hack. The test results came clearly late and i suggested to the
benchmarking guys a long time ago that if they want us to care about
their workload, and if it's complex to reproduce the benchmark, they
should distill some simpler test-app for us to so that we can reproduce
those cases. I'd much rather like to do the simplest thing: revert the
wakeup patch (we were fine without it for 15 kernel releases), than to
paper over [permanently!] this particular incarnation of a wider
problem.
Ingo
Ingo Molnar wrote on Tuesday, February 14, 2006 4:48 AM
> * Andrew Morton <[email protected]> wrote:
> > Ingo, what's your plan here?
>
> I really dont like the sysctl hack. Firstly, which precise kernel
> version was tested - do we know that it wasnt e.g. the smpnice
> regression interfering? Secondly, i dont like the sysctl concept itself:
> i really think we should try to find a way for _applications_ to be
> woken up according to their workload.
Ingo, you have tried to come up with an algorithm, Nick tried equally,
and I have tried. None of the "smart algorithms" works right now. The
workload requirements are on two opposite end of spectrum. If you do
one thing, it will hurt the other and vice versa. I don't see a solution
other than a sysctl. If you have implementation idea, I would be more
than happy to code it up and test. But I think we are really running dry
here.
- Ken