Log since RFC:
1. Throttle only when wake-affine failed. (thanks to PeterZ)
2. Do throttle inside wake_affine(). (thanks to PeterZ)
3. Other small fix.
Recently testing show that wake-affine stuff cause regression on pgbench, the
hiding rat was finally catched out.
wake-affine stuff is always trying to pull wakee close to waker, by theory,
this will benefit us if waker's cpu cached hot data for wakee, or the extreme
ping-pong case.
However, the whole stuff is somewhat blindly, load balance is the only factor
to be guaranteed, and since the stuff itself is time-consuming, some workload
suffered, pgbench is just the one who has been found.
Thus, throttle the wake-affine stuff for such workload is necessary.
This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
default value 1ms (default minimum balance interval), which means wake-affine
will keep silent for 1ms after it returned false.
By turning the new knob, those workload who suffered will have the chance to
stop the regression.
Test:
Test with 12 cpu X86 server and tip 3.9.0-rc2.
default
base 1ms interval 10ms interval 100ms interval
| db_size | clients | tps | | tps | | tps | | tps |
+---------+---------+-------+- +-------+ +-------+ +-------+
| 21 MB | 1 | 10572 | | 10804 | | 10802 | | 10801 |
| 21 MB | 2 | 21275 | | 21533 | | 21400 | | 21498 |
| 21 MB | 4 | 41866 | | 42158 | | 42410 | | 42306 |
| 21 MB | 8 | 53931 | | 55796 | | 58608 | +8.67% | 59916 | +11.10%
| 21 MB | 12 | 50956 | | 52266 | | 54586 | +7.12% | 55982 | +9.86%
| 21 MB | 16 | 49911 | | 52862 | +5.91% | 55668 | +11.53% | 57255 | +14.71%
| 21 MB | 24 | 46046 | | 48820 | +6.02% | 54269 | +17.86% | 58113 | +26.21%
| 21 MB | 32 | 43405 | | 46635 | +7.44% | 53690 | +23.70% | 57729 | +33.00%
| 7483 MB | 1 | 7734 | | 8013 | | 8046 | | 7879 |
| 7483 MB | 2 | 19375 | | 19459 | | 19448 | | 19421 |
| 7483 MB | 4 | 37408 | | 37780 | | 37937 | | 37819 |
| 7483 MB | 8 | 49033 | | 50389 | | 51636 | +5.31% | 52294 | +6.65%
| 7483 MB | 12 | 45525 | | 47794 | +4.98% | 49828 | +9.45% | 50571 | +11.08%
| 7483 MB | 16 | 45731 | | 47921 | +4.79% | 50203 | +9.78% | 52033 | +13.78%
| 7483 MB | 24 | 41533 | | 44301 | +6.67% | 49697 | +19.66% | 53833 | +29.62%
| 7483 MB | 32 | 36370 | | 38301 | +5.31% | 48146 | +32.38% | 52795 | +45.16%
| 15 GB | 1 | 7576 | | 7926 | | 7722 | | 7969 |
| 15 GB | 2 | 19157 | | 19284 | | 19294 | | 19304 |
| 15 GB | 4 | 37285 | | 37539 | | 37281 | | 37508 |
| 15 GB | 8 | 48718 | | 49176 | | 50836 | +4.35% | 51239 | +5.17%
| 15 GB | 12 | 45167 | | 47180 | +4.45% | 49206 | +8.94% | 50126 | +10.98%
| 15 GB | 16 | 45270 | | 47293 | +4.47% | 49638 | +9.65% | 51748 | +14.31%
| 15 GB | 24 | 40984 | | 43366 | +5.81% | 49356 | +20.43% | 53157 | +29.70%
| 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% | 52241 | +45.45%
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
include/linux/sched.h | 5 +++++
kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
kernel/sysctl.c | 10 ++++++++++
3 files changed, 46 insertions(+), 0 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..e9efd3a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1197,6 +1197,10 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};
+#ifdef CONFIG_SMP
+extern unsigned int sysctl_sched_wake_affine_interval;
+#endif
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1207,6 +1211,7 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+ unsigned long next_wake_affine;
#endif
int on_rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..68eedd7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3087,6 +3087,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
#endif
+/*
+ * Default is 1ms, to prevent the wake_affine() stuff working too frequently.
+ */
+unsigned int sysctl_sched_wake_affine_interval = 1U;
+
+static inline int wake_affine_throttled(struct task_struct *p)
+{
+ return time_before(jiffies, p->next_wake_affine);
+}
+
+static inline void wake_affine_throttle(struct task_struct *p)
+{
+ p->next_wake_affine = jiffies +
+ msecs_to_jiffies(sysctl_sched_wake_affine_interval);
+}
+
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
{
s64 this_load, load;
@@ -3096,6 +3112,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
unsigned long weight;
int balanced;
+ if (wake_affine_throttled(p))
+ return 0;
+
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -3167,6 +3186,18 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
return 1;
}
+
+ /*
+ * wake_affine() stuff try to pull wakee to the cpu
+ * around waker, this will benefit us if the data
+ * cached on waker cpu is hot for wakee, or the extreme
+ * ping-pong case.
+ *
+ * However, do such blindly work too frequently will
+ * cause regression to some workload, thus, each time
+ * when wake_affine() failed, throttle it for a while.
+ */
+ wake_affine_throttle(p);
return 0;
}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..6ebfc18 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -437,6 +437,16 @@ static struct ctl_table kern_table[] = {
.extra1 = &one,
},
#endif
+#ifdef CONFIG_SMP
+ {
+ .procname = "sched_wake_affine_interval",
+ .data = &sysctl_sched_wake_affine_interval,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
--
1.7.4.1
On 04/10/2013 11:30 AM, Michael Wang wrote:
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
Reviewed-by: Alex Shi <[email protected]>
BTW, could you try the kbulid, hackbench and aim for this?
--
Thanks Alex
On 04/10/2013 12:16 PM, Alex Shi wrote:
> On 04/10/2013 11:30 AM, Michael Wang wrote:
>> Suggested-by: Peter Zijlstra <[email protected]>
>> Signed-off-by: Michael Wang <[email protected]>
>
> Reviewed-by: Alex Shi <[email protected]>
Thanks for your review :)
>
> BTW, could you try the kbulid, hackbench and aim for this?
Sure, the patch has already been tested with aim7, also the hackbench,
kbench, and ebizzy, no notable changes on my box with the default 1ms
interval.
Regards,
Michael Wang
>
On 04/10/2013 01:11 PM, Michael Wang wrote:
>> > BTW, could you try the kbulid, hackbench and aim for this?
> Sure, the patch has already been tested with aim7, also the hackbench,
> kbench, and ebizzy, no notable changes on my box with the default 1ms
> interval.
That's fine.
--
Thanks Alex
On Wed, 2013-04-10 at 11:30 +0800, Michael Wang wrote:
> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% |
> 52241 | +45.45%
So I don't get this... is wake_affine() once every milisecond _that_
expensive?
Seeing we get a 45%!! improvement out of once every 100ms that would
mean we're like spending 1/3rd of our time in wake_affine()? that's
preposterous. So what's happening?
Hi, Peter
Thanks for your reply :)
On 04/10/2013 04:51 PM, Peter Zijlstra wrote:
> On Wed, 2013-04-10 at 11:30 +0800, Michael Wang wrote:
>> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% |
>> 52241 | +45.45%
>
> So I don't get this... is wake_affine() once every milisecond _that_
> expensive?
>
> Seeing we get a 45%!! improvement out of once every 100ms that would
> mean we're like spending 1/3rd of our time in wake_affine()? that's
> preposterous. So what's happening?
Not all the regression was caused by overhead, adopt curr_cpu not
prev_cpu for select_idle_sibling() is a more important reason for the
regression of pgbench.
In other word, for pgbench, we waste time in wake_affine() and make the
wrong decision at most of the time, the previously patch show
wake_affine() do pull unrelated tasks together, that's good if current
cpu still cached hot data for wakee, but that's not the case of the
workload like pgbench.
The workload just don't satisfied the decision changed by wake-affine,
the more wake-affine active, the more it suffered, that's why 100ms show
better results than 1ms, but when reached some rate, the benefit and
lost of wake-affine will be balanced.
Regards,
Michael Wang
>
>
>
On 04/10/2013 05:22 PM, Michael Wang wrote:
> Hi, Peter
>
> Thanks for your reply :)
>
> On 04/10/2013 04:51 PM, Peter Zijlstra wrote:
>> On Wed, 2013-04-10 at 11:30 +0800, Michael Wang wrote:
>>> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% |
>>> 52241 | +45.45%
>>
>> So I don't get this... is wake_affine() once every milisecond _that_
>> expensive?
>>
>> Seeing we get a 45%!! improvement out of once every 100ms that would
>> mean we're like spending 1/3rd of our time in wake_affine()? that's
>> preposterous. So what's happening?
>
> Not all the regression was caused by overhead, adopt curr_cpu not
> prev_cpu for select_idle_sibling() is a more important reason for the
> regression of pgbench.
>
> In other word, for pgbench, we waste time in wake_affine() and make the
> wrong decision at most of the time, the previously patch show
> wake_affine() do pull unrelated tasks together, that's good if current
> cpu still cached hot data for wakee, but that's not the case of the
> workload like pgbench.
Please let me know if I failed to express my thought clearly.
I know it's hard to figure out why throttle could bring so many benefit,
since the wake-affine stuff is a black box with too many unmeasurable
factors, but that's actually the reason why we finally figure out this
throttle idea, not the approach like wakeup-buddy, although both of them
help to stop the regression.
It's fortunate that there is a benchmark could help to find out the
regression, and now we have a simple and efficient approach ready for
action ;-)
Regards,
Michael Wang
>
> The workload just don't satisfied the decision changed by wake-affine,
> the more wake-affine active, the more it suffered, that's why 100ms show
> better results than 1ms, but when reached some rate, the benefit and
> lost of wake-affine will be balanced.
>
> Regards,
> Michael Wang
>
>>
>>
>>
>
On Thu, 2013-04-11 at 14:01 +0800, Michael Wang wrote:
> On 04/10/2013 05:22 PM, Michael Wang wrote:
> > Hi, Peter
> >
> > Thanks for your reply :)
> >
> > On 04/10/2013 04:51 PM, Peter Zijlstra wrote:
> >> On Wed, 2013-04-10 at 11:30 +0800, Michael Wang wrote:
> >>> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% |
> >>> 52241 | +45.45%
> >>
> >> So I don't get this... is wake_affine() once every milisecond _that_
> >> expensive?
> >>
> >> Seeing we get a 45%!! improvement out of once every 100ms that would
> >> mean we're like spending 1/3rd of our time in wake_affine()? that's
> >> preposterous. So what's happening?
> >
> > Not all the regression was caused by overhead, adopt curr_cpu not
> > prev_cpu for select_idle_sibling() is a more important reason for the
> > regression of pgbench.
> >
> > In other word, for pgbench, we waste time in wake_affine() and make the
> > wrong decision at most of the time, the previously patch show
> > wake_affine() do pull unrelated tasks together, that's good if current
> > cpu still cached hot data for wakee, but that's not the case of the
> > workload like pgbench.
>
> Please let me know if I failed to express my thought clearly.
>
> I know it's hard to figure out why throttle could bring so many benefit,
> since the wake-affine stuff is a black box with too many unmeasurable
> factors, but that's actually the reason why we finally figure out this
> throttle idea, not the approach like wakeup-buddy, although both of them
> help to stop the regression.
For that load, as soon as clients+server exceeds socket size, pull is
doomed to always be a guaranteed loser. There simply is no way to win,
some tasks must drag their data cross node no matter what you do,
because there is one and only one source of data, so you can not
possibly do anything but harm by pulling or in any other way disturbing
task placement, because you will force tasks to re-heat their footprint
every time you migrate someone with zero benefit to offset cost. That
is why the closer you get to completely killing all migration, the
better your throughput gets with this load.. you're killing the cost of
migration in a situation there simply is no gain to be had.
That's why that wakeup-buddy thingy is a ~good idea. It will allow 1:1
buddies that can and do benefit from motion to pair up and jabber in a
shared cache (though that motion needs slowing down too), _and_ detect
the case where wakeup migration is utterly pointless. Just killing
wakeup migration OTOH should (I'd say very emphatic will) hurt pgbench
just as much, because spreading a smallish set which could share a cache
across several nodes hurts things like pgbench via misses just as much
as any other load.. it's just that once this load (or ilk) doesn't fit
in a node, you're absolutely screwed as far as misses go, you will eat
that because there simply is no other option.
Any migration is pointless for this thing once it exceeds socket size,
and fairness plays a dominant role, is absolutely not throughputs best
friend when any component of a load requires more CPU than the other
components, which very definitely is the case with pgbench. Fairness
hurts this thing a lot. That's why pgbench took a whopping huge hit
when I fixed up select_idle_sibling() to not completely rape fast/light
communicating tasks, it forced pgbench to face the consequences of a
fair scheduler, by cutting off the escape routes that searching for
_any_ even ever so briefly idle spot to place tasks such that wakeup
preemption just didn't happen, and when we failed to pull, we instead
did the very same thing on wakees original socket, thus providing
pgbench the fairness escape mechanism that it needs.
When you wake to idle cores, you do not have a nanosecond resolution
ultra fair scheduler, with the fairness price to be paid.. tasks run as
long as they want to run, or at least full ticks, which of course makes
the hard working load components a lot more productive. Hogs can be
hogs. For pgbench run in 1:N mode, the hardest working load component
is the mother of all work, the (singular) server. Any time 'mom' is not
continuously working her little digital a$$ off to keep all those kids
fed, you have a performance problem on your hands, the entire load
stalls, lives and dies with one and only 'mom'.
-Mike
Hi, Mike
On 04/11/2013 03:30 PM, Mike Galbraith wrote:
[snip]
>>
>> Please let me know if I failed to express my thought clearly.
>>
>> I know it's hard to figure out why throttle could bring so many benefit,
>> since the wake-affine stuff is a black box with too many unmeasurable
>> factors, but that's actually the reason why we finally figure out this
>> throttle idea, not the approach like wakeup-buddy, although both of them
>> help to stop the regression.
>
> For that load, as soon as clients+server exceeds socket size, pull is
> doomed to always be a guaranteed loser. There simply is no way to win,
> some tasks must drag their data cross node no matter what you do,
> because there is one and only one source of data, so you can not
> possibly do anything but harm by pulling or in any other way disturbing
> task placement, because you will force tasks to re-heat their footprint
> every time you migrate someone with zero benefit to offset cost. That
> is why the closer you get to completely killing all migration, the
> better your throughput gets with this load.. you're killing the cost of
> migration in a situation there simply is no gain to be had.
>
> That's why that wakeup-buddy thingy is a ~good idea. It will allow 1:1
> buddies that can and do benefit from motion to pair up and jabber in a
> shared cache (though that motion needs slowing down too), _and_ detect
> the case where wakeup migration is utterly pointless. Just killing
> wakeup migration OTOH should (I'd say very emphatic will) hurt pgbench
> just as much, because spreading a smallish set which could share a cache
> across several nodes hurts things like pgbench via misses just as much
> as any other load.. it's just that once this load (or ilk) doesn't fit
> in a node, you're absolutely screwed as far as misses go, you will eat
> that because there simply is no other option.
Wow, it's really hard to write down the logical behind, but you made it ;-)
The 1:N is a good reason to explain why the chance that wakee's hot data
cached on curr_cpu is lower, and since it's just 'lower' not 'extinct',
after the throttle interval large enough, it will be balanced, this
could be proved, since during my test, when the interval become too big,
the improvement start to drop.
>
> Any migration is pointless for this thing once it exceeds socket size,
> and fairness plays a dominant role, is absolutely not throughputs best
> friend when any component of a load requires more CPU than the other
> components, which very definitely is the case with pgbench. Fairness
> hurts this thing a lot. That's why pgbench took a whopping huge hit
> when I fixed up select_idle_sibling() to not completely rape fast/light
> communicating tasks, it forced pgbench to face the consequences of a
> fair scheduler, by cutting off the escape routes that searching for
> _any_ even ever so briefly idle spot to place tasks such that wakeup
> preemption just didn't happen, and when we failed to pull, we instead
> did the very same thing on wakees original socket, thus providing
> pgbench the fairness escape mechanism that it needs.
>
> When you wake to idle cores, you do not have a nanosecond resolution
> ultra fair scheduler, with the fairness price to be paid.. tasks run as
> long as they want to run, or at least full ticks, which of course makes
> the hard working load components a lot more productive. Hogs can be
> hogs. For pgbench run in 1:N mode, the hardest working load component
> is the mother of all work, the (singular) server. Any time 'mom' is not
> continuously working her little digital a$$ off to keep all those kids
> fed, you have a performance problem on your hands, the entire load
> stalls, lives and dies with one and only 'mom'.
Hmm...that's an interesting point, the workload contain different
'priority' works, and depend on each other, if mother starving, all the
kids could do nothing but wait for her, may be that's the reason why the
benefit is so significant, since in such case, mother's little quicker
respond will make all the kids happy :)
Regards,
Michael Wang
>
> -Mike
>
On Thu, 2013-04-11 at 16:26 +0800, Michael Wang wrote:
> The 1:N is a good reason to explain why the chance that wakee's hot data
> cached on curr_cpu is lower, and since it's just 'lower' not 'extinct',
> after the throttle interval large enough, it will be balanced, this
> could be proved, since during my test, when the interval become too big,
> the improvement start to drop.
Magnitude of improvement drops just because there's less damage done
methinks. You'll eventually run out of measurable damage :)
Yes, it's not really extinct, you _can_ reap a gain, it's just not at
all likely to work out. A more symetric load will fare better, but any
1:N thing just has to spread far and wide to have any chance to perform.
> Hmm...that's an interesting point, the workload contain different
> 'priority' works, and depend on each other, if mother starving, all the
> kids could do nothing but wait for her, may be that's the reason why the
> benefit is so significant, since in such case, mother's little quicker
> respond will make all the kids happy :)
Exactly. The entire load is server latency bound. Keep the server on
cpu, the load performs as best it can given unavoidable data miss cost.
-Mike
On Thu, 2013-04-11 at 10:44 +0200, Mike Galbraith wrote:
> On Thu, 2013-04-11 at 16:26 +0800, Michael Wang wrote:
>
> > The 1:N is a good reason to explain why the chance that wakee's hot data
> > cached on curr_cpu is lower, and since it's just 'lower' not 'extinct',
> > after the throttle interval large enough, it will be balanced, this
> > could be proved, since during my test, when the interval become too big,
> > the improvement start to drop.
>
> Magnitude of improvement drops just because there's less damage done
> methinks. You'll eventually run out of measurable damage :)
>
> Yes, it's not really extinct, you _can_ reap a gain, it's just not at
> all likely to work out. A more symetric load will fare better, but any
> 1:N thing just has to spread far and wide to have any chance to perform.
> > Hmm...that's an interesting point, the workload contain different
> > 'priority' works, and depend on each other, if mother starving, all the
> > kids could do nothing but wait for her, may be that's the reason why the
> > benefit is so significant, since in such case, mother's little quicker
> > respond will make all the kids happy :)
>
> Exactly. The entire load is server latency bound. Keep the server on
> cpu, the load performs as best it can given unavoidable data miss cost.
(ie serial producer, parallel consumer... choke point lies with utterly
unscalable serial work producer)
On 04/11/2013 04:44 PM, Mike Galbraith wrote:
> On Thu, 2013-04-11 at 16:26 +0800, Michael Wang wrote:
>
>> The 1:N is a good reason to explain why the chance that wakee's hot data
>> cached on curr_cpu is lower, and since it's just 'lower' not 'extinct',
>> after the throttle interval large enough, it will be balanced, this
>> could be proved, since during my test, when the interval become too big,
>> the improvement start to drop.
>
> Magnitude of improvement drops just because there's less damage done
> methinks. You'll eventually run out of measurable damage :)
>
> Yes, it's not really extinct, you _can_ reap a gain, it's just not at
> all likely to work out. A more symetric load will fare better, but any
> 1:N thing just has to spread far and wide to have any chance to perform.
Agree.
>> Hmm...that's an interesting point, the workload contain different
>> 'priority' works, and depend on each other, if mother starving, all the
>> kids could do nothing but wait for her, may be that's the reason why the
>> benefit is so significant, since in such case, mother's little quicker
>> respond will make all the kids happy :)
>
> Exactly. The entire load is server latency bound. Keep the server on
> cpu, the load performs as best it can given unavoidable data miss cost.
Nice point :)
Regards,
Michael Wang
>
> -Mike
>
On 04/10/2013 04:51 PM, Peter Zijlstra wrote:
> On Wed, 2013-04-10 at 11:30 +0800, Michael Wang wrote:
>> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% |
>> 52241 | +45.45%
>
> So I don't get this... is wake_affine() once every milisecond _that_
> expensive?
>
> Seeing we get a 45%!! improvement out of once every 100ms that would
> mean we're like spending 1/3rd of our time in wake_affine()? that's
> preposterous. So what's happening?
Hi, Peter
I think Mike has very well explained the reason why throttle bring us
benefit and why the benefit looks so significant when interval get
higher, could you please take a look at his mail and see whether it
addressed this concern?
And thanks Mike again for the excellent analysis :)
Regards,
Michael Wang
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 04/10/2013 11:30 AM, Michael Wang wrote:
> Log since RFC:
> 1. Throttle only when wake-affine failed. (thanks to PeterZ)
> 2. Do throttle inside wake_affine(). (thanks to PeterZ)
> 3. Other small fix.
>
> Recently testing show that wake-affine stuff cause regression on pgbench, the
> hiding rat was finally catched out.
>
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
> ping-pong case.
>
> However, the whole stuff is somewhat blindly, load balance is the only factor
> to be guaranteed, and since the stuff itself is time-consuming, some workload
> suffered, pgbench is just the one who has been found.
>
> Thus, throttle the wake-affine stuff for such workload is necessary.
>
> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
> default value 1ms (default minimum balance interval), which means wake-affine
> will keep silent for 1ms after it returned false.
>
> By turning the new knob, those workload who suffered will have the chance to
> stop the regression.
I have tested the latest tip 3.9.0-rc7, huge regression on pgbench is
still there and this approach still works well, should we take the
action now?
Regards,
Michael Wang
>
> Test:
> Test with 12 cpu X86 server and tip 3.9.0-rc2.
>
> default
> base 1ms interval 10ms interval 100ms interval
> | db_size | clients | tps | | tps | | tps | | tps |
> +---------+---------+-------+- +-------+ +-------+ +-------+
> | 21 MB | 1 | 10572 | | 10804 | | 10802 | | 10801 |
> | 21 MB | 2 | 21275 | | 21533 | | 21400 | | 21498 |
> | 21 MB | 4 | 41866 | | 42158 | | 42410 | | 42306 |
> | 21 MB | 8 | 53931 | | 55796 | | 58608 | +8.67% | 59916 | +11.10%
> | 21 MB | 12 | 50956 | | 52266 | | 54586 | +7.12% | 55982 | +9.86%
> | 21 MB | 16 | 49911 | | 52862 | +5.91% | 55668 | +11.53% | 57255 | +14.71%
> | 21 MB | 24 | 46046 | | 48820 | +6.02% | 54269 | +17.86% | 58113 | +26.21%
> | 21 MB | 32 | 43405 | | 46635 | +7.44% | 53690 | +23.70% | 57729 | +33.00%
> | 7483 MB | 1 | 7734 | | 8013 | | 8046 | | 7879 |
> | 7483 MB | 2 | 19375 | | 19459 | | 19448 | | 19421 |
> | 7483 MB | 4 | 37408 | | 37780 | | 37937 | | 37819 |
> | 7483 MB | 8 | 49033 | | 50389 | | 51636 | +5.31% | 52294 | +6.65%
> | 7483 MB | 12 | 45525 | | 47794 | +4.98% | 49828 | +9.45% | 50571 | +11.08%
> | 7483 MB | 16 | 45731 | | 47921 | +4.79% | 50203 | +9.78% | 52033 | +13.78%
> | 7483 MB | 24 | 41533 | | 44301 | +6.67% | 49697 | +19.66% | 53833 | +29.62%
> | 7483 MB | 32 | 36370 | | 38301 | +5.31% | 48146 | +32.38% | 52795 | +45.16%
> | 15 GB | 1 | 7576 | | 7926 | | 7722 | | 7969 |
> | 15 GB | 2 | 19157 | | 19284 | | 19294 | | 19304 |
> | 15 GB | 4 | 37285 | | 37539 | | 37281 | | 37508 |
> | 15 GB | 8 | 48718 | | 49176 | | 50836 | +4.35% | 51239 | +5.17%
> | 15 GB | 12 | 45167 | | 47180 | +4.45% | 49206 | +8.94% | 50126 | +10.98%
> | 15 GB | 16 | 45270 | | 47293 | +4.47% | 49638 | +9.65% | 51748 | +14.31%
> | 15 GB | 24 | 40984 | | 43366 | +5.81% | 49356 | +20.43% | 53157 | +29.70%
> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% | 52241 | +45.45%
>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> include/linux/sched.h | 5 +++++
> kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
> kernel/sysctl.c | 10 ++++++++++
> 3 files changed, 46 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d35d2b6..e9efd3a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1197,6 +1197,10 @@ enum perf_event_task_context {
> perf_nr_task_contexts,
> };
>
> +#ifdef CONFIG_SMP
> +extern unsigned int sysctl_sched_wake_affine_interval;
> +#endif
> +
> struct task_struct {
> volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
> void *stack;
> @@ -1207,6 +1211,7 @@ struct task_struct {
> #ifdef CONFIG_SMP
> struct llist_node wake_entry;
> int on_cpu;
> + unsigned long next_wake_affine;
> #endif
> int on_rq;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a33e59..68eedd7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3087,6 +3087,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>
> #endif
>
> +/*
> + * Default is 1ms, to prevent the wake_affine() stuff working too frequently.
> + */
> +unsigned int sysctl_sched_wake_affine_interval = 1U;
> +
> +static inline int wake_affine_throttled(struct task_struct *p)
> +{
> + return time_before(jiffies, p->next_wake_affine);
> +}
> +
> +static inline void wake_affine_throttle(struct task_struct *p)
> +{
> + p->next_wake_affine = jiffies +
> + msecs_to_jiffies(sysctl_sched_wake_affine_interval);
> +}
> +
> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> {
> s64 this_load, load;
> @@ -3096,6 +3112,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> unsigned long weight;
> int balanced;
>
> + if (wake_affine_throttled(p))
> + return 0;
> +
> idx = sd->wake_idx;
> this_cpu = smp_processor_id();
> prev_cpu = task_cpu(p);
> @@ -3167,6 +3186,18 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>
> return 1;
> }
> +
> + /*
> + * wake_affine() stuff try to pull wakee to the cpu
> + * around waker, this will benefit us if the data
> + * cached on waker cpu is hot for wakee, or the extreme
> + * ping-pong case.
> + *
> + * However, do such blindly work too frequently will
> + * cause regression to some workload, thus, each time
> + * when wake_affine() failed, throttle it for a while.
> + */
> + wake_affine_throttle(p);
> return 0;
> }
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index afc1dc6..6ebfc18 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -437,6 +437,16 @@ static struct ctl_table kern_table[] = {
> .extra1 = &one,
> },
> #endif
> +#ifdef CONFIG_SMP
> + {
> + .procname = "sched_wake_affine_interval",
> + .data = &sysctl_sched_wake_affine_interval,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &zero,
> + },
> +#endif
> #ifdef CONFIG_PROVE_LOCKING
> {
> .procname = "prove_locking",
>
On Mon, 2013-04-22 at 12:21 +0800, Michael Wang wrote:
> On 04/10/2013 11:30 AM, Michael Wang wrote:
> > Log since RFC:
> > 1. Throttle only when wake-affine failed. (thanks to PeterZ)
> > 2. Do throttle inside wake_affine(). (thanks to PeterZ)
> > 3. Other small fix.
> >
> > Recently testing show that wake-affine stuff cause regression on pgbench, the
> > hiding rat was finally catched out.
> >
> > wake-affine stuff is always trying to pull wakee close to waker, by theory,
> > this will benefit us if waker's cpu cached hot data for wakee, or the extreme
> > ping-pong case.
> >
> > However, the whole stuff is somewhat blindly, load balance is the only factor
> > to be guaranteed, and since the stuff itself is time-consuming, some workload
> > suffered, pgbench is just the one who has been found.
> >
> > Thus, throttle the wake-affine stuff for such workload is necessary.
> >
> > This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
> > default value 1ms (default minimum balance interval), which means wake-affine
> > will keep silent for 1ms after it returned false.
> >
> > By turning the new knob, those workload who suffered will have the chance to
> > stop the regression.
>
> I have tested the latest tip 3.9.0-rc7, huge regression on pgbench is
> still there and this approach still works well, should we take the
> action now?
(It's not a _regression_ per se, more of a long standing issue for this
sort of load. Not that the load cares much what we call the problem;)
>
> Regards,
> Michael Wang
>
> >
> > Test:
> > Test with 12 cpu X86 server and tip 3.9.0-rc2.
> >
> > default
> > base 1ms interval 10ms interval 100ms interval
> > | db_size | clients | tps | | tps | | tps | | tps |
> > +---------+---------+-------+- +-------+ +-------+ +-------+
> > | 21 MB | 1 | 10572 | | 10804 | | 10802 | | 10801 |
> > | 21 MB | 2 | 21275 | | 21533 | | 21400 | | 21498 |
> > | 21 MB | 4 | 41866 | | 42158 | | 42410 | | 42306 |
> > | 21 MB | 8 | 53931 | | 55796 | | 58608 | +8.67% | 59916 | +11.10%
> > | 21 MB | 12 | 50956 | | 52266 | | 54586 | +7.12% | 55982 | +9.86%
> > | 21 MB | 16 | 49911 | | 52862 | +5.91% | 55668 | +11.53% | 57255 | +14.71%
> > | 21 MB | 24 | 46046 | | 48820 | +6.02% | 54269 | +17.86% | 58113 | +26.21%
> > | 21 MB | 32 | 43405 | | 46635 | +7.44% | 53690 | +23.70% | 57729 | +33.00%
> > | 7483 MB | 1 | 7734 | | 8013 | | 8046 | | 7879 |
> > | 7483 MB | 2 | 19375 | | 19459 | | 19448 | | 19421 |
> > | 7483 MB | 4 | 37408 | | 37780 | | 37937 | | 37819 |
> > | 7483 MB | 8 | 49033 | | 50389 | | 51636 | +5.31% | 52294 | +6.65%
> > | 7483 MB | 12 | 45525 | | 47794 | +4.98% | 49828 | +9.45% | 50571 | +11.08%
> > | 7483 MB | 16 | 45731 | | 47921 | +4.79% | 50203 | +9.78% | 52033 | +13.78%
> > | 7483 MB | 24 | 41533 | | 44301 | +6.67% | 49697 | +19.66% | 53833 | +29.62%
> > | 7483 MB | 32 | 36370 | | 38301 | +5.31% | 48146 | +32.38% | 52795 | +45.16%
> > | 15 GB | 1 | 7576 | | 7926 | | 7722 | | 7969 |
> > | 15 GB | 2 | 19157 | | 19284 | | 19294 | | 19304 |
> > | 15 GB | 4 | 37285 | | 37539 | | 37281 | | 37508 |
> > | 15 GB | 8 | 48718 | | 49176 | | 50836 | +4.35% | 51239 | +5.17%
> > | 15 GB | 12 | 45167 | | 47180 | +4.45% | 49206 | +8.94% | 50126 | +10.98%
> > | 15 GB | 16 | 45270 | | 47293 | +4.47% | 49638 | +9.65% | 51748 | +14.31%
> > | 15 GB | 24 | 40984 | | 43366 | +5.81% | 49356 | +20.43% | 53157 | +29.70%
> > | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% | 52241 | +45.45%
> >
> > Suggested-by: Peter Zijlstra <[email protected]>
> > Signed-off-by: Michael Wang <[email protected]>
> > ---
> > include/linux/sched.h | 5 +++++
> > kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
> > kernel/sysctl.c | 10 ++++++++++
> > 3 files changed, 46 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d35d2b6..e9efd3a 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1197,6 +1197,10 @@ enum perf_event_task_context {
> > perf_nr_task_contexts,
> > };
> >
> > +#ifdef CONFIG_SMP
> > +extern unsigned int sysctl_sched_wake_affine_interval;
> > +#endif
> > +
> > struct task_struct {
> > volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
> > void *stack;
> > @@ -1207,6 +1211,7 @@ struct task_struct {
> > #ifdef CONFIG_SMP
> > struct llist_node wake_entry;
> > int on_cpu;
> > + unsigned long next_wake_affine;
> > #endif
> > int on_rq;
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7a33e59..68eedd7 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3087,6 +3087,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
> >
> > #endif
> >
> > +/*
> > + * Default is 1ms, to prevent the wake_affine() stuff working too frequently.
> > + */
> > +unsigned int sysctl_sched_wake_affine_interval = 1U;
> > +
> > +static inline int wake_affine_throttled(struct task_struct *p)
> > +{
> > + return time_before(jiffies, p->next_wake_affine);
> > +}
> > +
> > +static inline void wake_affine_throttle(struct task_struct *p)
> > +{
> > + p->next_wake_affine = jiffies +
> > + msecs_to_jiffies(sysctl_sched_wake_affine_interval);
> > +}
> > +
> > static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> > {
> > s64 this_load, load;
> > @@ -3096,6 +3112,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> > unsigned long weight;
> > int balanced;
> >
> > + if (wake_affine_throttled(p))
> > + return 0;
> > +
> > idx = sd->wake_idx;
> > this_cpu = smp_processor_id();
> > prev_cpu = task_cpu(p);
> > @@ -3167,6 +3186,18 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> >
> > return 1;
> > }
> > +
> > + /*
> > + * wake_affine() stuff try to pull wakee to the cpu
> > + * around waker, this will benefit us if the data
> > + * cached on waker cpu is hot for wakee, or the extreme
> > + * ping-pong case.
> > + *
> > + * However, do such blindly work too frequently will
> > + * cause regression to some workload, thus, each time
> > + * when wake_affine() failed, throttle it for a while.
> > + */
> > + wake_affine_throttle(p);
> > return 0;
> > }
> >
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index afc1dc6..6ebfc18 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -437,6 +437,16 @@ static struct ctl_table kern_table[] = {
> > .extra1 = &one,
> > },
> > #endif
> > +#ifdef CONFIG_SMP
> > + {
> > + .procname = "sched_wake_affine_interval",
> > + .data = &sysctl_sched_wake_affine_interval,
> > + .maxlen = sizeof(unsigned int),
> > + .mode = 0644,
> > + .proc_handler = proc_dointvec_minmax,
> > + .extra1 = &zero,
> > + },
> > +#endif
> > #ifdef CONFIG_PROVE_LOCKING
> > {
> > .procname = "prove_locking",
> >
>
Hi, Mike
Thanks for your reply :)
On 04/22/2013 01:27 PM, Mike Galbraith wrote:
> On Mon, 2013-04-22 at 12:21 +0800, Michael Wang wrote:
>> On 04/10/2013 11:30 AM, Michael Wang wrote:
>>> Log since RFC:
>>> 1. Throttle only when wake-affine failed. (thanks to PeterZ)
>>> 2. Do throttle inside wake_affine(). (thanks to PeterZ)
>>> 3. Other small fix.
>>>
>>> Recently testing show that wake-affine stuff cause regression on pgbench, the
>>> hiding rat was finally catched out.
>>>
>>> wake-affine stuff is always trying to pull wakee close to waker, by theory,
>>> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
>>> ping-pong case.
>>>
>>> However, the whole stuff is somewhat blindly, load balance is the only factor
>>> to be guaranteed, and since the stuff itself is time-consuming, some workload
>>> suffered, pgbench is just the one who has been found.
>>>
>>> Thus, throttle the wake-affine stuff for such workload is necessary.
>>>
>>> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
>>> default value 1ms (default minimum balance interval), which means wake-affine
>>> will keep silent for 1ms after it returned false.
>>>
>>> By turning the new knob, those workload who suffered will have the chance to
>>> stop the regression.
>>
>> I have tested the latest tip 3.9.0-rc7, huge regression on pgbench is
>> still there and this approach still works well, should we take the
>> action now?
>
> (It's not a _regression_ per se, more of a long standing issue for this
> sort of load. Not that the load cares much what we call the problem;)
Yeah, it's not general but for one kind of workload (or may be several),
compared with the world without wake-affine, I will call it the
regression caused by wake-affine on pgbench ;)
Regards,
Michael Wang
>
>>
>> Regards,
>> Michael Wang
>>
>>>
>>> Test:
>>> Test with 12 cpu X86 server and tip 3.9.0-rc2.
>>>
>>> default
>>> base 1ms interval 10ms interval 100ms interval
>>> | db_size | clients | tps | | tps | | tps | | tps |
>>> +---------+---------+-------+- +-------+ +-------+ +-------+
>>> | 21 MB | 1 | 10572 | | 10804 | | 10802 | | 10801 |
>>> | 21 MB | 2 | 21275 | | 21533 | | 21400 | | 21498 |
>>> | 21 MB | 4 | 41866 | | 42158 | | 42410 | | 42306 |
>>> | 21 MB | 8 | 53931 | | 55796 | | 58608 | +8.67% | 59916 | +11.10%
>>> | 21 MB | 12 | 50956 | | 52266 | | 54586 | +7.12% | 55982 | +9.86%
>>> | 21 MB | 16 | 49911 | | 52862 | +5.91% | 55668 | +11.53% | 57255 | +14.71%
>>> | 21 MB | 24 | 46046 | | 48820 | +6.02% | 54269 | +17.86% | 58113 | +26.21%
>>> | 21 MB | 32 | 43405 | | 46635 | +7.44% | 53690 | +23.70% | 57729 | +33.00%
>>> | 7483 MB | 1 | 7734 | | 8013 | | 8046 | | 7879 |
>>> | 7483 MB | 2 | 19375 | | 19459 | | 19448 | | 19421 |
>>> | 7483 MB | 4 | 37408 | | 37780 | | 37937 | | 37819 |
>>> | 7483 MB | 8 | 49033 | | 50389 | | 51636 | +5.31% | 52294 | +6.65%
>>> | 7483 MB | 12 | 45525 | | 47794 | +4.98% | 49828 | +9.45% | 50571 | +11.08%
>>> | 7483 MB | 16 | 45731 | | 47921 | +4.79% | 50203 | +9.78% | 52033 | +13.78%
>>> | 7483 MB | 24 | 41533 | | 44301 | +6.67% | 49697 | +19.66% | 53833 | +29.62%
>>> | 7483 MB | 32 | 36370 | | 38301 | +5.31% | 48146 | +32.38% | 52795 | +45.16%
>>> | 15 GB | 1 | 7576 | | 7926 | | 7722 | | 7969 |
>>> | 15 GB | 2 | 19157 | | 19284 | | 19294 | | 19304 |
>>> | 15 GB | 4 | 37285 | | 37539 | | 37281 | | 37508 |
>>> | 15 GB | 8 | 48718 | | 49176 | | 50836 | +4.35% | 51239 | +5.17%
>>> | 15 GB | 12 | 45167 | | 47180 | +4.45% | 49206 | +8.94% | 50126 | +10.98%
>>> | 15 GB | 16 | 45270 | | 47293 | +4.47% | 49638 | +9.65% | 51748 | +14.31%
>>> | 15 GB | 24 | 40984 | | 43366 | +5.81% | 49356 | +20.43% | 53157 | +29.70%
>>> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% | 52241 | +45.45%
>>>
>>> Suggested-by: Peter Zijlstra <[email protected]>
>>> Signed-off-by: Michael Wang <[email protected]>
>>> ---
>>> include/linux/sched.h | 5 +++++
>>> kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
>>> kernel/sysctl.c | 10 ++++++++++
>>> 3 files changed, 46 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index d35d2b6..e9efd3a 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1197,6 +1197,10 @@ enum perf_event_task_context {
>>> perf_nr_task_contexts,
>>> };
>>>
>>> +#ifdef CONFIG_SMP
>>> +extern unsigned int sysctl_sched_wake_affine_interval;
>>> +#endif
>>> +
>>> struct task_struct {
>>> volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
>>> void *stack;
>>> @@ -1207,6 +1211,7 @@ struct task_struct {
>>> #ifdef CONFIG_SMP
>>> struct llist_node wake_entry;
>>> int on_cpu;
>>> + unsigned long next_wake_affine;
>>> #endif
>>> int on_rq;
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 7a33e59..68eedd7 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3087,6 +3087,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>>>
>>> #endif
>>>
>>> +/*
>>> + * Default is 1ms, to prevent the wake_affine() stuff working too frequently.
>>> + */
>>> +unsigned int sysctl_sched_wake_affine_interval = 1U;
>>> +
>>> +static inline int wake_affine_throttled(struct task_struct *p)
>>> +{
>>> + return time_before(jiffies, p->next_wake_affine);
>>> +}
>>> +
>>> +static inline void wake_affine_throttle(struct task_struct *p)
>>> +{
>>> + p->next_wake_affine = jiffies +
>>> + msecs_to_jiffies(sysctl_sched_wake_affine_interval);
>>> +}
>>> +
>>> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>> {
>>> s64 this_load, load;
>>> @@ -3096,6 +3112,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>> unsigned long weight;
>>> int balanced;
>>>
>>> + if (wake_affine_throttled(p))
>>> + return 0;
>>> +
>>> idx = sd->wake_idx;
>>> this_cpu = smp_processor_id();
>>> prev_cpu = task_cpu(p);
>>> @@ -3167,6 +3186,18 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>>
>>> return 1;
>>> }
>>> +
>>> + /*
>>> + * wake_affine() stuff try to pull wakee to the cpu
>>> + * around waker, this will benefit us if the data
>>> + * cached on waker cpu is hot for wakee, or the extreme
>>> + * ping-pong case.
>>> + *
>>> + * However, do such blindly work too frequently will
>>> + * cause regression to some workload, thus, each time
>>> + * when wake_affine() failed, throttle it for a while.
>>> + */
>>> + wake_affine_throttle(p);
>>> return 0;
>>> }
>>>
>>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>>> index afc1dc6..6ebfc18 100644
>>> --- a/kernel/sysctl.c
>>> +++ b/kernel/sysctl.c
>>> @@ -437,6 +437,16 @@ static struct ctl_table kern_table[] = {
>>> .extra1 = &one,
>>> },
>>> #endif
>>> +#ifdef CONFIG_SMP
>>> + {
>>> + .procname = "sched_wake_affine_interval",
>>> + .data = &sysctl_sched_wake_affine_interval,
>>> + .maxlen = sizeof(unsigned int),
>>> + .mode = 0644,
>>> + .proc_handler = proc_dointvec_minmax,
>>> + .extra1 = &zero,
>>> + },
>>> +#endif
>>> #ifdef CONFIG_PROVE_LOCKING
>>> {
>>> .procname = "prove_locking",
>>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
OK,.. Ingo said that pipe-test was the original motivation for
wake_affine() and since that's currently broken to pieces due to
select_idle_sibling() is there still a benefit to having it at all?
Can anybody find any significant regression when simply killing
wake_affine()?
* Peter Zijlstra <[email protected]> wrote:
> OK,.. Ingo said that pipe-test was the original motivation for
> wake_affine() and since that's currently broken to pieces due to
> select_idle_sibling() is there still a benefit to having it at all?
>
> Can anybody find any significant regression when simply killing
> wake_affine()?
I'd suggest doing a patch that does:
s/SD_WAKE_AFFINE/0*SD_WAKE_AFFINE
in all the relevant toplogy.h files, but otherwise keep the logic in
place. That way it's easy to revert.
Thanks,
Ingo
On Mon, Apr 22, 2013 at 3:23 AM, Peter Zijlstra <[email protected]> wrote:
>
> OK,.. Ingo said that pipe-test was the original motivation for
> wake_affine() and since that's currently broken to pieces due to
> select_idle_sibling() is there still a benefit to having it at all?
>
> Can anybody find any significant regression when simply killing
> wake_affine()?
>
This matches our experience; we've always tuned toward a more
aggressive select_idle_sibling() for in-socket wake-ups.
We did recently turn it back on in the cross-socket case (e.g.
SD_NODE) however as we were able to observe a benefit on a few
workloads.
On 04/22/2013 06:23 PM, Peter Zijlstra wrote:
>
> OK,.. Ingo said that pipe-test was the original motivation for
> wake_affine() and since that's currently broken to pieces due to
> select_idle_sibling() is there still a benefit to having it at all?
>
> Can anybody find any significant regression when simply killing
> wake_affine()?
IMHO, reserve this stuff may be a better choice, it do bring benefit,
pgbench is just some workload it also bring damage.
By throttle, we could reduce the damage meanwhile reserve the benefit,
when reach some rate, they will be balanced and show best performance, I
suppose just killing the stuff won't show that peak...
Besides, we still have many theory about the possible benefit of this
stuff, along with some side-effect like speared behaviour, it has been
there too long, the influence of killing it is unmeasurable (tend to be
worse), I really prefer to reserve it...
And if we really want to kill it, we could just throttle it with an
incredible interval, and make it unable to breathe any more ;-)
Regards,
Michael Wang
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 04/22/2013 06:35 PM, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
>> OK,.. Ingo said that pipe-test was the original motivation for
>> wake_affine() and since that's currently broken to pieces due to
>> select_idle_sibling() is there still a benefit to having it at all?
>>
>> Can anybody find any significant regression when simply killing
>> wake_affine()?
>
> I'd suggest doing a patch that does:
>
> s/SD_WAKE_AFFINE/0*SD_WAKE_AFFINE
But by doing this, we won't be able to find 'affine_sd' any more, that
will also skip the select_idle_sibling() logical (in current code),
isn't it?
If we really want to kill the stuff (I prefer not...), I suggest we
forbidden the wake-affine by throttle it with an incredible interval,
that's also easily to be reverted :)
Regards,
Michael Wang
>
> in all the relevant toplogy.h files, but otherwise keep the logic in
> place. That way it's easy to revert.
>
> Thanks,
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 04/22/2013 06:23 PM, Peter Zijlstra wrote:
>
> OK,.. Ingo said that pipe-test was the original motivation for
> wake_affine() and since that's currently broken to pieces due to
> select_idle_sibling() is there still a benefit to having it at all?
>
> Can anybody find any significant regression when simply killing
> wake_affine()?
Is it time for us to make the decision now?
Forgive me for urge the progress, but I really confused what is stopping
us from adopt the throttle approach...
Is there any other option more simple, safe and faster?
Regards,
Michael Wang
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 04/22/2013 06:23 PM, Peter Zijlstra wrote:
>
> OK,.. Ingo said that pipe-test was the original motivation for
> wake_affine() and since that's currently broken to pieces due to
> select_idle_sibling() is there still a benefit to having it at all?
>
> Can anybody find any significant regression when simply killing
> wake_affine()?
I got the proof that we could not simply killing the stuff (finally...).
It's the hackbench with a high pipe number, still on 12 cpu box, the
result of "./hackbench 48 process 10000" is:
Running with 48*40 (== 1920) tasks.
Time: 33.372
After killed the wake-affine, the result is:
Running with 48*40 (== 1920) tasks.
Time: 38.205
About 14.48% performance dropped without wake-affine, I guess it was
caused by the missing spread behaviour.
I've done the test for several times, also compared with the throttle
approach, default 1ms interval still works very well, the regression on
hackbench start to exceed 2% when interval become 100ms on my box, but
please note the pgbench already gain a lot benefit at that time.
I think now we could say that wake-affine is useful, and we could not
simply kill it.
So I still suggest we adopt the throttle approach, then we could make
adjustment according to the demand.
And please let me know if there are any concerns ;-)
Regards,
Michael Wang
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Thu, 2013-05-02 at 13:48 +0800, Michael Wang wrote:
> On 04/22/2013 06:23 PM, Peter Zijlstra wrote:
> >
> > OK,.. Ingo said that pipe-test was the original motivation for
> > wake_affine() and since that's currently broken to pieces due to
> > select_idle_sibling() is there still a benefit to having it at all?
> >
> > Can anybody find any significant regression when simply killing
> > wake_affine()?
>
> I got the proof that we could not simply killing the stuff (finally...).
>
> It's the hackbench with a high pipe number, still on 12 cpu box, the
> result of "./hackbench 48 process 10000" is:
>
> Running with 48*40 (== 1920) tasks.
> Time: 33.372
>
> After killed the wake-affine, the result is:
>
> Running with 48*40 (== 1920) tasks.
> Time: 38.205
>
> About 14.48% performance dropped without wake-affine, I guess it was
> caused by the missing spread behaviour.
>
> I've done the test for several times, also compared with the throttle
> approach, default 1ms interval still works very well, the regression on
> hackbench start to exceed 2% when interval become 100ms on my box, but
> please note the pgbench already gain a lot benefit at that time.
>
> I think now we could say that wake-affine is useful, and we could not
> simply kill it.
Oh, it's definitely useful. Communicating tasks deeply resent talking
over interconnects (advanced tin cans and string). My little Q6600 box
can even be described as dinky-numa given enough imagination.. place
communicating tasks on different core2 "nodes" if you will, throughput
falls through the floor. Shared L2 is quick like bunny, dram ain't.
-Mike
On 05/02/2013 03:10 PM, Mike Galbraith wrote:
>>
>> I've done the test for several times, also compared with the throttle
>> approach, default 1ms interval still works very well, the regression on
>> hackbench start to exceed 2% when interval become 100ms on my box, but
>> please note the pgbench already gain a lot benefit at that time.
>>
>> I think now we could say that wake-affine is useful, and we could not
>> simply kill it.
>
> Oh, it's definitely useful. Communicating tasks deeply resent talking
> over interconnects (advanced tin cans and string). My little Q6600 box
> can even be described as dinky-numa given enough imagination.. place
> communicating tasks on different core2 "nodes" if you will, throughput
> falls through the floor. Shared L2 is quick like bunny, dram ain't.
Nice, so we got another proof to defend wake-affine now ;-)
Regards,
Michael Wang
>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 04/10/2013 11:30 AM, Michael Wang wrote:
Now long time has been passed since the first version, I'd like to
do some summary about current states:
On a 12 cpu box with tip 3.9.0-rc7, test show that:
1. remove wake-affine stuff cause regression on hackbench (could be 15%).
2. reserve wake-affine stuff cause regression on pgbench (could be 45%).
And since neither remove or reserve is acceptable, this patch introduced a
knob to throttle wake-affine stuff.
By turning the knob, we could benefit both hackbench and pgbench, or sacrifice
one to significantly benefit another.
All similar workload which prefer or hate wake-affine behaviour could
been cared.
If this approach caused any concerns, please let me know ;-)
Regards,
Michael Wang
>
> Recently testing show that wake-affine stuff cause regression on pgbench, the
> hiding rat was finally catched out.
>
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
> ping-pong case.
>
> However, the whole stuff is somewhat blindly, load balance is the only factor
> to be guaranteed, and since the stuff itself is time-consuming, some workload
> suffered, pgbench is just the one who has been found.
>
> Thus, throttle the wake-affine stuff for such workload is necessary.
>
> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
> default value 1ms (default minimum balance interval), which means wake-affine
> will keep silent for 1ms after it returned false.
>
> By turning the new knob, those workload who suffered will have the chance to
> stop the regression.
>
> Test:
> Test with 12 cpu X86 server and tip 3.9.0-rc2.
>
> default
> base 1ms interval 10ms interval 100ms interval
> | db_size | clients | tps | | tps | | tps | | tps |
> +---------+---------+-------+- +-------+ +-------+ +-------+
> | 21 MB | 1 | 10572 | | 10804 | | 10802 | | 10801 |
> | 21 MB | 2 | 21275 | | 21533 | | 21400 | | 21498 |
> | 21 MB | 4 | 41866 | | 42158 | | 42410 | | 42306 |
> | 21 MB | 8 | 53931 | | 55796 | | 58608 | +8.67% | 59916 | +11.10%
> | 21 MB | 12 | 50956 | | 52266 | | 54586 | +7.12% | 55982 | +9.86%
> | 21 MB | 16 | 49911 | | 52862 | +5.91% | 55668 | +11.53% | 57255 | +14.71%
> | 21 MB | 24 | 46046 | | 48820 | +6.02% | 54269 | +17.86% | 58113 | +26.21%
> | 21 MB | 32 | 43405 | | 46635 | +7.44% | 53690 | +23.70% | 57729 | +33.00%
> | 7483 MB | 1 | 7734 | | 8013 | | 8046 | | 7879 |
> | 7483 MB | 2 | 19375 | | 19459 | | 19448 | | 19421 |
> | 7483 MB | 4 | 37408 | | 37780 | | 37937 | | 37819 |
> | 7483 MB | 8 | 49033 | | 50389 | | 51636 | +5.31% | 52294 | +6.65%
> | 7483 MB | 12 | 45525 | | 47794 | +4.98% | 49828 | +9.45% | 50571 | +11.08%
> | 7483 MB | 16 | 45731 | | 47921 | +4.79% | 50203 | +9.78% | 52033 | +13.78%
> | 7483 MB | 24 | 41533 | | 44301 | +6.67% | 49697 | +19.66% | 53833 | +29.62%
> | 7483 MB | 32 | 36370 | | 38301 | +5.31% | 48146 | +32.38% | 52795 | +45.16%
> | 15 GB | 1 | 7576 | | 7926 | | 7722 | | 7969 |
> | 15 GB | 2 | 19157 | | 19284 | | 19294 | | 19304 |
> | 15 GB | 4 | 37285 | | 37539 | | 37281 | | 37508 |
> | 15 GB | 8 | 48718 | | 49176 | | 50836 | +4.35% | 51239 | +5.17%
> | 15 GB | 12 | 45167 | | 47180 | +4.45% | 49206 | +8.94% | 50126 | +10.98%
> | 15 GB | 16 | 45270 | | 47293 | +4.47% | 49638 | +9.65% | 51748 | +14.31%
> | 15 GB | 24 | 40984 | | 43366 | +5.81% | 49356 | +20.43% | 53157 | +29.70%
> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% | 52241 | +45.45%
>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> include/linux/sched.h | 5 +++++
> kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
> kernel/sysctl.c | 10 ++++++++++
> 3 files changed, 46 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d35d2b6..e9efd3a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1197,6 +1197,10 @@ enum perf_event_task_context {
> perf_nr_task_contexts,
> };
>
> +#ifdef CONFIG_SMP
> +extern unsigned int sysctl_sched_wake_affine_interval;
> +#endif
> +
> struct task_struct {
> volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
> void *stack;
> @@ -1207,6 +1211,7 @@ struct task_struct {
> #ifdef CONFIG_SMP
> struct llist_node wake_entry;
> int on_cpu;
> + unsigned long next_wake_affine;
> #endif
> int on_rq;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a33e59..68eedd7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3087,6 +3087,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>
> #endif
>
> +/*
> + * Default is 1ms, to prevent the wake_affine() stuff working too frequently.
> + */
> +unsigned int sysctl_sched_wake_affine_interval = 1U;
> +
> +static inline int wake_affine_throttled(struct task_struct *p)
> +{
> + return time_before(jiffies, p->next_wake_affine);
> +}
> +
> +static inline void wake_affine_throttle(struct task_struct *p)
> +{
> + p->next_wake_affine = jiffies +
> + msecs_to_jiffies(sysctl_sched_wake_affine_interval);
> +}
> +
> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> {
> s64 this_load, load;
> @@ -3096,6 +3112,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> unsigned long weight;
> int balanced;
>
> + if (wake_affine_throttled(p))
> + return 0;
> +
> idx = sd->wake_idx;
> this_cpu = smp_processor_id();
> prev_cpu = task_cpu(p);
> @@ -3167,6 +3186,18 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>
> return 1;
> }
> +
> + /*
> + * wake_affine() stuff try to pull wakee to the cpu
> + * around waker, this will benefit us if the data
> + * cached on waker cpu is hot for wakee, or the extreme
> + * ping-pong case.
> + *
> + * However, do such blindly work too frequently will
> + * cause regression to some workload, thus, each time
> + * when wake_affine() failed, throttle it for a while.
> + */
> + wake_affine_throttle(p);
> return 0;
> }
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index afc1dc6..6ebfc18 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -437,6 +437,16 @@ static struct ctl_table kern_table[] = {
> .extra1 = &one,
> },
> #endif
> +#ifdef CONFIG_SMP
> + {
> + .procname = "sched_wake_affine_interval",
> + .data = &sysctl_sched_wake_affine_interval,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &zero,
> + },
> +#endif
> #ifdef CONFIG_PROVE_LOCKING
> {
> .procname = "prove_locking",
>
On Fri, 2013-05-03 at 11:46 +0800, Michael Wang wrote:
> On 04/10/2013 11:30 AM, Michael Wang wrote:
>
> Now long time has been passed since the first version, I'd like to
> do some summary about current states:
>
> On a 12 cpu box with tip 3.9.0-rc7, test show that:
>
> 1. remove wake-affine stuff cause regression on hackbench (could be 15%).
> 2. reserve wake-affine stuff cause regression on pgbench (could be 45%).
>
> And since neither remove or reserve is acceptable, this patch introduced a
> knob to throttle wake-affine stuff.
>
> By turning the knob, we could benefit both hackbench and pgbench, or sacrifice
> one to significantly benefit another.
>
> All similar workload which prefer or hate wake-affine behaviour could
> been cared.
>
> If this approach caused any concerns, please let me know ;-)
I wonder if throttling on failure is the way to go. Note the minimal
gain for pgbench with the default 1ms throttle interval. It's not very
effective out of the box for the load type it's targeted to help, and
people generally don't twiddle scheduler knobs. If you throttle on
success, you directly restrict migration frequency without that being
affected by what other tasks are doing. Seems that would be a bit more
effective.
(I still like the wakeup buddy thing, it's more effective because it
adds and uses knowledge, though without the knob, cache domain size.
Peter is right about the interrupt wakeups though, that could very
easily cause regressions, dirt simple throttle is much safer).
-Mike
Hi, Mike
Thanks for your reply.
On 05/03/2013 01:01 PM, Mike Galbraith wrote:
[snip]
>>
>> If this approach caused any concerns, please let me know ;-)
>
> I wonder if throttling on failure is the way to go. Note the minimal
> gain for pgbench with the default 1ms throttle interval. It's not very
> effective out of the box for the load type it's targeted to help, and
> people generally don't twiddle scheduler knobs. If you throttle on
> success, you directly restrict migration frequency without that being
> affected by what other tasks are doing. Seems that would be a bit more
> effective.
This is a good timing to make some conclusion for this problem ;-)
Let's suppose when wake-affine failed, next time slice got a higher
failure chance, then whether throttle on failure could be the question like:
throttle interval should cover more failure timing
or more success timing?
Obviously we should cover more failure timing, since it's just wasting
cycle and change nothing.
However, I used to concern about the damage of succeed wake-affine at
that rapid, sure it also contain the benefit, but which one is bigger?
Now if we look at the RFC version which throttle on succeed, for
pgbench, we could find that the default 1ms benefit is < 5%, while the
current version which throttle on failure bring 7% at most.
And that eliminate my concern :)
>
> (I still like the wakeup buddy thing, it's more effective because it
> adds and uses knowledge, though without the knob, cache domain size.
> Peter is right about the interrupt wakeups though, that could very
> easily cause regressions, dirt simple throttle is much safer).
Exactly, dark issue deserve dark solution, let darkness guide him...
Regards,
Michael Wang
>
> -Mike
>
On Fri, 2013-05-03 at 13:57 +0800, Michael Wang wrote:
> Hi, Mike
>
> Thanks for your reply.
>
> On 05/03/2013 01:01 PM, Mike Galbraith wrote:
> [snip]
> >>
> >> If this approach caused any concerns, please let me know ;-)
> >
> > I wonder if throttling on failure is the way to go. Note the minimal
> > gain for pgbench with the default 1ms throttle interval. It's not very
> > effective out of the box for the load type it's targeted to help, and
> > people generally don't twiddle scheduler knobs. If you throttle on
> > success, you directly restrict migration frequency without that being
> > affected by what other tasks are doing. Seems that would be a bit more
> > effective.
>
> This is a good timing to make some conclusion for this problem ;-)
>
> Let's suppose when wake-affine failed, next time slice got a higher
> failure chance, then whether throttle on failure could be the question like:
>
> throttle interval should cover more failure timing
> or more success timing?
>
> Obviously we should cover more failure timing, since it's just wasting
> cycle and change nothing.
>
> However, I used to concern about the damage of succeed wake-affine at
> that rapid, sure it also contain the benefit, but which one is bigger?
>
> Now if we look at the RFC version which throttle on succeed, for
> pgbench, we could find that the default 1ms benefit is < 5%, while the
> current version which throttle on failure bring 7% at most.
OK, so scratch that thought. Would still be good to find a dirt simple
dirt cheap way to increase effectiveness a bit, and eliminate the knob.
Until a better idea comes along, this helps pgbench some, and will also
help fast movers ala tbench on AMD, where select_idle_sibling() wasn't
particularly wonderful per my measurements.
-Mike
On 05/03/2013 02:14 PM, Mike Galbraith wrote:
> On Fri, 2013-05-03 at 13:57 +0800, Michael Wang wrote:
>> Hi, Mike
>>
>> Thanks for your reply.
>>
>> On 05/03/2013 01:01 PM, Mike Galbraith wrote:
>> [snip]
>>>>
>>>> If this approach caused any concerns, please let me know ;-)
>>>
>>> I wonder if throttling on failure is the way to go. Note the minimal
>>> gain for pgbench with the default 1ms throttle interval. It's not very
>>> effective out of the box for the load type it's targeted to help, and
>>> people generally don't twiddle scheduler knobs. If you throttle on
>>> success, you directly restrict migration frequency without that being
>>> affected by what other tasks are doing. Seems that would be a bit more
>>> effective.
>>
>> This is a good timing to make some conclusion for this problem ;-)
>>
>> Let's suppose when wake-affine failed, next time slice got a higher
>> failure chance, then whether throttle on failure could be the question like:
>>
>> throttle interval should cover more failure timing
>> or more success timing?
>>
>> Obviously we should cover more failure timing, since it's just wasting
>> cycle and change nothing.
>>
>> However, I used to concern about the damage of succeed wake-affine at
>> that rapid, sure it also contain the benefit, but which one is bigger?
>>
>> Now if we look at the RFC version which throttle on succeed, for
>> pgbench, we could find that the default 1ms benefit is < 5%, while the
>> current version which throttle on failure bring 7% at most.
>
> OK, so scratch that thought. Would still be good to find a dirt simple
> dirt cheap way to increase effectiveness a bit, and eliminate the knob.
> Until a better idea comes along, this helps pgbench some, and will also
> help fast movers ala tbench on AMD, where select_idle_sibling() wasn't
> particularly wonderful per my measurements.
Yep, another advantage of this approach is simple, when later we figure
out the better idea, it could be easily replaced, for now, I would
prefer to use it as an urgent rescue for the 'suffered workload' ;-)
Regards,
Michael Wang
>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 05/03/2013 11:46 AM, Michael Wang wrote:
> On 04/10/2013 11:30 AM, Michael Wang wrote:
>
> Now long time has been passed since the first version, I'd like to
> do some summary about current states:
>
> On a 12 cpu box with tip 3.9.0-rc7, test show that:
>
> 1. remove wake-affine stuff cause regression on hackbench (could be 15%).
> 2. reserve wake-affine stuff cause regression on pgbench (could be 45%).
>
> And since neither remove or reserve is acceptable, this patch introduced a
> knob to throttle wake-affine stuff.
>
> By turning the knob, we could benefit both hackbench and pgbench, or sacrifice
> one to significantly benefit another.
>
> All similar workload which prefer or hate wake-affine behaviour could
> been cared.
>
> If this approach caused any concerns, please let me know ;-)
Is there any more concerns? or should we take some action now?
Regards,
Michael Wang
>
> Regards,
> Michael Wang
>
>>
>> Recently testing show that wake-affine stuff cause regression on pgbench, the
>> hiding rat was finally catched out.
>>
>> wake-affine stuff is always trying to pull wakee close to waker, by theory,
>> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
>> ping-pong case.
>>
>> However, the whole stuff is somewhat blindly, load balance is the only factor
>> to be guaranteed, and since the stuff itself is time-consuming, some workload
>> suffered, pgbench is just the one who has been found.
>>
>> Thus, throttle the wake-affine stuff for such workload is necessary.
>>
>> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
>> default value 1ms (default minimum balance interval), which means wake-affine
>> will keep silent for 1ms after it returned false.
>>
>> By turning the new knob, those workload who suffered will have the chance to
>> stop the regression.
>>
>> Test:
>> Test with 12 cpu X86 server and tip 3.9.0-rc2.
>>
>> default
>> base 1ms interval 10ms interval 100ms interval
>> | db_size | clients | tps | | tps | | tps | | tps |
>> +---------+---------+-------+- +-------+ +-------+ +-------+
>> | 21 MB | 1 | 10572 | | 10804 | | 10802 | | 10801 |
>> | 21 MB | 2 | 21275 | | 21533 | | 21400 | | 21498 |
>> | 21 MB | 4 | 41866 | | 42158 | | 42410 | | 42306 |
>> | 21 MB | 8 | 53931 | | 55796 | | 58608 | +8.67% | 59916 | +11.10%
>> | 21 MB | 12 | 50956 | | 52266 | | 54586 | +7.12% | 55982 | +9.86%
>> | 21 MB | 16 | 49911 | | 52862 | +5.91% | 55668 | +11.53% | 57255 | +14.71%
>> | 21 MB | 24 | 46046 | | 48820 | +6.02% | 54269 | +17.86% | 58113 | +26.21%
>> | 21 MB | 32 | 43405 | | 46635 | +7.44% | 53690 | +23.70% | 57729 | +33.00%
>> | 7483 MB | 1 | 7734 | | 8013 | | 8046 | | 7879 |
>> | 7483 MB | 2 | 19375 | | 19459 | | 19448 | | 19421 |
>> | 7483 MB | 4 | 37408 | | 37780 | | 37937 | | 37819 |
>> | 7483 MB | 8 | 49033 | | 50389 | | 51636 | +5.31% | 52294 | +6.65%
>> | 7483 MB | 12 | 45525 | | 47794 | +4.98% | 49828 | +9.45% | 50571 | +11.08%
>> | 7483 MB | 16 | 45731 | | 47921 | +4.79% | 50203 | +9.78% | 52033 | +13.78%
>> | 7483 MB | 24 | 41533 | | 44301 | +6.67% | 49697 | +19.66% | 53833 | +29.62%
>> | 7483 MB | 32 | 36370 | | 38301 | +5.31% | 48146 | +32.38% | 52795 | +45.16%
>> | 15 GB | 1 | 7576 | | 7926 | | 7722 | | 7969 |
>> | 15 GB | 2 | 19157 | | 19284 | | 19294 | | 19304 |
>> | 15 GB | 4 | 37285 | | 37539 | | 37281 | | 37508 |
>> | 15 GB | 8 | 48718 | | 49176 | | 50836 | +4.35% | 51239 | +5.17%
>> | 15 GB | 12 | 45167 | | 47180 | +4.45% | 49206 | +8.94% | 50126 | +10.98%
>> | 15 GB | 16 | 45270 | | 47293 | +4.47% | 49638 | +9.65% | 51748 | +14.31%
>> | 15 GB | 24 | 40984 | | 43366 | +5.81% | 49356 | +20.43% | 53157 | +29.70%
>> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% | 52241 | +45.45%
>>
>> Suggested-by: Peter Zijlstra <[email protected]>
>> Signed-off-by: Michael Wang <[email protected]>
>> ---
>> include/linux/sched.h | 5 +++++
>> kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
>> kernel/sysctl.c | 10 ++++++++++
>> 3 files changed, 46 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index d35d2b6..e9efd3a 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1197,6 +1197,10 @@ enum perf_event_task_context {
>> perf_nr_task_contexts,
>> };
>>
>> +#ifdef CONFIG_SMP
>> +extern unsigned int sysctl_sched_wake_affine_interval;
>> +#endif
>> +
>> struct task_struct {
>> volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
>> void *stack;
>> @@ -1207,6 +1211,7 @@ struct task_struct {
>> #ifdef CONFIG_SMP
>> struct llist_node wake_entry;
>> int on_cpu;
>> + unsigned long next_wake_affine;
>> #endif
>> int on_rq;
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7a33e59..68eedd7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3087,6 +3087,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>>
>> #endif
>>
>> +/*
>> + * Default is 1ms, to prevent the wake_affine() stuff working too frequently.
>> + */
>> +unsigned int sysctl_sched_wake_affine_interval = 1U;
>> +
>> +static inline int wake_affine_throttled(struct task_struct *p)
>> +{
>> + return time_before(jiffies, p->next_wake_affine);
>> +}
>> +
>> +static inline void wake_affine_throttle(struct task_struct *p)
>> +{
>> + p->next_wake_affine = jiffies +
>> + msecs_to_jiffies(sysctl_sched_wake_affine_interval);
>> +}
>> +
>> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>> {
>> s64 this_load, load;
>> @@ -3096,6 +3112,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>> unsigned long weight;
>> int balanced;
>>
>> + if (wake_affine_throttled(p))
>> + return 0;
>> +
>> idx = sd->wake_idx;
>> this_cpu = smp_processor_id();
>> prev_cpu = task_cpu(p);
>> @@ -3167,6 +3186,18 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>
>> return 1;
>> }
>> +
>> + /*
>> + * wake_affine() stuff try to pull wakee to the cpu
>> + * around waker, this will benefit us if the data
>> + * cached on waker cpu is hot for wakee, or the extreme
>> + * ping-pong case.
>> + *
>> + * However, do such blindly work too frequently will
>> + * cause regression to some workload, thus, each time
>> + * when wake_affine() failed, throttle it for a while.
>> + */
>> + wake_affine_throttle(p);
>> return 0;
>> }
>>
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index afc1dc6..6ebfc18 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -437,6 +437,16 @@ static struct ctl_table kern_table[] = {
>> .extra1 = &one,
>> },
>> #endif
>> +#ifdef CONFIG_SMP
>> + {
>> + .procname = "sched_wake_affine_interval",
>> + .data = &sysctl_sched_wake_affine_interval,
>> + .maxlen = sizeof(unsigned int),
>> + .mode = 0644,
>> + .proc_handler = proc_dointvec_minmax,
>> + .extra1 = &zero,
>> + },
>> +#endif
>> #ifdef CONFIG_PROVE_LOCKING
>> {
>> .procname = "prove_locking",
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 05/07/2013 10:46 AM, Michael Wang wrote:
> On 05/03/2013 11:46 AM, Michael Wang wrote:
>> On 04/10/2013 11:30 AM, Michael Wang wrote:
>>
>> Now long time has been passed since the first version, I'd like to
>> do some summary about current states:
>>
>> On a 12 cpu box with tip 3.9.0-rc7, test show that:
>>
>> 1. remove wake-affine stuff cause regression on hackbench (could be 15%).
>> 2. reserve wake-affine stuff cause regression on pgbench (could be 45%).
>>
>> And since neither remove or reserve is acceptable, this patch introduced a
>> knob to throttle wake-affine stuff.
>>
>> By turning the knob, we could benefit both hackbench and pgbench, or sacrifice
>> one to significantly benefit another.
>>
>> All similar workload which prefer or hate wake-affine behaviour could
>> been cared.
>>
>> If this approach caused any concerns, please let me know ;-)
>
> Is there any more concerns? or should we take some action now?
Seems like folks has no more concerns :)
Peter, would you like to pick the patch? or please give some comments
may be?
Regards,
Michael Wang
>
> Regards,
> Michael Wang
>
>>
>> Regards,
>> Michael Wang
>>
>>>
>>> Recently testing show that wake-affine stuff cause regression on pgbench, the
>>> hiding rat was finally catched out.
>>>
>>> wake-affine stuff is always trying to pull wakee close to waker, by theory,
>>> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
>>> ping-pong case.
>>>
>>> However, the whole stuff is somewhat blindly, load balance is the only factor
>>> to be guaranteed, and since the stuff itself is time-consuming, some workload
>>> suffered, pgbench is just the one who has been found.
>>>
>>> Thus, throttle the wake-affine stuff for such workload is necessary.
>>>
>>> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
>>> default value 1ms (default minimum balance interval), which means wake-affine
>>> will keep silent for 1ms after it returned false.
>>>
>>> By turning the new knob, those workload who suffered will have the chance to
>>> stop the regression.
>>>
>>> Test:
>>> Test with 12 cpu X86 server and tip 3.9.0-rc2.
>>>
>>> default
>>> base 1ms interval 10ms interval 100ms interval
>>> | db_size | clients | tps | | tps | | tps | | tps |
>>> +---------+---------+-------+- +-------+ +-------+ +-------+
>>> | 21 MB | 1 | 10572 | | 10804 | | 10802 | | 10801 |
>>> | 21 MB | 2 | 21275 | | 21533 | | 21400 | | 21498 |
>>> | 21 MB | 4 | 41866 | | 42158 | | 42410 | | 42306 |
>>> | 21 MB | 8 | 53931 | | 55796 | | 58608 | +8.67% | 59916 | +11.10%
>>> | 21 MB | 12 | 50956 | | 52266 | | 54586 | +7.12% | 55982 | +9.86%
>>> | 21 MB | 16 | 49911 | | 52862 | +5.91% | 55668 | +11.53% | 57255 | +14.71%
>>> | 21 MB | 24 | 46046 | | 48820 | +6.02% | 54269 | +17.86% | 58113 | +26.21%
>>> | 21 MB | 32 | 43405 | | 46635 | +7.44% | 53690 | +23.70% | 57729 | +33.00%
>>> | 7483 MB | 1 | 7734 | | 8013 | | 8046 | | 7879 |
>>> | 7483 MB | 2 | 19375 | | 19459 | | 19448 | | 19421 |
>>> | 7483 MB | 4 | 37408 | | 37780 | | 37937 | | 37819 |
>>> | 7483 MB | 8 | 49033 | | 50389 | | 51636 | +5.31% | 52294 | +6.65%
>>> | 7483 MB | 12 | 45525 | | 47794 | +4.98% | 49828 | +9.45% | 50571 | +11.08%
>>> | 7483 MB | 16 | 45731 | | 47921 | +4.79% | 50203 | +9.78% | 52033 | +13.78%
>>> | 7483 MB | 24 | 41533 | | 44301 | +6.67% | 49697 | +19.66% | 53833 | +29.62%
>>> | 7483 MB | 32 | 36370 | | 38301 | +5.31% | 48146 | +32.38% | 52795 | +45.16%
>>> | 15 GB | 1 | 7576 | | 7926 | | 7722 | | 7969 |
>>> | 15 GB | 2 | 19157 | | 19284 | | 19294 | | 19304 |
>>> | 15 GB | 4 | 37285 | | 37539 | | 37281 | | 37508 |
>>> | 15 GB | 8 | 48718 | | 49176 | | 50836 | +4.35% | 51239 | +5.17%
>>> | 15 GB | 12 | 45167 | | 47180 | +4.45% | 49206 | +8.94% | 50126 | +10.98%
>>> | 15 GB | 16 | 45270 | | 47293 | +4.47% | 49638 | +9.65% | 51748 | +14.31%
>>> | 15 GB | 24 | 40984 | | 43366 | +5.81% | 49356 | +20.43% | 53157 | +29.70%
>>> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% | 52241 | +45.45%
>>>
>>> Suggested-by: Peter Zijlstra <[email protected]>
>>> Signed-off-by: Michael Wang <[email protected]>
>>> ---
>>> include/linux/sched.h | 5 +++++
>>> kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
>>> kernel/sysctl.c | 10 ++++++++++
>>> 3 files changed, 46 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index d35d2b6..e9efd3a 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1197,6 +1197,10 @@ enum perf_event_task_context {
>>> perf_nr_task_contexts,
>>> };
>>>
>>> +#ifdef CONFIG_SMP
>>> +extern unsigned int sysctl_sched_wake_affine_interval;
>>> +#endif
>>> +
>>> struct task_struct {
>>> volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
>>> void *stack;
>>> @@ -1207,6 +1211,7 @@ struct task_struct {
>>> #ifdef CONFIG_SMP
>>> struct llist_node wake_entry;
>>> int on_cpu;
>>> + unsigned long next_wake_affine;
>>> #endif
>>> int on_rq;
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 7a33e59..68eedd7 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3087,6 +3087,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>>>
>>> #endif
>>>
>>> +/*
>>> + * Default is 1ms, to prevent the wake_affine() stuff working too frequently.
>>> + */
>>> +unsigned int sysctl_sched_wake_affine_interval = 1U;
>>> +
>>> +static inline int wake_affine_throttled(struct task_struct *p)
>>> +{
>>> + return time_before(jiffies, p->next_wake_affine);
>>> +}
>>> +
>>> +static inline void wake_affine_throttle(struct task_struct *p)
>>> +{
>>> + p->next_wake_affine = jiffies +
>>> + msecs_to_jiffies(sysctl_sched_wake_affine_interval);
>>> +}
>>> +
>>> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>> {
>>> s64 this_load, load;
>>> @@ -3096,6 +3112,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>> unsigned long weight;
>>> int balanced;
>>>
>>> + if (wake_affine_throttled(p))
>>> + return 0;
>>> +
>>> idx = sd->wake_idx;
>>> this_cpu = smp_processor_id();
>>> prev_cpu = task_cpu(p);
>>> @@ -3167,6 +3186,18 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>>
>>> return 1;
>>> }
>>> +
>>> + /*
>>> + * wake_affine() stuff try to pull wakee to the cpu
>>> + * around waker, this will benefit us if the data
>>> + * cached on waker cpu is hot for wakee, or the extreme
>>> + * ping-pong case.
>>> + *
>>> + * However, do such blindly work too frequently will
>>> + * cause regression to some workload, thus, each time
>>> + * when wake_affine() failed, throttle it for a while.
>>> + */
>>> + wake_affine_throttle(p);
>>> return 0;
>>> }
>>>
>>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>>> index afc1dc6..6ebfc18 100644
>>> --- a/kernel/sysctl.c
>>> +++ b/kernel/sysctl.c
>>> @@ -437,6 +437,16 @@ static struct ctl_table kern_table[] = {
>>> .extra1 = &one,
>>> },
>>> #endif
>>> +#ifdef CONFIG_SMP
>>> + {
>>> + .procname = "sched_wake_affine_interval",
>>> + .data = &sysctl_sched_wake_affine_interval,
>>> + .maxlen = sizeof(unsigned int),
>>> + .mode = 0644,
>>> + .proc_handler = proc_dointvec_minmax,
>>> + .extra1 = &zero,
>>> + },
>>> +#endif
>>> #ifdef CONFIG_PROVE_LOCKING
>>> {
>>> .procname = "prove_locking",
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 05/03/2013 11:46 AM, Michael Wang wrote:
> On 04/10/2013 11:30 AM, Michael Wang wrote:
>
> Now long time has been passed since the first version, I'd like to
> do some summary about current states:
>
> On a 12 cpu box with tip 3.9.0-rc7, test show that:
>
> 1. remove wake-affine stuff cause regression on hackbench (could be 15%).
> 2. reserve wake-affine stuff cause regression on pgbench (could be 45%).
>
> And since neither remove or reserve is acceptable, this patch introduced a
> knob to throttle wake-affine stuff.
>
> By turning the knob, we could benefit both hackbench and pgbench, or sacrifice
> one to significantly benefit another.
>
> All similar workload which prefer or hate wake-affine behaviour could
> been cared.
>
> If this approach caused any concerns, please let me know ;-)
I've just tested tip 3.10.0-rc1, it inherit the same issue from 3.9.0,
pgbench was sacrificed and we still don't have any way to stop the damage...
Regards,
Michael Wang
>
> Regards,
> Michael Wang
>
>>
>> Recently testing show that wake-affine stuff cause regression on pgbench, the
>> hiding rat was finally catched out.
>>
>> wake-affine stuff is always trying to pull wakee close to waker, by theory,
>> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
>> ping-pong case.
>>
>> However, the whole stuff is somewhat blindly, load balance is the only factor
>> to be guaranteed, and since the stuff itself is time-consuming, some workload
>> suffered, pgbench is just the one who has been found.
>>
>> Thus, throttle the wake-affine stuff for such workload is necessary.
>>
>> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
>> default value 1ms (default minimum balance interval), which means wake-affine
>> will keep silent for 1ms after it returned false.
>>
>> By turning the new knob, those workload who suffered will have the chance to
>> stop the regression.
>>
>> Test:
>> Test with 12 cpu X86 server and tip 3.9.0-rc2.
>>
>> default
>> base 1ms interval 10ms interval 100ms interval
>> | db_size | clients | tps | | tps | | tps | | tps |
>> +---------+---------+-------+- +-------+ +-------+ +-------+
>> | 21 MB | 1 | 10572 | | 10804 | | 10802 | | 10801 |
>> | 21 MB | 2 | 21275 | | 21533 | | 21400 | | 21498 |
>> | 21 MB | 4 | 41866 | | 42158 | | 42410 | | 42306 |
>> | 21 MB | 8 | 53931 | | 55796 | | 58608 | +8.67% | 59916 | +11.10%
>> | 21 MB | 12 | 50956 | | 52266 | | 54586 | +7.12% | 55982 | +9.86%
>> | 21 MB | 16 | 49911 | | 52862 | +5.91% | 55668 | +11.53% | 57255 | +14.71%
>> | 21 MB | 24 | 46046 | | 48820 | +6.02% | 54269 | +17.86% | 58113 | +26.21%
>> | 21 MB | 32 | 43405 | | 46635 | +7.44% | 53690 | +23.70% | 57729 | +33.00%
>> | 7483 MB | 1 | 7734 | | 8013 | | 8046 | | 7879 |
>> | 7483 MB | 2 | 19375 | | 19459 | | 19448 | | 19421 |
>> | 7483 MB | 4 | 37408 | | 37780 | | 37937 | | 37819 |
>> | 7483 MB | 8 | 49033 | | 50389 | | 51636 | +5.31% | 52294 | +6.65%
>> | 7483 MB | 12 | 45525 | | 47794 | +4.98% | 49828 | +9.45% | 50571 | +11.08%
>> | 7483 MB | 16 | 45731 | | 47921 | +4.79% | 50203 | +9.78% | 52033 | +13.78%
>> | 7483 MB | 24 | 41533 | | 44301 | +6.67% | 49697 | +19.66% | 53833 | +29.62%
>> | 7483 MB | 32 | 36370 | | 38301 | +5.31% | 48146 | +32.38% | 52795 | +45.16%
>> | 15 GB | 1 | 7576 | | 7926 | | 7722 | | 7969 |
>> | 15 GB | 2 | 19157 | | 19284 | | 19294 | | 19304 |
>> | 15 GB | 4 | 37285 | | 37539 | | 37281 | | 37508 |
>> | 15 GB | 8 | 48718 | | 49176 | | 50836 | +4.35% | 51239 | +5.17%
>> | 15 GB | 12 | 45167 | | 47180 | +4.45% | 49206 | +8.94% | 50126 | +10.98%
>> | 15 GB | 16 | 45270 | | 47293 | +4.47% | 49638 | +9.65% | 51748 | +14.31%
>> | 15 GB | 24 | 40984 | | 43366 | +5.81% | 49356 | +20.43% | 53157 | +29.70%
>> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% | 52241 | +45.45%
>>
>> Suggested-by: Peter Zijlstra <[email protected]>
>> Signed-off-by: Michael Wang <[email protected]>
>> ---
>> include/linux/sched.h | 5 +++++
>> kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
>> kernel/sysctl.c | 10 ++++++++++
>> 3 files changed, 46 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index d35d2b6..e9efd3a 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1197,6 +1197,10 @@ enum perf_event_task_context {
>> perf_nr_task_contexts,
>> };
>>
>> +#ifdef CONFIG_SMP
>> +extern unsigned int sysctl_sched_wake_affine_interval;
>> +#endif
>> +
>> struct task_struct {
>> volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
>> void *stack;
>> @@ -1207,6 +1211,7 @@ struct task_struct {
>> #ifdef CONFIG_SMP
>> struct llist_node wake_entry;
>> int on_cpu;
>> + unsigned long next_wake_affine;
>> #endif
>> int on_rq;
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7a33e59..68eedd7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3087,6 +3087,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>>
>> #endif
>>
>> +/*
>> + * Default is 1ms, to prevent the wake_affine() stuff working too frequently.
>> + */
>> +unsigned int sysctl_sched_wake_affine_interval = 1U;
>> +
>> +static inline int wake_affine_throttled(struct task_struct *p)
>> +{
>> + return time_before(jiffies, p->next_wake_affine);
>> +}
>> +
>> +static inline void wake_affine_throttle(struct task_struct *p)
>> +{
>> + p->next_wake_affine = jiffies +
>> + msecs_to_jiffies(sysctl_sched_wake_affine_interval);
>> +}
>> +
>> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>> {
>> s64 this_load, load;
>> @@ -3096,6 +3112,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>> unsigned long weight;
>> int balanced;
>>
>> + if (wake_affine_throttled(p))
>> + return 0;
>> +
>> idx = sd->wake_idx;
>> this_cpu = smp_processor_id();
>> prev_cpu = task_cpu(p);
>> @@ -3167,6 +3186,18 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>
>> return 1;
>> }
>> +
>> + /*
>> + * wake_affine() stuff try to pull wakee to the cpu
>> + * around waker, this will benefit us if the data
>> + * cached on waker cpu is hot for wakee, or the extreme
>> + * ping-pong case.
>> + *
>> + * However, do such blindly work too frequently will
>> + * cause regression to some workload, thus, each time
>> + * when wake_affine() failed, throttle it for a while.
>> + */
>> + wake_affine_throttle(p);
>> return 0;
>> }
>>
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index afc1dc6..6ebfc18 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -437,6 +437,16 @@ static struct ctl_table kern_table[] = {
>> .extra1 = &one,
>> },
>> #endif
>> +#ifdef CONFIG_SMP
>> + {
>> + .procname = "sched_wake_affine_interval",
>> + .data = &sysctl_sched_wake_affine_interval,
>> + .maxlen = sizeof(unsigned int),
>> + .mode = 0644,
>> + .proc_handler = proc_dointvec_minmax,
>> + .extra1 = &zero,
>> + },
>> +#endif
>> #ifdef CONFIG_PROVE_LOCKING
>> {
>> .procname = "prove_locking",
>>
>
On 04/10/2013 11:30 AM, Michael Wang wrote:
> Log since RFC:
> 1. Throttle only when wake-affine failed. (thanks to PeterZ)
> 2. Do throttle inside wake_affine(). (thanks to PeterZ)
> 3. Other small fix.
>
> Recently testing show that wake-affine stuff cause regression on pgbench, the
> hiding rat was finally catched out.
>
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
> ping-pong case.
>
> However, the whole stuff is somewhat blindly, load balance is the only factor
> to be guaranteed, and since the stuff itself is time-consuming, some workload
> suffered, pgbench is just the one who has been found.
>
> Thus, throttle the wake-affine stuff for such workload is necessary.
>
> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
> default value 1ms (default minimum balance interval), which means wake-affine
> will keep silent for 1ms after it returned false.
>
> By turning the new knob, those workload who suffered will have the chance to
> stop the regression.
>
> Test:
> Test with 12 cpu X86 server and tip 3.9.0-rc2.
>
> default
> base 1ms interval 10ms interval 100ms interval
> | db_size | clients | tps | | tps | | tps | | tps |
> +---------+---------+-------+- +-------+ +-------+ +-------+
> | 21 MB | 1 | 10572 | | 10804 | | 10802 | | 10801 |
> | 21 MB | 2 | 21275 | | 21533 | | 21400 | | 21498 |
> | 21 MB | 4 | 41866 | | 42158 | | 42410 | | 42306 |
> | 21 MB | 8 | 53931 | | 55796 | | 58608 | +8.67% | 59916 | +11.10%
> | 21 MB | 12 | 50956 | | 52266 | | 54586 | +7.12% | 55982 | +9.86%
> | 21 MB | 16 | 49911 | | 52862 | +5.91% | 55668 | +11.53% | 57255 | +14.71%
> | 21 MB | 24 | 46046 | | 48820 | +6.02% | 54269 | +17.86% | 58113 | +26.21%
> | 21 MB | 32 | 43405 | | 46635 | +7.44% | 53690 | +23.70% | 57729 | +33.00%
> | 7483 MB | 1 | 7734 | | 8013 | | 8046 | | 7879 |
> | 7483 MB | 2 | 19375 | | 19459 | | 19448 | | 19421 |
> | 7483 MB | 4 | 37408 | | 37780 | | 37937 | | 37819 |
> | 7483 MB | 8 | 49033 | | 50389 | | 51636 | +5.31% | 52294 | +6.65%
> | 7483 MB | 12 | 45525 | | 47794 | +4.98% | 49828 | +9.45% | 50571 | +11.08%
> | 7483 MB | 16 | 45731 | | 47921 | +4.79% | 50203 | +9.78% | 52033 | +13.78%
> | 7483 MB | 24 | 41533 | | 44301 | +6.67% | 49697 | +19.66% | 53833 | +29.62%
> | 7483 MB | 32 | 36370 | | 38301 | +5.31% | 48146 | +32.38% | 52795 | +45.16%
> | 15 GB | 1 | 7576 | | 7926 | | 7722 | | 7969 |
> | 15 GB | 2 | 19157 | | 19284 | | 19294 | | 19304 |
> | 15 GB | 4 | 37285 | | 37539 | | 37281 | | 37508 |
> | 15 GB | 8 | 48718 | | 49176 | | 50836 | +4.35% | 51239 | +5.17%
> | 15 GB | 12 | 45167 | | 47180 | +4.45% | 49206 | +8.94% | 50126 | +10.98%
> | 15 GB | 16 | 45270 | | 47293 | +4.47% | 49638 | +9.65% | 51748 | +14.31%
> | 15 GB | 24 | 40984 | | 43366 | +5.81% | 49356 | +20.43% | 53157 | +29.70%
> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% | 52241 | +45.45%
>
Add cc list :)
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Mike Galbraith <[email protected]>
CC: Alex Shi <[email protected]>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> include/linux/sched.h | 5 +++++
> kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
> kernel/sysctl.c | 10 ++++++++++
> 3 files changed, 46 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d35d2b6..e9efd3a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1197,6 +1197,10 @@ enum perf_event_task_context {
> perf_nr_task_contexts,
> };
>
> +#ifdef CONFIG_SMP
> +extern unsigned int sysctl_sched_wake_affine_interval;
> +#endif
> +
> struct task_struct {
> volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
> void *stack;
> @@ -1207,6 +1211,7 @@ struct task_struct {
> #ifdef CONFIG_SMP
> struct llist_node wake_entry;
> int on_cpu;
> + unsigned long next_wake_affine;
> #endif
> int on_rq;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a33e59..68eedd7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3087,6 +3087,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>
> #endif
>
> +/*
> + * Default is 1ms, to prevent the wake_affine() stuff working too frequently.
> + */
> +unsigned int sysctl_sched_wake_affine_interval = 1U;
> +
> +static inline int wake_affine_throttled(struct task_struct *p)
> +{
> + return time_before(jiffies, p->next_wake_affine);
> +}
> +
> +static inline void wake_affine_throttle(struct task_struct *p)
> +{
> + p->next_wake_affine = jiffies +
> + msecs_to_jiffies(sysctl_sched_wake_affine_interval);
> +}
> +
> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> {
> s64 this_load, load;
> @@ -3096,6 +3112,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> unsigned long weight;
> int balanced;
>
> + if (wake_affine_throttled(p))
> + return 0;
> +
> idx = sd->wake_idx;
> this_cpu = smp_processor_id();
> prev_cpu = task_cpu(p);
> @@ -3167,6 +3186,18 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>
> return 1;
> }
> +
> + /*
> + * wake_affine() stuff try to pull wakee to the cpu
> + * around waker, this will benefit us if the data
> + * cached on waker cpu is hot for wakee, or the extreme
> + * ping-pong case.
> + *
> + * However, do such blindly work too frequently will
> + * cause regression to some workload, thus, each time
> + * when wake_affine() failed, throttle it for a while.
> + */
> + wake_affine_throttle(p);
> return 0;
> }
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index afc1dc6..6ebfc18 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -437,6 +437,16 @@ static struct ctl_table kern_table[] = {
> .extra1 = &one,
> },
> #endif
> +#ifdef CONFIG_SMP
> + {
> + .procname = "sched_wake_affine_interval",
> + .data = &sysctl_sched_wake_affine_interval,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &zero,
> + },
> +#endif
> #ifdef CONFIG_PROVE_LOCKING
> {
> .procname = "prove_locking",
>
Log since v1:
Add cc list.
Add more comments.
Tested on tip 3.10.0-rc1.
wake-affine stuff is always trying to pull wakee close to waker, by theory,
this will benefit us if waker's cpu cached hot data for wakee, or the extreme
ping-pong case, and testing show it could benefit hackbench 15% at most.
However, the whole feature is somewhat blindly, load balance is the only factor
to be guaranteed, and since the stuff itself is time-consuming, some workload
suffered, and testing show it could damage pgbench 41% at most.
The feature currently settled in mainline, which means the current scheduler
force sacrificed some workloads to benefit others, that is definitely unfair.
Thus, this patch provide the way to throttle wake-affine stuff, in order to
adjust the gain and loss according to demand.
The patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
default value 1ms (default minimum balance interval), which means wake-affine
will keep silent for 1ms after it's failure.
By turning the new knob, compared with mainline, which currently blindly using
wake-affine, pgbench show 41% improvement at most.
Link:
Analysis from Mike Galbraith about the improvement:
https://lkml.org/lkml/2013/4/11/54
Analysis about the reason of throttle after failed:
https://lkml.org/lkml/2013/5/3/31
Test:
Test with 12 cpu X86 server and tip 3.10.0-rc1.
default
base 1ms interval 10ms interval 100ms interval
| db_size | clients | tps | | tps | | tps | | tps |
+---------+---------+-------+ +-------+ +-------+ +-------+
| 22 MB | 1 | 10828 | | 10850 | | 10795 | | 10845 |
| 22 MB | 2 | 21434 | | 21469 | | 21463 | | 21455 |
| 22 MB | 4 | 41563 | | 41826 | | 41789 | | 41779 |
| 22 MB | 8 | 53451 | | 54917 | | 59250 | | 59097 |
| 22 MB | 12 | 48681 | | 50454 | | 53248 | | 54881 |
| 22 MB | 16 | 46352 | | 49627 | +7.07% | 54029 | +16.56% | 55935 | +20.67%
| 22 MB | 24 | 44200 | | 46745 | +5.76% | 52106 | +17.89% | 57907 | +31.01%
| 22 MB | 32 | 43567 | | 45264 | +3.90% | 51463 | +18.12% | 57122 | +31.11%
| 7484 MB | 1 | 8926 | | 8959 | | 8765 | | 8682 |
| 7484 MB | 2 | 19308 | | 19470 | | 19397 | | 19409 |
| 7484 MB | 4 | 37269 | | 37501 | | 37552 | | 37470 |
| 7484 MB | 8 | 47277 | | 48452 | | 51535 | | 52095 |
| 7484 MB | 12 | 42815 | | 45347 | | 48478 | | 49256 |
| 7484 MB | 16 | 40951 | | 44063 | +7.60% | 48536 | +18.52% | 51141 | +24.88%
| 7484 MB | 24 | 37389 | | 39620 | +5.97% | 47052 | +25.84% | 52720 | +41.00%
| 7484 MB | 32 | 36705 | | 38109 | +3.83% | 45932 | +25.14% | 51456 | +40.19%
| 15 GB | 1 | 8642 | | 8850 | | 9092 | | 8560 |
| 15 GB | 2 | 19256 | | 19285 | | 19362 | | 19322 |
| 15 GB | 4 | 37114 | | 37131 | | 37221 | | 37257 |
| 15 GB | 8 | 47120 | | 48053 | | 50845 | | 50923 |
| 15 GB | 12 | 42386 | | 44748 | | 47868 | | 48875 |
| 15 GB | 16 | 40624 | | 43414 | +6.87% | 48169 | +18.57% | 50814 | +25.08%
| 15 GB | 24 | 37110 | | 39096 | +5.35% | 46594 | +25.56% | 52477 | +41.41%
| 15 GB | 32 | 36252 | | 37316 | +2.94% | 45327 | +25.03% | 51217 | +41.28%
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Mike Galbraith <[email protected]>
CC: Alex Shi <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
include/linux/sched.h | 5 +++++
kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 10 ++++++++++
3 files changed, 50 insertions(+), 0 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1af1473 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1031,6 +1031,10 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};
+#ifdef CONFIG_SMP
+extern unsigned int sysctl_sched_wake_affine_interval;
+#endif
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1041,6 +1045,7 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+ unsigned long next_wake_affine;
#endif
int on_rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f62b16d..417fa87 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3127,6 +3127,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
#endif
+/*
+ * Default is 1ms, to prevent the wake_affine() stuff working too frequently.
+ */
+unsigned int sysctl_sched_wake_affine_interval = 1U;
+
+static inline int wake_affine_throttled(struct task_struct *p)
+{
+ return time_before(jiffies, p->next_wake_affine);
+}
+
+static inline void wake_affine_throttle(struct task_struct *p)
+{
+ p->next_wake_affine = jiffies +
+ msecs_to_jiffies(sysctl_sched_wake_affine_interval);
+}
+
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
{
s64 this_load, load;
@@ -3136,6 +3152,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
unsigned long weight;
int balanced;
+ if (wake_affine_throttled(p))
+ return 0;
+
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -3207,6 +3226,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
return 1;
}
+
+ /*
+ * wake_affine() stuff try to pull wakee to the cpu
+ * around waker, this will benefit us if the data
+ * cached on waker cpu is hot for wakee, or the extreme
+ * ping-pong case.
+ *
+ * However, do such blindly work too frequently will
+ * cause regression to some workload, thus, each time
+ * when wake_affine() failed, throttle it for a while.
+ *
+ * Throttle when failed is supposed to make the interval
+ * cover more failures, since failed wake_affine()
+ * is nothing but wasting cpu cycles.
+ */
+ wake_affine_throttle(p);
return 0;
}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9edcf45..3ca46d7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -436,6 +436,16 @@ static struct ctl_table kern_table[] = {
.extra1 = &one,
},
#endif
+#ifdef CONFIG_SMP
+ {
+ .procname = "sched_wake_affine_interval",
+ .data = &sysctl_sched_wake_affine_interval,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
--
1.7.4.1
On 05/21/2013 11:20 AM, Michael Wang wrote:
> CC: Mike Galbraith <[email protected]>
> CC: Alex Shi <[email protected]>
You can change CC to
Reviewed-by: Alex Shi <[email protected]>
--
Thanks
Alex
On 05/21/2013 02:47 PM, Alex Shi wrote:
> On 05/21/2013 11:20 AM, Michael Wang wrote:
>> CC: Mike Galbraith <[email protected]>
>> CC: Alex Shi <[email protected]>
>
> You can change CC to
> Reviewed-by: Alex Shi <[email protected]>
Got it, thanks for your review :)
Regards,
Michael Wang
>
On Tue, May 21, 2013 at 11:20:18AM +0800, Michael Wang wrote:
>
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
> ping-pong case, and testing show it could benefit hackbench 15% at most.
>
> However, the whole feature is somewhat blindly, load balance is the only factor
> to be guaranteed, and since the stuff itself is time-consuming, some workload
> suffered, and testing show it could damage pgbench 41% at most.
>
> The feature currently settled in mainline, which means the current scheduler
> force sacrificed some workloads to benefit others, that is definitely unfair.
>
> Thus, this patch provide the way to throttle wake-affine stuff, in order to
> adjust the gain and loss according to demand.
>
> The patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
> default value 1ms (default minimum balance interval), which means wake-affine
> will keep silent for 1ms after it's failure.
>
> By turning the new knob, compared with mainline, which currently blindly using
> wake-affine, pgbench show 41% improvement at most.
>
> Link:
> Analysis from Mike Galbraith about the improvement:
> https://lkml.org/lkml/2013/4/11/54
>
> Analysis about the reason of throttle after failed:
> https://lkml.org/lkml/2013/5/3/31
>
> Test:
> Test with 12 cpu X86 server and tip 3.10.0-rc1.
>
> default
> base 1ms interval 10ms interval 100ms interval
> | db_size | clients | tps | | tps | | tps | | tps |
> +---------+---------+-------+ +-------+ +-------+ +-------+
> | 22 MB | 1 | 10828 | | 10850 | | 10795 | | 10845 |
> | 22 MB | 2 | 21434 | | 21469 | | 21463 | | 21455 |
> | 22 MB | 4 | 41563 | | 41826 | | 41789 | | 41779 |
> | 22 MB | 8 | 53451 | | 54917 | | 59250 | | 59097 |
> | 22 MB | 12 | 48681 | | 50454 | | 53248 | | 54881 |
> | 22 MB | 16 | 46352 | | 49627 | +7.07% | 54029 | +16.56% | 55935 | +20.67%
> | 22 MB | 24 | 44200 | | 46745 | +5.76% | 52106 | +17.89% | 57907 | +31.01%
> | 22 MB | 32 | 43567 | | 45264 | +3.90% | 51463 | +18.12% | 57122 | +31.11%
> | 7484 MB | 1 | 8926 | | 8959 | | 8765 | | 8682 |
> | 7484 MB | 2 | 19308 | | 19470 | | 19397 | | 19409 |
> | 7484 MB | 4 | 37269 | | 37501 | | 37552 | | 37470 |
> | 7484 MB | 8 | 47277 | | 48452 | | 51535 | | 52095 |
> | 7484 MB | 12 | 42815 | | 45347 | | 48478 | | 49256 |
> | 7484 MB | 16 | 40951 | | 44063 | +7.60% | 48536 | +18.52% | 51141 | +24.88%
> | 7484 MB | 24 | 37389 | | 39620 | +5.97% | 47052 | +25.84% | 52720 | +41.00%
> | 7484 MB | 32 | 36705 | | 38109 | +3.83% | 45932 | +25.14% | 51456 | +40.19%
> | 15 GB | 1 | 8642 | | 8850 | | 9092 | | 8560 |
> | 15 GB | 2 | 19256 | | 19285 | | 19362 | | 19322 |
> | 15 GB | 4 | 37114 | | 37131 | | 37221 | | 37257 |
> | 15 GB | 8 | 47120 | | 48053 | | 50845 | | 50923 |
> | 15 GB | 12 | 42386 | | 44748 | | 47868 | | 48875 |
> | 15 GB | 16 | 40624 | | 43414 | +6.87% | 48169 | +18.57% | 50814 | +25.08%
> | 15 GB | 24 | 37110 | | 39096 | +5.35% | 46594 | +25.56% | 52477 | +41.41%
> | 15 GB | 32 | 36252 | | 37316 | +2.94% | 45327 | +25.03% | 51217 | +41.28%
>
> CC: Ingo Molnar <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Mike Galbraith <[email protected]>
> CC: Alex Shi <[email protected]>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
So I utterly hate this patch. I hate it worse than your initial buddy
patch :/
And I know its got a Suggested-by there; but that was when you led me to
believe that wake_affine() itself was expensive to run; its not, its the
result of those runs you don't like.
While we have a ton (too many to be sure) scheduler tunables, users
shouldn't ever need to actually touch those. Its just that every time we
have to make a random choice its as easy to make it a debug knob as to
hardcode it.
The problem with this patch is that users _have_ to frob knobs and while
doing so potentially wreck other workloads.
To make it worse, the knob isn't anything fundamental, its a random
hack.
So I would really either improve the smarts of wake_affine, with for
example your wake buddy relation thing (and simply exempt [Soft]IRQs) or
kill wake_affine and be done with it.
Either avenue has the risk of regressing some workload, but at least
when that happens (and people report it) we'll have a counter-example to
learn from and incorporate.
On 05/22/2013 04:49 PM, Peter Zijlstra wrote:
[snip]
>>
>> CC: Ingo Molnar <[email protected]>
>> CC: Peter Zijlstra <[email protected]>
>> CC: Mike Galbraith <[email protected]>
>> CC: Alex Shi <[email protected]>
>> Suggested-by: Peter Zijlstra <[email protected]>
>> Signed-off-by: Michael Wang <[email protected]>
Thanks for your reply, I've looking forward it for a long time...
>
> So I utterly hate this patch. I hate it worse than your initial buddy
> patch :/
Then we nuke it, and figure out the better one ;-)
>
> And I know its got a Suggested-by there; but that was when you led me to
> believe that wake_affine() itself was expensive to run; its not, its the
> result of those runs you don't like.
Both are the reason, it's just the game between gain & lost & cost, your
suggestion definitely is a good choice, otherwise I won't pay time on
it, and I will call it's the best one if we are searching for a quick fix.
>
> While we have a ton (too many to be sure) scheduler tunables, users
> shouldn't ever need to actually touch those. Its just that every time we
> have to make a random choice its as easy to make it a debug knob as to
> hardcode it.
>
> The problem with this patch is that users _have_ to frob knobs and while
> doing so potentially wreck other workloads.
>
> To make it worse, the knob isn't anything fundamental, its a random
> hack.
So we discard.
>
> So I would really either improve the smarts of wake_affine, with for
> example your wake buddy relation thing (and simply exempt [Soft]IRQs) or
> kill wake_affine and be done with it.
No kill...we show mercy, I will back to the wakeup-buddy and let's
forgot the IRQ case temporarily unless some regression report appear.
>
> Either avenue has the risk of regressing some workload, but at least
> when that happens (and people report it) we'll have a counter-example to
> learn from and incorporate.
I've not test the hackbench with wakeup-buddy before, will do it this
time, I suppose the 15% illegal income will suffered, anyway, it's
illegal :)
Regards,
Michael Wang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Wed, 2013-05-22 at 17:25 +0800, Michael Wang wrote:
> I've not test the hackbench with wakeup-buddy before, will do it this
> time, I suppose the 15% illegal income will suffered, anyway, it's
> illegal :)
On a 4 socket 40 core (+SMT) box, hackbench wasn't too happy.
defaultx = wakeup buddy
tbench 30 localhost
1 5 10 20 40 80 160 320
3.10.0-default 188.13 953.03 1860.03 3180.93 5378.34 10826.06 11342.90 11651.30
3.10.0-defaultx 187.26 934.55 1859.30 3160.29 5016.35 12477.15 12567.05 12068.47
hackbench -l 400 -g 400
3.10.0-default Time: 3.919 4.250 4.116
3.10.0-defaultx Time: 9.074 10.985 9.849
aim7 compute
AIM Multiuser Benchmark - Suite VII "1.1" 3.10.0-default AIM Multiuser Benchmark - Suite VII "1.1" 3.10.0-defaultx
Tasks Jobs/Min JTI Real CPU Jobs/sec/task Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 428.0 100 14.2 4.1 7.1328 1 428.9 100 14.1 4.1 7.1479
1 417.9 100 14.5 4.1 6.9655 1 430.4 100 14.1 4.0 7.1733
1 427.7 100 14.2 4.2 7.1277 1 424.7 100 14.3 4.2 7.0778
5 2350.7 99 12.9 13.8 7.8355 5 2156.6 99 14.1 19.8 7.1886
5 2422.1 99 12.5 12.1 8.0735 5 2155.0 99 14.1 19.7 7.1835
5 2189.3 98 13.8 18.3 7.2977 5 2108.6 99 14.4 21.4 7.0285
10 4515.6 93 13.4 27.6 7.5261 10 4529.1 96 13.4 29.5 7.5486
10 4708.6 96 12.9 24.3 7.8477 10 4597.9 96 13.2 26.9 7.6631
10 4636.6 96 13.1 25.7 7.7276 10 5197.3 98 11.7 14.8 8.6621
20 8053.2 95 15.1 78.1 6.7110 20 9431.9 98 12.8 49.1 7.8599
20 8250.5 92 14.7 67.5 6.8754 20 7973.7 97 15.2 93.4 6.6447
20 8178.1 97 14.8 78.5 6.8151 20 8145.2 95 14.9 78.4 6.7876
40 17413.8 94 13.9 88.6 7.2557 40 16312.2 92 14.9 115.6 6.7968
40 16775.1 93 14.5 111.6 6.9896 40 17070.4 94 14.2 110.6 7.1127
40 16031.7 93 15.1 147.1 6.6799 40 17578.0 94 13.8 96.9 7.3241
80 33666.7 95 14.4 138.4 7.0139 80 33854.7 96 14.3 177.9 7.0531
80 33949.6 97 14.3 128.0 7.0728 80 34164.9 96 14.2 146.4 7.1177
80 35752.2 96 13.6 159.1 7.4484 80 33807.5 96 14.3 127.6 7.0432
160 74814.8 98 13.0 149.8 7.7932 160 75162.8 98 12.9 148.6 7.8295
160 74015.3 97 13.1 149.5 7.7099 160 74642.0 98 13.0 168.2 7.7752
160 73621.9 98 13.2 146.3 7.6689 160 75572.9 98 12.8 163.6 7.8722
320 139210.3 96 13.9 280.1 7.2505 320 139010.8 97 14.0 282.4 7.2401
320 135135.9 96 14.3 277.8 7.0383 320 139611.2 96 13.9 282.2 7.2714
320 139110.5 96 13.9 280.4 7.2453 320 138514.3 97 14.0 281.0 7.2143
640 223538.9 98 17.4 577.2 5.8213 640 224704.5 95 17.3 567.3 5.8517
640 224055.5 97 17.3 575.7 5.8348 640 222385.3 95 17.4 580.1 5.7913
640 225488.4 96 17.2 566.3 5.8721 640 225882.4 93 17.2 563.0 5.8824
On 05/22/2013 10:55 PM, Mike Galbraith wrote:
> On Wed, 2013-05-22 at 17:25 +0800, Michael Wang wrote:
>
>> I've not test the hackbench with wakeup-buddy before, will do it this
>> time, I suppose the 15% illegal income will suffered, anyway, it's
>> illegal :)
>
Thanks, Mike, these are very helpful data ;-)
> On a 4 socket 40 core (+SMT) box, hackbench wasn't too happy.
I'm not sure whether this is caused by the SOFT IRQ case, I have some
idea in mind which may could help to solve that issue, but if this huge
regression is caused by other reasons...
Anyway, I will rework on the idea and let's see whether we could make
things better :)
Regards,
Michael Wang
>
> defaultx = wakeup buddy
>
> tbench 30 localhost
> 1 5 10 20 40 80 160 320
> 3.10.0-default 188.13 953.03 1860.03 3180.93 5378.34 10826.06 11342.90 11651.30
> 3.10.0-defaultx 187.26 934.55 1859.30 3160.29 5016.35 12477.15 12567.05 12068.47
>
> hackbench -l 400 -g 400
> 3.10.0-default Time: 3.919 4.250 4.116
> 3.10.0-defaultx Time: 9.074 10.985 9.849
>
> aim7 compute
> AIM Multiuser Benchmark - Suite VII "1.1" 3.10.0-default AIM Multiuser Benchmark - Suite VII "1.1" 3.10.0-defaultx
>
> Tasks Jobs/Min JTI Real CPU Jobs/sec/task Tasks Jobs/Min JTI Real CPU Jobs/sec/task
> 1 428.0 100 14.2 4.1 7.1328 1 428.9 100 14.1 4.1 7.1479
> 1 417.9 100 14.5 4.1 6.9655 1 430.4 100 14.1 4.0 7.1733
> 1 427.7 100 14.2 4.2 7.1277 1 424.7 100 14.3 4.2 7.0778
> 5 2350.7 99 12.9 13.8 7.8355 5 2156.6 99 14.1 19.8 7.1886
> 5 2422.1 99 12.5 12.1 8.0735 5 2155.0 99 14.1 19.7 7.1835
> 5 2189.3 98 13.8 18.3 7.2977 5 2108.6 99 14.4 21.4 7.0285
> 10 4515.6 93 13.4 27.6 7.5261 10 4529.1 96 13.4 29.5 7.5486
> 10 4708.6 96 12.9 24.3 7.8477 10 4597.9 96 13.2 26.9 7.6631
> 10 4636.6 96 13.1 25.7 7.7276 10 5197.3 98 11.7 14.8 8.6621
> 20 8053.2 95 15.1 78.1 6.7110 20 9431.9 98 12.8 49.1 7.8599
> 20 8250.5 92 14.7 67.5 6.8754 20 7973.7 97 15.2 93.4 6.6447
> 20 8178.1 97 14.8 78.5 6.8151 20 8145.2 95 14.9 78.4 6.7876
> 40 17413.8 94 13.9 88.6 7.2557 40 16312.2 92 14.9 115.6 6.7968
> 40 16775.1 93 14.5 111.6 6.9896 40 17070.4 94 14.2 110.6 7.1127
> 40 16031.7 93 15.1 147.1 6.6799 40 17578.0 94 13.8 96.9 7.3241
> 80 33666.7 95 14.4 138.4 7.0139 80 33854.7 96 14.3 177.9 7.0531
> 80 33949.6 97 14.3 128.0 7.0728 80 34164.9 96 14.2 146.4 7.1177
> 80 35752.2 96 13.6 159.1 7.4484 80 33807.5 96 14.3 127.6 7.0432
> 160 74814.8 98 13.0 149.8 7.7932 160 75162.8 98 12.9 148.6 7.8295
> 160 74015.3 97 13.1 149.5 7.7099 160 74642.0 98 13.0 168.2 7.7752
> 160 73621.9 98 13.2 146.3 7.6689 160 75572.9 98 12.8 163.6 7.8722
> 320 139210.3 96 13.9 280.1 7.2505 320 139010.8 97 14.0 282.4 7.2401
> 320 135135.9 96 14.3 277.8 7.0383 320 139611.2 96 13.9 282.2 7.2714
> 320 139110.5 96 13.9 280.4 7.2453 320 138514.3 97 14.0 281.0 7.2143
> 640 223538.9 98 17.4 577.2 5.8213 640 224704.5 95 17.3 567.3 5.8517
> 640 224055.5 97 17.3 575.7 5.8348 640 222385.3 95 17.4 580.1 5.7913
> 640 225488.4 96 17.2 566.3 5.8721 640 225882.4 93 17.2 563.0 5.8824
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 05/22/2013 10:55 PM, Mike Galbraith wrote:
> On Wed, 2013-05-22 at 17:25 +0800, Michael Wang wrote:
>
>> I've not test the hackbench with wakeup-buddy before, will do it this
>> time, I suppose the 15% illegal income will suffered, anyway, it's
>> illegal :)
>
> On a 4 socket 40 core (+SMT) box, hackbench wasn't too happy.
I've done more test and now I got the reason of regression...
The writer and reader in hackbench is N:N, prev writer will write all
the fd then switch to next writer and repeat the same work, so it's
impossible to setup the buddy relationship by just record the last one,
and we have to record all the waker/wakee in history, but that means
unacceptable memory overhead...
So this buddy idea seems to be bad...
I think a better way may should be allowing pull in most time, but
filter the very bad cases carefully.
For workload like pgbench, we actually just need to avoid pull if that
will damage the 'mother' thread, which is busy and be relied by many
'child'.
I will send out a new implementation, let's see whether it could solve
the problems ;-)
Regards,
Michael Wang
>
> defaultx = wakeup buddy
>
> tbench 30 localhost
> 1 5 10 20 40 80 160 320
> 3.10.0-default 188.13 953.03 1860.03 3180.93 5378.34 10826.06 11342.90 11651.30
> 3.10.0-defaultx 187.26 934.55 1859.30 3160.29 5016.35 12477.15 12567.05 12068.47
>
> hackbench -l 400 -g 400
> 3.10.0-default Time: 3.919 4.250 4.116
> 3.10.0-defaultx Time: 9.074 10.985 9.849
>
> aim7 compute
> AIM Multiuser Benchmark - Suite VII "1.1" 3.10.0-default AIM Multiuser Benchmark - Suite VII "1.1" 3.10.0-defaultx
>
> Tasks Jobs/Min JTI Real CPU Jobs/sec/task Tasks Jobs/Min JTI Real CPU Jobs/sec/task
> 1 428.0 100 14.2 4.1 7.1328 1 428.9 100 14.1 4.1 7.1479
> 1 417.9 100 14.5 4.1 6.9655 1 430.4 100 14.1 4.0 7.1733
> 1 427.7 100 14.2 4.2 7.1277 1 424.7 100 14.3 4.2 7.0778
> 5 2350.7 99 12.9 13.8 7.8355 5 2156.6 99 14.1 19.8 7.1886
> 5 2422.1 99 12.5 12.1 8.0735 5 2155.0 99 14.1 19.7 7.1835
> 5 2189.3 98 13.8 18.3 7.2977 5 2108.6 99 14.4 21.4 7.0285
> 10 4515.6 93 13.4 27.6 7.5261 10 4529.1 96 13.4 29.5 7.5486
> 10 4708.6 96 12.9 24.3 7.8477 10 4597.9 96 13.2 26.9 7.6631
> 10 4636.6 96 13.1 25.7 7.7276 10 5197.3 98 11.7 14.8 8.6621
> 20 8053.2 95 15.1 78.1 6.7110 20 9431.9 98 12.8 49.1 7.8599
> 20 8250.5 92 14.7 67.5 6.8754 20 7973.7 97 15.2 93.4 6.6447
> 20 8178.1 97 14.8 78.5 6.8151 20 8145.2 95 14.9 78.4 6.7876
> 40 17413.8 94 13.9 88.6 7.2557 40 16312.2 92 14.9 115.6 6.7968
> 40 16775.1 93 14.5 111.6 6.9896 40 17070.4 94 14.2 110.6 7.1127
> 40 16031.7 93 15.1 147.1 6.6799 40 17578.0 94 13.8 96.9 7.3241
> 80 33666.7 95 14.4 138.4 7.0139 80 33854.7 96 14.3 177.9 7.0531
> 80 33949.6 97 14.3 128.0 7.0728 80 34164.9 96 14.2 146.4 7.1177
> 80 35752.2 96 13.6 159.1 7.4484 80 33807.5 96 14.3 127.6 7.0432
> 160 74814.8 98 13.0 149.8 7.7932 160 75162.8 98 12.9 148.6 7.8295
> 160 74015.3 97 13.1 149.5 7.7099 160 74642.0 98 13.0 168.2 7.7752
> 160 73621.9 98 13.2 146.3 7.6689 160 75572.9 98 12.8 163.6 7.8722
> 320 139210.3 96 13.9 280.1 7.2505 320 139010.8 97 14.0 282.4 7.2401
> 320 135135.9 96 14.3 277.8 7.0383 320 139611.2 96 13.9 282.2 7.2714
> 320 139110.5 96 13.9 280.4 7.2453 320 138514.3 97 14.0 281.0 7.2143
> 640 223538.9 98 17.4 577.2 5.8213 640 224704.5 95 17.3 567.3 5.8517
> 640 224055.5 97 17.3 575.7 5.8348 640 222385.3 95 17.4 580.1 5.7913
> 640 225488.4 96 17.2 566.3 5.8721 640 225882.4 93 17.2 563.0 5.8824
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Tue, 2013-05-28 at 13:02 +0800, Michael Wang wrote:
> On 05/22/2013 10:55 PM, Mike Galbraith wrote:
> > On Wed, 2013-05-22 at 17:25 +0800, Michael Wang wrote:
> >
> >> I've not test the hackbench with wakeup-buddy before, will do it this
> >> time, I suppose the 15% illegal income will suffered, anyway, it's
> >> illegal :)
> >
> > On a 4 socket 40 core (+SMT) box, hackbench wasn't too happy.
>
> I've done more test and now I got the reason of regression...
>
> The writer and reader in hackbench is N:N, prev writer will write all
> the fd then switch to next writer and repeat the same work, so it's
> impossible to setup the buddy relationship by just record the last one,
> and we have to record all the waker/wakee in history, but that means
> unacceptable memory overhead...
Yeah, that's why I was thinking we'd need a dinky/fast as hell FIFO of
tokens or such to bind waker/wakee more or less reliably. Making such a
scheme cheap enough could be hard.
> So this buddy idea seems to be bad...
>
> I think a better way may should be allowing pull in most time, but
> filter the very bad cases carefully.
Any way that is cheap, and fairly accurately recognizes when we're being
stupid will help. First and foremost, it has to be dirt cheap :)
> For workload like pgbench, we actually just need to avoid pull if that
> will damage the 'mother' thread, which is busy and be relied by many
> 'child'.
Yeah, 'mom' is the key player. If we can cheaply recognize mom, that
should get us a generic improvement. Not as good as being able to
recognize the size of her+brood as size changes, but better anyway.
-Mike
On 05/28/2013 02:29 PM, Mike Galbraith wrote:
> On Tue, 2013-05-28 at 13:02 +0800, Michael Wang wrote:
>> On 05/22/2013 10:55 PM, Mike Galbraith wrote:
>>> On Wed, 2013-05-22 at 17:25 +0800, Michael Wang wrote:
>>>
>>>> I've not test the hackbench with wakeup-buddy before, will do it this
>>>> time, I suppose the 15% illegal income will suffered, anyway, it's
>>>> illegal :)
>>>
>>> On a 4 socket 40 core (+SMT) box, hackbench wasn't too happy.
>>
>> I've done more test and now I got the reason of regression...
>>
>> The writer and reader in hackbench is N:N, prev writer will write all
>> the fd then switch to next writer and repeat the same work, so it's
>> impossible to setup the buddy relationship by just record the last one,
>> and we have to record all the waker/wakee in history, but that means
>> unacceptable memory overhead...
>
> Yeah, that's why I was thinking we'd need a dinky/fast as hell FIFO of
> tokens or such to bind waker/wakee more or less reliably. Making such a
> scheme cheap enough could be hard.
>
>> So this buddy idea seems to be bad...
>>
>> I think a better way may should be allowing pull in most time, but
>> filter the very bad cases carefully.
>
> Any way that is cheap, and fairly accurately recognizes when we're being
> stupid will help. First and foremost, it has to be dirt cheap :)
>
>> For workload like pgbench, we actually just need to avoid pull if that
>> will damage the 'mother' thread, which is busy and be relied by many
>> 'child'.
>
> Yeah, 'mom' is the key player. If we can cheaply recognize mom, that
> should get us a generic improvement. Not as good as being able to
> recognize the size of her+brood as size changes, but better anyway.
That's right, I'm trying to rely on the frequency of a task switching
it's wakee in the new idea, it's really cheap and somewhat reliable, I
appreciate if you could pay an eye on the new patch and let me know you
opinion :)
Regards,
Michael Wang
>
> -Mike
>
On Tue, 2013-05-28 at 15:22 +0800, Michael Wang wrote:
> On 05/28/2013 02:29 PM, Mike Galbraith wrote:
> > On Tue, 2013-05-28 at 13:02 +0800, Michael Wang wrote:
> >> On 05/22/2013 10:55 PM, Mike Galbraith wrote:
> >>> On Wed, 2013-05-22 at 17:25 +0800, Michael Wang wrote:
> >>>
> >>>> I've not test the hackbench with wakeup-buddy before, will do it this
> >>>> time, I suppose the 15% illegal income will suffered, anyway, it's
> >>>> illegal :)
> >>>
> >>> On a 4 socket 40 core (+SMT) box, hackbench wasn't too happy.
> >>
> >> I've done more test and now I got the reason of regression...
> >>
> >> The writer and reader in hackbench is N:N, prev writer will write all
> >> the fd then switch to next writer and repeat the same work, so it's
> >> impossible to setup the buddy relationship by just record the last one,
> >> and we have to record all the waker/wakee in history, but that means
> >> unacceptable memory overhead...
> >
> > Yeah, that's why I was thinking we'd need a dinky/fast as hell FIFO of
> > tokens or such to bind waker/wakee more or less reliably. Making such a
> > scheme cheap enough could be hard.
> >
> >> So this buddy idea seems to be bad...
> >>
> >> I think a better way may should be allowing pull in most time, but
> >> filter the very bad cases carefully.
> >
> > Any way that is cheap, and fairly accurately recognizes when we're being
> > stupid will help. First and foremost, it has to be dirt cheap :)
> >
> >> For workload like pgbench, we actually just need to avoid pull if that
> >> will damage the 'mother' thread, which is busy and be relied by many
> >> 'child'.
> >
> > Yeah, 'mom' is the key player. If we can cheaply recognize mom, that
> > should get us a generic improvement. Not as good as being able to
> > recognize the size of her+brood as size changes, but better anyway.
>
> That's right, I'm trying to rely on the frequency of a task switching
> it's wakee in the new idea, it's really cheap and somewhat reliable, I
> appreciate if you could pay an eye on the new patch and let me know you
> opinion :)
I'll feed it to my 40 core box, hopefully soon.
-Mike
On 05/28/2013 04:49 PM, Mike Galbraith wrote:
> On Tue, 2013-05-28 at 15:22 +0800, Michael Wang wrote:
>> On 05/28/2013 02:29 PM, Mike Galbraith wrote:
>>> On Tue, 2013-05-28 at 13:02 +0800, Michael Wang wrote:
>>>> On 05/22/2013 10:55 PM, Mike Galbraith wrote:
>>>>> On Wed, 2013-05-22 at 17:25 +0800, Michael Wang wrote:
>>>>>
>>>>>> I've not test the hackbench with wakeup-buddy before, will do it this
>>>>>> time, I suppose the 15% illegal income will suffered, anyway, it's
>>>>>> illegal :)
>>>>>
>>>>> On a 4 socket 40 core (+SMT) box, hackbench wasn't too happy.
>>>>
>>>> I've done more test and now I got the reason of regression...
>>>>
>>>> The writer and reader in hackbench is N:N, prev writer will write all
>>>> the fd then switch to next writer and repeat the same work, so it's
>>>> impossible to setup the buddy relationship by just record the last one,
>>>> and we have to record all the waker/wakee in history, but that means
>>>> unacceptable memory overhead...
>>>
>>> Yeah, that's why I was thinking we'd need a dinky/fast as hell FIFO of
>>> tokens or such to bind waker/wakee more or less reliably. Making such a
>>> scheme cheap enough could be hard.
>>>
>>>> So this buddy idea seems to be bad...
>>>>
>>>> I think a better way may should be allowing pull in most time, but
>>>> filter the very bad cases carefully.
>>>
>>> Any way that is cheap, and fairly accurately recognizes when we're being
>>> stupid will help. First and foremost, it has to be dirt cheap :)
>>>
>>>> For workload like pgbench, we actually just need to avoid pull if that
>>>> will damage the 'mother' thread, which is busy and be relied by many
>>>> 'child'.
>>>
>>> Yeah, 'mom' is the key player. If we can cheaply recognize mom, that
>>> should get us a generic improvement. Not as good as being able to
>>> recognize the size of her+brood as size changes, but better anyway.
>>
>> That's right, I'm trying to rely on the frequency of a task switching
>> it's wakee in the new idea, it's really cheap and somewhat reliable, I
>> appreciate if you could pay an eye on the new patch and let me know you
>> opinion :)
>
> I'll feed it to my 40 core box, hopefully soon.
Thanks for that, wish it taste delicious ;-)
Regards,
Michael Wang
>
> -Mike
>