2013-03-25 05:24:27

by Michael wang

[permalink] [raw]
Subject: [RFC PATCH] sched: wake-affine throttle

Recently testing show that wake-affine stuff cause regression on pgbench, the
hiding rat was finally catched out.

wake-affine stuff is always trying to pull wakee close to waker, by theory,
this will benefit us if waker's cpu cached hot data for wakee, or the extreme
ping-pong case.

However, the whole stuff is somewhat blindly, there is no examining on the
relationship between waker and wakee, and since the stuff itself
is time-consuming, some workload suffered, pgbench is just the one who
has been found.

Thus, throttle the wake-affine stuff for such workload is necessary.

This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
default value 1ms, which means wake-affine stuff only effect once per 1ms, which
usually the minimum balance interval (the idea is that the rapid of wake-affine
should lower than the rapid of load-balance at least).

By turning the new knob, those workload who suffered will have the chance to
stop the regression.

Test:
Test with 12 cpu X86 server and tip 3.9.0-rc2.

Default 1ms interval bring limited performance improvement(<5%) for
pgbench, significant improvement start to show when turning the
knob to 100ms.

original 100ms

| db_size | clients | tps | | tps |
+---------+---------+-------+ +-------+
| 21 MB | 1 | 10572 | | 10675 |
| 21 MB | 2 | 21275 | | 21228 |
| 21 MB | 4 | 41866 | | 41946 |
| 21 MB | 8 | 53931 | | 55176 |
| 21 MB | 12 | 50956 | | 54457 | +6.87%
| 21 MB | 16 | 49911 | | 55468 | +11.11%
| 21 MB | 24 | 46046 | | 56446 | +22.59%
| 21 MB | 32 | 43405 | | 55177 | +27.12%
| 7483 MB | 1 | 7734 | | 7721 |
| 7483 MB | 2 | 19375 | | 19277 |
| 7483 MB | 4 | 37408 | | 37685 |
| 7483 MB | 8 | 49033 | | 49152 |
| 7483 MB | 12 | 45525 | | 49241 | +8.16%
| 7483 MB | 16 | 45731 | | 51425 | +12.45%
| 7483 MB | 24 | 41533 | | 52349 | +26.04%
| 7483 MB | 32 | 36370 | | 51022 | +40.28%
| 15 GB | 1 | 7576 | | 7422 |
| 15 GB | 2 | 19157 | | 19176 |
| 15 GB | 4 | 37285 | | 36982 |
| 15 GB | 8 | 48718 | | 48413 |
| 15 GB | 12 | 45167 | | 48497 | +7.37%
| 15 GB | 16 | 45270 | | 51276 | +13.27%
| 15 GB | 24 | 40984 | | 51628 | +25.97%
| 15 GB | 32 | 35918 | | 51060 | +42.16%

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
include/linux/sched.h | 5 +++++
kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++++-
kernel/sysctl.c | 10 ++++++++++
3 files changed, 47 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..e9efd3a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1197,6 +1197,10 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};

+#ifdef CONFIG_SMP
+extern unsigned int sysctl_sched_wake_affine_interval;
+#endif
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1207,6 +1211,7 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+ unsigned long next_wake_affine;
#endif
int on_rq;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..00d7f45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3087,6 +3087,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,

#endif

+/*
+ * Default is 1ms, to prevent the wake_affine() stuff working too frequently.
+ */
+unsigned int sysctl_sched_wake_affine_interval = 1U;
+
+static inline int wake_affine_throttled(struct task_struct *p)
+{
+ return time_before(jiffies, p->next_wake_affine);
+}
+
+static inline void wake_affine_throttle(struct task_struct *p)
+{
+ p->next_wake_affine = jiffies +
+ msecs_to_jiffies(sysctl_sched_wake_affine_interval);
+}
+
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
{
s64 this_load, load;
@@ -3096,6 +3112,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
unsigned long weight;
int balanced;

+ if (wake_affine_throttled(p))
+ return 0;
+
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -3342,8 +3361,20 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
}

if (affine_sd) {
- if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
+ /*
+ * wake_affine() stuff try to pull wakee to the cpu
+ * around waker, this will benefit us if the data
+ * cached on waker cpu is hot for wakee, or the extreme
+ * ping-pong case.
+ *
+ * However, do such blindly work too frequently will
+ * cause regression to some workload, thus, each time
+ * when wake_affine() succeed, throttle it for a while.
+ */
+ wake_affine_throttle(p);
prev_cpu = cpu;
+ }

new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..6b798b6 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -437,6 +437,16 @@ static struct ctl_table kern_table[] = {
.extra1 = &one,
},
#endif
+#ifdef CONFIG_SMP
+ {
+ .procname = "sched_wake_affine_interval",
+ .data = &sysctl_sched_wake_affine_interval,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
--
1.7.4.1


2013-03-25 09:22:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: wake-affine throttle

On Mon, 2013-03-25 at 13:24 +0800, Michael Wang wrote:
> Recently testing show that wake-affine stuff cause regression on pgbench, the
> hiding rat was finally catched out.
>
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
> ping-pong case.
>
> However, the whole stuff is somewhat blindly, there is no examining on the
> relationship between waker and wakee, and since the stuff itself
> is time-consuming, some workload suffered, pgbench is just the one who
> has been found.
>
> Thus, throttle the wake-affine stuff for such workload is necessary.
>
> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
> default value 1ms, which means wake-affine stuff only effect once per 1ms, which
> usually the minimum balance interval (the idea is that the rapid of wake-affine
> should lower than the rapid of load-balance at least).
>
> By turning the new knob, those workload who suffered will have the chance to
> stop the regression.

I wouldn't do it quite this way. Per task, yes (I suggested that too,
better agree;), but for one, using jiffies in the scheduler when we have
a spiffy clock sitting there ready to use seems wrong, and secondly,
when you've got a bursty load, not pulling can hurt like hell. Alex
encountered that while working on his patch set.

> Test:
> Test with 12 cpu X86 server and tip 3.9.0-rc2.
>
> Default 1ms interval bring limited performance improvement(<5%) for
> pgbench, significant improvement start to show when turning the
> knob to 100ms.

So it seems you'd be better served by an on/off switch for this load.
100ms in the future for many tasks is akin to a human todo list entry
scheduled for Solar radius >= Earth orbital radius day ;-)

> original 100ms
>
> | db_size | clients | tps | | tps |
> +---------+---------+-------+ +-------+
> | 21 MB | 1 | 10572 | | 10675 |
> | 21 MB | 2 | 21275 | | 21228 |
> | 21 MB | 4 | 41866 | | 41946 |
> | 21 MB | 8 | 53931 | | 55176 |
> | 21 MB | 12 | 50956 | | 54457 | +6.87%
> | 21 MB | 16 | 49911 | | 55468 | +11.11%
> | 21 MB | 24 | 46046 | | 56446 | +22.59%
> | 21 MB | 32 | 43405 | | 55177 | +27.12%
> | 7483 MB | 1 | 7734 | | 7721 |
> | 7483 MB | 2 | 19375 | | 19277 |
> | 7483 MB | 4 | 37408 | | 37685 |
> | 7483 MB | 8 | 49033 | | 49152 |
> | 7483 MB | 12 | 45525 | | 49241 | +8.16%
> | 7483 MB | 16 | 45731 | | 51425 | +12.45%
> | 7483 MB | 24 | 41533 | | 52349 | +26.04%
> | 7483 MB | 32 | 36370 | | 51022 | +40.28%
> | 15 GB | 1 | 7576 | | 7422 |
> | 15 GB | 2 | 19157 | | 19176 |
> | 15 GB | 4 | 37285 | | 36982 |
> | 15 GB | 8 | 48718 | | 48413 |
> | 15 GB | 12 | 45167 | | 48497 | +7.37%
> | 15 GB | 16 | 45270 | | 51276 | +13.27%
> | 15 GB | 24 | 40984 | | 51628 | +25.97%
> | 15 GB | 32 | 35918 | | 51060 | +42.16%

The benefit you get with not pulling is two fold at least, first and
foremost it keeps the forked off clients the hell away from the mother
of all work so it can keep the kids fed. Second, you keep the load
spread out, which is the only way the full box sized load can possibly
perform in the first place. The full box benefit seems clear from the
numbers.. hard working server can compete best for its share when it's
competing against the same set of clients, that's likely why you have to
set the knob to 100ms to get the big win.

With small burst loads of short running tasks, even things like pgbench
will benefit from pulling to local llc more frequently than 100ms, iff
burst does not exceed socket size. That pulling is not completely evil,
it automagically consolidates your mostly idle NUMA box to it's most
efficient task placement for both power saving and throughput, so IMHO,
you can't just let tasks sit cross node over ~extended idle periods
without doing harm.

OTOH, if the box is mostly busy, or if there's only one llc, per task
pick a smallish number is dirt simple, and should be a general case
improvement over the overly 1:1 buddy centric current behavior.

Have you tried the patch set by Alex Shi? In my fiddling with them,
they put a very big dent in the evil side of select_idle_sibling() and
affine wakeups in general, and should help pgbench and ilk heaping
truckloads.

-Mike

2013-03-25 10:21:22

by Michael wang

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: wake-affine throttle

Hi, Mike

Thanks for your reply :)

On 03/25/2013 05:22 PM, Mike Galbraith wrote:
> On Mon, 2013-03-25 at 13:24 +0800, Michael Wang wrote:
>> Recently testing show that wake-affine stuff cause regression on pgbench, the
>> hiding rat was finally catched out.
>>
>> wake-affine stuff is always trying to pull wakee close to waker, by theory,
>> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
>> ping-pong case.
>>
>> However, the whole stuff is somewhat blindly, there is no examining on the
>> relationship between waker and wakee, and since the stuff itself
>> is time-consuming, some workload suffered, pgbench is just the one who
>> has been found.
>>
>> Thus, throttle the wake-affine stuff for such workload is necessary.
>>
>> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
>> default value 1ms, which means wake-affine stuff only effect once per 1ms, which
>> usually the minimum balance interval (the idea is that the rapid of wake-affine
>> should lower than the rapid of load-balance at least).
>>
>> By turning the new knob, those workload who suffered will have the chance to
>> stop the regression.
>
> I wouldn't do it quite this way. Per task, yes (I suggested that too,
> better agree;), but for one, using jiffies in the scheduler when we have
> a spiffy clock sitting there ready to use seems wrong,

Well, I get the approach from load-balance code, this is one existed way
to play interval stuff, just try to keep consistent...

and secondly,
> when you've got a bursty load, not pulling can hurt like hell. Alex
> encountered that while working on his patch set.
>
>> Test:
>> Test with 12 cpu X86 server and tip 3.9.0-rc2.
>>
>> Default 1ms interval bring limited performance improvement(<5%) for
>> pgbench, significant improvement start to show when turning the
>> knob to 100ms.
>
> So it seems you'd be better served by an on/off switch for this load.
> 100ms in the future for many tasks is akin to a human todo list entry
> scheduled for Solar radius >= Earth orbital radius day ;-)

Do you mean 1ms interval is still too big? and you prefer to have a 0
option?

>
>> original 100ms
>>
>> | db_size | clients | tps | | tps |
>> +---------+---------+-------+ +-------+
>> | 21 MB | 1 | 10572 | | 10675 |
>> | 21 MB | 2 | 21275 | | 21228 |
>> | 21 MB | 4 | 41866 | | 41946 |
>> | 21 MB | 8 | 53931 | | 55176 |
>> | 21 MB | 12 | 50956 | | 54457 | +6.87%
>> | 21 MB | 16 | 49911 | | 55468 | +11.11%
>> | 21 MB | 24 | 46046 | | 56446 | +22.59%
>> | 21 MB | 32 | 43405 | | 55177 | +27.12%
>> | 7483 MB | 1 | 7734 | | 7721 |
>> | 7483 MB | 2 | 19375 | | 19277 |
>> | 7483 MB | 4 | 37408 | | 37685 |
>> | 7483 MB | 8 | 49033 | | 49152 |
>> | 7483 MB | 12 | 45525 | | 49241 | +8.16%
>> | 7483 MB | 16 | 45731 | | 51425 | +12.45%
>> | 7483 MB | 24 | 41533 | | 52349 | +26.04%
>> | 7483 MB | 32 | 36370 | | 51022 | +40.28%
>> | 15 GB | 1 | 7576 | | 7422 |
>> | 15 GB | 2 | 19157 | | 19176 |
>> | 15 GB | 4 | 37285 | | 36982 |
>> | 15 GB | 8 | 48718 | | 48413 |
>> | 15 GB | 12 | 45167 | | 48497 | +7.37%
>> | 15 GB | 16 | 45270 | | 51276 | +13.27%
>> | 15 GB | 24 | 40984 | | 51628 | +25.97%
>> | 15 GB | 32 | 35918 | | 51060 | +42.16%
>
> The benefit you get with not pulling is two fold at least, first and
> foremost it keeps the forked off clients the hell away from the mother
> of all work so it can keep the kids fed. Second, you keep the load
> spread out, which is the only way the full box sized load can possibly
> perform in the first place. The full box benefit seems clear from the
> numbers.. hard working server can compete best for its share when it's
> competing against the same set of clients, that's likely why you have to
> set the knob to 100ms to get the big win.

Actually the 10ms will also get around 27% improvement at most, I use
100ms since it looks more significant...

I haven't tried the interval between 1 and 10, but I suppose the benefit
could be some kind of parabola, it's not a suddenly change, but smoothly.

>
> With small burst loads of short running tasks, even things like pgbench
> will benefit from pulling to local llc more frequently than 100ms, iff
> burst does not exceed socket size. That pulling is not completely evil,
> it automagically consolidates your mostly idle NUMA box to it's most
> efficient task placement for both power saving and throughput, so IMHO,
> you can't just let tasks sit cross node over ~extended idle periods
> without doing harm.

I see, and actually that's the reason for this proposal, it's just try
to reserve all the possible benefit of wake-affine, and provide a way to
control the rapid.

I think your point here is still that we need a 0 option, it that correct?

>
> OTOH, if the box is mostly busy, or if there's only one llc, per task
> pick a smallish number is dirt simple, and should be a general case
> improvement over the overly 1:1 buddy centric current behavior.
>
> Have you tried the patch set by Alex Shi? In my fiddling with them,
> they put a very big dent in the evil side of select_idle_sibling() and
> affine wakeups in general, and should help pgbench and ilk heaping
> truckloads.

I haven't tried yet, whatever, a way to throttle the wake-affine is
necessary, since we have the proof that it will damage some workload.

OTOH, by theory, we have the chance to pull none-related tasks together,
and we can't promise this will help the system all the time, don't we?

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-03-25 14:32:07

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: wake-affine throttle

On Mon, 2013-03-25 at 18:21 +0800, Michael Wang wrote:
> Hi, Mike
>
> Thanks for your reply :)
>
> On 03/25/2013 05:22 PM, Mike Galbraith wrote:
> > On Mon, 2013-03-25 at 13:24 +0800, Michael Wang wrote:
> >> Recently testing show that wake-affine stuff cause regression on pgbench, the
> >> hiding rat was finally catched out.
> >>
> >> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> >> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
> >> ping-pong case.
> >>
> >> However, the whole stuff is somewhat blindly, there is no examining on the
> >> relationship between waker and wakee, and since the stuff itself
> >> is time-consuming, some workload suffered, pgbench is just the one who
> >> has been found.
> >>
> >> Thus, throttle the wake-affine stuff for such workload is necessary.
> >>
> >> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
> >> default value 1ms, which means wake-affine stuff only effect once per 1ms, which
> >> usually the minimum balance interval (the idea is that the rapid of wake-affine
> >> should lower than the rapid of load-balance at least).
> >>
> >> By turning the new knob, those workload who suffered will have the chance to
> >> stop the regression.
> >
> > I wouldn't do it quite this way. Per task, yes (I suggested that too,
> > better agree;), but for one, using jiffies in the scheduler when we have
> > a spiffy clock sitting there ready to use seems wrong,
>
> Well, I get the approach from load-balance code, this is one existed way
> to play interval stuff, just try to keep consistent...
>
> and secondly,
> > when you've got a bursty load, not pulling can hurt like hell. Alex
> > encountered that while working on his patch set.
> >
> >> Test:
> >> Test with 12 cpu X86 server and tip 3.9.0-rc2.
> >>
> >> Default 1ms interval bring limited performance improvement(<5%) for
> >> pgbench, significant improvement start to show when turning the
> >> knob to 100ms.
> >
> > So it seems you'd be better served by an on/off switch for this load.
> > 100ms in the future for many tasks is akin to a human todo list entry
> > scheduled for Solar radius >= Earth orbital radius day ;-)
>
> Do you mean 1ms interval is still too big? and you prefer to have a 0
> option?

Not really, I just think a fixed interval may not be good enough without
some idle time consideration. Once a single load gets going less
balancing is more, it's just when load is fluctuating a lot, and mixed
loads where I can imagine troubles.

Perhaps ramp up to knob interval after an idle period trigger of.. say
migration_cost, or whatever. Something dirt simple that makes it open
the gates when it's most likely to matter.

> >
> >> original 100ms
> >>
> >> | db_size | clients | tps | | tps |
> >> +---------+---------+-------+ +-------+
> >> | 21 MB | 1 | 10572 | | 10675 |
> >> | 21 MB | 2 | 21275 | | 21228 |
> >> | 21 MB | 4 | 41866 | | 41946 |
> >> | 21 MB | 8 | 53931 | | 55176 |
> >> | 21 MB | 12 | 50956 | | 54457 | +6.87%
> >> | 21 MB | 16 | 49911 | | 55468 | +11.11%
> >> | 21 MB | 24 | 46046 | | 56446 | +22.59%
> >> | 21 MB | 32 | 43405 | | 55177 | +27.12%
> >> | 7483 MB | 1 | 7734 | | 7721 |
> >> | 7483 MB | 2 | 19375 | | 19277 |
> >> | 7483 MB | 4 | 37408 | | 37685 |
> >> | 7483 MB | 8 | 49033 | | 49152 |
> >> | 7483 MB | 12 | 45525 | | 49241 | +8.16%
> >> | 7483 MB | 16 | 45731 | | 51425 | +12.45%
> >> | 7483 MB | 24 | 41533 | | 52349 | +26.04%
> >> | 7483 MB | 32 | 36370 | | 51022 | +40.28%
> >> | 15 GB | 1 | 7576 | | 7422 |
> >> | 15 GB | 2 | 19157 | | 19176 |
> >> | 15 GB | 4 | 37285 | | 36982 |
> >> | 15 GB | 8 | 48718 | | 48413 |
> >> | 15 GB | 12 | 45167 | | 48497 | +7.37%
> >> | 15 GB | 16 | 45270 | | 51276 | +13.27%
> >> | 15 GB | 24 | 40984 | | 51628 | +25.97%
> >> | 15 GB | 32 | 35918 | | 51060 | +42.16%
> >
> > The benefit you get with not pulling is two fold at least, first and
> > foremost it keeps the forked off clients the hell away from the mother
> > of all work so it can keep the kids fed. Second, you keep the load
> > spread out, which is the only way the full box sized load can possibly
> > perform in the first place. The full box benefit seems clear from the
> > numbers.. hard working server can compete best for its share when it's
> > competing against the same set of clients, that's likely why you have to
> > set the knob to 100ms to get the big win.
>
> Actually the 10ms will also get around 27% improvement at most, I use
> 100ms since it looks more significant...
>
> I haven't tried the interval between 1 and 10, but I suppose the benefit
> could be some kind of parabola, it's not a suddenly change, but smoothly.
>
> >
> > With small burst loads of short running tasks, even things like pgbench
> > will benefit from pulling to local llc more frequently than 100ms, iff
> > burst does not exceed socket size. That pulling is not completely evil,
> > it automagically consolidates your mostly idle NUMA box to it's most
> > efficient task placement for both power saving and throughput, so IMHO,
> > you can't just let tasks sit cross node over ~extended idle periods
> > without doing harm.
>
> I see, and actually that's the reason for this proposal, it's just try
> to reserve all the possible benefit of wake-affine, and provide a way to
> control the rapid.
>
> I think your point here is still that we need a 0 option, it that correct?

No, zero is pretty much what we've got, and it's less than wonderful
after ramping up.

-Mike

2013-03-26 02:46:03

by Michael wang

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: wake-affine throttle

On 03/25/2013 10:31 PM, Mike Galbraith wrote:
[snip]
>>
>> Do you mean 1ms interval is still too big? and you prefer to have a 0
>> option?
>
> Not really, I just think a fixed interval may not be good enough without
> some idle time consideration. Once a single load gets going less
> balancing is more, it's just when load is fluctuating a lot, and mixed
> loads where I can imagine troubles.
>
> Perhaps ramp up to knob interval after an idle period trigger of.. say
> migration_cost, or whatever. Something dirt simple that makes it open
> the gates when it's most likely to matter.
>

So a dynamically adjustment, sounds attractively ;-)

However, IMHO, I don't think we could be able to figure out when to
adjust and how to adjust, actually we even don't have the data to count
on, otherwise, there is no necessary to throttle the wake-affine stuff
at all...

May be do such work in user space will be better?

This knob is nothing but compromise, besides, it's a highlight to notify
us we still have a feature waiting for improve, if later we have the way
to build an accurate wake-affine, remove the knob should be easy.

Regards,
Michael Wang

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-04-08 02:09:12

by Michael wang

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: wake-affine throttle

On 03/25/2013 01:24 PM, Michael Wang wrote:
> Recently testing show that wake-affine stuff cause regression on pgbench, the
> hiding rat was finally catched out.
>
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
> ping-pong case.
>
> However, the whole stuff is somewhat blindly, there is no examining on the
> relationship between waker and wakee, and since the stuff itself
> is time-consuming, some workload suffered, pgbench is just the one who
> has been found.
>
> Thus, throttle the wake-affine stuff for such workload is necessary.
>
> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
> default value 1ms, which means wake-affine stuff only effect once per 1ms, which
> usually the minimum balance interval (the idea is that the rapid of wake-affine
> should lower than the rapid of load-balance at least).
>
> By turning the new knob, those workload who suffered will have the chance to
> stop the regression.

My recently testing show this idea still works well after the
wake-affine itself was optimized.

Throttle seems to be an easy and efficient way to stop the regression of
pgbench caused by wake-affine stuff, should we adopt this approach?

Please let me know if there are still any concerns ;-)

Regards,
Michael Wang

>
> Test:
> Test with 12 cpu X86 server and tip 3.9.0-rc2.
>
> Default 1ms interval bring limited performance improvement(<5%) for
> pgbench, significant improvement start to show when turning the
> knob to 100ms.
>
> original 100ms
>
> | db_size | clients | tps | | tps |
> +---------+---------+-------+ +-------+
> | 21 MB | 1 | 10572 | | 10675 |
> | 21 MB | 2 | 21275 | | 21228 |
> | 21 MB | 4 | 41866 | | 41946 |
> | 21 MB | 8 | 53931 | | 55176 |
> | 21 MB | 12 | 50956 | | 54457 | +6.87%
> | 21 MB | 16 | 49911 | | 55468 | +11.11%
> | 21 MB | 24 | 46046 | | 56446 | +22.59%
> | 21 MB | 32 | 43405 | | 55177 | +27.12%
> | 7483 MB | 1 | 7734 | | 7721 |
> | 7483 MB | 2 | 19375 | | 19277 |
> | 7483 MB | 4 | 37408 | | 37685 |
> | 7483 MB | 8 | 49033 | | 49152 |
> | 7483 MB | 12 | 45525 | | 49241 | +8.16%
> | 7483 MB | 16 | 45731 | | 51425 | +12.45%
> | 7483 MB | 24 | 41533 | | 52349 | +26.04%
> | 7483 MB | 32 | 36370 | | 51022 | +40.28%
> | 15 GB | 1 | 7576 | | 7422 |
> | 15 GB | 2 | 19157 | | 19176 |
> | 15 GB | 4 | 37285 | | 36982 |
> | 15 GB | 8 | 48718 | | 48413 |
> | 15 GB | 12 | 45167 | | 48497 | +7.37%
> | 15 GB | 16 | 45270 | | 51276 | +13.27%
> | 15 GB | 24 | 40984 | | 51628 | +25.97%
> | 15 GB | 32 | 35918 | | 51060 | +42.16%
>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> include/linux/sched.h | 5 +++++
> kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++++-
> kernel/sysctl.c | 10 ++++++++++
> 3 files changed, 47 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d35d2b6..e9efd3a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1197,6 +1197,10 @@ enum perf_event_task_context {
> perf_nr_task_contexts,
> };
>
> +#ifdef CONFIG_SMP
> +extern unsigned int sysctl_sched_wake_affine_interval;
> +#endif
> +
> struct task_struct {
> volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
> void *stack;
> @@ -1207,6 +1211,7 @@ struct task_struct {
> #ifdef CONFIG_SMP
> struct llist_node wake_entry;
> int on_cpu;
> + unsigned long next_wake_affine;
> #endif
> int on_rq;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a33e59..00d7f45 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3087,6 +3087,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>
> #endif
>
> +/*
> + * Default is 1ms, to prevent the wake_affine() stuff working too frequently.
> + */
> +unsigned int sysctl_sched_wake_affine_interval = 1U;
> +
> +static inline int wake_affine_throttled(struct task_struct *p)
> +{
> + return time_before(jiffies, p->next_wake_affine);
> +}
> +
> +static inline void wake_affine_throttle(struct task_struct *p)
> +{
> + p->next_wake_affine = jiffies +
> + msecs_to_jiffies(sysctl_sched_wake_affine_interval);
> +}
> +
> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> {
> s64 this_load, load;
> @@ -3096,6 +3112,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> unsigned long weight;
> int balanced;
>
> + if (wake_affine_throttled(p))
> + return 0;
> +
> idx = sd->wake_idx;
> this_cpu = smp_processor_id();
> prev_cpu = task_cpu(p);
> @@ -3342,8 +3361,20 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> }
>
> if (affine_sd) {
> - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> + if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
> + /*
> + * wake_affine() stuff try to pull wakee to the cpu
> + * around waker, this will benefit us if the data
> + * cached on waker cpu is hot for wakee, or the extreme
> + * ping-pong case.
> + *
> + * However, do such blindly work too frequently will
> + * cause regression to some workload, thus, each time
> + * when wake_affine() succeed, throttle it for a while.
> + */
> + wake_affine_throttle(p);
> prev_cpu = cpu;
> + }
>
> new_cpu = select_idle_sibling(p, prev_cpu);
> goto unlock;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index afc1dc6..6b798b6 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -437,6 +437,16 @@ static struct ctl_table kern_table[] = {
> .extra1 = &one,
> },
> #endif
> +#ifdef CONFIG_SMP
> + {
> + .procname = "sched_wake_affine_interval",
> + .data = &sysctl_sched_wake_affine_interval,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &one,
> + },
> +#endif
> #ifdef CONFIG_PROVE_LOCKING
> {
> .procname = "prove_locking",
>

2013-04-08 10:00:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: wake-affine throttle

On Mon, 2013-03-25 at 13:24 +0800, Michael Wang wrote:
> if (affine_sd) {
> - if (cpu != prev_cpu && wake_affine(affine_sd, p,
> sync))
> + if (cpu != prev_cpu && wake_affine(affine_sd, p,
> sync)) {
> + /*
> + * wake_affine() stuff try to pull wakee to
> the cpu
> + * around waker, this will benefit us if the
> data
> + * cached on waker cpu is hot for wakee, or
> the extreme
> + * ping-pong case.
> + *
> + * However, do such blindly work too
> frequently will
> + * cause regression to some workload, thus,
> each time
> + * when wake_affine() succeed, throttle it for
> a while.
> + */
> + wake_affine_throttle(p);
> prev_cpu = cpu;
> + }

How about only throttling when wake_affine() starts returning false? At
that point its lost its benefit.

Also, why not place this inside wake_affine() like you did the throttled
test.

2013-04-09 05:01:27

by Michael wang

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: wake-affine throttle

On 04/08/2013 06:00 PM, Peter Zijlstra wrote:
> On Mon, 2013-03-25 at 13:24 +0800, Michael Wang wrote:
>> if (affine_sd) {
>> - if (cpu != prev_cpu && wake_affine(affine_sd, p,
>> sync))
>> + if (cpu != prev_cpu && wake_affine(affine_sd, p,
>> sync)) {
>> + /*
>> + * wake_affine() stuff try to pull wakee to
>> the cpu
>> + * around waker, this will benefit us if the
>> data
>> + * cached on waker cpu is hot for wakee, or
>> the extreme
>> + * ping-pong case.
>> + *
>> + * However, do such blindly work too
>> frequently will
>> + * cause regression to some workload, thus,
>> each time
>> + * when wake_affine() succeed, throttle it for
>> a while.
>> + */
>> + wake_affine_throttle(p);
>> prev_cpu = cpu;
>> + }
>
> How about only throttling when wake_affine() starts returning false? At
> that point its lost its benefit.

Honestly, I used to think this won't work, but the testing results show
it even better than throttle on succeed...your suggestion is correct, it
will be applied to the next version.

But please allow me to express my concern here:

In summary, I don't think 'return false' is the only point wake-affine
lost it's benefit.

When wake-affine return true, the only thing guaranteed is the balance,
but what about the factors like:

1. wakee has no hot data cached on curr_cpu
2. hot data cached on curr_cpu for wakee is less than those on prev_cpu
3. wakee has no relationship with waker at all
...

Those factors was unconsidered when wake_affine() return true, but they
could be a reason for regression too, isn't it?

Anyway, according to the test results, the benefit of wake-affine
reserved by 'throttle on failed' is impressive, those factors failed to
defend them self...

>
> Also, why not place this inside wake_affine() like you did the throttled
> test.

That's right, will correct it :)

Regards,
Michael Wang

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>