2013-05-28 05:05:39

by Michael wang

[permalink] [raw]
Subject: [RFC PATCH] sched: smart wake-affine

wake-affine stuff is always trying to pull wakee close to waker, by theory,
this will bring benefit if waker's cpu cached hot data for wakee, or the
extreme ping-pong case.

And testing show it could benefit hackbench 15% at most.

However, the whole stuff is somewhat blindly and time-consuming, some
workload therefore suffer.

And testing show it could damage pgbench 50% at most.

Thus, wake-affine stuff should be smarter, and realise when to stop
it's thankless effort.

This patch introduced per task 'nr_wakee_switch', which will be increased
each time the task switch it's wakee.

So a high 'nr_wakee_switch' means the task has more than one wakee, and
less the wakee number, higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention on
the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
but that imply waker will face cruel competition later, it could be very
crule or very fast depends on the story behind 'nr_wakee_switch', whatever,
waker therefore suffer.

Furthermore, if waker also has a high 'nr_wakee_switch', that imply multiple
tasks rely on it, waker's higher latency will damage all those tasks, pull
wakee in such cases seems to be a bad deal.

Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become higher
and higher, the deal seems to be worse and worse.

This patch therefore help wake-affine stuff to stop it's work when:

wakee->nr_wakee_switch > factor &&
waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)

The factor here is the online cpu number, so more cpu will lead to more pull
since the trial become more severe.

After applied the patch, pgbench show 42% improvement at most.

Test:
Test with 12 cpu X86 server and tip 3.10.0-rc1.

base smart

| db_size | clients | tps | | tps |
+---------+---------+-------+ +-------+
| 21 MB | 1 | 10749 | | 10337 |
| 21 MB | 2 | 21382 | | 21391 |
| 21 MB | 4 | 41570 | | 41808 |
| 21 MB | 8 | 52828 | | 58792 |
| 21 MB | 12 | 48447 | | 54553 |
| 21 MB | 16 | 46246 | | 56726 | +22.66%
| 21 MB | 24 | 43850 | | 56853 | +29.65%
| 21 MB | 32 | 43455 | | 55846 | +28.51%
| 7483 MB | 1 | 9290 | | 8848 |
| 7483 MB | 2 | 19347 | | 19351 |
| 7483 MB | 4 | 37135 | | 37511 |
| 7483 MB | 8 | 47310 | | 50210 |
| 7483 MB | 12 | 42721 | | 49396 |
| 7483 MB | 16 | 41016 | | 51826 | +26.36%
| 7483 MB | 24 | 37540 | | 52579 | +40.06%
| 7483 MB | 32 | 36756 | | 51332 | +39.66%
| 15 GB | 1 | 8758 | | 8670 |
| 15 GB | 2 | 19204 | | 19249 |
| 15 GB | 4 | 36997 | | 37199 |
| 15 GB | 8 | 46578 | | 50681 |
| 15 GB | 12 | 42141 | | 48671 |
| 15 GB | 16 | 40518 | | 51280 | +26.56%
| 15 GB | 24 | 36788 | | 52329 | +42.24%
| 15 GB | 32 | 36056 | | 50350 | +39.64%



CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Mike Galbraith <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/fair.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 48 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1c996c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1041,6 +1041,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+ struct task_struct *last_wakee;
+ unsigned long nr_wakee_switch;
+ unsigned long last_switch_decay;
#endif
int on_rq;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f62b16d..eaaceb7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3127,6 +3127,45 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,

#endif

+static void record_wakee(struct task_struct *p)
+{
+ /*
+ * Rough decay, don't worry about the boundary, really active
+ * task won't care the loose.
+ */
+ if (jiffies > current->last_switch_decay + HZ) {
+ current->nr_wakee_switch = 0;
+ current->last_switch_decay = jiffies;
+ }
+
+ if (current->last_wakee != p) {
+ current->last_wakee = p;
+ current->nr_wakee_switch++;
+ }
+}
+
+static int nasty_pull(struct task_struct *p)
+{
+ int factor = cpumask_weight(cpu_online_mask);
+
+ /*
+ * Yeah, it's the switching-frequency, could means many wakee or
+ * rapidly switch, use factor here will just help to automatically
+ * adjust the loose-degree, so more cpu will lead to more pull.
+ */
+ if (p->nr_wakee_switch > factor) {
+ /*
+ * wakee is somewhat hot, it needs certain amount of cpu
+ * resource, so if waker is far more hot, prefer to leave
+ * it alone.
+ */
+ if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
+ return 1;
+ }
+
+ return 0;
+}
+
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
{
s64 this_load, load;
@@ -3136,6 +3175,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
unsigned long weight;
int balanced;

+ if (nasty_pull(p))
+ return 0;
+
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -3428,6 +3470,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
/* while loop will break here if sd == NULL */
}
unlock:
+ if (sd_flag & SD_BALANCE_WAKE)
+ record_wakee(p);
+
rcu_read_unlock();

return new_cpu;
--
1.7.4.1


2013-06-03 02:29:18

by Michael wang

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: smart wake-affine

On 05/28/2013 01:05 PM, Michael Wang wrote:
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will bring benefit if waker's cpu cached hot data for wakee, or the
> extreme ping-pong case.
>
> And testing show it could benefit hackbench 15% at most.
>
> However, the whole stuff is somewhat blindly and time-consuming, some
> workload therefore suffer.
>
> And testing show it could damage pgbench 50% at most.
>
> Thus, wake-affine stuff should be smarter, and realise when to stop
> it's thankless effort.

Is there any comments?

Peter, do you have any comments on this idea? Is this the kind of fix we
are looking for? I think you mentioned we want some kind of filter
rather than the knob, correct?

Folks, please let me know your concerns so I could help on the research
work :)

Regards,
Michael Wang


>
> This patch introduced per task 'nr_wakee_switch', which will be increased
> each time the task switch it's wakee.
>
> So a high 'nr_wakee_switch' means the task has more than one wakee, and
> less the wakee number, higher the wakeup frequency.
>
> Now when making the decision on whether to pull or not, pay attention on
> the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
> but that imply waker will face cruel competition later, it could be very
> crule or very fast depends on the story behind 'nr_wakee_switch', whatever,
> waker therefore suffer.
>
> Furthermore, if waker also has a high 'nr_wakee_switch', that imply multiple
> tasks rely on it, waker's higher latency will damage all those tasks, pull
> wakee in such cases seems to be a bad deal.
>
> Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become higher
> and higher, the deal seems to be worse and worse.
>
> This patch therefore help wake-affine stuff to stop it's work when:
>
> wakee->nr_wakee_switch > factor &&
> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>
> The factor here is the online cpu number, so more cpu will lead to more pull
> since the trial become more severe.
>
> After applied the patch, pgbench show 42% improvement at most.
>
> Test:
> Test with 12 cpu X86 server and tip 3.10.0-rc1.
>
> base smart
>
> | db_size | clients | tps | | tps |
> +---------+---------+-------+ +-------+
> | 21 MB | 1 | 10749 | | 10337 |
> | 21 MB | 2 | 21382 | | 21391 |
> | 21 MB | 4 | 41570 | | 41808 |
> | 21 MB | 8 | 52828 | | 58792 |
> | 21 MB | 12 | 48447 | | 54553 |
> | 21 MB | 16 | 46246 | | 56726 | +22.66%
> | 21 MB | 24 | 43850 | | 56853 | +29.65%
> | 21 MB | 32 | 43455 | | 55846 | +28.51%
> | 7483 MB | 1 | 9290 | | 8848 |
> | 7483 MB | 2 | 19347 | | 19351 |
> | 7483 MB | 4 | 37135 | | 37511 |
> | 7483 MB | 8 | 47310 | | 50210 |
> | 7483 MB | 12 | 42721 | | 49396 |
> | 7483 MB | 16 | 41016 | | 51826 | +26.36%
> | 7483 MB | 24 | 37540 | | 52579 | +40.06%
> | 7483 MB | 32 | 36756 | | 51332 | +39.66%
> | 15 GB | 1 | 8758 | | 8670 |
> | 15 GB | 2 | 19204 | | 19249 |
> | 15 GB | 4 | 36997 | | 37199 |
> | 15 GB | 8 | 46578 | | 50681 |
> | 15 GB | 12 | 42141 | | 48671 |
> | 15 GB | 16 | 40518 | | 51280 | +26.56%
> | 15 GB | 24 | 36788 | | 52329 | +42.24%
> | 15 GB | 32 | 36056 | | 50350 | +39.64%
>
>
>
> CC: Ingo Molnar <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Mike Galbraith <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> include/linux/sched.h | 3 +++
> kernel/sched/fair.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 48 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 178a8d9..1c996c7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1041,6 +1041,9 @@ struct task_struct {
> #ifdef CONFIG_SMP
> struct llist_node wake_entry;
> int on_cpu;
> + struct task_struct *last_wakee;
> + unsigned long nr_wakee_switch;
> + unsigned long last_switch_decay;
> #endif
> int on_rq;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f62b16d..eaaceb7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3127,6 +3127,45 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>
> #endif
>
> +static void record_wakee(struct task_struct *p)
> +{
> + /*
> + * Rough decay, don't worry about the boundary, really active
> + * task won't care the loose.
> + */
> + if (jiffies > current->last_switch_decay + HZ) {
> + current->nr_wakee_switch = 0;
> + current->last_switch_decay = jiffies;
> + }
> +
> + if (current->last_wakee != p) {
> + current->last_wakee = p;
> + current->nr_wakee_switch++;
> + }
> +}
> +
> +static int nasty_pull(struct task_struct *p)
> +{
> + int factor = cpumask_weight(cpu_online_mask);
> +
> + /*
> + * Yeah, it's the switching-frequency, could means many wakee or
> + * rapidly switch, use factor here will just help to automatically
> + * adjust the loose-degree, so more cpu will lead to more pull.
> + */
> + if (p->nr_wakee_switch > factor) {
> + /*
> + * wakee is somewhat hot, it needs certain amount of cpu
> + * resource, so if waker is far more hot, prefer to leave
> + * it alone.
> + */
> + if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> {
> s64 this_load, load;
> @@ -3136,6 +3175,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> unsigned long weight;
> int balanced;
>
> + if (nasty_pull(p))
> + return 0;
> +
> idx = sd->wake_idx;
> this_cpu = smp_processor_id();
> prev_cpu = task_cpu(p);
> @@ -3428,6 +3470,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> /* while loop will break here if sd == NULL */
> }
> unlock:
> + if (sd_flag & SD_BALANCE_WAKE)
> + record_wakee(p);
> +
> rcu_read_unlock();
>
> return new_cpu;
>

2013-06-03 03:09:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: smart wake-affine

On Mon, 2013-06-03 at 10:28 +0800, Michael Wang wrote:
> On 05/28/2013 01:05 PM, Michael Wang wrote:
> > wake-affine stuff is always trying to pull wakee close to waker, by theory,
> > this will bring benefit if waker's cpu cached hot data for wakee, or the
> > extreme ping-pong case.
> >
> > And testing show it could benefit hackbench 15% at most.
> >
> > However, the whole stuff is somewhat blindly and time-consuming, some
> > workload therefore suffer.
> >
> > And testing show it could damage pgbench 50% at most.
> >
> > Thus, wake-affine stuff should be smarter, and realise when to stop
> > it's thankless effort.
>
> Is there any comments?

(I haven't had time to test-drive yet, -rt munches time like popcorn)

2013-06-03 03:26:33

by Michael wang

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: smart wake-affine

On 06/03/2013 11:09 AM, Mike Galbraith wrote:
> On Mon, 2013-06-03 at 10:28 +0800, Michael Wang wrote:
>> On 05/28/2013 01:05 PM, Michael Wang wrote:
>>> wake-affine stuff is always trying to pull wakee close to waker, by theory,
>>> this will bring benefit if waker's cpu cached hot data for wakee, or the
>>> extreme ping-pong case.
>>>
>>> And testing show it could benefit hackbench 15% at most.
>>>
>>> However, the whole stuff is somewhat blindly and time-consuming, some
>>> workload therefore suffer.
>>>
>>> And testing show it could damage pgbench 50% at most.
>>>
>>> Thus, wake-affine stuff should be smarter, and realise when to stop
>>> it's thankless effort.
>>
>> Is there any comments?
>
> (I haven't had time to test-drive yet, -rt munches time like popcorn)

I see ;-)

During my testing, this one works well on the box, solved the issues of
pgbench and won't harm hackbench any, I think we have caught some good
point here :)

Regards,
Michael Wang

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-06-03 03:53:32

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: smart wake-affine

On Mon, 2013-06-03 at 11:26 +0800, Michael Wang wrote:
> On 06/03/2013 11:09 AM, Mike Galbraith wrote:
> > On Mon, 2013-06-03 at 10:28 +0800, Michael Wang wrote:
> >> On 05/28/2013 01:05 PM, Michael Wang wrote:
> >>> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> >>> this will bring benefit if waker's cpu cached hot data for wakee, or the
> >>> extreme ping-pong case.
> >>>
> >>> And testing show it could benefit hackbench 15% at most.
> >>>
> >>> However, the whole stuff is somewhat blindly and time-consuming, some
> >>> workload therefore suffer.
> >>>
> >>> And testing show it could damage pgbench 50% at most.
> >>>
> >>> Thus, wake-affine stuff should be smarter, and realise when to stop
> >>> it's thankless effort.
> >>
> >> Is there any comments?
> >
> > (I haven't had time to test-drive yet, -rt munches time like popcorn)
>
> I see ;-)
>
> During my testing, this one works well on the box, solved the issues of
> pgbench and won't harm hackbench any, I think we have caught some good
> point here :)

Some wider spectrum testing needs doing though. Hackbench is a good
sign, but localhost and db type stuff that really suffer from misses
would be good to test. Java crud tends to be sensitive too. I used to
watch vmark (crap) as an indicator, if you see unhappiness there, you'll
very likely see it in other loads as well, it is very fond of cache
affine wakeups, but loathes preemption (super heavy loads usually do).

-Mike

2013-06-03 04:53:13

by Michael wang

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: smart wake-affine

On 06/03/2013 11:53 AM, Mike Galbraith wrote:
> On Mon, 2013-06-03 at 11:26 +0800, Michael Wang wrote:
>> On 06/03/2013 11:09 AM, Mike Galbraith wrote:
>>> On Mon, 2013-06-03 at 10:28 +0800, Michael Wang wrote:
>>>> On 05/28/2013 01:05 PM, Michael Wang wrote:
>>>>> wake-affine stuff is always trying to pull wakee close to waker, by theory,
>>>>> this will bring benefit if waker's cpu cached hot data for wakee, or the
>>>>> extreme ping-pong case.
>>>>>
>>>>> And testing show it could benefit hackbench 15% at most.
>>>>>
>>>>> However, the whole stuff is somewhat blindly and time-consuming, some
>>>>> workload therefore suffer.
>>>>>
>>>>> And testing show it could damage pgbench 50% at most.
>>>>>
>>>>> Thus, wake-affine stuff should be smarter, and realise when to stop
>>>>> it's thankless effort.
>>>>
>>>> Is there any comments?
>>>
>>> (I haven't had time to test-drive yet, -rt munches time like popcorn)
>>
>> I see ;-)
>>
>> During my testing, this one works well on the box, solved the issues of
>> pgbench and won't harm hackbench any, I think we have caught some good
>> point here :)
>
> Some wider spectrum testing needs doing though.

That's right, the benchmark I currently have is hackbench, pgbench,
ebizzy, aim7, tbench, dbench, kbench, is there any other good candidate
we should add to the test?

Hackbench is a good
> sign, but localhost and db type stuff that really suffer from misses
> would be good to test. Java crud tends to be sensitive too. I used to
> watch vmark (crap) as an indicator,

I can't get it from google...do you mean vmmark?

if you see unhappiness there, you'll
> very likely see it in other loads as well, it is very fond of cache
> affine wakeups, but loathes preemption (super heavy loads usually do).

I agree that this idea, in other work, 'stop wake-affine when current is
busy with wakeup' may miss the chance to bring benefit, although I could
not find such workload, but I can't do promise...

However, IMHO, wake-affine is actually an additional accelerator, if we
want to adopt it without any knob, what we have to make sure is, it
won't damage any workload, rather than it catch all the chances to bring
benefit.

Here my point is, we stop the additional accelerator if it seems to
bring damage, and we currently found one such case ;-)

So the worst result of this idea is missing the timing to accelerate,
however, since such acceleration will sacrifice others and we don't want
knob, just filter it out.

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-06-03 05:22:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: smart wake-affine

On Mon, 2013-06-03 at 12:52 +0800, Michael Wang wrote:
> On 06/03/2013 11:53 AM, Mike Galbraith wrote:
> > On Mon, 2013-06-03 at 11:26 +0800, Michael Wang wrote:
> >> On 06/03/2013 11:09 AM, Mike Galbraith wrote:
> >>> On Mon, 2013-06-03 at 10:28 +0800, Michael Wang wrote:
> >>>> On 05/28/2013 01:05 PM, Michael Wang wrote:
> >>>>> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> >>>>> this will bring benefit if waker's cpu cached hot data for wakee, or the
> >>>>> extreme ping-pong case.
> >>>>>
> >>>>> And testing show it could benefit hackbench 15% at most.
> >>>>>
> >>>>> However, the whole stuff is somewhat blindly and time-consuming, some
> >>>>> workload therefore suffer.
> >>>>>
> >>>>> And testing show it could damage pgbench 50% at most.
> >>>>>
> >>>>> Thus, wake-affine stuff should be smarter, and realise when to stop
> >>>>> it's thankless effort.
> >>>>
> >>>> Is there any comments?
> >>>
> >>> (I haven't had time to test-drive yet, -rt munches time like popcorn)
> >>
> >> I see ;-)
> >>
> >> During my testing, this one works well on the box, solved the issues of
> >> pgbench and won't harm hackbench any, I think we have caught some good
> >> point here :)
> >
> > Some wider spectrum testing needs doing though.
>
> That's right, the benchmark I currently have is hackbench, pgbench,
> ebizzy, aim7, tbench, dbench, kbench, is there any other good candidate
> we should add to the test?

pgsql/mysql+oltp are useful. I used to track mysql especially all the
time, until I lost my fast mysql database, and couldn't re-create
anything that wasn't a complete slug.

> Hackbench is a good
> > sign, but localhost and db type stuff that really suffer from misses
> > would be good to test. Java crud tends to be sensitive too. I used to
> > watch vmark (crap) as an indicator,
>
> I can't get it from google...do you mean vmmark?

No, crusty old, and widely disparaged as being a useless POS benchmark,
volanomark. The big boys do SPECjbb, maybe that's better quality java
crud, dunno, never had it to play with.

> if you see unhappiness there, you'll
> > very likely see it in other loads as well, it is very fond of cache
> > affine wakeups, but loathes preemption (super heavy loads usually do).
>
> I agree that this idea, in other work, 'stop wake-affine when current is
> busy with wakeup' may miss the chance to bring benefit, although I could
> not find such workload, but I can't do promise...

Someday we'll find the perfect balance... likely the day before the sun
turns into a red giant and melts the earth.

-Mike

2013-06-03 05:50:45

by Michael wang

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: smart wake-affine

On 06/03/2013 01:22 PM, Mike Galbraith wrote:
[snip]
>>
>> I agree that this idea, in other work, 'stop wake-affine when current is
>> busy with wakeup' may miss the chance to bring benefit, although I could
>> not find such workload, but I can't do promise...
>
> Someday we'll find the perfect balance... likely the day before the sun
> turns into a red giant and melts the earth.

Won't take so long ;-)

I would like to stop the regression on pgbench firstly, as PeterZ
mentioned, if someone reported other regressions, we will know what is
missing, if fix is possible, we fix it, if cost is too high, then I say
we ignore the illegal income, after all, we could not benefit one in the
cost of sacrifice others...

I'd like to fix the problem ASAP, it's really a big, urgent problem on
my point of view, but doesn't win enough attentions as I thought it will...

The regression has been buried too long and hidden, if we miss the
chance to correct the issue this time, it will no doubt been buried again...

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-06-03 06:05:31

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: smart wake-affine

On Mon, 2013-06-03 at 13:50 +0800, Michael Wang wrote:
> On 06/03/2013 01:22 PM, Mike Galbraith wrote:
> [snip]
> >>
> >> I agree that this idea, in other work, 'stop wake-affine when current is
> >> busy with wakeup' may miss the chance to bring benefit, although I could
> >> not find such workload, but I can't do promise...
> >
> > Someday we'll find the perfect balance... likely the day before the sun
> > turns into a red giant and melts the earth.
>
> Won't take so long ;-)
>
> I would like to stop the regression on pgbench firstly, as PeterZ
> mentioned, if someone reported other regressions, we will know what is
> missing, if fix is possible, we fix it, if cost is too high, then I say
> we ignore the illegal income, after all, we could not benefit one in the
> cost of sacrifice others...
>
> I'd like to fix the problem ASAP, it's really a big, urgent problem on
> my point of view, but doesn't win enough attentions as I thought it will...

I fully agree that it's a problem, but not that it's a regression. The
"we became too buddy-centric" problem has existed for a long time, it's
just that pgbench in 1:N mode shows us how much that pull pull pull can
cost us in scalability.

A much more interesting pgbench test (imho) would be with one server per
socket. 1 server (mother of all work) driving a multi-socket sized load
is just silly, can't possibly scale, so it's important that improving
1:N pgbench (we can, and need to) doesn't harm sane loads.

-Mike

2013-06-03 06:31:52

by Michael wang

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: smart wake-affine

On 06/03/2013 02:05 PM, Mike Galbraith wrote:
> On Mon, 2013-06-03 at 13:50 +0800, Michael Wang wrote:
>> On 06/03/2013 01:22 PM, Mike Galbraith wrote:
>> [snip]
>>>>
>>>> I agree that this idea, in other work, 'stop wake-affine when current is
>>>> busy with wakeup' may miss the chance to bring benefit, although I could
>>>> not find such workload, but I can't do promise...
>>>
>>> Someday we'll find the perfect balance... likely the day before the sun
>>> turns into a red giant and melts the earth.
>>
>> Won't take so long ;-)
>>
>> I would like to stop the regression on pgbench firstly, as PeterZ
>> mentioned, if someone reported other regressions, we will know what is
>> missing, if fix is possible, we fix it, if cost is too high, then I say
>> we ignore the illegal income, after all, we could not benefit one in the
>> cost of sacrifice others...
>>
>> I'd like to fix the problem ASAP, it's really a big, urgent problem on
>> my point of view, but doesn't win enough attentions as I thought it will...
>
> I fully agree that it's a problem, but not that it's a regression.

I won't disagree, it's just the view point, I used to compare the whole
thing with/without wake-affine stuff, and I found it damage pgbench to
benefit hackbench, furthermore, that was in mainline by default...

The
> "we became too buddy-centric" problem has existed for a long time, it's
> just that pgbench in 1:N mode shows us how much that pull pull pull can
> cost us in scalability.

Agree, that was something we missed when we are so happily figure out
the good-side of pull...

>
> A much more interesting pgbench test (imho) would be with one server per
> socket.

May be configure the database could achieve that, sounds like a better
deployment to me ;-)

1 server (mother of all work) driving a multi-socket sized load
> is just silly, can't possibly scale, so it's important that improving
> 1:N pgbench (we can, and need to) doesn't harm sane loads.

I guess the pgbench behaviour is badly depends on the database
configuration on box, I think there are still many interesting points we
could dig :)

Anyway, I still prefer to do it step by step, if we do not fix the issue
we already found, then the deeper research will based on an forked
unreal world...

Regards,
Michael Wang


>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-06-13 03:09:29

by Michael wang

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: smart wake-affine

Hi, Peter

Would you like to give some comments on this approach? or may be just
some hint on what's the concern? the damage on pgbench is still there...

Regards,
Michael Wang


On 05/28/2013 01:05 PM, Michael Wang wrote:
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will bring benefit if waker's cpu cached hot data for wakee, or the
> extreme ping-pong case.
>
> And testing show it could benefit hackbench 15% at most.
>
> However, the whole stuff is somewhat blindly and time-consuming, some
> workload therefore suffer.
>
> And testing show it could damage pgbench 50% at most.
>
> Thus, wake-affine stuff should be smarter, and realise when to stop
> it's thankless effort.
>
> This patch introduced per task 'nr_wakee_switch', which will be increased
> each time the task switch it's wakee.
>
> So a high 'nr_wakee_switch' means the task has more than one wakee, and
> less the wakee number, higher the wakeup frequency.
>
> Now when making the decision on whether to pull or not, pay attention on
> the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
> but that imply waker will face cruel competition later, it could be very
> crule or very fast depends on the story behind 'nr_wakee_switch', whatever,
> waker therefore suffer.
>
> Furthermore, if waker also has a high 'nr_wakee_switch', that imply multiple
> tasks rely on it, waker's higher latency will damage all those tasks, pull
> wakee in such cases seems to be a bad deal.
>
> Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become higher
> and higher, the deal seems to be worse and worse.
>
> This patch therefore help wake-affine stuff to stop it's work when:
>
> wakee->nr_wakee_switch > factor &&
> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>
> The factor here is the online cpu number, so more cpu will lead to more pull
> since the trial become more severe.
>
> After applied the patch, pgbench show 42% improvement at most.
>
> Test:
> Test with 12 cpu X86 server and tip 3.10.0-rc1.
>
> base smart
>
> | db_size | clients | tps | | tps |
> +---------+---------+-------+ +-------+
> | 21 MB | 1 | 10749 | | 10337 |
> | 21 MB | 2 | 21382 | | 21391 |
> | 21 MB | 4 | 41570 | | 41808 |
> | 21 MB | 8 | 52828 | | 58792 |
> | 21 MB | 12 | 48447 | | 54553 |
> | 21 MB | 16 | 46246 | | 56726 | +22.66%
> | 21 MB | 24 | 43850 | | 56853 | +29.65%
> | 21 MB | 32 | 43455 | | 55846 | +28.51%
> | 7483 MB | 1 | 9290 | | 8848 |
> | 7483 MB | 2 | 19347 | | 19351 |
> | 7483 MB | 4 | 37135 | | 37511 |
> | 7483 MB | 8 | 47310 | | 50210 |
> | 7483 MB | 12 | 42721 | | 49396 |
> | 7483 MB | 16 | 41016 | | 51826 | +26.36%
> | 7483 MB | 24 | 37540 | | 52579 | +40.06%
> | 7483 MB | 32 | 36756 | | 51332 | +39.66%
> | 15 GB | 1 | 8758 | | 8670 |
> | 15 GB | 2 | 19204 | | 19249 |
> | 15 GB | 4 | 36997 | | 37199 |
> | 15 GB | 8 | 46578 | | 50681 |
> | 15 GB | 12 | 42141 | | 48671 |
> | 15 GB | 16 | 40518 | | 51280 | +26.56%
> | 15 GB | 24 | 36788 | | 52329 | +42.24%
> | 15 GB | 32 | 36056 | | 50350 | +39.64%
>
>
>
> CC: Ingo Molnar <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Mike Galbraith <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> include/linux/sched.h | 3 +++
> kernel/sched/fair.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 48 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 178a8d9..1c996c7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1041,6 +1041,9 @@ struct task_struct {
> #ifdef CONFIG_SMP
> struct llist_node wake_entry;
> int on_cpu;
> + struct task_struct *last_wakee;
> + unsigned long nr_wakee_switch;
> + unsigned long last_switch_decay;
> #endif
> int on_rq;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f62b16d..eaaceb7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3127,6 +3127,45 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>
> #endif
>
> +static void record_wakee(struct task_struct *p)
> +{
> + /*
> + * Rough decay, don't worry about the boundary, really active
> + * task won't care the loose.
> + */
> + if (jiffies > current->last_switch_decay + HZ) {
> + current->nr_wakee_switch = 0;
> + current->last_switch_decay = jiffies;
> + }
> +
> + if (current->last_wakee != p) {
> + current->last_wakee = p;
> + current->nr_wakee_switch++;
> + }
> +}
> +
> +static int nasty_pull(struct task_struct *p)
> +{
> + int factor = cpumask_weight(cpu_online_mask);
> +
> + /*
> + * Yeah, it's the switching-frequency, could means many wakee or
> + * rapidly switch, use factor here will just help to automatically
> + * adjust the loose-degree, so more cpu will lead to more pull.
> + */
> + if (p->nr_wakee_switch > factor) {
> + /*
> + * wakee is somewhat hot, it needs certain amount of cpu
> + * resource, so if waker is far more hot, prefer to leave
> + * it alone.
> + */
> + if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> {
> s64 this_load, load;
> @@ -3136,6 +3175,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> unsigned long weight;
> int balanced;
>
> + if (nasty_pull(p))
> + return 0;
> +
> idx = sd->wake_idx;
> this_cpu = smp_processor_id();
> prev_cpu = task_cpu(p);
> @@ -3428,6 +3470,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> /* while loop will break here if sd == NULL */
> }
> unlock:
> + if (sd_flag & SD_BALANCE_WAKE)
> + record_wakee(p);
> +
> rcu_read_unlock();
>
> return new_cpu;
>

2013-07-02 04:43:56

by Michael wang

[permalink] [raw]
Subject: [PATCH] sched: smart wake-affine

Since RFC:
Tested again with the latest tip 3.10.0-rc7.

wake-affine stuff is always trying to pull wakee close to waker, by theory,
this will bring benefit if waker's cpu cached hot data for wakee, or the
extreme ping-pong case.

And testing show it could benefit hackbench 15% at most.

However, the whole stuff is somewhat blindly and time-consuming, some
workload therefore suffer.

And testing show it could damage pgbench 50% at most.

Thus, wake-affine stuff should be more smart, and realise when to stop
it's thankless effort.

This patch introduced 'nr_wakee_switch', which will be increased each
time the task switch it's wakee.

So a high 'nr_wakee_switch' means the task has more than one wakee, and
bigger the number, higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention on
the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
but also imply that waker will face cruel competition later, it could be
very cruel or very fast depends on the story behind 'nr_wakee_switch',
whatever, waker therefore suffer.

Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
tasks rely on it, then waker's higher latency will damage all of them, pull
wakee seems to be a bad deal.

Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become higher
and higher, the deal seems to be worse and worse.

The patch therefore help wake-affine stuff to stop it's work when:

wakee->nr_wakee_switch > factor &&
waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)

The factor here is the online cpu number, and more cpu will lead to more pull
since the trial become more severe.

After applied the patch, pgbench show 40% improvement at most.

Test:
Test with 12 cpu X86 server and tip 3.10.0-rc7.

base smart

| db_size | clients | tps | | tps |
+---------+---------+-------+ +-------+
| 22 MB | 1 | 10598 | | 10693 |
| 22 MB | 2 | 21257 | | 21409 |
| 22 MB | 4 | 41386 | | 41517 |
| 22 MB | 8 | 51253 | | 58173 |
| 22 MB | 12 | 48570 | | 53817 |
| 22 MB | 16 | 46748 | | 55992 | +19.77%
| 22 MB | 24 | 44346 | | 56087 | +26.48%
| 22 MB | 32 | 43460 | | 54781 | +26.05%
| 7484 MB | 1 | 8951 | | 9336 |
| 7484 MB | 2 | 19233 | | 19348 |
| 7484 MB | 4 | 37239 | | 37316 |
| 7484 MB | 8 | 46087 | | 49329 |
| 7484 MB | 12 | 42054 | | 49231 |
| 7484 MB | 16 | 40765 | | 51082 | +25.31%
| 7484 MB | 24 | 37651 | | 52740 | +40.08%
| 7484 MB | 32 | 37056 | | 50866 | +37.27%
| 15 GB | 1 | 8845 | | 9124 |
| 15 GB | 2 | 19094 | | 19187 |
| 15 GB | 4 | 36979 | | 37178 |
| 15 GB | 8 | 46087 | | 50075 |
| 15 GB | 12 | 41901 | | 48098 |
| 15 GB | 16 | 40147 | | 51463 | +28.19%
| 15 GB | 24 | 37250 | | 51750 | +38.93%
| 15 GB | 32 | 36470 | | 50807 | +39.31%

CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Mike Galbraith <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/fair.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 48 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1c996c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1041,6 +1041,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+ struct task_struct *last_wakee;
+ unsigned long nr_wakee_switch;
+ unsigned long last_switch_decay;
#endif
int on_rq;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..591c113 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3109,6 +3109,45 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,

#endif

+static void record_wakee(struct task_struct *p)
+{
+ /*
+ * Rough decay, don't worry about the boundary, really active
+ * task won't care the loose.
+ */
+ if (jiffies > current->last_switch_decay + HZ) {
+ current->nr_wakee_switch = 0;
+ current->last_switch_decay = jiffies;
+ }
+
+ if (current->last_wakee != p) {
+ current->last_wakee = p;
+ current->nr_wakee_switch++;
+ }
+}
+
+static int nasty_pull(struct task_struct *p)
+{
+ int factor = cpumask_weight(cpu_online_mask);
+
+ /*
+ * Yeah, it's the switching-frequency, could means many wakee or
+ * rapidly switch, use factor here will just help to automatically
+ * adjust the loose-degree, so more cpu will lead to more pull.
+ */
+ if (p->nr_wakee_switch > factor) {
+ /*
+ * wakee is somewhat hot, it needs certain amount of cpu
+ * resource, so if waker is far more hot, prefer to leave
+ * it alone.
+ */
+ if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
+ return 1;
+ }
+
+ return 0;
+}
+
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
{
s64 this_load, load;
@@ -3118,6 +3157,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
unsigned long weight;
int balanced;

+ if (nasty_pull(p))
+ return 0;
+
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -3410,6 +3452,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
/* while loop will break here if sd == NULL */
}
unlock:
+ if (sd_flag & SD_BALANCE_WAKE)
+ record_wakee(p);
+
rcu_read_unlock();

return new_cpu;
--
1.7.4.1

2013-07-02 05:38:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Tue, 2013-07-02 at 12:43 +0800, Michael Wang wrote:

> +static int nasty_pull(struct task_struct *p)
> +{
> + int factor = cpumask_weight(cpu_online_mask);
> +
> + /*
> + * Yeah, it's the switching-frequency, could means many wakee or
> + * rapidly switch, use factor here will just help to automatically
> + * adjust the loose-degree, so more cpu will lead to more pull.
> + */
> + if (p->nr_wakee_switch > factor) {
> + /*
> + * wakee is somewhat hot, it needs certain amount of cpu
> + * resource, so if waker is far more hot, prefer to leave
> + * it alone.
> + */
> + if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
> + return 1;
> + }
> +
> + return 0;
> +}

Ew. I haven't gotten around to test-driving this patchlet, and I see
you haven't gotten around to finding a better name either. Any other
name will likely have a better chance of flying.

tasks_related()
...
well, nearly any..
tasks_think_wake_affine_sucks_rocks()
..that won't fly either :)

-Mike

2013-07-02 05:50:43

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/02/2013 01:38 PM, Mike Galbraith wrote:
> On Tue, 2013-07-02 at 12:43 +0800, Michael Wang wrote:
>
>> +static int nasty_pull(struct task_struct *p)
>> +{
>> + int factor = cpumask_weight(cpu_online_mask);
>> +
>> + /*
>> + * Yeah, it's the switching-frequency, could means many wakee or
>> + * rapidly switch, use factor here will just help to automatically
>> + * adjust the loose-degree, so more cpu will lead to more pull.
>> + */
>> + if (p->nr_wakee_switch > factor) {
>> + /*
>> + * wakee is somewhat hot, it needs certain amount of cpu
>> + * resource, so if waker is far more hot, prefer to leave
>> + * it alone.
>> + */
>> + if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
>> + return 1;
>> + }
>> +
>> + return 0;
>> +}
>
> Ew. I haven't gotten around to test-driving this patchlet, and I see
> you haven't gotten around to finding a better name either. Any other
> name will likely have a better chance of flying.

Trust me, I've tried to get a good name...and some cells in my brain do
sacrificed for it, bravely ;-)

>
> tasks_related()
> ...
> well, nearly any..
> tasks_think_wake_affine_sucks_rocks()
> ..that won't fly either :)

Hmm...better than those in my mind (like dragon_wake_affine(), well...at
least dragon could fly).

Anyway, if the idea itself become acceptable, then any name is ok for
me, let's figure out a good one at that time :)

Regards,
Michael Wang


>
> -Mike
>

2013-07-02 05:54:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Tue, 2013-07-02 at 12:43 +0800, Michael Wang wrote:
> Since RFC:
> Tested again with the latest tip 3.10.0-rc7.
>
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will bring benefit if waker's cpu cached hot data for wakee, or the
> extreme ping-pong case.
>
> And testing show it could benefit hackbench 15% at most.

How much does this still help with Alex's patches integrated?

aside: were I a maintainer, I'd be a little concerned that what this
helps with collides somewhat with the ongoing numa work.

-Mike

2013-07-02 06:18:00

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/02/2013 01:54 PM, Mike Galbraith wrote:
> On Tue, 2013-07-02 at 12:43 +0800, Michael Wang wrote:
>> Since RFC:
>> Tested again with the latest tip 3.10.0-rc7.
>>
>> wake-affine stuff is always trying to pull wakee close to waker, by theory,
>> this will bring benefit if waker's cpu cached hot data for wakee, or the
>> extreme ping-pong case.
>>
>> And testing show it could benefit hackbench 15% at most.
>
> How much does this still help with Alex's patches integrated?

I remember Alex already tested hackbench, and for wake_affine(), his
patch set is some kind of load filter, mine is nr_wakee filter, they are
separated, but I will do more test on this point when it become the last
concern.

>
> aside: were I a maintainer, I'd be a little concerned that what this
> helps with collides somewhat with the ongoing numa work.

As Peter mentioned before, we currently need some solution like the
buddy-idea, and when folks report regression (I suppose they won't...),
we will have more data then.

So we could firstly try to regain the lost performance of pgbench, if it
strip the benefit of other benchmarks, let's fix it, and at last we will
have a real smart wake-affine and no one will complain ;-)

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-02 06:29:25

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Tue, 2013-07-02 at 14:17 +0800, Michael Wang wrote:

> As Peter mentioned before, we currently need some solution like the
> buddy-idea, and when folks report regression (I suppose they won't...),
> we will have more data then.
>
> So we could firstly try to regain the lost performance of pgbench, if it
> strip the benefit of other benchmarks, let's fix it, and at last we will
> have a real smart wake-affine and no one will complain ;-)

The idea is plenty simple (and the fastpath has a deep and abiding love
of simple) so the idea itself flies in my book. It doesn't add as much
knowledge as may be nice to have, but if it adds enough to help pgbench
and ilk without harming others, cool.

-Mike

2013-07-02 06:45:35

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/02/2013 02:29 PM, Mike Galbraith wrote:
> On Tue, 2013-07-02 at 14:17 +0800, Michael Wang wrote:
>
>> As Peter mentioned before, we currently need some solution like the
>> buddy-idea, and when folks report regression (I suppose they won't...),
>> we will have more data then.
>>
>> So we could firstly try to regain the lost performance of pgbench, if it
>> strip the benefit of other benchmarks, let's fix it, and at last we will
>> have a real smart wake-affine and no one will complain ;-)
>
> The idea is plenty simple (and the fastpath has a deep and abiding love
> of simple) so the idea itself flies in my book. It doesn't add as much
> knowledge as may be nice to have, but if it adds enough to help pgbench
> and ilk without harming others, cool.

Nice to know you like it ;-)

There are some thinking behind the idea, since the knob is unacceptable,
I try to make the filter more strict, we actually
could get all the lost 50% performance back, but will run the risk to
strip other's benefit (like hackbench), but if just get 40% performance
back, then we may could reduce the risk nearly to 0.

So the principle of this idea is to filter out the extremely bad cases,
and we make sure under such cases, the chances of mess things up is very
high, thus the wake_affine() will become a little smart and know to stop
in front of the cliff...

Regards,
Michael Wang


>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-02 08:52:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Tue, Jul 02, 2013 at 12:43:44PM +0800, Michael Wang wrote:

> Signed-off-by: Michael Wang <[email protected]>
> ---
> include/linux/sched.h | 3 +++
> kernel/sched/fair.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 48 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 178a8d9..1c996c7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1041,6 +1041,9 @@ struct task_struct {
> #ifdef CONFIG_SMP
> struct llist_node wake_entry;
> int on_cpu;
> + struct task_struct *last_wakee;
> + unsigned long nr_wakee_switch;
> + unsigned long last_switch_decay;
> #endif
> int on_rq;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c61a614..591c113 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3109,6 +3109,45 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>
> #endif
>
> +static void record_wakee(struct task_struct *p)
> +{
> + /*
> + * Rough decay, don't worry about the boundary, really active
> + * task won't care the loose.
> + */

OK so we 'decay' once a second.

> + if (jiffies > current->last_switch_decay + HZ) {
> + current->nr_wakee_switch = 0;
> + current->last_switch_decay = jiffies;
> + }

This isn't so much a decay as it is wiping state. Did you try an actual
decay -- something like: current->nr_wakee_switch >>= 1; ?

I suppose you wanted to avoid something like:

now = jiffies;
while (now > current->last_switch_decay + HZ) {
current->nr_wakee_switch >>= 1;
current->last_switch_decay += HZ;
}

?

And we increment every time we wake someone else. Gaining a measure of
how often we wake someone else.

> + if (current->last_wakee != p) {
> + current->last_wakee = p;
> + current->nr_wakee_switch++;
> + }
> +}
> +
> +static int nasty_pull(struct task_struct *p)

I've seen there's some discussion as to this function name.. good :-) It
really wants to change. How about something like:

int wake_affine()
{
...

/*
* If we wake multiple tasks be careful to not bounce
* ourselves around too much.
*/
if (wake_wide(p))
return 0;


> +{
> + int factor = cpumask_weight(cpu_online_mask);

We have num_cpus_online() for this.. however both are rather expensive.
Having to walk and count a 4096 bitmap for every wakeup if going to get
tiresome real quick.

I suppose the question is; to what level do we really want to scale?

One fair answer would be node size I suppose; do you really want to go
bigger than that?

Also; you compare a size against a switching frequency, that's not
really and apples to apples comparison.

> +
> + /*
> + * Yeah, it's the switching-frequency, could means many wakee or
> + * rapidly switch, use factor here will just help to automatically
> + * adjust the loose-degree, so more cpu will lead to more pull.
> + */
> + if (p->nr_wakee_switch > factor) {
> + /*
> + * wakee is somewhat hot, it needs certain amount of cpu
> + * resource, so if waker is far more hot, prefer to leave
> + * it alone.
> + */
> + if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
> + return 1;

Ah ok, this makes more sense; the first is simply a filter to avoid
doing the second dereference I suppose.

> + }
> +
> + return 0;
> +}
> +
> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> {
> s64 this_load, load;
> @@ -3118,6 +3157,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> unsigned long weight;
> int balanced;
>
> + if (nasty_pull(p))
> + return 0;
> +
> idx = sd->wake_idx;
> this_cpu = smp_processor_id();
> prev_cpu = task_cpu(p);
> @@ -3410,6 +3452,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> /* while loop will break here if sd == NULL */
> }
> unlock:
> + if (sd_flag & SD_BALANCE_WAKE)
> + record_wakee(p);

if we put this in task_waking_fair() we can avoid an entire conditional!

2013-07-02 09:36:07

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

Hi, Peter

Thanks for your review :)

On 07/02/2013 04:52 PM, Peter Zijlstra wrote:
[snip]
>> +static void record_wakee(struct task_struct *p)
>> +{
>> + /*
>> + * Rough decay, don't worry about the boundary, really active
>> + * task won't care the loose.
>> + */
>
> OK so we 'decay' once a second.
>
>> + if (jiffies > current->last_switch_decay + HZ) {
>> + current->nr_wakee_switch = 0;
>> + current->last_switch_decay = jiffies;
>> + }
>
> This isn't so much a decay as it is wiping state. Did you try an actual
> decay -- something like: current->nr_wakee_switch >>= 1; ?
>
> I suppose you wanted to avoid something like:
>
> now = jiffies;
> while (now > current->last_switch_decay + HZ) {
> current->nr_wakee_switch >>= 1;
> current->last_switch_decay += HZ;
> }

Right, actually I have though about the decay problem with some testing,
including some similar implementations like this, but one issue I could
not solve is:

the task waken up after dequeue 10secs and the task waken up
after dequeue 1sec will suffer the same decay.

Thus, in order to keep fair, we have to do some calculation here to make
the decay correct, but that means cost...

So I pick this wiping method, and the cost performance is not so bad :)

>
> ?
>
> And we increment every time we wake someone else. Gaining a measure of
> how often we wake someone else.
>
>> + if (current->last_wakee != p) {
>> + current->last_wakee = p;
>> + current->nr_wakee_switch++;
>> + }
>> +}
>> +
>> +static int nasty_pull(struct task_struct *p)
>
> I've seen there's some discussion as to this function name.. good :-) It
> really wants to change. How about something like:
>
> int wake_affine()
> {
> ...
>
> /*
> * If we wake multiple tasks be careful to not bounce
> * ourselves around too much.
> */
> if (wake_wide(p))
> return 0;

Do you mean wake_wipe() here?

>
>
>> +{
>> + int factor = cpumask_weight(cpu_online_mask);
>
> We have num_cpus_online() for this.. however both are rather expensive.
> Having to walk and count a 4096 bitmap for every wakeup if going to get
> tiresome real quick.
>
> I suppose the question is; to what level do we really want to scale?
>
> One fair answer would be node size I suppose; do you really want to go
> bigger than that?

Agree, it sounds more reasonable, let me do some testing on it.

>
> Also; you compare a size against a switching frequency, that's not
> really and apples to apples comparison.
>
>> +
>> + /*
>> + * Yeah, it's the switching-frequency, could means many wakee or
>> + * rapidly switch, use factor here will just help to automatically
>> + * adjust the loose-degree, so more cpu will lead to more pull.
>> + */
>> + if (p->nr_wakee_switch > factor) {
>> + /*
>> + * wakee is somewhat hot, it needs certain amount of cpu
>> + * resource, so if waker is far more hot, prefer to leave
>> + * it alone.
>> + */
>> + if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
>> + return 1;
>
> Ah ok, this makes more sense; the first is simply a filter to avoid
> doing the second dereference I suppose.

Yeah, the first one is some kind of vague filter, the second one is the
core filter ;-)

>
>> + }
>> +
>> + return 0;
>> +}
>> +
>> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>> {
>> s64 this_load, load;
>> @@ -3118,6 +3157,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>> unsigned long weight;
>> int balanced;
>>
>> + if (nasty_pull(p))
>> + return 0;
>> +
>> idx = sd->wake_idx;
>> this_cpu = smp_processor_id();
>> prev_cpu = task_cpu(p);
>> @@ -3410,6 +3452,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>> /* while loop will break here if sd == NULL */
>> }
>> unlock:
>> + if (sd_flag & SD_BALANCE_WAKE)
>> + record_wakee(p);
>
> if we put this in task_waking_fair() we can avoid an entire conditional!

Nice, will do it in next version :)

Regards,
Michael Wang

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-02 09:44:27

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/02/2013 05:35 PM, Michael Wang wrote:
[snip]
>> I've seen there's some discussion as to this function name.. good :-) It
>> really wants to change. How about something like:
>>
>> int wake_affine()
>> {
>> ...
>>
>> /*
>> * If we wake multiple tasks be careful to not bounce
>> * ourselves around too much.
>> */
>> if (wake_wide(p))
>> return 0;
>
> Do you mean wake_wipe() here?

Oh, wake_wide() means don't pull tasks together, I got it ;-)

Regards,
Michael Wang

>
>>
>>
>>> +{
>>> + int factor = cpumask_weight(cpu_online_mask);
>>
>> We have num_cpus_online() for this.. however both are rather expensive.
>> Having to walk and count a 4096 bitmap for every wakeup if going to get
>> tiresome real quick.
>>
>> I suppose the question is; to what level do we really want to scale?
>>
>> One fair answer would be node size I suppose; do you really want to go
>> bigger than that?
>
> Agree, it sounds more reasonable, let me do some testing on it.
>
>>
>> Also; you compare a size against a switching frequency, that's not
>> really and apples to apples comparison.
>>
>>> +
>>> + /*
>>> + * Yeah, it's the switching-frequency, could means many wakee or
>>> + * rapidly switch, use factor here will just help to automatically
>>> + * adjust the loose-degree, so more cpu will lead to more pull.
>>> + */
>>> + if (p->nr_wakee_switch > factor) {
>>> + /*
>>> + * wakee is somewhat hot, it needs certain amount of cpu
>>> + * resource, so if waker is far more hot, prefer to leave
>>> + * it alone.
>>> + */
>>> + if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
>>> + return 1;
>>
>> Ah ok, this makes more sense; the first is simply a filter to avoid
>> doing the second dereference I suppose.
>
> Yeah, the first one is some kind of vague filter, the second one is the
> core filter ;-)
>
>>
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>> {
>>> s64 this_load, load;
>>> @@ -3118,6 +3157,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>> unsigned long weight;
>>> int balanced;
>>>
>>> + if (nasty_pull(p))
>>> + return 0;
>>> +
>>> idx = sd->wake_idx;
>>> this_cpu = smp_processor_id();
>>> prev_cpu = task_cpu(p);
>>> @@ -3410,6 +3452,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>> /* while loop will break here if sd == NULL */
>>> }
>>> unlock:
>>> + if (sd_flag & SD_BALANCE_WAKE)
>>> + record_wakee(p);
>>
>> if we put this in task_waking_fair() we can avoid an entire conditional!
>
> Nice, will do it in next version :)
>
> Regards,
> Michael Wang
>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-03 06:10:46

by Michael wang

[permalink] [raw]
Subject: [PATCH v2] sched: smart wake-affine

Since v1:
1. Invoke record_wakee() at a better timing. (Thanks to PeterZ)
2. Use node-size instead of online-cpu-number
as the factor, to reduce overhead and make
logical more reasonable. (Thanks to PeterZ)
3. Name fix and more comments. (Thanks to PeterZ)
4. Tested again with the tip 3.10.0-rc7.

wake-affine stuff is always trying to pull wakee close to waker, by theory,
this will bring benefit if waker's cpu cached hot data for wakee, or the
extreme ping-pong case.

And testing show it could benefit hackbench 15% at most.

However, the whole stuff is somewhat blindly and time-consuming, some
workload therefore suffer.

And testing show it could damage pgbench 50% at most.

Thus, wake-affine stuff should be more smart, and realise when to stop
it's thankless effort.

This patch introduced 'nr_wakee_switch', which will be increased each
time the task switch it's wakee.

So a high 'nr_wakee_switch' means the task has more than one wakee, and
bigger the number, higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention on
the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
but also imply that waker will face cruel competition later, it could be
very cruel or very fast depends on the story behind 'nr_wakee_switch',
whatever, waker therefore suffer.

Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
tasks rely on it, then waker's higher latency will damage all of them, pull
wakee seems to be a bad deal.

Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become higher
and higher, the deal seems to be worse and worse.

The patch therefore help wake-affine stuff to stop it's work when:

wakee->nr_wakee_switch > factor &&
waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)

The factor here is the node-size of current-cpu, so bigger node will lead
to more pull since the trial become more severe.

After applied the patch, pgbench show 40% improvement at most.

Test:
Tested with 12 cpu X86 server and tip 3.10.0-rc7.

pgbench base smart

| db_size | clients | tps | | tps |
+---------+---------+-------+ +-------+
| 22 MB | 1 | 10598 | | 10796 |
| 22 MB | 2 | 21257 | | 21336 |
| 22 MB | 4 | 41386 | | 41622 |
| 22 MB | 8 | 51253 | | 57932 |
| 22 MB | 12 | 48570 | | 54000 |
| 22 MB | 16 | 46748 | | 55982 | +19.75%
| 22 MB | 24 | 44346 | | 55847 | +25.93%
| 22 MB | 32 | 43460 | | 54614 | +25.66%
| 7484 MB | 1 | 8951 | | 9193 |
| 7484 MB | 2 | 19233 | | 19240 |
| 7484 MB | 4 | 37239 | | 37302 |
| 7484 MB | 8 | 46087 | | 50018 |
| 7484 MB | 12 | 42054 | | 48763 |
| 7484 MB | 16 | 40765 | | 51633 | +26.66%
| 7484 MB | 24 | 37651 | | 52377 | +39.11%
| 7484 MB | 32 | 37056 | | 51108 | +37.92%
| 15 GB | 1 | 8845 | | 9104 |
| 15 GB | 2 | 19094 | | 19162 |
| 15 GB | 4 | 36979 | | 36983 |
| 15 GB | 8 | 46087 | | 49977 |
| 15 GB | 12 | 41901 | | 48591 |
| 15 GB | 16 | 40147 | | 50651 | +26.16%
| 15 GB | 24 | 37250 | | 52365 | +40.58%
| 15 GB | 32 | 36470 | | 50015 | +37.14%

CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Mike Galbraith <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/fair.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1c996c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1041,6 +1041,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+ struct task_struct *last_wakee;
+ unsigned long nr_wakee_switch;
+ unsigned long last_switch_decay;
#endif
int on_rq;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..a4ddbf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int cpu)
return 0;
}

+static void record_wakee(struct task_struct *p)
+{
+ /*
+ * Rough decay(wiping) for cost saving, don't worry
+ * about the boundary, really active task won't care
+ * the loose.
+ */
+ if (jiffies > current->last_switch_decay + HZ) {
+ current->nr_wakee_switch = 0;
+ current->last_switch_decay = jiffies;
+ }
+
+ if (current->last_wakee != p) {
+ current->last_wakee = p;
+ current->nr_wakee_switch++;
+ }
+}

static void task_waking_fair(struct task_struct *p)
{
@@ -2991,6 +3008,7 @@ static void task_waking_fair(struct task_struct *p)
#endif

se->vruntime -= min_vruntime;
+ record_wakee(p);
}

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -3109,6 +3127,28 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,

#endif

+static int wake_wide(struct task_struct *p)
+{
+ int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
+
+ /*
+ * Yeah, it's the switching-frequency, could means many wakee or
+ * rapidly switch, use factor here will just help to automatically
+ * adjust the loose-degree, so bigger node will lead to more pull.
+ */
+ if (p->nr_wakee_switch > factor) {
+ /*
+ * wakee is somewhat hot, it needs certain amount of cpu
+ * resource, so if waker is far more hot, prefer to leave
+ * it alone.
+ */
+ if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
+ return 1;
+ }
+
+ return 0;
+}
+
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
{
s64 this_load, load;
@@ -3118,6 +3158,13 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
unsigned long weight;
int balanced;

+ /*
+ * If we wake multiple tasks be careful to not bounce
+ * ourselves around too much.
+ */
+ if (wake_wide(p))
+ return 0;
+
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
--
1.7.4.1

2013-07-03 08:52:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched: smart wake-affine

On Wed, Jul 03, 2013 at 02:10:32PM +0800, Michael Wang wrote:
> +static int wake_wide(struct task_struct *p)
> +{
> + int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));

That's still a cpumask_weight() in there... we should avoid that. How
about something like the below:

> +
> + /*
> + * Yeah, it's the switching-frequency, could means many wakee or
> + * rapidly switch, use factor here will just help to automatically
> + * adjust the loose-degree, so bigger node will lead to more pull.
> + */
> + if (p->nr_wakee_switch > factor) {
> + /*
> + * wakee is somewhat hot, it needs certain amount of cpu
> + * resource, so if waker is far more hot, prefer to leave
> + * it alone.
> + */
> + if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
> + return 1;
> + }
> +
> + return 0;
> +}

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9b1f2e5..166ab9b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5081,18 +5083,23 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
* two cpus are in the same cache domain, see cpus_share_cache().
*/
DEFINE_PER_CPU(struct sched_domain *, sd_llc);
+DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);

static void update_top_cache_domain(int cpu)
{
struct sched_domain *sd;
int id = cpu;
+ int size = 1;

sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
- if (sd)
+ if (sd) {
id = cpumask_first(sched_domain_span(sd));
+ size = cpumask_weight(sched_domain_span(sd));
+ }

rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
+ per_cpu(sd_llc_size, cpu) = size;
per_cpu(sd_llc_id, cpu) = id;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef0a7b2..c992f58 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -595,6 +595,7 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
}

DECLARE_PER_CPU(struct sched_domain *, sd_llc);
+DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);

struct sched_group_power {
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3175,7 +3175,7 @@ static inline unsigned long effective_lo

static int wake_wide(struct task_struct *p)
{
- int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
+ int factor = this_cpu_read(sd_llc_size);

/*
* Yeah, it's the switching-frequency, could means many wakee or

2013-07-03 09:11:57

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH v2] sched: smart wake-affine

On 07/03/2013 04:50 PM, Peter Zijlstra wrote:
> On Wed, Jul 03, 2013 at 02:10:32PM +0800, Michael Wang wrote:
>> +static int wake_wide(struct task_struct *p)
>> +{
>> + int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
>
> That's still a cpumask_weight() in there...

Exactly, although the scale is lower but still it's a mask...

> we should avoid that. How about something like the below:

Amazing, cost will be lowest when we got such counter, will make them a
patch set and test again ;-)

Regards,
Michael Wang

>
>> +
>> + /*
>> + * Yeah, it's the switching-frequency, could means many wakee or
>> + * rapidly switch, use factor here will just help to automatically
>> + * adjust the loose-degree, so bigger node will lead to more pull.
>> + */
>> + if (p->nr_wakee_switch > factor) {
>> + /*
>> + * wakee is somewhat hot, it needs certain amount of cpu
>> + * resource, so if waker is far more hot, prefer to leave
>> + * it alone.
>> + */
>> + if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
>> + return 1;
>> + }
>> +
>> + return 0;
>> +}
>
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9b1f2e5..166ab9b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5081,18 +5083,23 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
> * two cpus are in the same cache domain, see cpus_share_cache().
> */
> DEFINE_PER_CPU(struct sched_domain *, sd_llc);
> +DEFINE_PER_CPU(int, sd_llc_size);
> DEFINE_PER_CPU(int, sd_llc_id);
>
> static void update_top_cache_domain(int cpu)
> {
> struct sched_domain *sd;
> int id = cpu;
> + int size = 1;
>
> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> - if (sd)
> + if (sd) {
> id = cpumask_first(sched_domain_span(sd));
> + size = cpumask_weight(sched_domain_span(sd));
> + }
>
> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> + per_cpu(sd_llc_size, cpu) = size;
> per_cpu(sd_llc_id, cpu) = id;
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ef0a7b2..c992f58 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -595,6 +595,7 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
> }
>
> DECLARE_PER_CPU(struct sched_domain *, sd_llc);
> +DECLARE_PER_CPU(int, sd_llc_size);
> DECLARE_PER_CPU(int, sd_llc_id);
>
> struct sched_group_power {
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3175,7 +3175,7 @@ static inline unsigned long effective_lo
>
> static int wake_wide(struct task_struct *p)
> {
> - int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
> + int factor = this_cpu_read(sd_llc_size);
>
> /*
> * Yeah, it's the switching-frequency, could means many wakee or
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-04 09:14:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Tue, Jul 02, 2013 at 05:35:33PM +0800, Michael Wang wrote:
> >> + if (jiffies > current->last_switch_decay + HZ) {
> >> + current->nr_wakee_switch = 0;
> >> + current->last_switch_decay = jiffies;
> >> + }
> >
> > This isn't so much a decay as it is wiping state. Did you try an actual
> > decay -- something like: current->nr_wakee_switch >>= 1; ?
> >
> > I suppose you wanted to avoid something like:
> >
> > now = jiffies;
> > while (now > current->last_switch_decay + HZ) {
> > current->nr_wakee_switch >>= 1;
> > current->last_switch_decay += HZ;
> > }
>
> Right, actually I have though about the decay problem with some testing,
> including some similar implementations like this, but one issue I could
> not solve is:
>
> the task waken up after dequeue 10secs and the task waken up
> after dequeue 1sec will suffer the same decay.
>
> Thus, in order to keep fair, we have to do some calculation here to make
> the decay correct, but that means cost...

Right, but something like the below is limited in cost to at most 32/64 (I
forgot the type) shifts. Now its probably not worth doing, but it shows
things like that can be done in 'constant' time.

now = jiffies;
if (now - p->last_switch_decay > 8*sizeof(p->nr_wakee_switch)*HZ) {
p->nr_wakee_switch = 0;
p->last_switch_decay = now;
} else while (now > p->last_switch_decay + HZ) {
p->nr_wakee_switch >>= 1;
p->last_switch_decay += HZ;
}

2013-07-04 09:38:47

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/04/2013 05:13 PM, Peter Zijlstra wrote:
[snip]
>
> Right, but something like the below is limited in cost to at most 32/64 (I
> forgot the type) shifts. Now its probably not worth doing, but it shows
> things like that can be done in 'constant' time.
>
> now = jiffies;
> if (now - p->last_switch_decay > 8*sizeof(p->nr_wakee_switch)*HZ) {
> p->nr_wakee_switch = 0;
> p->last_switch_decay = now;
> } else while (now > p->last_switch_decay + HZ) {
> p->nr_wakee_switch >>= 1;
> p->last_switch_decay += HZ;
> }

Hmm...interesting, some kind of cataclysm decay, not sure how it works
but yes, the cost was capped.

Well, seems like we still have many follow-up research works after fix
the issue ;-)

Regards,
Michael Wang

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-04 10:33:44

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Thu, 2013-07-04 at 17:38 +0800, Michael Wang wrote:
> On 07/04/2013 05:13 PM, Peter Zijlstra wrote:
> [snip]
> >
> > Right, but something like the below is limited in cost to at most 32/64 (I
> > forgot the type) shifts. Now its probably not worth doing, but it shows
> > things like that can be done in 'constant' time.
> >
> > now = jiffies;
> > if (now - p->last_switch_decay > 8*sizeof(p->nr_wakee_switch)*HZ) {
> > p->nr_wakee_switch = 0;
> > p->last_switch_decay = now;
> > } else while (now > p->last_switch_decay + HZ) {
> > p->nr_wakee_switch >>= 1;
> > p->last_switch_decay += HZ;
> > }
>
> Hmm...interesting, some kind of cataclysm decay, not sure how it works
> but yes, the cost was capped.
>
> Well, seems like we still have many follow-up research works after fix
> the issue ;-)

Yeah. Like how to how to exterminate the plus sign, they munch cache
lines, and have a general tendency to negatively impact benchmarks.

Q6600 box, hackbench -l 1000
avg
3.10.0-regress 2.293 2.297 2.313 2.291 2.295 2.297 1.000
3.10.0-regressx 2.560 2.524 2.427 2.599 2.602 2.542 1.106

pahole said...

marge:/usr/local/src/kernel/linux-3.x.git # tail virgin
long unsigned int timer_slack_ns; /* 1512 8 */
long unsigned int default_timer_slack_ns; /* 1520 8 */
atomic_t ptrace_bp_refcnt; /* 1528 4 */

/* size: 1536, cachelines: 24, members: 125 */
/* sum members: 1509, holes: 6, sum holes: 23 */
/* bit holes: 1, sum bit holes: 26 bits */
/* padding: 4 */
/* paddings: 1, sum paddings: 4 */
};

marge:/usr/local/src/kernel/linux-3.x.git # tail michael
long unsigned int default_timer_slack_ns; /* 1552 8 */
atomic_t ptrace_bp_refcnt; /* 1560 4 */

/* size: 1568, cachelines: 25, members: 128 */
/* sum members: 1533, holes: 8, sum holes: 31 */
/* bit holes: 1, sum bit holes: 26 bits */
/* padding: 4 */
/* paddings: 1, sum paddings: 4 */
/* last cacheline: 32 bytes */
};

..but plugging holes, didn't help, moving this/that around neither, nor
did letting pahole go wild to get the line back. It's plus signs I tell
ya, the evil things must die ;-)

-Mike

2013-07-05 02:48:07

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/04/2013 06:33 PM, Mike Galbraith wrote:
[snip]
>> Well, seems like we still have many follow-up research works after fix
>> the issue ;-)
>
> Yeah. Like how to how to exterminate the plus sign, they munch cache
> lines, and have a general tendency to negatively impact benchmarks.
>
> Q6600 box, hackbench -l 1000
> avg
> 3.10.0-regress 2.293 2.297 2.313 2.291 2.295 2.297 1.000
> 3.10.0-regressx 2.560 2.524 2.427 2.599 2.602 2.542 1.106

Wow, I used to think such issue is very hard to be tracked by
benchmarks, is this regression stable?

My test could not get a stable differ, this time a little bit lose but
next time a little bit win, it's always floating, may caused by the
different chip cache-behaviour I suppose...

>
> pahole said...
>
> marge:/usr/local/src/kernel/linux-3.x.git # tail virgin
> long unsigned int timer_slack_ns; /* 1512 8 */
> long unsigned int default_timer_slack_ns; /* 1520 8 */
> atomic_t ptrace_bp_refcnt; /* 1528 4 */
>
> /* size: 1536, cachelines: 24, members: 125 */
> /* sum members: 1509, holes: 6, sum holes: 23 */
> /* bit holes: 1, sum bit holes: 26 bits */
> /* padding: 4 */
> /* paddings: 1, sum paddings: 4 */
> };
>
> marge:/usr/local/src/kernel/linux-3.x.git # tail michael
> long unsigned int default_timer_slack_ns; /* 1552 8 */
> atomic_t ptrace_bp_refcnt; /* 1560 4 */
>
> /* size: 1568, cachelines: 25, members: 128 */
> /* sum members: 1533, holes: 8, sum holes: 31 */
> /* bit holes: 1, sum bit holes: 26 bits */
> /* padding: 4 */
> /* paddings: 1, sum paddings: 4 */
> /* last cacheline: 32 bytes */
> };
>
> ..but plugging holes, didn't help, moving this/that around neither, nor
> did letting pahole go wild to get the line back. It's plus signs I tell
> ya, the evil things must die ;-)

Hmm...so the new members kicked some tail members to a new line...or may
be totally different when compiler take part in...

It's really hard to estimate the influence, especially when the
task_struct is still keep changing...

But the task_struct is really a little big now, may be we could put the
'cold' members into a new structure and just record the pointer, that
may increase the chances of cache-hit the hot members, but it's platform
related and not so easy to be detect...

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-05 04:08:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Fri, 2013-07-05 at 10:47 +0800, Michael Wang wrote:
> On 07/04/2013 06:33 PM, Mike Galbraith wrote:
> [snip]
> >> Well, seems like we still have many follow-up research works after fix
> >> the issue ;-)
> >
> > Yeah. Like how to how to exterminate the plus sign, they munch cache
> > lines, and have a general tendency to negatively impact benchmarks.
> >
> > Q6600 box, hackbench -l 1000
> > avg
> > 3.10.0-regress 2.293 2.297 2.313 2.291 2.295 2.297 1.000
> > 3.10.0-regressx 2.560 2.524 2.427 2.599 2.602 2.542 1.106
>
> Wow, I used to think such issue is very hard to be tracked by
> benchmarks, is this regression stable?

Yeah, seems to be. I was curious as to why you saw an improvement to
hackbench, didn't seem there should be any, so though I'd try it on my
little box on the way to a long weekend. The unexpected happened.

> > pahole said...
> >
> > marge:/usr/local/src/kernel/linux-3.x.git # tail virgin
> > long unsigned int timer_slack_ns; /* 1512 8 */
> > long unsigned int default_timer_slack_ns; /* 1520 8 */
> > atomic_t ptrace_bp_refcnt; /* 1528 4 */
> >
> > /* size: 1536, cachelines: 24, members: 125 */
> > /* sum members: 1509, holes: 6, sum holes: 23 */
> > /* bit holes: 1, sum bit holes: 26 bits */
> > /* padding: 4 */
> > /* paddings: 1, sum paddings: 4 */
> > };
> >
> > marge:/usr/local/src/kernel/linux-3.x.git # tail michael
> > long unsigned int default_timer_slack_ns; /* 1552 8 */
> > atomic_t ptrace_bp_refcnt; /* 1560 4 */
> >
> > /* size: 1568, cachelines: 25, members: 128 */
> > /* sum members: 1533, holes: 8, sum holes: 31 */
> > /* bit holes: 1, sum bit holes: 26 bits */
> > /* padding: 4 */
> > /* paddings: 1, sum paddings: 4 */
> > /* last cacheline: 32 bytes */
> > };
> >
> > ..but plugging holes, didn't help, moving this/that around neither, nor
> > did letting pahole go wild to get the line back. It's plus signs I tell
> > ya, the evil things must die ;-)
>
> Hmm...so the new members kicked some tail members to a new line...or may
> be totally different when compiler take part in...
>
> It's really hard to estimate the influence, especially when the
> task_struct is still keep changing...

Yeah, could be memory layout crud that disappears with the next
pull/build. Wouldn't be the first time.

-Mike

2013-07-05 04:33:21

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/05/2013 12:08 PM, Mike Galbraith wrote:
[snip]
>>
>> Wow, I used to think such issue is very hard to be tracked by
>> benchmarks, is this regression stable?
>
> Yeah, seems to be. I was curious as to why you saw an improvement to
> hackbench, didn't seem there should be any, so though I'd try it on my
> little box on the way to a long weekend. The unexpected happened.

Oh, I think I failed to explain things clearly in comments...

It's not the patch who bring 15% benefit to hackbench, but the
wake-affine stuff itself.

In the prev-test, I removed the whole stuff and find that hackbench
dropped 15%, which means with wake-affine enabled, we will gain 15%
benefit (and that's actually the reason why we don't kill the stuff).

And this idea is try to not harm that 15% benefit, and meanwhile regain
the pgbench lost performance, thus, apply this patch to mainline won't
improve hackbench performance, but improve pgbench performance.

But this regression is really unexpected... I could hardly believe it's
just caused by cache issue now, since the number is not small (10% at
most?).

Have you tried to use more loops and groups? will that show even bigger
regressions?

BTW, is this the results of 10 group and 40 sockets == 400 tasks?

Regards,
Michael Wang

>
>>> pahole said...
>>>
>>> marge:/usr/local/src/kernel/linux-3.x.git # tail virgin
>>> long unsigned int timer_slack_ns; /* 1512 8 */
>>> long unsigned int default_timer_slack_ns; /* 1520 8 */
>>> atomic_t ptrace_bp_refcnt; /* 1528 4 */
>>>
>>> /* size: 1536, cachelines: 24, members: 125 */
>>> /* sum members: 1509, holes: 6, sum holes: 23 */
>>> /* bit holes: 1, sum bit holes: 26 bits */
>>> /* padding: 4 */
>>> /* paddings: 1, sum paddings: 4 */
>>> };
>>>
>>> marge:/usr/local/src/kernel/linux-3.x.git # tail michael
>>> long unsigned int default_timer_slack_ns; /* 1552 8 */
>>> atomic_t ptrace_bp_refcnt; /* 1560 4 */
>>>
>>> /* size: 1568, cachelines: 25, members: 128 */
>>> /* sum members: 1533, holes: 8, sum holes: 31 */
>>> /* bit holes: 1, sum bit holes: 26 bits */
>>> /* padding: 4 */
>>> /* paddings: 1, sum paddings: 4 */
>>> /* last cacheline: 32 bytes */
>>> };
>>>
>>> ..but plugging holes, didn't help, moving this/that around neither, nor
>>> did letting pahole go wild to get the line back. It's plus signs I tell
>>> ya, the evil things must die ;-)
>>
>> Hmm...so the new members kicked some tail members to a new line...or may
>> be totally different when compiler take part in...
>>
>> It's really hard to estimate the influence, especially when the
>> task_struct is still keep changing...
>
> Yeah, could be memory layout crud that disappears with the next
> pull/build. Wouldn't be the first time.
>
> -Mike
>

2013-07-05 05:41:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Fri, 2013-07-05 at 12:33 +0800, Michael Wang wrote:
> On 07/05/2013 12:08 PM, Mike Galbraith wrote:
> [snip]
> >>
> >> Wow, I used to think such issue is very hard to be tracked by
> >> benchmarks, is this regression stable?
> >
> > Yeah, seems to be. I was curious as to why you saw an improvement to
> > hackbench, didn't seem there should be any, so though I'd try it on my
> > little box on the way to a long weekend. The unexpected happened.
>
> Oh, I think I failed to explain things clearly in comments...
>
> It's not the patch who bring 15% benefit to hackbench, but the
> wake-affine stuff itself.
>
> In the prev-test, I removed the whole stuff and find that hackbench
> dropped 15%, which means with wake-affine enabled, we will gain 15%
> benefit (and that's actually the reason why we don't kill the stuff).

Ah.

> And this idea is try to not harm that 15% benefit, and meanwhile regain
> the pgbench lost performance, thus, apply this patch to mainline won't
> improve hackbench performance, but improve pgbench performance.
>
> But this regression is really unexpected... I could hardly believe it's
> just caused by cache issue now, since the number is not small (10% at
> most?).
>
> Have you tried to use more loops and groups? will that show even bigger
> regressions?

Nope, less on either side.

hackbench -g 100 -l 1000
avg
3.10.0-regress 21.895 21.564 21.777 21.958 22.093 21.857 1.000
3.10.0-regressx 22.844 23.268 23.056 23.231 22.375 22.954 1.050

hackbench -g 1 -l 100000
avg
3.10.0-regress 29.913 29.711 30.395 30.213 30.236 30.093 1.000
3.10.0-regressx 30.392 31.003 30.728 31.008 30.389 30.704 1.020

> BTW, is this the results of 10 group and 40 sockets == 400 tasks?

Yeah, stock.

Off to do some body/mind tuning. Bavarian mushrooms don't hide as well
as memory access thingies.. and I can still out run 'em.

-Mike


2013-07-05 06:16:37

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/05/2013 01:41 PM, Mike Galbraith wrote:
[snip]
>>
>> Have you tried to use more loops and groups? will that show even bigger
>> regressions?
>
> Nope, less on either side.
>
> hackbench -g 100 -l 1000
> avg
> 3.10.0-regress 21.895 21.564 21.777 21.958 22.093 21.857 1.000
> 3.10.0-regressx 22.844 23.268 23.056 23.231 22.375 22.954 1.050
>
> hackbench -g 1 -l 100000
> avg
> 3.10.0-regress 29.913 29.711 30.395 30.213 30.236 30.093 1.000
> 3.10.0-regressx 30.392 31.003 30.728 31.008 30.389 30.704 1.020

Hmm...I'm not expecting to reserve all of the 15%, but this still seems
a little bit more...

PeterZ has suggested some optimization which I sent out yesterday, I
suppose they haven't been included into this test yet, correct?

Since currently I could not reproduce the issue on my box with that
patch, I suppose it may solved that issue ;-)

Regards,
Michael Wang

>
>> BTW, is this the results of 10 group and 40 sockets == 400 tasks?
>
> Yeah, stock.
>
> Off to do some body/mind tuning. Bavarian mushrooms don't hide as well
> as memory access thingies.. and I can still out run 'em.
>
> -Mike
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-07 06:43:48

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Fri, 2013-07-05 at 14:16 +0800, Michael Wang wrote:

> PeterZ has suggested some optimization which I sent out yesterday, I
> suppose they haven't been included into this test yet, correct?

No, that was with both v3 patches applied. hackbench -l 1000 delta
tracked to next pull/build fwiw, but I don't see anything noteworthy
elsewhere, so I'm filing it under "tasks/core ~= zillion ~= hohum".

-Mike

2013-07-08 02:50:10

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/07/2013 02:43 PM, Mike Galbraith wrote:
> On Fri, 2013-07-05 at 14:16 +0800, Michael Wang wrote:
>
>> PeterZ has suggested some optimization which I sent out yesterday, I
>> suppose they haven't been included into this test yet, correct?
>
> No, that was with both v3 patches applied. hackbench -l 1000 delta
> tracked to next pull/build fwiw, but I don't see anything noteworthy
> elsewhere, so I'm filing it under "tasks/core ~= zillion ~= hohum".

I see, thanks for the testing, Mike :)

I still can't reproduce the issue on my box, so I suggest we put down
this temporarily, if later more similar report show up, let's try to
catch the point.

BTW, could you please show me the '/proc/cpuinfo' of your box? I'd like
to collect some data for analyse later ;-)

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-08 03:13:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Mon, 2013-07-08 at 10:49 +0800, Michael Wang wrote:

> BTW, could you please show me the '/proc/cpuinfo' of your box? I'd like
> to collect some data for analyse later ;-)

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
stepping : 11
microcode : 0xba
cpu MHz : 1600.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow vnmi flexpriority
bogomips : 4784.89
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

-Mike

2013-07-08 08:21:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Sun, Jul 07, 2013 at 08:43:25AM +0200, Mike Galbraith wrote:
> On Fri, 2013-07-05 at 14:16 +0800, Michael Wang wrote:
>
> > PeterZ has suggested some optimization which I sent out yesterday, I
> > suppose they haven't been included into this test yet, correct?
>
> No, that was with both v3 patches applied. hackbench -l 1000 delta
> tracked to next pull/build fwiw, but I don't see anything noteworthy
> elsewhere, so I'm filing it under "tasks/core ~= zillion ~= hohum".

OK, I'll apply the patches, we'll see what happens. If there significant
fallout we'll immediately have more information anyway ;-)

2013-07-08 08:50:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Mon, 2013-07-08 at 10:21 +0200, Peter Zijlstra wrote:
> On Sun, Jul 07, 2013 at 08:43:25AM +0200, Mike Galbraith wrote:
> > On Fri, 2013-07-05 at 14:16 +0800, Michael Wang wrote:
> >
> > > PeterZ has suggested some optimization which I sent out yesterday, I
> > > suppose they haven't been included into this test yet, correct?
> >
> > No, that was with both v3 patches applied. hackbench -l 1000 delta
> > tracked to next pull/build fwiw, but I don't see anything noteworthy
> > elsewhere, so I'm filing it under "tasks/core ~= zillion ~= hohum".
>
> OK, I'll apply the patches, we'll see what happens. If there significant
> fallout we'll immediately have more information anyway ;-)

Yup. It's a repeatable oddity on my box, but it doesn't even track to
itself when using different hackbench parameters. All I can really
conclude from test results is "golly, changing stuff affects stuff" :)

-Mike

2013-07-08 08:58:38

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/08/2013 04:21 PM, Peter Zijlstra wrote:
> On Sun, Jul 07, 2013 at 08:43:25AM +0200, Mike Galbraith wrote:
>> On Fri, 2013-07-05 at 14:16 +0800, Michael Wang wrote:
>>
>>> PeterZ has suggested some optimization which I sent out yesterday, I
>>> suppose they haven't been included into this test yet, correct?
>>
>> No, that was with both v3 patches applied. hackbench -l 1000 delta
>> tracked to next pull/build fwiw, but I don't see anything noteworthy
>> elsewhere, so I'm filing it under "tasks/core ~= zillion ~= hohum".
>
> OK, I'll apply the patches, we'll see what happens. If there significant
> fallout we'll immediately have more information anyway ;-)

Agree to begin the alpha test, let's monitor it ;-)

And thanks for all the advice and help from you folks, I really learned
a lot :)

Regards,
Michael Wang

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-08 09:08:55

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/08/2013 04:49 PM, Mike Galbraith wrote:
> On Mon, 2013-07-08 at 10:21 +0200, Peter Zijlstra wrote:
>> On Sun, Jul 07, 2013 at 08:43:25AM +0200, Mike Galbraith wrote:
>>> On Fri, 2013-07-05 at 14:16 +0800, Michael Wang wrote:
>>>
>>>> PeterZ has suggested some optimization which I sent out yesterday, I
>>>> suppose they haven't been included into this test yet, correct?
>>>
>>> No, that was with both v3 patches applied. hackbench -l 1000 delta
>>> tracked to next pull/build fwiw, but I don't see anything noteworthy
>>> elsewhere, so I'm filing it under "tasks/core ~= zillion ~= hohum".
>>
>> OK, I'll apply the patches, we'll see what happens. If there significant
>> fallout we'll immediately have more information anyway ;-)
>
> Yup. It's a repeatable oddity on my box, but it doesn't even track to
> itself when using different hackbench parameters. All I can really
> conclude from test results is "golly, changing stuff affects stuff" :)

If there is problems, they will show up sooner or later, and we just
conquer them at that time ;-)

And thanks again for all the help you've done :)

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-08 19:00:02

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Mon, 2013-07-08 at 10:21 +0200, Peter Zijlstra wrote:
> On Sun, Jul 07, 2013 at 08:43:25AM +0200, Mike Galbraith wrote:
> > On Fri, 2013-07-05 at 14:16 +0800, Michael Wang wrote:
> >
> > > PeterZ has suggested some optimization which I sent out yesterday, I
> > > suppose they haven't been included into this test yet, correct?
> >
> > No, that was with both v3 patches applied. hackbench -l 1000 delta
> > tracked to next pull/build fwiw, but I don't see anything noteworthy
> > elsewhere, so I'm filing it under "tasks/core ~= zillion ~= hohum".
>
> OK, I'll apply the patches, we'll see what happens. If there significant
> fallout we'll immediately have more information anyway ;-)

So I gave the v2 a spin on my aim7 benchmark on an 80-core 8 socket
DL980. Not much changed, most numbers are in the noise range, however,
with HT off, the high_systime workload suffered in throughput with this
patch with higher concurrency (after 600 users). Image attached.

Thanks,
Davidlohr


Attachments:
high_systime-htoff.jpg (56.88 kB)

2013-07-09 02:30:56

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

Hi, Davidlohr

Thanks for the testing :)

On 07/09/2013 02:59 AM, Davidlohr Bueso wrote:
[snip]
>>
>> OK, I'll apply the patches, we'll see what happens. If there significant
>> fallout we'll immediately have more information anyway ;-)
>
> So I gave the v2 a spin on my aim7 benchmark on an 80-core 8 socket
> DL980. Not much changed, most numbers are in the noise range, however,
> with HT off, the high_systime workload suffered in throughput with this
> patch with higher concurrency (after 600 users). Image attached.

To make sure I'm not on the wrong way... HT here means hyperthreading,
correct?

I have some questions like:
1. how do you disable the hyperthreading? by manual or some other way?
2. is the 3.10-rc5 in image also disabled the hyperthreading?
3. is the v3 patch set show the same issue?

Regards,
Michael Wang

>
> Thanks,
> Davidlohr
>

2013-07-09 02:36:50

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Tue, 2013-07-09 at 10:30 +0800, Michael Wang wrote:
> Hi, Davidlohr
>
> Thanks for the testing :)
>
> On 07/09/2013 02:59 AM, Davidlohr Bueso wrote:
> [snip]
> >>
> >> OK, I'll apply the patches, we'll see what happens. If there significant
> >> fallout we'll immediately have more information anyway ;-)
> >
> > So I gave the v2 a spin on my aim7 benchmark on an 80-core 8 socket
> > DL980. Not much changed, most numbers are in the noise range, however,
> > with HT off, the high_systime workload suffered in throughput with this
> > patch with higher concurrency (after 600 users). Image attached.
>
> To make sure I'm not on the wrong way... HT here means hyperthreading,
> correct?

Yep :)

>
> I have some questions like:
> 1. how do you disable the hyperthreading? by manual or some other way?

Manually, from the BIOS.

> 2. is the 3.10-rc5 in image also disabled the hyperthreading?

Yes, I happened to have data already collected for 3.10-rc5. While the
runs with this patch was with -rc7, unless there was some performance
related commit I missed, I don't think the performance difference was
because of that.

> 3. is the v3 patch set show the same issue?

Uhmmm shoot, I didn't realize there was a v3, sorry about that.

/me takes another look at the thread.

Thanks,
Davidlohr

2013-07-09 02:52:49

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/09/2013 10:36 AM, Davidlohr Bueso wrote:
[snip]
>> 2. is the 3.10-rc5 in image also disabled the hyperthreading?
>
> Yes, I happened to have data already collected for 3.10-rc5. While the
> runs with this patch was with -rc7, unless there was some performance
> related commit I missed, I don't think the performance difference was
> because of that.
>
>> 3. is the v3 patch set show the same issue?
>
> Uhmmm shoot, I didn't realize there was a v3, sorry about that.
>
> /me takes another look at the thread.

V3 will reduce the overhead, should make things better, especially when
workload is high and platform is big (your box is really what I desired
;-), honestly).

And if it is possible, comparison based on the same basement will be
better :)

Regards,
Michael Wang

>
> Thanks,
> Davidlohr
>

2013-07-15 05:14:18

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

Hi, Davidlohr

On 07/09/2013 10:52 AM, Michael Wang wrote:
> On 07/09/2013 10:36 AM, Davidlohr Bueso wrote:
> [snip]
>>> 2. is the 3.10-rc5 in image also disabled the hyperthreading?
>>
>> Yes, I happened to have data already collected for 3.10-rc5. While the
>> runs with this patch was with -rc7, unless there was some performance
>> related commit I missed, I don't think the performance difference was
>> because of that.
>>
>>> 3. is the v3 patch set show the same issue?
>>
>> Uhmmm shoot, I didn't realize there was a v3, sorry about that.
>>
>> /me takes another look at the thread.
>
> V3 will reduce the overhead, should make things better, especially when
> workload is high and platform is big (your box is really what I desired
> ;-), honestly).
>
> And if it is possible, comparison based on the same basement will be
> better :)

I have done some tests with reaim high_systime workfile, and I could not
find regression on my box (the tool itself has some issue, but I got the
results), v3 works well.

I suppose the issue have been addressed, but please let us know if v3
also show regression on your box, we could try to solve the problem ;-)

Regards,
Michael Wang

>
> Regards,
> Michael Wang
>
>>
>> Thanks,
>> Davidlohr
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-15 05:58:03

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On Mon, 2013-07-15 at 13:13 +0800, Michael Wang wrote:
> Hi, Davidlohr
>
> On 07/09/2013 10:52 AM, Michael Wang wrote:
> > On 07/09/2013 10:36 AM, Davidlohr Bueso wrote:
> > [snip]
> >>> 2. is the 3.10-rc5 in image also disabled the hyperthreading?
> >>
> >> Yes, I happened to have data already collected for 3.10-rc5. While the
> >> runs with this patch was with -rc7, unless there was some performance
> >> related commit I missed, I don't think the performance difference was
> >> because of that.
> >>
> >>> 3. is the v3 patch set show the same issue?
> >>
> >> Uhmmm shoot, I didn't realize there was a v3, sorry about that.
> >>
> >> /me takes another look at the thread.
> >
> > V3 will reduce the overhead, should make things better, especially when
> > workload is high and platform is big (your box is really what I desired
> > ;-), honestly).
> >
> > And if it is possible, comparison based on the same basement will be
> > better :)
>
> I have done some tests with reaim high_systime workfile, and I could not
> find regression on my box (the tool itself has some issue, but I got the
> results), v3 works well.

Sorry for the delay. I'm glad you reminded me since I already had the
data.
>
> I suppose the issue have been addressed, but please let us know if v3
> also show regression on your box, we could try to solve the problem ;-)

It has, high_systime no longer shows a hit and all workloads remain
basically the same.

Tested-by: Davidlohr Bueso <[email protected]>

Thanks.

>
> Regards,
> Michael Wang
>
> >
> > Regards,
> > Michael Wang
> >
> >>
> >> Thanks,
> >> Davidlohr
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>

2013-07-15 06:01:42

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

On 07/15/2013 01:57 PM, Davidlohr Bueso wrote:
[snip]
>>>
>>> And if it is possible, comparison based on the same basement will be
>>> better :)
>>
>> I have done some tests with reaim high_systime workfile, and I could not
>> find regression on my box (the tool itself has some issue, but I got the
>> results), v3 works well.
>
> Sorry for the delay. I'm glad you reminded me since I already had the
> data.
>>
>> I suppose the issue have been addressed, but please let us know if v3
>> also show regression on your box, we could try to solve the problem ;-)
>
> It has, high_systime no longer shows a hit and all workloads remain
> basically the same.
>
> Tested-by: Davidlohr Bueso <[email protected]>

Thanks for the testing, I think we could go ahead now ;-)

Regards,
Michael Wang

>
> Thanks.
>
>>
>> Regards,
>> Michael Wang
>>
>>>
>>> Regards,
>>> Michael Wang
>>>
>>>>
>>>> Thanks,
>>>> Davidlohr
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-18 02:15:33

by Michael wang

[permalink] [raw]
Subject: Re: [PATCH] sched: smart wake-affine

Hi, Peter

Davidlohr has tested the v3 patch set and it work well as reported (and
in my test too), I thing your patch has solved the issue he found in v2 :)

Thus I think v3 is ready for next step now, I wish it has not yet been
removed out of your apply-queue ;-)

Regards,
Michael Wang

On 07/15/2013 01:57 PM, Davidlohr Bueso wrote:
[snip]
>
> Sorry for the delay. I'm glad you reminded me since I already had the
> data.
>>
>> I suppose the issue have been addressed, but please let us know if v3
>> also show regression on your box, we could try to solve the problem ;-)
>
> It has, high_systime no longer shows a hit and all workloads remain
> basically the same.
>
> Tested-by: Davidlohr Bueso <[email protected]>
>
> Thanks.
>
>>
>> Regards,
>> Michael Wang
>>
>>>
>>> Regards,
>>> Michael Wang
>>>
>>>>
>>>> Thanks,
>>>> Davidlohr
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>