2013-09-25 07:53:56

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

Subject: sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
From: Peter Zijlstra <[email protected]>
Date: Wed Sep 25 08:28:39 CEST 2013

When a task is the only running task and does a sync wakeup; avoid
going through select_idle_sibling() as it doesn't know the current CPU
is going to be idle shortly.

Without this two sync wakers will ping-pong between CPUs for no
reason.

Suggested-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/fair.c | 10 ++++++++++
1 file changed, 10 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3461,6 +3461,16 @@ select_task_rq_fair(struct task_struct *
if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
prev_cpu = cpu;

+ /*
+ * Don't bother with select_idle_sibling() in the case of a sync wakeup
+ * where we know the only running task will soon go away. Going
+ * through select_idle_sibling will only lead to pointless ping-pong.
+ */
+ if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1) {
+ new_cpu = cpu;
+ goto unlock;
+ }
+
new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock;
}


2013-09-25 08:56:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Wed, 2013-09-25 at 09:53 +0200, Peter Zijlstra wrote:
> Subject: sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
> From: Peter Zijlstra <[email protected]>
> Date: Wed Sep 25 08:28:39 CEST 2013
>
> When a task is the only running task and does a sync wakeup; avoid
> going through select_idle_sibling() as it doesn't know the current CPU
> is going to be idle shortly.
>
> Without this two sync wakers will ping-pong between CPUs for no
> reason.

That will make pipe-test go fugly -> pretty, and help very fast/light
localhost network, but eat heavier localhost overlap recovery. We need
a working (and cheap) overlap detector scheme, so we can know when there
is enough to be worth going after.

(I sent you some lmbench numbers offline a while back showing the
two-faced little <b-word> in action, doing both good and evil)
> Suggested-by: Rik van Riel <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> kernel/sched/fair.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3461,6 +3461,16 @@ select_task_rq_fair(struct task_struct *
> if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> prev_cpu = cpu;
>
> + /*
> + * Don't bother with select_idle_sibling() in the case of a sync wakeup
> + * where we know the only running task will soon go away. Going
> + * through select_idle_sibling will only lead to pointless ping-pong.
> + */
> + if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1) {
> + new_cpu = cpu;
> + goto unlock;
> + }
> +
> new_cpu = select_idle_sibling(p, prev_cpu);
> goto unlock;
> }

2013-09-26 02:50:25

by Michael wang

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On 09/25/2013 04:56 PM, Mike Galbraith wrote:
> On Wed, 2013-09-25 at 09:53 +0200, Peter Zijlstra wrote:
>> Subject: sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
>> From: Peter Zijlstra <[email protected]>
>> Date: Wed Sep 25 08:28:39 CEST 2013
>>
>> When a task is the only running task and does a sync wakeup; avoid
>> going through select_idle_sibling() as it doesn't know the current CPU
>> is going to be idle shortly.
>>
>> Without this two sync wakers will ping-pong between CPUs for no
>> reason.
>
> That will make pipe-test go fugly -> pretty, and help very fast/light
> localhost network, but eat heavier localhost overlap recovery. We need
> a working (and cheap) overlap detector scheme, so we can know when there
> is enough to be worth going after.
>
> (I sent you some lmbench numbers offline a while back showing the
> two-faced little <b-word> in action, doing both good and evil)

It seems like the choice between the overhead and a little possibility
to balance the load :)

Like the case when we have:

core0 sg core1 sg
cpu0 cpu1 cpu2 cpu3
waker busy idle idle

If the sync wakeup was on cpu0, we can:

1. choose cpu in core1 sg like we did usually
some overhead but tend to make the load a little balance
core0 sg core1 sg
cpu0 cpu1 cpu2 cpu3
idle busy wakee idle

2. choose cpu0 like the patch proposed
no overhead but tend to make the load a little more unbalance
core0 sg core1 sg
cpu0 cpu1 cpu2 cpu3
wakee busy idle idle

May be we should add a higher scope load balance check in wake_affine(),
but that means higher overhead which is just what the patch want to
reduce...

What about some discount for sync case inside select_idle_sibling()?
For example we consider sync cpu as idle and prefer it more than the others?

Regards,
Michael Wang


>> Suggested-by: Rik van Riel <[email protected]>
>> Signed-off-by: Peter Zijlstra <[email protected]>
>> ---
>> kernel/sched/fair.c | 10 ++++++++++
>> 1 file changed, 10 insertions(+)
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3461,6 +3461,16 @@ select_task_rq_fair(struct task_struct *
>> if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>> prev_cpu = cpu;
>>
>> + /*
>> + * Don't bother with select_idle_sibling() in the case of a sync wakeup
>> + * where we know the only running task will soon go away. Going
>> + * through select_idle_sibling will only lead to pointless ping-pong.
>> + */
>> + if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1) {
>> + new_cpu = cpu;
>> + goto unlock;
>> + }
>> +
>> new_cpu = select_idle_sibling(p, prev_cpu);
>> goto unlock;
>> }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-09-26 03:41:53

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Thu, 2013-09-26 at 10:50 +0800, Michael wang wrote:
> On 09/25/2013 04:56 PM, Mike Galbraith wrote:
> > On Wed, 2013-09-25 at 09:53 +0200, Peter Zijlstra wrote:
> >> Subject: sched: Avoid select_idle_sibling() for wake_affine(.sync=true)
> >> From: Peter Zijlstra <[email protected]>
> >> Date: Wed Sep 25 08:28:39 CEST 2013
> >>
> >> When a task is the only running task and does a sync wakeup; avoid
> >> going through select_idle_sibling() as it doesn't know the current CPU
> >> is going to be idle shortly.
> >>
> >> Without this two sync wakers will ping-pong between CPUs for no
> >> reason.
> >
> > That will make pipe-test go fugly -> pretty, and help very fast/light
> > localhost network, but eat heavier localhost overlap recovery. We need
> > a working (and cheap) overlap detector scheme, so we can know when there
> > is enough to be worth going after.
> >
> > (I sent you some lmbench numbers offline a while back showing the
> > two-faced little <b-word> in action, doing both good and evil)
>
> It seems like the choice between the overhead and a little possibility
> to balance the load :)
>
> Like the case when we have:
>
> core0 sg core1 sg
> cpu0 cpu1 cpu2 cpu3
> waker busy idle idle
>
> If the sync wakeup was on cpu0, we can:
>
> 1. choose cpu in core1 sg like we did usually
> some overhead but tend to make the load a little balance
> core0 sg core1 sg
> cpu0 cpu1 cpu2 cpu3
> idle busy wakee idle

Reducing latency and increasing throughput when the waker isn't really
really going to immediately schedule off as the hint implies. Nice for
bursty loads and ramp.

The breakeven point is going up though. If you don't have nohz
throttled, you eat tick start/stop overhead, and the menu governor
recently added yet more overhead, so maybe we should say hell with it.

> 2. choose cpu0 like the patch proposed
> no overhead but tend to make the load a little more unbalance
> core0 sg core1 sg
> cpu0 cpu1 cpu2 cpu3
> wakee busy idle idle
>
> May be we should add a higher scope load balance check in wake_affine(),
> but that means higher overhead which is just what the patch want to
> reduce...

Yeah, more overhead is the last thing we need.

> What about some discount for sync case inside select_idle_sibling()?
> For example we consider sync cpu as idle and prefer it more than the others?

That's what the sync hint does. Problem is, it's a hint. If it were
truth, there would be no point in calling select_idle_sibling().

-Mike

2013-09-26 05:12:58

by Michael wang

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On 09/26/2013 11:41 AM, Mike Galbraith wrote:
[snip]
>> Like the case when we have:
>>
>> core0 sg core1 sg
>> cpu0 cpu1 cpu2 cpu3
>> waker busy idle idle
>>
>> If the sync wakeup was on cpu0, we can:
>>
>> 1. choose cpu in core1 sg like we did usually
>> some overhead but tend to make the load a little balance
>> core0 sg core1 sg
>> cpu0 cpu1 cpu2 cpu3
>> idle busy wakee idle
>
> Reducing latency and increasing throughput when the waker isn't really
> really going to immediately schedule off as the hint implies. Nice for
> bursty loads and ramp.
>
> The breakeven point is going up though. If you don't have nohz
> throttled, you eat tick start/stop overhead, and the menu governor
> recently added yet more overhead, so maybe we should say hell with it.

Exactly, more and more factors to be considered, we say things get
balanced but actually it's not the best choice...

>
>> 2. choose cpu0 like the patch proposed
>> no overhead but tend to make the load a little more unbalance
>> core0 sg core1 sg
>> cpu0 cpu1 cpu2 cpu3
>> wakee busy idle idle
>>
>> May be we should add a higher scope load balance check in wake_affine(),
>> but that means higher overhead which is just what the patch want to
>> reduce...
>
> Yeah, more overhead is the last thing we need.
>
>> What about some discount for sync case inside select_idle_sibling()?
>> For example we consider sync cpu as idle and prefer it more than the others?
>
> That's what the sync hint does. Problem is, it's a hint. If it were
> truth, there would be no point in calling select_idle_sibling().

Just wondering if the hint was wrong in most of the time, then why don't
we remove it...

Otherwise I think we can still utilize it to make some decision tends to
be correct, don't we?

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-09-26 05:34:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Thu, 2013-09-26 at 13:12 +0800, Michael wang wrote:
> On 09/26/2013 11:41 AM, Mike Galbraith wrote:
> [snip]
> >> Like the case when we have:
> >>
> >> core0 sg core1 sg
> >> cpu0 cpu1 cpu2 cpu3
> >> waker busy idle idle
> >>
> >> If the sync wakeup was on cpu0, we can:
> >>
> >> 1. choose cpu in core1 sg like we did usually
> >> some overhead but tend to make the load a little balance
> >> core0 sg core1 sg
> >> cpu0 cpu1 cpu2 cpu3
> >> idle busy wakee idle
> >
> > Reducing latency and increasing throughput when the waker isn't really
> > really going to immediately schedule off as the hint implies. Nice for
> > bursty loads and ramp.
> >
> > The breakeven point is going up though. If you don't have nohz
> > throttled, you eat tick start/stop overhead, and the menu governor
> > recently added yet more overhead, so maybe we should say hell with it.
>
> Exactly, more and more factors to be considered, we say things get
> balanced but actually it's not the best choice...
>
> >
> >> 2. choose cpu0 like the patch proposed
> >> no overhead but tend to make the load a little more unbalance
> >> core0 sg core1 sg
> >> cpu0 cpu1 cpu2 cpu3
> >> wakee busy idle idle
> >>
> >> May be we should add a higher scope load balance check in wake_affine(),
> >> but that means higher overhead which is just what the patch want to
> >> reduce...
> >
> > Yeah, more overhead is the last thing we need.
> >
> >> What about some discount for sync case inside select_idle_sibling()?
> >> For example we consider sync cpu as idle and prefer it more than the others?
> >
> > That's what the sync hint does. Problem is, it's a hint. If it were
> > truth, there would be no point in calling select_idle_sibling().
>
> Just wondering if the hint was wrong in most of the time, then why don't
> we remove it...

For very fast/light network ping-pong micro-benchmarks, it is right.
For pipe-test, it's absolutely right, jabbering parties are 100%
synchronous, there is nada/nil/zip/diddly squat overlap reclaimable..
but in the real world, it ain't necessarily so.

> Otherwise I think we can still utilize it to make some decision tends to
> be correct, don't we?

Sometimes :)

-Mike

2013-09-26 06:16:05

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Thu, 2013-09-26 at 07:34 +0200, Mike Galbraith wrote:
> On Thu, 2013-09-26 at 13:12 +0800, Michael wang wrote:
> > On 09/26/2013 11:41 AM, Mike Galbraith wrote:
> > [snip]
> > >> Like the case when we have:
> > >>
> > >> core0 sg core1 sg
> > >> cpu0 cpu1 cpu2 cpu3
> > >> waker busy idle idle
> > >>
> > >> If the sync wakeup was on cpu0, we can:
> > >>
> > >> 1. choose cpu in core1 sg like we did usually
> > >> some overhead but tend to make the load a little balance
> > >> core0 sg core1 sg
> > >> cpu0 cpu1 cpu2 cpu3
> > >> idle busy wakee idle
> > >
> > > Reducing latency and increasing throughput when the waker isn't really
> > > really going to immediately schedule off as the hint implies. Nice for
> > > bursty loads and ramp.
> > >
> > > The breakeven point is going up though. If you don't have nohz
> > > throttled, you eat tick start/stop overhead, and the menu governor
> > > recently added yet more overhead, so maybe we should say hell with it.
> >
> > Exactly, more and more factors to be considered, we say things get
> > balanced but actually it's not the best choice...
> >
> > >
> > >> 2. choose cpu0 like the patch proposed
> > >> no overhead but tend to make the load a little more unbalance
> > >> core0 sg core1 sg
> > >> cpu0 cpu1 cpu2 cpu3
> > >> wakee busy idle idle
> > >>
> > >> May be we should add a higher scope load balance check in wake_affine(),
> > >> but that means higher overhead which is just what the patch want to
> > >> reduce...
> > >
> > > Yeah, more overhead is the last thing we need.
> > >
> > >> What about some discount for sync case inside select_idle_sibling()?
> > >> For example we consider sync cpu as idle and prefer it more than the others?
> > >
> > > That's what the sync hint does. Problem is, it's a hint. If it were
> > > truth, there would be no point in calling select_idle_sibling().
> >
> > Just wondering if the hint was wrong in most of the time, then why don't
> > we remove it...
>
> For very fast/light network ping-pong micro-benchmarks, it is right.
> For pipe-test, it's absolutely right, jabbering parties are 100%
> synchronous, there is nada/nil/zip/diddly squat overlap reclaimable..
> but in the real world, it ain't necessarily so.
>
> > Otherwise I think we can still utilize it to make some decision tends to
> > be correct, don't we?
>
> Sometimes :)

P.S. while we're slapping select_idle_sibling()'s _evil_ face, let's
give it a pat on the head too. It showed regressions in bright red.
Put pipe-test on one core, you only see scheduler weight.. but entering
and exiting idle is part of the fast path, whether you're exercising it
by doing something silly or not.

-Mike

2013-09-26 06:32:19

by Michael wang

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On 09/26/2013 01:34 PM, Mike Galbraith wrote:
> On Thu, 2013-09-26 at 13:12 +0800, Michael wang wrote:
>> On 09/26/2013 11:41 AM, Mike Galbraith wrote:
>> [snip]
>>>> Like the case when we have:
>>>>
>>>> core0 sg core1 sg
>>>> cpu0 cpu1 cpu2 cpu3
>>>> waker busy idle idle
>>>>
>>>> If the sync wakeup was on cpu0, we can:
>>>>
>>>> 1. choose cpu in core1 sg like we did usually
>>>> some overhead but tend to make the load a little balance
>>>> core0 sg core1 sg
>>>> cpu0 cpu1 cpu2 cpu3
>>>> idle busy wakee idle
>>>
>>> Reducing latency and increasing throughput when the waker isn't really
>>> really going to immediately schedule off as the hint implies. Nice for
>>> bursty loads and ramp.
>>>
>>> The breakeven point is going up though. If you don't have nohz
>>> throttled, you eat tick start/stop overhead, and the menu governor
>>> recently added yet more overhead, so maybe we should say hell with it.
>>
>> Exactly, more and more factors to be considered, we say things get
>> balanced but actually it's not the best choice...
>>
>>>
>>>> 2. choose cpu0 like the patch proposed
>>>> no overhead but tend to make the load a little more unbalance
>>>> core0 sg core1 sg
>>>> cpu0 cpu1 cpu2 cpu3
>>>> wakee busy idle idle
>>>>
>>>> May be we should add a higher scope load balance check in wake_affine(),
>>>> but that means higher overhead which is just what the patch want to
>>>> reduce...
>>>
>>> Yeah, more overhead is the last thing we need.
>>>
>>>> What about some discount for sync case inside select_idle_sibling()?
>>>> For example we consider sync cpu as idle and prefer it more than the others?
>>>
>>> That's what the sync hint does. Problem is, it's a hint. If it were
>>> truth, there would be no point in calling select_idle_sibling().
>>
>> Just wondering if the hint was wrong in most of the time, then why don't
>> we remove it...
>
> For very fast/light network ping-pong micro-benchmarks, it is right.
> For pipe-test, it's absolutely right, jabbering parties are 100%
> synchronous, there is nada/nil/zip/diddly squat overlap reclaimable..
> but in the real world, it ain't necessarily so.
>
>> Otherwise I think we can still utilize it to make some decision tends to
>> be correct, don't we?
>
> Sometimes :)

Ok, a double-edged sword I see :)

May be we can wave it carefully here, give the discount to a bigger
scope not the sync cpu, for example:

sg1 sg2
cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7
waker idle idle idle idle idle idle idle

If it's sync wakeup on cpu0 (only waker), and the sg is wide enough,
which means one cpu is not so influencial, then suppose cpu0 to be idle
could be more safe, also prefer sg1 than sg2 is more likely to be right.

And we can still choose idle-cpu at final step, like cpu1 in this case,
to avoid the risk that waker don't get off as it said.

The key point is to reduce the influence of sync, trust a little but not
totally ;-)

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-09-26 07:10:04

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Thu, 2013-09-26 at 14:32 +0800, Michael wang wrote:
> On 09/26/2013 01:34 PM, Mike Galbraith wrote:
> > On Thu, 2013-09-26 at 13:12 +0800, Michael wang wrote:
> >> On 09/26/2013 11:41 AM, Mike Galbraith wrote:
> >> [snip]
> >>>> Like the case when we have:
> >>>>
> >>>> core0 sg core1 sg
> >>>> cpu0 cpu1 cpu2 cpu3
> >>>> waker busy idle idle
> >>>>
> >>>> If the sync wakeup was on cpu0, we can:
> >>>>
> >>>> 1. choose cpu in core1 sg like we did usually
> >>>> some overhead but tend to make the load a little balance
> >>>> core0 sg core1 sg
> >>>> cpu0 cpu1 cpu2 cpu3
> >>>> idle busy wakee idle
> >>>
> >>> Reducing latency and increasing throughput when the waker isn't really
> >>> really going to immediately schedule off as the hint implies. Nice for
> >>> bursty loads and ramp.
> >>>
> >>> The breakeven point is going up though. If you don't have nohz
> >>> throttled, you eat tick start/stop overhead, and the menu governor
> >>> recently added yet more overhead, so maybe we should say hell with it.
> >>
> >> Exactly, more and more factors to be considered, we say things get
> >> balanced but actually it's not the best choice...
> >>
> >>>
> >>>> 2. choose cpu0 like the patch proposed
> >>>> no overhead but tend to make the load a little more unbalance
> >>>> core0 sg core1 sg
> >>>> cpu0 cpu1 cpu2 cpu3
> >>>> wakee busy idle idle
> >>>>
> >>>> May be we should add a higher scope load balance check in wake_affine(),
> >>>> but that means higher overhead which is just what the patch want to
> >>>> reduce...
> >>>
> >>> Yeah, more overhead is the last thing we need.
> >>>
> >>>> What about some discount for sync case inside select_idle_sibling()?
> >>>> For example we consider sync cpu as idle and prefer it more than the others?
> >>>
> >>> That's what the sync hint does. Problem is, it's a hint. If it were
> >>> truth, there would be no point in calling select_idle_sibling().
> >>
> >> Just wondering if the hint was wrong in most of the time, then why don't
> >> we remove it...
> >
> > For very fast/light network ping-pong micro-benchmarks, it is right.
> > For pipe-test, it's absolutely right, jabbering parties are 100%
> > synchronous, there is nada/nil/zip/diddly squat overlap reclaimable..
> > but in the real world, it ain't necessarily so.
> >
> >> Otherwise I think we can still utilize it to make some decision tends to
> >> be correct, don't we?
> >
> > Sometimes :)
>
> Ok, a double-edged sword I see :)
>
> May be we can wave it carefully here, give the discount to a bigger
> scope not the sync cpu, for example:
>
> sg1 sg2
> cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7
> waker idle idle idle idle idle idle idle
>
> If it's sync wakeup on cpu0 (only waker), and the sg is wide enough,
> which means one cpu is not so influencial, then suppose cpu0 to be idle
> could be more safe, also prefer sg1 than sg2 is more likely to be right.
>
> And we can still choose idle-cpu at final step, like cpu1 in this case,
> to avoid the risk that waker don't get off as it said.
>
> The key point is to reduce the influence of sync, trust a little but not
> totally ;-)

What we need is a dirt cheap way to fairly accurately predict overlap
potential (todo: write omniscience().. patent, buy planet).

-Mike

2013-09-26 07:27:05

by Michael wang

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On 09/26/2013 03:09 PM, Mike Galbraith wrote:
[snip]
>>
>> Ok, a double-edged sword I see :)
>>
>> May be we can wave it carefully here, give the discount to a bigger
>> scope not the sync cpu, for example:
>>
>> sg1 sg2
>> cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7
>> waker idle idle idle idle idle idle idle
>>
>> If it's sync wakeup on cpu0 (only waker), and the sg is wide enough,
>> which means one cpu is not so influencial, then suppose cpu0 to be idle
>> could be more safe, also prefer sg1 than sg2 is more likely to be right.
>>
>> And we can still choose idle-cpu at final step, like cpu1 in this case,
>> to avoid the risk that waker don't get off as it said.
>>
>> The key point is to reduce the influence of sync, trust a little but not
>> totally ;-)
>
> What we need is a dirt cheap way to fairly accurately predict overlap
> potential (todo: write omniscience().. patent, buy planet).

Agree, solutions for such cases are usually incredible ;-)

Regards,
Michael Wang

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-09-26 09:58:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Wed, Sep 25, 2013 at 10:56:17AM +0200, Mike Galbraith wrote:
> That will make pipe-test go fugly -> pretty, and help very fast/light
> localhost network, but eat heavier localhost overlap recovery. We need
> a working (and cheap) overlap detector scheme, so we can know when there
> is enough to be worth going after.

We used to have an overlap detectoring thing.. It went away though.

But see if you can make something like the below work?

You could make it a general overlap thing and try without the sync too I
suppose..

---
include/linux/sched.h | 3 +++
kernel/sched/fair.c | 25 +++++++++++++++++++++++++
2 files changed, 28 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b5344de..5428016 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -974,6 +974,9 @@ struct sched_entity {
u64 vruntime;
u64 prev_sum_exec_runtime;

+ u64 last_sync_wakeup;
+ u64 avg_overlap;
+
u64 nr_migrations;

#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b89cd2..47b0d0f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2913,6 +2913,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct sched_entity *se = &p->se;
int task_sleep = flags & DEQUEUE_SLEEP;

+ if (se->last_sync_wakeup) {
+ u64 overlap;
+ s64 diff;
+
+ overlap = rq->clock - se->last_sync_wakeup;
+ se->last_sync_wakeup = 0;
+
+ diff = overlap - se->avg_overlap;
+ se->avg_overlap += diff >> 8;
+ }
+
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);
@@ -3429,6 +3440,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
int want_affine = 0;
int sync = wake_flags & WF_SYNC;

+ if (sync)
+ p->se.last_sync_wakeup = sched_clock_cpu(cpu);
+
if (p->nr_cpus_allowed == 1)
return prev_cpu;

@@ -3461,6 +3475,17 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
prev_cpu = cpu;

+ /*
+ * Don't bother with select_idle_sibling() in the case of a sync wakeup
+ * where we know the only running task will soon go-away. Going
+ * through select_idle_sibling will only lead to pointless ping-pong.
+ */
+ if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1 &&
+ current->se.avg_overlap < 10000) {
+ new_cpu = cpu;
+ goto unlock;
+ }
+
new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock;
}

2013-09-26 10:05:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Thu, Sep 26, 2013 at 11:58:12AM +0200, Peter Zijlstra wrote:
> @@ -3429,6 +3440,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> int want_affine = 0;
> int sync = wake_flags & WF_SYNC;
>
> + if (sync)
> + p->se.last_sync_wakeup = sched_clock_cpu(cpu);
> +
> if (p->nr_cpus_allowed == 1)
> return prev_cpu;
>

Oh, I suppose something like:

if (sync && !p->se.last_sync_wakeup)
p->se.last_sync_wakeup = sched_clock_cpu(cpu);

is also a nice variation to try..

2013-09-26 10:56:27

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Thu, Sep 26, 2013 at 2:58 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Sep 25, 2013 at 10:56:17AM +0200, Mike Galbraith wrote:
>> That will make pipe-test go fugly -> pretty, and help very fast/light
>> localhost network, but eat heavier localhost overlap recovery. We need
>> a working (and cheap) overlap detector scheme, so we can know when there
>> is enough to be worth going after.
>
> We used to have an overlap detectoring thing.. It went away though.
>
> But see if you can make something like the below work?
>
> You could make it a general overlap thing and try without the sync too I
> suppose..
>
> ---
> include/linux/sched.h | 3 +++
> kernel/sched/fair.c | 25 +++++++++++++++++++++++++
> 2 files changed, 28 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b5344de..5428016 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -974,6 +974,9 @@ struct sched_entity {
> u64 vruntime;
> u64 prev_sum_exec_runtime;
>
> + u64 last_sync_wakeup;
> + u64 avg_overlap;
> +
> u64 nr_migrations;
>
> #ifdef CONFIG_SCHEDSTATS
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2b89cd2..47b0d0f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2913,6 +2913,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct sched_entity *se = &p->se;
> int task_sleep = flags & DEQUEUE_SLEEP;
>
> + if (se->last_sync_wakeup) {
> + u64 overlap;
> + s64 diff;
> +
> + overlap = rq->clock - se->last_sync_wakeup;
> + se->last_sync_wakeup = 0;
> +
> + diff = overlap - se->avg_overlap;
> + se->avg_overlap += diff >> 8;
> + }
> +
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> dequeue_entity(cfs_rq, se, flags);
> @@ -3429,6 +3440,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> int want_affine = 0;
> int sync = wake_flags & WF_SYNC;
>
> + if (sync)
> + p->se.last_sync_wakeup = sched_clock_cpu(cpu);
> +
> if (p->nr_cpus_allowed == 1)
> return prev_cpu;
>
> @@ -3461,6 +3475,17 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> prev_cpu = cpu;
>
> + /*
> + * Don't bother with select_idle_sibling() in the case of a sync wakeup
> + * where we know the only running task will soon go-away. Going
> + * through select_idle_sibling will only lead to pointless ping-pong.
> + */
> + if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1 &&

I've long thought of trying something like this.

I like the intent but I'd go a step further in that I think we want to
also implicitly extract WF_SYNC itself. While pipe_test is a good
microbenchmark it's not entirely representative since, in reality,
overlap is most usefully applied to threads in the same process -- and
they rarely communicate using a pipe.

What we really then care about is predicting the overlap associated
with userspace synchronization objects, typically built on top of
futexes. Unfortunately the existence/use of per-thread futexes
reduces how much state you could usefully associate with the futex.
One approach might be to hash (with some small saturating counter)
against rip. But this gets more complicated quite quickly.

> + current->se.avg_overlap < 10000) {
> + new_cpu = cpu;
> + goto unlock;
> + }
> +
> new_cpu = select_idle_sibling(p, prev_cpu);
> goto unlock;
> }

2013-09-26 11:16:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Thu, Sep 26, 2013 at 03:55:55AM -0700, Paul Turner wrote:
> > + /*
> > + * Don't bother with select_idle_sibling() in the case of a sync wakeup
> > + * where we know the only running task will soon go-away. Going
> > + * through select_idle_sibling will only lead to pointless ping-pong.
> > + */
> > + if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1 &&
>
> I've long thought of trying something like this.
>
> I like the intent but I'd go a step further in that I think we want to
> also implicitly extract WF_SYNC itself.

I have vague memories of actually trying something like that a good
number of years ago.. sadly that's all I remember about it.

> What we really then care about is predicting the overlap associated
> with userspace synchronization objects, typically built on top of
> futexes. Unfortunately the existence/use of per-thread futexes
> reduces how much state you could usefully associate with the futex.
> One approach might be to hash (with some small saturating counter)
> against rip. But this gets more complicated quite quickly.

Why would you need per object storage? To further granulate the
predicted overlap? instead of having one per task, you have one per
object?

2013-09-26 11:40:04

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Thu, Sep 26, 2013 at 4:16 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Sep 26, 2013 at 03:55:55AM -0700, Paul Turner wrote:
>> > + /*
>> > + * Don't bother with select_idle_sibling() in the case of a sync wakeup
>> > + * where we know the only running task will soon go-away. Going
>> > + * through select_idle_sibling will only lead to pointless ping-pong.
>> > + */
>> > + if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1 &&
>>
>> I've long thought of trying something like this.
>>
>> I like the intent but I'd go a step further in that I think we want to
>> also implicitly extract WF_SYNC itself.
>
> I have vague memories of actually trying something like that a good
> number of years ago.. sadly that's all I remember about it.
>
>> What we really then care about is predicting the overlap associated
>> with userspace synchronization objects, typically built on top of
>> futexes. Unfortunately the existence/use of per-thread futexes
>> reduces how much state you could usefully associate with the futex.
>> One approach might be to hash (with some small saturating counter)
>> against rip. But this gets more complicated quite quickly.
>
> Why would you need per object storage? To further granulate the
> predicted overlap? instead of having one per task, you have one per
> object?

It is my intuition that there are a few common objects with fairly
polarized behavior: I.e. For condition variables and producer
consumer queues, a wakeup strongly predicts blocking. Whereas for
locks protecting objects, e.g. a Mutex, would be expected to have the
opposite behavior.

For this hint to be beneficial you have to get it right frequently,
getting it wrong in the first case hurts cache and in the second hurts
parallelism. Today we always err on the side of hurting locality
since the cost of getting it wrong is better bounded. These are
sufficiently common, and likely to be interspersed, that I suspect
allowing them to interact on a thread-wide counter will basically give
a mush result (or even be an anti predictor since it will strongly
favor the last observation) an an input.

2013-09-26 13:46:14

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Thu, 2013-09-26 at 11:58 +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2013 at 10:56:17AM +0200, Mike Galbraith wrote:
> > That will make pipe-test go fugly -> pretty, and help very fast/light
> > localhost network, but eat heavier localhost overlap recovery. We need
> > a working (and cheap) overlap detector scheme, so we can know when there
> > is enough to be worth going after.
>
> We used to have an overlap detectoring thing.. It went away though.

Yup.. guilty as charged your honor.

> But see if you can make something like the below work?

It'll have the same, preemption etc., problems as the whacked. Could
bring that back and have another go at making it actually work, maybe
it'll be good enough to cause more gain than pain.

> You could make it a general overlap thing and try without the sync too I
> suppose..
>
> ---
> include/linux/sched.h | 3 +++
> kernel/sched/fair.c | 25 +++++++++++++++++++++++++
> 2 files changed, 28 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b5344de..5428016 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -974,6 +974,9 @@ struct sched_entity {
> u64 vruntime;
> u64 prev_sum_exec_runtime;
>
> + u64 last_sync_wakeup;
> + u64 avg_overlap;
> +
> u64 nr_migrations;
>
> #ifdef CONFIG_SCHEDSTATS
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2b89cd2..47b0d0f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2913,6 +2913,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct sched_entity *se = &p->se;
> int task_sleep = flags & DEQUEUE_SLEEP;
>
> + if (se->last_sync_wakeup) {
> + u64 overlap;
> + s64 diff;
> +
> + overlap = rq->clock - se->last_sync_wakeup;
> + se->last_sync_wakeup = 0;
> +
> + diff = overlap - se->avg_overlap;
> + se->avg_overlap += diff >> 8;
> + }
> +
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> dequeue_entity(cfs_rq, se, flags);
> @@ -3429,6 +3440,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> int want_affine = 0;
> int sync = wake_flags & WF_SYNC;
>
> + if (sync)
> + p->se.last_sync_wakeup = sched_clock_cpu(cpu);
> +
> if (p->nr_cpus_allowed == 1)
> return prev_cpu;
>
> @@ -3461,6 +3475,17 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> prev_cpu = cpu;
>
> + /*
> + * Don't bother with select_idle_sibling() in the case of a sync wakeup
> + * where we know the only running task will soon go-away. Going
> + * through select_idle_sibling will only lead to pointless ping-pong.
> + */
> + if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1 &&
> + current->se.avg_overlap < 10000) {
> + new_cpu = cpu;
> + goto unlock;
> + }
> +
> new_cpu = select_idle_sibling(p, prev_cpu);
> goto unlock;
> }

2013-09-26 14:35:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Thu, Sep 26, 2013 at 04:39:30AM -0700, Paul Turner wrote:
> It is my intuition that there are a few common objects with fairly
> polarized behavior: I.e. For condition variables and producer
> consumer queues, a wakeup strongly predicts blocking. Whereas for
> locks protecting objects, e.g. a Mutex, would be expected to have the
> opposite behavior.

Agreed; however none of those seem to have the property we're looking
for.

Even produces consumer queues on their own don't generate the
alternating patterns we're looking for with the SYNC hint.

We need a 'guarantee' that the waker is going to stop until the wakee is
done.

What we're looking for is the typical synchronous request-reply like
pattern -- and that doesn't seem to correlate to any one locking object.

Rather it is an inter-task relation; so task state does make sense in
finding them. We could for instance try and infer which task is
servicing requests; and then we know that requesting tasks will sleep
until reply.

2013-09-26 15:10:07

by Michael wang

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

Hi, Peter

On 09/26/2013 05:58 PM, Peter Zijlstra wrote:
[snip]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2b89cd2..47b0d0f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2913,6 +2913,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct sched_entity *se = &p->se;
> int task_sleep = flags & DEQUEUE_SLEEP;
>
> + if (se->last_sync_wakeup) {
> + u64 overlap;
> + s64 diff;
> +
> + overlap = rq->clock - se->last_sync_wakeup;
> + se->last_sync_wakeup = 0;
> +
> + diff = overlap - se->avg_overlap;
> + se->avg_overlap += diff >> 8;
> + }
> +
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> dequeue_entity(cfs_rq, se, flags);
> @@ -3429,6 +3440,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> int want_affine = 0;
> int sync = wake_flags & WF_SYNC;
>
> + if (sync)
> + p->se.last_sync_wakeup = sched_clock_cpu(cpu);

Forgive me but I'm trying to understand it... why not 'current' but 'p'
here? we want the get off speed of waker or the working time of wakee?

Regards,
Michael Wang

> +
> if (p->nr_cpus_allowed == 1)
> return prev_cpu;
>
> @@ -3461,6 +3475,17 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> prev_cpu = cpu;
>
> + /*
> + * Don't bother with select_idle_sibling() in the case of a sync wakeup
> + * where we know the only running task will soon go-away. Going
> + * through select_idle_sibling will only lead to pointless ping-pong.
> + */
> + if (sync && prev_cpu == cpu && cpu_rq(cpu)->nr_running == 1 &&
> + current->se.avg_overlap < 10000) {
> + new_cpu = cpu;
> + goto unlock;
> + }
> +
> new_cpu = select_idle_sibling(p, prev_cpu);
> goto unlock;
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-09-26 15:43:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Thu, Sep 26, 2013 at 04:35:33PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 26, 2013 at 04:39:30AM -0700, Paul Turner wrote:
> > It is my intuition that there are a few common objects with fairly
> > polarized behavior: I.e. For condition variables and producer
> > consumer queues, a wakeup strongly predicts blocking. Whereas for
> > locks protecting objects, e.g. a Mutex, would be expected to have the
> > opposite behavior.
>
> Agreed; however none of those seem to have the property we're looking
> for.
>
> Even produces consumer queues on their own don't generate the
> alternating patterns we're looking for with the SYNC hint.
>
> We need a 'guarantee' that the waker is going to stop until the wakee is
> done.
>
> What we're looking for is the typical synchronous request-reply like
> pattern -- and that doesn't seem to correlate to any one locking object.
>
> Rather it is an inter-task relation; so task state does make sense in
> finding them. We could for instance try and infer which task is
> servicing requests; and then we know that requesting tasks will sleep
> until reply.
>

Oh never mind, I see what you meant, the edges in that graph are the
locks.

Can't use RIPs for futexes though; you'd likely end up in the one
pthread_mutex_lock() implementation or such.

2013-09-26 15:44:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On Thu, Sep 26, 2013 at 11:09:46PM +0800, Michael wang wrote:
> > + if (sync)
> > + p->se.last_sync_wakeup = sched_clock_cpu(cpu);
>
> Forgive me but I'm trying to understand it... why not 'current' but 'p'
> here? we want the get off speed of waker or the working time of wakee?

Because I'm an idiot? ;-)

2013-09-27 01:19:22

by Michael wang

[permalink] [raw]
Subject: Re: [RFC][PATCH] sched: Avoid select_idle_sibling() for wake_affine(.sync=true)

On 09/26/2013 11:44 PM, Peter Zijlstra wrote:
> On Thu, Sep 26, 2013 at 11:09:46PM +0800, Michael wang wrote:
>>> + if (sync)
>>> + p->se.last_sync_wakeup = sched_clock_cpu(cpu);
>>
>> Forgive me but I'm trying to understand it... why not 'current' but 'p'
>> here? we want the get off speed of waker or the working time of wakee?
>
> Because I'm an idiot? ;-)

Well, I know you are full of good ideas ;-)

Regards,
Michael Wang

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>