2015-06-12 05:35:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:

> @@ -5022,22 +5026,28 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> * If both cpu and prev_cpu are part of this domain,
> * cpu is a valid SD_WAKE_AFFINE target.
> */
> - if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> - cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> + if (want_affine && !affine_sd &&
> + (tmp->flags & SD_WAKE_AFFINE) &&
> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> affine_sd = tmp;
> - break;
> - }
>
> if (tmp->flags & sd_flag)
> sd = tmp;
> + else if (!want_affine || (want_affine && affine_sd))
> + break;
> }

Hm, new_cpu == cpu.

> - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> + if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
> prev_cpu = cpu;
> + sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
> + }

If branch above is not taken, new_cpu remains cpu.

> if (sd_flag & SD_BALANCE_WAKE) {
> - new_cpu = select_idle_sibling(p, prev_cpu);
> - goto unlock;
> + int tmp = select_idle_sibling(p, prev_cpu);
> + if (tmp >= 0) {
> + new_cpu = tmp;
> + goto unlock;
> + }
> }

If select_idle_sibling() returns -1, new_cpu remains cpu.

>
> while (sd) {

If sd == NULL, we fall through and try to pull wakee despite nacked-by
tsk_cpus_allowed() or wake_affine().

-Mike


2015-06-17 18:07:03

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On 06/11/2015 10:35 PM, Mike Galbraith wrote:
> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
>
>> @@ -5022,22 +5026,28 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> * If both cpu and prev_cpu are part of this domain,
>> * cpu is a valid SD_WAKE_AFFINE target.
>> */
>> - if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
>> - cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
>> + if (want_affine && !affine_sd &&
>> + (tmp->flags & SD_WAKE_AFFINE) &&
>> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>> affine_sd = tmp;
>> - break;
>> - }
>>
>> if (tmp->flags & sd_flag)
>> sd = tmp;
>> + else if (!want_affine || (want_affine && affine_sd))
>> + break;
>> }
>
> Hm, new_cpu == cpu.
>
>> - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>> + if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
>> prev_cpu = cpu;
>> + sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
>> + }
>
> If branch above is not taken, new_cpu remains cpu.
>
>> if (sd_flag & SD_BALANCE_WAKE) {
>> - new_cpu = select_idle_sibling(p, prev_cpu);
>> - goto unlock;
>> + int tmp = select_idle_sibling(p, prev_cpu);
>> + if (tmp >= 0) {
>> + new_cpu = tmp;
>> + goto unlock;
>> + }
>> }
>
> If select_idle_sibling() returns -1, new_cpu remains cpu.
>
>>
>> while (sd) {
>
> If sd == NULL, we fall through and try to pull wakee despite nacked-by
> tsk_cpus_allowed() or wake_affine().
>

So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
like this

if (tmp >= 0) {
new_cpu = tmp;
goto unlock;
} else if (!want_affine) {
new_cpu = prev_cpu;
}

so we can make sure we're not being pushed onto a cpu that we aren't
allowed on? Thanks,

Josef

2015-06-18 00:55:48

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote:
> On 06/11/2015 10:35 PM, Mike Galbraith wrote:
> > On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:

> > If sd == NULL, we fall through and try to pull wakee despite nacked-by
> > tsk_cpus_allowed() or wake_affine().
> >
>
> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
> like this
>
> if (tmp >= 0) {
> new_cpu = tmp;
> goto unlock;
> } else if (!want_affine) {
> new_cpu = prev_cpu;
> }
>
> so we can make sure we're not being pushed onto a cpu that we aren't
> allowed on? Thanks,

The buglet is a messenger methinks. You saying the patch helped without
SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say
that preferring cache is not harming your load after all. It now sounds
as though wake_wide() may be what you're squabbling with.

Things aren't adding up all that well.

-Mike

2015-06-18 03:47:39

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On 06/17/2015 05:55 PM, Mike Galbraith wrote:
> On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote:
>> On 06/11/2015 10:35 PM, Mike Galbraith wrote:
>>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
>
>>> If sd == NULL, we fall through and try to pull wakee despite nacked-by
>>> tsk_cpus_allowed() or wake_affine().
>>>
>>
>> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
>> like this
>>
>> if (tmp >= 0) {
>> new_cpu = tmp;
>> goto unlock;
>> } else if (!want_affine) {
>> new_cpu = prev_cpu;
>> }
>>
>> so we can make sure we're not being pushed onto a cpu that we aren't
>> allowed on? Thanks,
>
> The buglet is a messenger methinks. You saying the patch helped without
> SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say
> that preferring cache is not harming your load after all. It now sounds
> as though wake_wide() may be what you're squabbling with.
>
> Things aren't adding up all that well.

Yeah I'm horribly confused. The other thing is I had to switch clusters
(I know, I know, I'm changing the parameters of the test). So these new
boxes are haswell boxes, but basically the same otherwise, 2 socket 12
core with HT, just newer/faster CPUs. I'll re-run everything again and
give the numbers so we're all on the same page again, but as it stands
now I think we have this

3.10 with wake_idle forward ported - good
4.0 stock - 20% perf drop
4.0 w/ Peter's patch - good
4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop

I can do all these iterations again to verify, is there any other
permutation you'd like to see? Thanks,

Josef

2015-06-18 04:12:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On Wed, 2015-06-17 at 20:46 -0700, Josef Bacik wrote:
> On 06/17/2015 05:55 PM, Mike Galbraith wrote:
> > On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote:
> >> On 06/11/2015 10:35 PM, Mike Galbraith wrote:
> >>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
> >
> >>> If sd == NULL, we fall through and try to pull wakee despite nacked-by
> >>> tsk_cpus_allowed() or wake_affine().
> >>>
> >>
> >> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
> >> like this
> >>
> >> if (tmp >= 0) {
> >> new_cpu = tmp;
> >> goto unlock;
> >> } else if (!want_affine) {
> >> new_cpu = prev_cpu;
> >> }
> >>
> >> so we can make sure we're not being pushed onto a cpu that we aren't
> >> allowed on? Thanks,
> >
> > The buglet is a messenger methinks. You saying the patch helped without
> > SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say
> > that preferring cache is not harming your load after all. It now sounds
> > as though wake_wide() may be what you're squabbling with.
> >
> > Things aren't adding up all that well.
>
> Yeah I'm horribly confused. The other thing is I had to switch clusters
> (I know, I know, I'm changing the parameters of the test). So these new
> boxes are haswell boxes, but basically the same otherwise, 2 socket 12
> core with HT, just newer/faster CPUs. I'll re-run everything again and
> give the numbers so we're all on the same page again, but as it stands
> now I think we have this
>
> 3.10 with wake_idle forward ported - good
> 4.0 stock - 20% perf drop
> 4.0 w/ Peter's patch - good
> 4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop
>
> I can do all these iterations again to verify, is there any other
> permutation you'd like to see? Thanks,

Yeah, after re-baseline, please apply/poke these buttons individually in
4.0-virgin.

(cat /sys/kernel/debug/sched_features, prepend NO_, echo it back)

---
kernel/sched/fair.c | 4 ++--
kernel/sched/features.h | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4506,7 +4506,7 @@ static int wake_affine(struct sched_doma
* If we wake multiple tasks be careful to not bounce
* ourselves around too much.
*/
- if (wake_wide(p))
+ if (sched_feat(WAKE_WIDE) && wake_wide(p))
return 0;

idx = sd->wake_idx;
@@ -4682,7 +4682,7 @@ static int select_idle_sibling(struct ta
struct sched_group *sg;
int i = task_cpu(p);

- if (idle_cpu(target))
+ if (!sched_feat(PREFER_IDLE) || idle_cpu(target))
return target;

/*
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -59,6 +59,8 @@ SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
+SCHED_FEAT(PREFER_IDLE, true)
+SCHED_FEAT(WAKE_WIDE, true)

/*
* Apply the automatic NUMA scheduling policy. Enabled automatically

2015-07-02 17:45:47

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On 06/18/2015 12:12 AM, Mike Galbraith wrote:
> On Wed, 2015-06-17 at 20:46 -0700, Josef Bacik wrote:
>> On 06/17/2015 05:55 PM, Mike Galbraith wrote:
>>> On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote:
>>>> On 06/11/2015 10:35 PM, Mike Galbraith wrote:
>>>>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
>>>
>>>>> If sd == NULL, we fall through and try to pull wakee despite nacked-by
>>>>> tsk_cpus_allowed() or wake_affine().
>>>>>
>>>>
>>>> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
>>>> like this
>>>>
>>>> if (tmp >= 0) {
>>>> new_cpu = tmp;
>>>> goto unlock;
>>>> } else if (!want_affine) {
>>>> new_cpu = prev_cpu;
>>>> }
>>>>
>>>> so we can make sure we're not being pushed onto a cpu that we aren't
>>>> allowed on? Thanks,
>>>
>>> The buglet is a messenger methinks. You saying the patch helped without
>>> SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say
>>> that preferring cache is not harming your load after all. It now sounds
>>> as though wake_wide() may be what you're squabbling with.
>>>
>>> Things aren't adding up all that well.
>>
>> Yeah I'm horribly confused. The other thing is I had to switch clusters
>> (I know, I know, I'm changing the parameters of the test). So these new
>> boxes are haswell boxes, but basically the same otherwise, 2 socket 12
>> core with HT, just newer/faster CPUs. I'll re-run everything again and
>> give the numbers so we're all on the same page again, but as it stands
>> now I think we have this
>>
>> 3.10 with wake_idle forward ported - good
>> 4.0 stock - 20% perf drop
>> 4.0 w/ Peter's patch - good
>> 4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop
>>
>> I can do all these iterations again to verify, is there any other
>> permutation you'd like to see? Thanks,
>
> Yeah, after re-baseline, please apply/poke these buttons individually in
> 4.0-virgin.
>
> (cat /sys/kernel/debug/sched_features, prepend NO_, echo it back)
>

Sorry it took me a while to get these numbers to you, migrating the
whole fleet to a new setup broke the performance test suite thing so
I've only just been able to run tests again. I'll do my best to
describe what is going on and hopefully that will make the results make
sense.

This is on our webservers, which is HHVM. A request comes in for a page
and this goes onto one of the two hhvm.node.# threads, one thread per
NUMA node. From there it is farmed off to one of the worker threads.
If there are no idle workers the request gets put on what is called the
"select_queue". Basically the select_queue should never be larger than
0 in a perfect world. If it's more than we've hit latency somewhere and
that's not good. The other measurement we care about is how long a
thread spends on a request before it sends a response (this would be the
actual work being done).

Our tester slowly increases load to a group of servers until the select
queue is consistently >= 1. That means we've loaded the boxes so high
that they can't process the requests as soon as they've come in. Then
it backs down and then ramps up a second time. It takes all of these
measurements and puts them into these pretty graphs. There are 2 graphs
we care about, the duration of the requests vs the requests per second
and the probability that our select queue is >= 1 vs requests per second.

Now for 3.10 vs 4.0 our request duration time is the same if not
slightly better on 4.0, so once the workers are doing their job
everything is a-ok.

The problem is the probability the select queue >= 1 is way different on
4.0 vs 3.10. Normally this graph looks like an S, it's essentially 0 up
to some RPS (requests per second) threshold and then shoots up to 100%
after the threshold. I'll make a table of these graphs that hopefully
makes sense, the numbers are different from run to run because of
traffic and such, the test and control are both run at the same time.
The header is the probability the select queue >=1

25% 50% 75%
4.0 plain: 371 388 402
control: 386 394 402
difference: 15 6 0

So with 4.0 its basically a straight line, at lower RPS we are getting a
higher probability of a select queue >= 1. We are measuring the cpu
delay avg ms thing from the scheduler netlink stuff which is how I
noticed it was scheduler related, our cpu delay is way higher on 4.0
than it is on 3.10 or 4.0 with the wake idle patch.

So the next test is NO_PREFER_IDLE. This is slightly better than 4.0 plain
25% 50% 75%
NO_PREFER_IDLE: 399 401 414
control: 385 408 416
difference: 14 7 2

The numbers don't really show it well, but the graphs are closer
together, it's slightly more s shaped, but still not great.

Next is NO_WAKE_WIDE, which is horrible

25% 50% 75%
NO_WAKE_WIDE: 315 344 369
control: 373 380 388
difference: 58 36 19

This isn't even in the same ballpark, it's a way worse regression than
plain.

The next bit is NO_WAKE_WIDE|NO_PREFER_IDLE, which is just as bad

25% 50% 75%
EVERYTHING: 327 360 383
control: 381 390 399
difference: 54 30 19

Hopefully that helps. Thanks,

Josef

2015-07-03 06:41:07

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On Thu, 2015-07-02 at 13:44 -0400, Josef Bacik wrote:

> Now for 3.10 vs 4.0 our request duration time is the same if not
> slightly better on 4.0, so once the workers are doing their job
> everything is a-ok.
>
> The problem is the probability the select queue >= 1 is way different on
> 4.0 vs 3.10. Normally this graph looks like an S, it's essentially 0 up
> to some RPS (requests per second) threshold and then shoots up to 100%
> after the threshold. I'll make a table of these graphs that hopefully
> makes sense, the numbers are different from run to run because of
> traffic and such, the test and control are both run at the same time.
> The header is the probability the select queue >=1
>
> 25% 50% 75%
> 4.0 plain: 371 388 402
> control: 386 394 402
> difference: 15 6 0

So control is 3.10? Virgin?

> So with 4.0 its basically a straight line, at lower RPS we are getting a
> higher probability of a select queue >= 1. We are measuring the cpu
> delay avg ms thing from the scheduler netlink stuff which is how I
> noticed it was scheduler related, our cpu delay is way higher on 4.0
> than it is on 3.10 or 4.0 with the wake idle patch.
>
> So the next test is NO_PREFER_IDLE. This is slightly better than 4.0 plain
> 25% 50% 75%
> NO_PREFER_IDLE: 399 401 414
> control: 385 408 416
> difference: 14 7 2

Hm. Throttling nohz may make larger delta. But never mind that.

> The numbers don't really show it well, but the graphs are closer
> together, it's slightly more s shaped, but still not great.
>
> Next is NO_WAKE_WIDE, which is horrible
>
> 25% 50% 75%
> NO_WAKE_WIDE: 315 344 369
> control: 373 380 388
> difference: 58 36 19
>
> This isn't even in the same ballpark, it's a way worse regression than
> plain.

Ok, this jibes perfectly with 1:N waker/wakee thingy.

> The next bit is NO_WAKE_WIDE|NO_PREFER_IDLE, which is just as bad
>
> 25% 50% 75%
> EVERYTHING: 327 360 383
> control: 381 390 399
> difference: 54 30 19

Ditto.

Hm. Seems what this load should like best is if we detect 1:N, skip all
of the routine gyrations, ie move the N (workers) infrequently, expend
search cycles frequently only on the 1 (dispatch).

Ponder..

-Mike

2015-07-03 09:29:22

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On Fri, 2015-07-03 at 08:40 +0200, Mike Galbraith wrote:

> Hm. Seems what this load should like best is if we detect 1:N, skip all
> of the routine gyrations, ie move the N (workers) infrequently, expend
> search cycles frequently only on the 1 (dispatch).
>
> Ponder..

While taking a refresher peek at the wake_wide() thing, seems it's not
really paying attention when the waker of many is awakened. I wonder if
your load would see more benefit if it watched like so.. rashly assuming
I didn't wreck it completely (iow, completely untested).

---
kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
1 file changed, 22 insertions(+), 14 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4586,10 +4586,23 @@ static void record_wakee(struct task_str
current->wakee_flips >>= 1;
current->wakee_flip_decay_ts = jiffies;
}
+ if (time_after(jiffies, p->wakee_flip_decay_ts + HZ)) {
+ p->wakee_flips >>= 1;
+ p->wakee_flip_decay_ts = jiffies;
+ }

if (current->last_wakee != p) {
current->last_wakee = p;
current->wakee_flips++;
+ /*
+ * Flip the buddy as well. It's the ratio of flips
+ * with a socket size decayed cutoff that determines
+ * whether the pair are considered to be part of 1:N
+ * or M*N loads of a size that we need to spread, so
+ * ensure flips of both load components. The waker
+ * of many will have many more flips than its wakees.
+ */
+ p->wakee_flips++;
}
}

@@ -4732,24 +4745,19 @@ static long effective_load(struct task_g

static int wake_wide(struct task_struct *p)
{
+ unsigned long max = max(current->wakee_flips, p->wakee_flips);
+ unsigned long min = min(current->wakee_flips, p->wakee_flips);
int factor = this_cpu_read(sd_llc_size);

/*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
+ * Yeah, it's a switching-frequency heuristic, and could mean the
+ * intended many wakees/waker relationship, or rapidly switching
+ * between a few. Use factor to try to automatically adjust such
+ * that the load spreads when it grows beyond what will fit in llc.
*/
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (min < factor)
+ return 0;
+ return max > min * factor;
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)

2015-07-05 10:29:35

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On Fri, 2015-07-03 at 08:40 +0200, Mike Galbraith wrote:

> Hm. Seems what this load should like best is if we detect 1:N, skip all
> of the routine gyrations, ie move the N (workers) infrequently, expend
> search cycles frequently only on the 1 (dispatch).
>
> Ponder..

Since it was too hot to do outside chores (any excuse will do;)...

If we're (read /me) on track, the bellow should help. Per my tracing,
it may want a wee bit of toning down actually, though when I trace
virgin source I expect to see the same, namely Xorg and friends having
"wide-load" tattooed across their hindquarters earlier than they should.
It doesn't seem to hurt anything, but then demolishing a single llc box
is a tad more difficult than demolishing a NUMA box.


sched: beef up wake_wide()

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when waking one of its N minions.

Correct that, and give the user the option to do an expensive balance IFF
select_idle_sibling() doesn't find an idle CPU, and IFF the wakee is the
the 1:N dispatcher of work, thus worth some extra effort.

Not-Signed-off-by: Mike Galbraith <[email protected]>
---
kernel/sched/fair.c | 89 +++++++++++++++++++++++++-----------------------
kernel/sched/features.h | 6 +++
2 files changed, 54 insertions(+), 41 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -666,7 +666,7 @@ static u64 sched_vslice(struct cfs_rq *c
}

#ifdef CONFIG_SMP
-static int select_idle_sibling(struct task_struct *p, int cpu);
+static int select_idle_sibling(struct task_struct *p, int cpu, void *clear);
static unsigned long task_h_load(struct task_struct *p);

static inline void __update_task_entity_contrib(struct sched_entity *se);
@@ -1375,7 +1375,7 @@ static void task_numa_compare(struct tas
* Call select_idle_sibling to maybe find a better one.
*/
if (!cur)
- env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+ env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu, NULL);

assign:
task_numa_assign(env, cur, imp);
@@ -4730,26 +4730,30 @@ static long effective_load(struct task_g

#endif

+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
- int factor = this_cpu_read(sd_llc_size);
-
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
+ unsigned long waker_flips = current->wakee_flips;
+ unsigned long wakee_flips = p->wakee_flips;
+ int factor = this_cpu_read(sd_llc_size), ret = 1;
+
+ if (waker_flips < wakee_flips) {
+ swap(waker_flips, wakee_flips);
+ /* Tell the caller that we're waking a 1:N waker */
+ ret += sched_feat(WAKE_WIDE_BALANCE);
}
-
- return 0;
+ if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+ return 0;
+ return ret;
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4765,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;

- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -4935,20 +4932,22 @@ find_idlest_cpu(struct sched_group *grou
/*
* Try and locate an idle CPU in the sched_domain.
*/
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int target, void *clear)
{
struct sched_domain *sd;
struct sched_group *sg;
int i = task_cpu(p);

if (idle_cpu(target))
- return target;
+ goto done;

/*
* If the prevous cpu is cache affine and idle, don't be stupid.
*/
- if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
- return i;
+ if (i != target && cpus_share_cache(i, target) && idle_cpu(i)) {
+ target = i;
+ goto done;
+ }

/*
* Otherwise, iterate the domains and find an elegible idle cpu.
@@ -4973,7 +4972,11 @@ static int select_idle_sibling(struct ta
sg = sg->next;
} while (sg != sd->groups);
}
+ return target;
done:
+ if (clear)
+ *(void **)clear = 0;
+
return target;
}
/*
@@ -5021,14 +5024,19 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
- int want_affine = 0;
+ int new_cpu = prev_cpu;
+ int want_affine = 0, want_balance = 0;
int sync = wake_flags & WF_SYNC;

- if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
rcu_read_lock();
+ if (sd_flag & SD_BALANCE_WAKE) {
+ want_affine = wake_wide(p);
+ want_balance = want_affine > 1;
+ want_affine = !want_affine && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ if (!want_affine && !want_balance)
+ goto select;
+ }
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
continue;
@@ -5043,23 +5051,23 @@ select_task_rq_fair(struct task_struct *
break;
}

- if (tmp->flags & sd_flag)
+ if (tmp->flags & sd_flag || want_balance)
sd = tmp;
}

if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ new_cpu = cpu;

if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+select:
+ new_cpu = select_idle_sibling(p, new_cpu, &sd);
}

while (sd) {
struct sched_group *group;
int weight;

- if (!(sd->flags & sd_flag)) {
+ if (!(sd->flags & sd_flag) && !want_balance) {
sd = sd->child;
continue;
}
@@ -5089,7 +5097,6 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();

return new_cpu;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
*/
SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
+
+/*
+ * Perform expensive full wake balance for 1:N wakers when the
+ * selected cpu is not completely idle.
+ */
+SCHED_FEAT(WAKE_WIDE_BALANCE, false)

2015-07-05 08:24:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On Sat, 2015-07-04 at 17:57 +0200, Mike Galbraith wrote:

> If we're (read /me) on track, the bellow should help. Per my tracing,
> it may want a wee bit of toning down actually, though when I trace
> virgin source I expect to see the same, namely Xorg and friends having
> "wide-load" tattooed across their hindquarters earlier than they should.

That's true, the only difference is that the virgin kernel will still
try to pull the multi-waker when it is being awakened, which I can
easily imagine going either way performance wise, depending on a pile of
variables.. so let's just ask your box.

V2: drop select_idle_sibling() changes (ick), try to save some cycles.


sched: beef up wake_wide()

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when the 1:N waker is waking one of its minions.

Correct that, and give the user the option to do an expensive balance IFF
select_idle_sibling() doesn't find an idle CPU, and IFF the wakee is the
1:N waker, the dispatcher of work, thus worth some extra effort. Don't
drill down through all domains though, stop searching at highest, we'll
either have found the desired completely idle CPU, or if heavily loaded,
the least loaded CPU of the least loaded group, which should still add
up to an average scheduling latency improvement (knock wood).


Not-Signed-off-by: Mike Galbraith <[email protected]>
---
include/linux/sched.h | 7 ++-
kernel/sched/fair.c | 86 ++++++++++++++++++++++++++++--------------------
kernel/sched/features.h | 6 +++
3 files changed, 61 insertions(+), 38 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -962,7 +962,8 @@ extern void wake_up_q(struct wake_q_head
#define SD_BALANCE_EXEC 0x0004 /* Balance on exec */
#define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
#define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
-#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
+#define SD_BALANCE_WAKE_IDLE 0x0020 /* Balance on wakeup, searching for an idle CPU */
+#define SD_WAKE_AFFINE 0x0040 /* Wake task to waking CPU */
#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu power */
#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
@@ -1353,9 +1354,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;

int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,26 +4730,30 @@ static long effective_load(struct task_g

#endif

+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
- int factor = this_cpu_read(sd_llc_size);
-
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
+ unsigned int waker_flips = current->wakee_flips;
+ unsigned int wakee_flips = p->wakee_flips;
+ int factor = this_cpu_read(sd_llc_size), ret = 1;
+
+ if (waker_flips < wakee_flips) {
+ swap(waker_flips, wakee_flips);
+ /* Tell the caller that we're waking a 1:N waker */
+ ret += sched_feat(WAKE_WIDE_IDLE);
}
-
- return 0;
+ if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+ return 0;
+ return ret;
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4765,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;

- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -4865,6 +4862,10 @@ find_idlest_group(struct sched_domain *s
load = target_load(i, load_idx);

avg_load += load;
+
+ if (sd_flag & SD_BALANCE_WAKE_IDLE && idle_cpu(i) &&
+ cpumask_test_cpu(1, tsk_cpus_allowed(p)))
+ return group;
}

/* Adjust by relative CPU capacity of the group */
@@ -5021,14 +5022,21 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
- int want_affine = 0;
+ int new_cpu = prev_cpu;
+ int want_affine = 0, want_idle = 0;
int sync = wake_flags & WF_SYNC;

- if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
rcu_read_lock();
+ if (sd_flag & SD_BALANCE_WAKE) {
+ want_idle = wake_wide(p);
+ want_affine = !want_idle && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_idle = want_idle > 1;
+ if (!want_affine && !want_idle)
+ goto select;
+ if (want_idle)
+ sd_flag |= SD_BALANCE_WAKE_IDLE;
+ }
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
continue;
@@ -5043,23 +5051,25 @@ select_task_rq_fair(struct task_struct *
break;
}

- if (tmp->flags & sd_flag)
+ if (tmp->flags & sd_flag || want_idle)
sd = tmp;
}

if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ new_cpu = cpu;

if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+select:
+ new_cpu = select_idle_sibling(p, new_cpu);
+ if (want_idle && (new_cpu != prev_cpu || idle_cpu(prev_cpu)))
+ sd = NULL;
}

while (sd) {
struct sched_group *group;
int weight;

- if (!(sd->flags & sd_flag)) {
+ if (!(sd->flags & sd_flag) && !want_idle) {
sd = sd->child;
continue;
}
@@ -5077,6 +5087,13 @@ select_task_rq_fair(struct task_struct *
continue;
}

+ /*
+ * We've either found an idle CPU, or the least loaded CPU in
+ * the least load group of the highest domain. Good enough.
+ */
+ if (want_idle)
+ break;
+
/* Now try balancing at a lower domain level of new_cpu */
cpu = new_cpu;
weight = sd->span_weight;
@@ -5089,7 +5106,6 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();

return new_cpu;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
*/
SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
+
+/*
+ * Perform expensive full wake balance for 1:N wakers when the
+ * selected cpu is not completely idle.
+ */
+SCHED_FEAT(WAKE_WIDE_IDLE, false)


2015-07-06 05:13:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

Hm. Piddling with pgbench, which doesn't seem to collapse into a
quivering heap when load exceeds cores these days, deltas weren't all
that impressive, but it does appreciate the extra effort a bit, and a
bit more when clients receive it as well.

If you test, and have time to piddle, you could try letting wake_wide()
return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
the dispatcher.

Numbers from my little desktop box.

NO_WAKE_WIDE_IDLE
postgres@homer:~> pgbench.sh
clients 8 tps = 116697.697662
clients 12 tps = 115160.230523
clients 16 tps = 115569.804548
clients 20 tps = 117879.230514
clients 24 tps = 118281.753040
clients 28 tps = 116974.796627
clients 32 tps = 119082.163998 avg 117092.239 1.000

WAKE_WIDE_IDLE
postgres@homer:~> pgbench.sh
clients 8 tps = 124351.735754
clients 12 tps = 124419.673135
clients 16 tps = 125050.716498
clients 20 tps = 124813.042352
clients 24 tps = 126047.442307
clients 28 tps = 125373.719401
clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000

WAKE_WIDE_IDLE (clients as well as server)
postgres@homer:~> pgbench.sh
clients 8 tps = 130539.795246
clients 12 tps = 128984.648554
clients 16 tps = 130564.386447
clients 20 tps = 129149.693118
clients 24 tps = 130211.119780
clients 28 tps = 130325.355433
clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037

2015-07-06 14:35:18

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On 07/06/2015 01:13 AM, Mike Galbraith wrote:
> Hm. Piddling with pgbench, which doesn't seem to collapse into a
> quivering heap when load exceeds cores these days, deltas weren't all
> that impressive, but it does appreciate the extra effort a bit, and a
> bit more when clients receive it as well.
>
> If you test, and have time to piddle, you could try letting wake_wide()
> return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
> the dispatcher.
>
> Numbers from my little desktop box.
>
> NO_WAKE_WIDE_IDLE
> postgres@homer:~> pgbench.sh
> clients 8 tps = 116697.697662
> clients 12 tps = 115160.230523
> clients 16 tps = 115569.804548
> clients 20 tps = 117879.230514
> clients 24 tps = 118281.753040
> clients 28 tps = 116974.796627
> clients 32 tps = 119082.163998 avg 117092.239 1.000
>
> WAKE_WIDE_IDLE
> postgres@homer:~> pgbench.sh
> clients 8 tps = 124351.735754
> clients 12 tps = 124419.673135
> clients 16 tps = 125050.716498
> clients 20 tps = 124813.042352
> clients 24 tps = 126047.442307
> clients 28 tps = 125373.719401
> clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000
>
> WAKE_WIDE_IDLE (clients as well as server)
> postgres@homer:~> pgbench.sh
> clients 8 tps = 130539.795246
> clients 12 tps = 128984.648554
> clients 16 tps = 130564.386447
> clients 20 tps = 129149.693118
> clients 24 tps = 130211.119780
> clients 28 tps = 130325.355433
> clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037
>

I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get
this solved upstream and then I'll rip out the old and put in the new,
I'm happy to screw around until we're all happy. I'll throw this in a
kernel this morning and run stuff today. Barring any issues with the
testing infrastructure I should have results today. Thanks,

Josef

2015-07-06 18:36:33

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On Mon, 2015-07-06 at 10:34 -0400, Josef Bacik wrote:
> On 07/06/2015 01:13 AM, Mike Galbraith wrote:
> > Hm. Piddling with pgbench, which doesn't seem to collapse into a
> > quivering heap when load exceeds cores these days, deltas weren't all
> > that impressive, but it does appreciate the extra effort a bit, and a
> > bit more when clients receive it as well.
> >
> > If you test, and have time to piddle, you could try letting wake_wide()
> > return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
> > the dispatcher.
> >
> > Numbers from my little desktop box.
> >
> > NO_WAKE_WIDE_IDLE
> > postgres@homer:~> pgbench.sh
> > clients 8 tps = 116697.697662
> > clients 12 tps = 115160.230523
> > clients 16 tps = 115569.804548
> > clients 20 tps = 117879.230514
> > clients 24 tps = 118281.753040
> > clients 28 tps = 116974.796627
> > clients 32 tps = 119082.163998 avg 117092.239 1.000
> >
> > WAKE_WIDE_IDLE
> > postgres@homer:~> pgbench.sh
> > clients 8 tps = 124351.735754
> > clients 12 tps = 124419.673135
> > clients 16 tps = 125050.716498
> > clients 20 tps = 124813.042352
> > clients 24 tps = 126047.442307
> > clients 28 tps = 125373.719401
> > clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000
> >
> > WAKE_WIDE_IDLE (clients as well as server)
> > postgres@homer:~> pgbench.sh
> > clients 8 tps = 130539.795246
> > clients 12 tps = 128984.648554
> > clients 16 tps = 130564.386447
> > clients 20 tps = 129149.693118
> > clients 24 tps = 130211.119780
> > clients 28 tps = 130325.355433
> > clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037

I had a typo in my script, so those desktop box numbers were all doing
the same number of clients. It doesn't invalidate anything, but the
individual deltas are just run to run variance.. not to mention that
single cache box is not all that interesting for this anyway. That
happens when interconnect becomes a player.

> I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get
> this solved upstream and then I'll rip out the old and put in the new,
> I'm happy to screw around until we're all happy. I'll throw this in a
> kernel this morning and run stuff today. Barring any issues with the
> testing infrastructure I should have results today. Thanks,

I'll be interested in your results. Taking pgbench to a little NUMA
box, I'm seeing _nada_ outside of variance with master (crap). I have a
way to win significantly for _older_ kernels, and that win over master
_may_ provide some useful insight, but I don't trust postgres/pgbench as
far as I can toss the planet, so don't have a warm fuzzy about trying to
use it to approximate your real world load.

BTW, what's your topology look like (numactl --hardware).

-Mike

2015-07-06 19:41:40

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On 07/06/2015 02:36 PM, Mike Galbraith wrote:
> On Mon, 2015-07-06 at 10:34 -0400, Josef Bacik wrote:
>> On 07/06/2015 01:13 AM, Mike Galbraith wrote:
>>> Hm. Piddling with pgbench, which doesn't seem to collapse into a
>>> quivering heap when load exceeds cores these days, deltas weren't all
>>> that impressive, but it does appreciate the extra effort a bit, and a
>>> bit more when clients receive it as well.
>>>
>>> If you test, and have time to piddle, you could try letting wake_wide()
>>> return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
>>> the dispatcher.
>>>
>>> Numbers from my little desktop box.
>>>
>>> NO_WAKE_WIDE_IDLE
>>> postgres@homer:~> pgbench.sh
>>> clients 8 tps = 116697.697662
>>> clients 12 tps = 115160.230523
>>> clients 16 tps = 115569.804548
>>> clients 20 tps = 117879.230514
>>> clients 24 tps = 118281.753040
>>> clients 28 tps = 116974.796627
>>> clients 32 tps = 119082.163998 avg 117092.239 1.000
>>>
>>> WAKE_WIDE_IDLE
>>> postgres@homer:~> pgbench.sh
>>> clients 8 tps = 124351.735754
>>> clients 12 tps = 124419.673135
>>> clients 16 tps = 125050.716498
>>> clients 20 tps = 124813.042352
>>> clients 24 tps = 126047.442307
>>> clients 28 tps = 125373.719401
>>> clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000
>>>
>>> WAKE_WIDE_IDLE (clients as well as server)
>>> postgres@homer:~> pgbench.sh
>>> clients 8 tps = 130539.795246
>>> clients 12 tps = 128984.648554
>>> clients 16 tps = 130564.386447
>>> clients 20 tps = 129149.693118
>>> clients 24 tps = 130211.119780
>>> clients 28 tps = 130325.355433
>>> clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037
>
> I had a typo in my script, so those desktop box numbers were all doing
> the same number of clients. It doesn't invalidate anything, but the
> individual deltas are just run to run variance.. not to mention that
> single cache box is not all that interesting for this anyway. That
> happens when interconnect becomes a player.
>
>> I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get
>> this solved upstream and then I'll rip out the old and put in the new,
>> I'm happy to screw around until we're all happy. I'll throw this in a
>> kernel this morning and run stuff today. Barring any issues with the
>> testing infrastructure I should have results today. Thanks,
>
> I'll be interested in your results. Taking pgbench to a little NUMA
> box, I'm seeing _nada_ outside of variance with master (crap). I have a
> way to win significantly for _older_ kernels, and that win over master
> _may_ provide some useful insight, but I don't trust postgres/pgbench as
> far as I can toss the planet, so don't have a warm fuzzy about trying to
> use it to approximate your real world load.
>
> BTW, what's your topology look like (numactl --hardware).
>

So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the
baseline with a slight regression at lower RPS and a slight improvement
at high RPS. I'm running with WAKE_WIDE_IDLE set now, that should be
done soonish and then I'll do the 1 + sched_feat(WAKE_WIDE_IDLE) thing
next and those results should come in the morning. Here is the numa
information from one of the boxes in the test cluster

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 15890 MB
node 0 free: 2651 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 16125 MB
node 1 free: 2063 MB
node distances:
node 0 1
0: 10 20
1: 20 10

Thanks,

Josef

2015-07-07 04:01:34

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:

> So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the
> baseline with a slight regression at lower RPS and a slight improvement
> at high RPS.

Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm
not real keen on the feature unless your numbers are really good, and
odds are that ain't gonna happen.

-Mike

2015-07-07 09:43:56

by Mike Galbraith

[permalink] [raw]
Subject: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote:
> On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:
>
> > So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the
> > baseline with a slight regression at lower RPS and a slight improvement
> > at high RPS.
>
> Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm
> not real keen on the feature unless your numbers are really good, and
> odds are that ain't gonna happen.

More extensive testing in pedantic-man mode increased my confidence of
that enough to sign off and ship the dirt simple version. Any further
twiddles should grow their own wings if they want to fly anyway, the
simplest form helps your real world load, as well as the not so real
pgbench, my numbers for that below.

virgin master, 2 socket box
postgres@nessler:~> pgbench.sh
clients 12 tps = 96233.854271 1.000
clients 24 tps = 142234.686166 1.000
clients 36 tps = 148433.534531 1.000
clients 48 tps = 133105.634302 1.000
clients 60 tps = 128903.080371 1.000
clients 72 tps = 128591.821782 1.000
clients 84 tps = 114445.967116 1.000
clients 96 tps = 109803.557524 1.000 avg 125219.017 1.000

V3 (KISS, below)
postgres@nessler:~> pgbench.sh
clients 12 tps = 120793.023637 1.255
clients 24 tps = 144668.961468 1.017
clients 36 tps = 156705.239251 1.055
clients 48 tps = 152004.886893 1.141
clients 60 tps = 138582.113864 1.075
clients 72 tps = 136286.891104 1.059
clients 84 tps = 137420.986043 1.200
clients 96 tps = 135199.060242 1.231 avg 140207.645 1.119 1.000

V2 NO_WAKE_WIDE_IDLE
postgres@nessler:~> pgbench.sh
clients 12 tps = 121821.966162 1.265
clients 24 tps = 146446.388366 1.029
clients 36 tps = 151373.362190 1.019
clients 48 tps = 156806.730746 1.178
clients 60 tps = 133933.491567 1.039
clients 72 tps = 131460.489424 1.022
clients 84 tps = 130859.340261 1.143
clients 96 tps = 130787.476584 1.191 avg 137936.155 1.101 0.983

V2 WAKE_WIDE_IDLE (crawl in a hole feature, you're dead)
postgres@nessler:~> pgbench.sh
clients 12 tps = 121297.791570
clients 24 tps = 145939.488312
clients 36 tps = 155336.090263
clients 48 tps = 149018.245323
clients 60 tps = 136730.079391
clients 72 tps = 134886.116831
clients 84 tps = 130493.283398
clients 96 tps = 126043.336074


sched: beef up wake_wide()

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when the 1:N waker is waking one of its minions.

Correct that, and don't bother doing domain traversal when we know
that all we need to do is check for an idle cpu.

Signed-off-by: Mike Galbraith <[email protected]>
---
include/linux/sched.h | 4 +--
kernel/sched/fair.c | 56 ++++++++++++++++++++++++--------------------------
2 files changed, 29 insertions(+), 31 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;

int wake_cpu;
#endif
Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -4730,26 +4730,27 @@ static long effective_load(struct task_g

#endif

+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int waker_flips = current->wakee_flips;
+ unsigned int wakee_flips = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);

- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (waker_flips < wakee_flips)
+ swap(waker_flips, wakee_flips);
+ if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+ return 0;
+ return 1;
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4762,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;

- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;

- if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
rcu_read_lock();
+ if (sd_flag & SD_BALANCE_WAKE) {
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ if (!want_affine)
+ goto select_idle;
+ }
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
continue;
@@ -5048,10 +5045,11 @@ select_task_rq_fair(struct task_struct *
}

if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ new_cpu = cpu;

if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
+select_idle:
+ new_cpu = select_idle_sibling(p, new_cpu);
goto unlock;
}


2015-07-07 13:41:06

by Josef Bacik

[permalink] [raw]
Subject: Re: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On 07/07/2015 05:43 AM, Mike Galbraith wrote:
> On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote:
>> On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:
>>
>>> So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the
>>> baseline with a slight regression at lower RPS and a slight improvement
>>> at high RPS.
>>
>> Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm
>> not real keen on the feature unless your numbers are really good, and
>> odds are that ain't gonna happen.
>
> More extensive testing in pedantic-man mode increased my confidence of
> that enough to sign off and ship the dirt simple version. Any further
> twiddles should grow their own wings if they want to fly anyway, the
> simplest form helps your real world load, as well as the not so real
> pgbench, my numbers for that below.
>

The WAKE_WIDE_IDLE run was basically the same so I'm good with the KISS
version. I'll run that through the load tests this morning and let you
know how it goes. I'm still seeing a slight regression at lower RPS,
but it's like 1-2%, compared to ~15%. Once load ramps up we're good to
go, not sure why that is but it may also be my sample size (the cluster
is only 32 boxes, the minimum for decent results). Thanks,

Josef

2015-07-07 15:25:13

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On Tue, 2015-07-07 at 09:40 -0400, Josef Bacik wrote:

> The WAKE_WIDE_IDLE run was basically the same so I'm good with the KISS
> version. I'll run that through the load tests this morning and let you
> know how it goes. I'm still seeing a slight regression at lower RPS,
> but it's like 1-2%, compared to ~15%.

If I'm to believe pgbench, and configs really are as close as they
should be, there may be something between 3.12 and master which could
account for your delta and then some. I may go hunting.. heck, if I
trusted pgbench I would be hunting.

patched SLE (resembles 3.12 if you squint)
postgres@nessler:~> pgbench.sh
clients 12 tps = 134795.901777
clients 24 tps = 156955.584523
clients 36 tps = 161450.044316
clients 48 tps = 171181.069044
clients 60 tps = 157896.858179
clients 72 tps = 155816.051709
clients 84 tps = 152456.003468
clients 96 tps = 121129.848008

patched master
postgres@nessler:~> pgbench.sh
clients 12 tps = 120793.023637
clients 24 tps = 144668.961468
clients 36 tps = 156705.239251
clients 48 tps = 152004.886893
clients 60 tps = 138582.113864
clients 72 tps = 136286.891104
clients 84 tps = 137420.986043
clients 96 tps = 135199.060242

2015-07-07 17:07:11

by Josef Bacik

[permalink] [raw]
Subject: Re: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

On 07/07/2015 05:43 AM, Mike Galbraith wrote:
> On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote:
>> On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:
>>
>>> So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the
>>> baseline with a slight regression at lower RPS and a slight improvement
>>> at high RPS.
>>
>> Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm
>> not real keen on the feature unless your numbers are really good, and
>> odds are that ain't gonna happen.
>
> More extensive testing in pedantic-man mode increased my confidence of
> that enough to sign off and ship the dirt simple version. Any further
> twiddles should grow their own wings if they want to fly anyway, the
> simplest form helps your real world load, as well as the not so real
> pgbench, my numbers for that below.
>
> virgin master, 2 socket box
> postgres@nessler:~> pgbench.sh
> clients 12 tps = 96233.854271 1.000
> clients 24 tps = 142234.686166 1.000
> clients 36 tps = 148433.534531 1.000
> clients 48 tps = 133105.634302 1.000
> clients 60 tps = 128903.080371 1.000
> clients 72 tps = 128591.821782 1.000
> clients 84 tps = 114445.967116 1.000
> clients 96 tps = 109803.557524 1.000 avg 125219.017 1.000
>
> V3 (KISS, below)
> postgres@nessler:~> pgbench.sh
> clients 12 tps = 120793.023637 1.255
> clients 24 tps = 144668.961468 1.017
> clients 36 tps = 156705.239251 1.055
> clients 48 tps = 152004.886893 1.141
> clients 60 tps = 138582.113864 1.075
> clients 72 tps = 136286.891104 1.059
> clients 84 tps = 137420.986043 1.200
> clients 96 tps = 135199.060242 1.231 avg 140207.645 1.119 1.000
>
> V2 NO_WAKE_WIDE_IDLE
> postgres@nessler:~> pgbench.sh
> clients 12 tps = 121821.966162 1.265
> clients 24 tps = 146446.388366 1.029
> clients 36 tps = 151373.362190 1.019
> clients 48 tps = 156806.730746 1.178
> clients 60 tps = 133933.491567 1.039
> clients 72 tps = 131460.489424 1.022
> clients 84 tps = 130859.340261 1.143
> clients 96 tps = 130787.476584 1.191 avg 137936.155 1.101 0.983
>
> V2 WAKE_WIDE_IDLE (crawl in a hole feature, you're dead)
> postgres@nessler:~> pgbench.sh
> clients 12 tps = 121297.791570
> clients 24 tps = 145939.488312
> clients 36 tps = 155336.090263
> clients 48 tps = 149018.245323
> clients 60 tps = 136730.079391
> clients 72 tps = 134886.116831
> clients 84 tps = 130493.283398
> clients 96 tps = 126043.336074
>
>
> sched: beef up wake_wide()
>
> Josef Bacik reported that Facebook sees better performance with their
> 1:N load (1 dispatch/node, N workers/node) when carrying an old patch
> to try very hard to wake to an idle CPU. While looking at wake_wide(),
> I noticed that it doesn't pay attention to wakeup of the 1:N waker,
> returning 1 only when the 1:N waker is waking one of its minions.
>
> Correct that, and don't bother doing domain traversal when we know
> that all we need to do is check for an idle cpu.
>
> Signed-off-by: Mike Galbraith <[email protected]>

Ok you can add

Tested-by: Josef Bacik <[email protected]>

The new patch is the best across the board, there are no regressions and
about a 5% improvement for the bulk of the run (25 percentile and up).
Thanks for your help on this!

Josef

2015-07-08 06:13:57

by Mike Galbraith

[permalink] [raw]
Subject: [patch] sched: beef up wake_wide()


Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when the 1:N waker is waking one of its minions.

Correct that, and don't bother doing domain traversal when we know
that all we need to do is check for an idle cpu.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Signed-off-by: Mike Galbraith <[email protected]>
Tested-by: Josef Bacik <[email protected]>
---
include/linux/sched.h | 4 +--
kernel/sched/fair.c | 56 ++++++++++++++++++++++++--------------------------
2 files changed, 29 insertions(+), 31 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;

int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,26 +4730,27 @@ static long effective_load(struct task_g

#endif

+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int waker_flips = current->wakee_flips;
+ unsigned int wakee_flips = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);

- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (waker_flips < wakee_flips)
+ swap(waker_flips, wakee_flips);
+ if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+ return 0;
+ return 1;
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4762,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;

- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;

- if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
rcu_read_lock();
+ if (sd_flag & SD_BALANCE_WAKE) {
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ if (!want_affine)
+ goto select_idle;
+ }
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
continue;
@@ -5048,10 +5045,11 @@ select_task_rq_fair(struct task_struct *
}

if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ new_cpu = cpu;

if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
+select_idle:
+ new_cpu = select_idle_sibling(p, new_cpu);
goto unlock;
}


2015-07-09 13:27:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
>
> +/*
> + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
> + * A waker of many should wake a different task than the one last awakened
> + * at a frequency roughly N times higher than one of its wakees. In order
> + * to determine whether we should let the load spread vs consolodating to
> + * shared cache, we look for a minimum 'flip' frequency of llc_size in one
> + * partner, and a factor of lls_size higher frequency in the other. With
> + * both conditions met, we can be relatively sure that we are seeing a 1:N
> + * relationship, and that load size exceeds socket size.
> + */
> static int wake_wide(struct task_struct *p)
> {
> + unsigned int waker_flips = current->wakee_flips;
> + unsigned int wakee_flips = p->wakee_flips;
> int factor = this_cpu_read(sd_llc_size);
>
> + if (waker_flips < wakee_flips)
> + swap(waker_flips, wakee_flips);

This makes the wakee/waker names useless, the end result is more like
wakee_flips := client_flips, waker_flips := server_flips.

> + if (wakee_flips < factor || waker_flips < wakee_flips * factor)
> + return 0;

I don't get the first condition... why would the client ever flip? It
only talks to that one server.

> + return 1;
> }


> @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
> {
> struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> int cpu = smp_processor_id();
> + int new_cpu = prev_cpu;
> int want_affine = 0;
> int sync = wake_flags & WF_SYNC;
>
> rcu_read_lock();
> + if (sd_flag & SD_BALANCE_WAKE) {
> + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> + if (!want_affine)
> + goto select_idle;
> + }

So this preserves/makes worse the bug Morten spotted, even without
want_affine we should still attempt SD_BALANCE_WAKE if set.

> +
> for_each_domain(cpu, tmp) {
> if (!(tmp->flags & SD_LOAD_BALANCE))
> continue;

2015-07-09 14:07:36

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
> >
> > +/*
> > + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
> > + * A waker of many should wake a different task than the one last awakened
> > + * at a frequency roughly N times higher than one of its wakees. In order
> > + * to determine whether we should let the load spread vs consolodating to
> > + * shared cache, we look for a minimum 'flip' frequency of llc_size in one
> > + * partner, and a factor of lls_size higher frequency in the other. With
> > + * both conditions met, we can be relatively sure that we are seeing a 1:N
> > + * relationship, and that load size exceeds socket size.
> > + */
> > static int wake_wide(struct task_struct *p)
> > {
> > + unsigned int waker_flips = current->wakee_flips;
> > + unsigned int wakee_flips = p->wakee_flips;
> > int factor = this_cpu_read(sd_llc_size);
> >
> > + if (waker_flips < wakee_flips)
> > + swap(waker_flips, wakee_flips);
>
> This makes the wakee/waker names useless, the end result is more like
> wakee_flips := client_flips, waker_flips := server_flips.

True, perhaps a rename is in order.

> > + if (wakee_flips < factor || waker_flips < wakee_flips * factor)
> > + return 0;
>
> I don't get the first condition... why would the client ever flip? It
> only talks to that one server.

So I was thinking too, and I initially cemented the relationship by
flipping both. However, the thing works in virgin source, ie clients do
in fact flip, so I removed that cementing based on the hard evidence.

> > @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
> > {
> > struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> > int cpu = smp_processor_id();
> > + int new_cpu = prev_cpu;
> > int want_affine = 0;
> > int sync = wake_flags & WF_SYNC;
> >
> > rcu_read_lock();
> > + if (sd_flag & SD_BALANCE_WAKE) {
> > + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> > + if (!want_affine)
> > + goto select_idle;
> > + }
>
> So this preserves/makes worse the bug Morten spotted, even without
> want_affine we should still attempt SD_BALANCE_WAKE if set.

Yeah. I can redo it if you want, but it seems a shame to traverse for
nothing given we know SD_BALANCE_WAKE is so painful that nobody really
really wants to do that. One has to override the other in any case, no?

-Mike

2015-07-09 14:46:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On Thu, 2015-07-09 at 16:07 +0200, Mike Galbraith wrote:
> On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
> > On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
> > >
> > > +/*
> > > + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
> > > + * A waker of many should wake a different task than the one last awakened
> > > + * at a frequency roughly N times higher than one of its wakees. In order
> > > + * to determine whether we should let the load spread vs consolodating to
> > > + * shared cache, we look for a minimum 'flip' frequency of llc_size in one
> > > + * partner, and a factor of lls_size higher frequency in the other. With
> > > + * both conditions met, we can be relatively sure that we are seeing a 1:N
> > > + * relationship, and that load size exceeds socket size.
> > > + */
> > > static int wake_wide(struct task_struct *p)
> > > {
> > > + unsigned int waker_flips = current->wakee_flips;
> > > + unsigned int wakee_flips = p->wakee_flips;
> > > int factor = this_cpu_read(sd_llc_size);
> > >
> > > + if (waker_flips < wakee_flips)
> > > + swap(waker_flips, wakee_flips);
> >
> > This makes the wakee/waker names useless, the end result is more like
> > wakee_flips := client_flips, waker_flips := server_flips.
>
> True, perhaps a rename is in order.

I should perhaps add that waker/wakee_flips does make sense to me in the
sense that the task who's flips end up in the waker_flips bucket is our
waker of many vs being one its many wakees.

-Mike

2015-07-10 05:19:38

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
> > static int wake_wide(struct task_struct *p)
> > {
> > + unsigned int waker_flips = current->wakee_flips;
> > + unsigned int wakee_flips = p->wakee_flips;
> > int factor = this_cpu_read(sd_llc_size);
> >
> > + if (waker_flips < wakee_flips)
> > + swap(waker_flips, wakee_flips);
>
> This makes the wakee/waker names useless, the end result is more like
> wakee_flips := client_flips, waker_flips := server_flips.

I settled on master/slave plus hopefully improved comment block.

> > + if (wakee_flips < factor || waker_flips < wakee_flips * factor)
> > + return 0;
>
> I don't get the first condition... why would the client ever flip? It
> only talks to that one server.

(tightening heuristic up a bit by one means or another would be good,
but "if it ain't broke, don't fix it" applies for this patchlet)

> > @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
> > {
> > struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> > int cpu = smp_processor_id();
> > + int new_cpu = prev_cpu;
> > int want_affine = 0;
> > int sync = wake_flags & WF_SYNC;
> >
> > rcu_read_lock();
> > + if (sd_flag & SD_BALANCE_WAKE) {
> > + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> > + if (!want_affine)
> > + goto select_idle;
> > + }
>
> So this preserves/makes worse the bug Morten spotted, even without
> want_affine we should still attempt SD_BALANCE_WAKE if set.

Fixed. wake_wide() may override want_affine as before, want_affine may
override other ->flags as before, but a surviving domain selection now
results in a full balance instead of a select_idle_sibling() call.

sched: beef up wake_wide()

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Signed-off-by: Mike Galbraith <[email protected]>
Tested-by: Josef Bacik <[email protected]>
---
include/linux/sched.h | 4 +--
kernel/sched/fair.c | 57 ++++++++++++++++++++++----------------------------
2 files changed, 28 insertions(+), 33 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;

int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,26 +4730,29 @@ static long effective_load(struct task_g

#endif

+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size. Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int master = current->wakee_flips;
+ unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);

- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (master < slave)
+ swap(master, slave);
+ if (slave < factor || master < slave * factor)
+ return 0;
+ return 1;
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4764,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;

- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5021,12 +5017,12 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;

if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));

rcu_read_lock();
for_each_domain(cpu, tmp) {
@@ -5040,6 +5036,8 @@ select_task_rq_fair(struct task_struct *
if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
affine_sd = tmp;
+ /* Prefer affinity over any other flags */
+ sd = NULL;
break;
}

@@ -5048,12 +5046,10 @@ select_task_rq_fair(struct task_struct *
}

if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ new_cpu = cpu;

- if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
- }
+ if ((sd_flag & SD_BALANCE_WAKE) && (!sd || (!(sd->flags & SD_BALANCE_WAKE))))
+ new_cpu = select_idle_sibling(p, new_cpu);

while (sd) {
struct sched_group *group;
@@ -5089,7 +5085,6 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();

return new_cpu;

2015-07-10 13:42:53

by Josef Bacik

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On 07/10/2015 01:19 AM, Mike Galbraith wrote:
> On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
>> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
>>> static int wake_wide(struct task_struct *p)
>>> {
>>> + unsigned int waker_flips = current->wakee_flips;
>>> + unsigned int wakee_flips = p->wakee_flips;
>>> int factor = this_cpu_read(sd_llc_size);
>>>
>>> + if (waker_flips < wakee_flips)
>>> + swap(waker_flips, wakee_flips);
>>
>> This makes the wakee/waker names useless, the end result is more like
>> wakee_flips := client_flips, waker_flips := server_flips.
>
> I settled on master/slave plus hopefully improved comment block.
>
>>> + if (wakee_flips < factor || waker_flips < wakee_flips * factor)
>>> + return 0;
>>
>> I don't get the first condition... why would the client ever flip? It
>> only talks to that one server.
>
> (tightening heuristic up a bit by one means or another would be good,
> but "if it ain't broke, don't fix it" applies for this patchlet)
>
>>> @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
>>> {
>>> struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>>> int cpu = smp_processor_id();
>>> + int new_cpu = prev_cpu;
>>> int want_affine = 0;
>>> int sync = wake_flags & WF_SYNC;
>>>
>>> rcu_read_lock();
>>> + if (sd_flag & SD_BALANCE_WAKE) {
>>> + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>>> + if (!want_affine)
>>> + goto select_idle;
>>> + }
>>
>> So this preserves/makes worse the bug Morten spotted, even without
>> want_affine we should still attempt SD_BALANCE_WAKE if set.
>
> Fixed. wake_wide() may override want_affine as before, want_affine may
> override other ->flags as before, but a surviving domain selection now
> results in a full balance instead of a select_idle_sibling() call.
>
> sched: beef up wake_wide()
>
> Josef Bacik reported that Facebook sees better performance with their
> 1:N load (1 dispatch/node, N workers/node) when carrying an old patch
> to try very hard to wake to an idle CPU. While looking at wake_wide(),
> I noticed that it doesn't pay attention to the wakeup of a many partner
> waker, returning 1 only when waking one of its many partners.
>
> Correct that, letting explicit domain flags override the heuristic.
>
> While at it, adjust task_struct bits, we don't need a 64bit counter.
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Tested-by: Josef Bacik <[email protected]>


I'll give this new one a whirl and let you know how it goes. Thanks,

Josef

2015-07-10 21:00:42

by Josef Bacik

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On 07/10/2015 01:19 AM, Mike Galbraith wrote:
> On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
>> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
>>> static int wake_wide(struct task_struct *p)
>>> {
>>> + unsigned int waker_flips = current->wakee_flips;
>>> + unsigned int wakee_flips = p->wakee_flips;
>>> int factor = this_cpu_read(sd_llc_size);
>>>
>>> + if (waker_flips < wakee_flips)
>>> + swap(waker_flips, wakee_flips);
>>
>> This makes the wakee/waker names useless, the end result is more like
>> wakee_flips := client_flips, waker_flips := server_flips.
>
> I settled on master/slave plus hopefully improved comment block.
>
>>> + if (wakee_flips < factor || waker_flips < wakee_flips * factor)
>>> + return 0;
>>
>> I don't get the first condition... why would the client ever flip? It
>> only talks to that one server.
>
> (tightening heuristic up a bit by one means or another would be good,
> but "if it ain't broke, don't fix it" applies for this patchlet)
>
>>> @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
>>> {
>>> struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>>> int cpu = smp_processor_id();
>>> + int new_cpu = prev_cpu;
>>> int want_affine = 0;
>>> int sync = wake_flags & WF_SYNC;
>>>
>>> rcu_read_lock();
>>> + if (sd_flag & SD_BALANCE_WAKE) {
>>> + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>>> + if (!want_affine)
>>> + goto select_idle;
>>> + }
>>
>> So this preserves/makes worse the bug Morten spotted, even without
>> want_affine we should still attempt SD_BALANCE_WAKE if set.
>
> Fixed. wake_wide() may override want_affine as before, want_affine may
> override other ->flags as before, but a surviving domain selection now
> results in a full balance instead of a select_idle_sibling() call.
>
> sched: beef up wake_wide()
>
> Josef Bacik reported that Facebook sees better performance with their
> 1:N load (1 dispatch/node, N workers/node) when carrying an old patch
> to try very hard to wake to an idle CPU. While looking at wake_wide(),
> I noticed that it doesn't pay attention to the wakeup of a many partner
> waker, returning 1 only when waking one of its many partners.
>
> Correct that, letting explicit domain flags override the heuristic.
>
> While at it, adjust task_struct bits, we don't need a 64bit counter.
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Tested-by: Josef Bacik <[email protected]>

Not quite as awesome but still better than the baseline so we're good.
Thanks,

Josef

2015-07-11 06:10:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On Fri, 2015-07-10 at 16:59 -0400, Josef Bacik wrote:

> Not quite as awesome but still better than the baseline so we're good.
> Thanks,

I personally like the other much better. We're not doing the user any
favor by making the thing balance when SD_BALANCE_WAKE is set. Until
such time as it ceases to be horribly painful, we're just paying a few
cycles to help a theoretically existing masochist torture his box.

-Mike

2015-07-13 13:53:50

by Josef Bacik

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On 07/10/2015 11:11 PM, Mike Galbraith wrote:
> On Fri, 2015-07-10 at 16:59 -0400, Josef Bacik wrote:
>
>> Not quite as awesome but still better than the baseline so we're good.
>> Thanks,
>
> I personally like the other much better. We're not doing the user any
> favor by making the thing balance when SD_BALANCE_WAKE is set. Until
> such time as it ceases to be horribly painful, we're just paying a few
> cycles to help a theoretically existing masochist torture his box.
>

Agreed, I'm all for the faster version ;),

Josef

2015-07-14 11:19:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On Sat, Jul 11, 2015 at 05:11:51AM +0200, Mike Galbraith wrote:
> On Fri, 2015-07-10 at 16:59 -0400, Josef Bacik wrote:
>
> > Not quite as awesome but still better than the baseline so we're good.
> > Thanks,
>
> I personally like the other much better. We're not doing the user any
> favor by making the thing balance when SD_BALANCE_WAKE is set. Until
> such time as it ceases to be horribly painful, we're just paying a few
> cycles to help a theoretically existing masochist torture his box.

OK, how about something like the below; it tightens things up by
applying two rules:

- We really should not continue looking for a balancing domain once
SD_LOAD_BALANCE is not set.

- SD (balance) flags should really be set in a single contiguous range,
always starting at the bottom.

The latter means what if !want_affine and the (first) sd doesn't have
BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
junk you didn't like..

Hmm?

---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <[email protected]>
Date: Fri, 10 Jul 2015 07:19:26 +0200

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Mike Galbraith <[email protected]>
Tested-by: Josef Bacik <[email protected]>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
include/linux/sched.h | 4 +-
kernel/sched/fair.c | 72 +++++++++++++++++++++++---------------------------
2 files changed, 36 insertions(+), 40 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;

int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_g

#endif

+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size. Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int master = current->wakee_flips;
+ unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);

- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (master < slave)
+ swap(master, slave);
+ if (slave < factor || master < slave * factor)
+ return 0;
+ return 1;
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;

- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu)
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
- struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+ struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;

if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));

rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
- continue;
+ break;

/*
* If both cpu and prev_cpu are part of this domain,
@@ -5035,20 +5031,14 @@ select_task_rq_fair(struct task_struct *
*/
if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
- affine_sd = tmp;
- break;
+ /* Prefer affinity over any other flags */
+ goto do_affine;
}

if (tmp->flags & sd_flag)
sd = tmp;
- }
-
- if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
-
- if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+ else if (!want_affine)
+ break;
}

while (sd) {
@@ -5087,8 +5077,14 @@ select_task_rq_fair(struct task_struct *
}
unlock:
rcu_read_unlock();
-
return new_cpu;
+
+do_affine:
+ if (cpu != prev_cpu && wake_affine(tmp, p, sync))
+ new_cpu = cpu;
+
+ new_cpu = select_idle_sibling(p, new_cpu);
+ goto unlock;
}

/*

2015-07-14 13:49:25

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote:

> OK, how about something like the below; it tightens things up by
> applying two rules:
>
> - We really should not continue looking for a balancing domain once
> SD_LOAD_BALANCE is not set.
>
> - SD (balance) flags should really be set in a single contiguous range,
> always starting at the bottom.
>
> The latter means what if !want_affine and the (first) sd doesn't have
> BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
> junk you didn't like..
>
> Hmm?

Yeah, that's better. It's not big hairy deal either way, it just bugged
me to knowingly toss those cycles out the window ;-)

select_idle_sibling() looks kinda funny down there, but otoh when the
day comes (hah) that we can just balance, it's closer to the exit.

-Mike

2015-07-14 14:07:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On Tue, Jul 14, 2015 at 03:49:17PM +0200, Mike Galbraith wrote:
> On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote:
>
> > OK, how about something like the below; it tightens things up by
> > applying two rules:
> >
> > - We really should not continue looking for a balancing domain once
> > SD_LOAD_BALANCE is not set.
> >
> > - SD (balance) flags should really be set in a single contiguous range,
> > always starting at the bottom.
> >
> > The latter means what if !want_affine and the (first) sd doesn't have
> > BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
> > junk you didn't like..
> >
> > Hmm?
>
> Yeah, that's better. It's not big hairy deal either way, it just bugged
> me to knowingly toss those cycles out the window ;-)
>
> select_idle_sibling() looks kinda funny down there, but otoh when the
> day comes (hah) that we can just balance, it's closer to the exit.

Right, not too pretty, does this look beter?

---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <[email protected]>
Date: Fri, 10 Jul 2015 07:19:26 +0200

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Mike Galbraith <[email protected]>
Tested-by: Josef Bacik <[email protected]>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
include/linux/sched.h | 4 +-
kernel/sched/fair.c | 68 +++++++++++++++++++++++---------------------------
2 files changed, 34 insertions(+), 38 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,9 +1359,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;

int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_g

#endif

+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size. Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int master = current->wakee_flips;
+ unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);

- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (master < slave)
+ swap(master, slave);
+ if (slave < factor || master < slave * factor)
+ return 0;
+ return 1;
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;

- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu)
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
- struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+ struct sched_domain *tmp, affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;

if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));

rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
- continue;
+ break;

/*
* If both cpu and prev_cpu are part of this domain,
@@ -5041,17 +5037,17 @@ select_task_rq_fair(struct task_struct *

if (tmp->flags & sd_flag)
sd = tmp;
+ else if (!want_affine)
+ break;
}

- if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ if (affine_sd) { /* Prefer affinity over any other flags */
+ if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ new_cpu = cpu;

- if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
- }
+ new_cpu = select_idle_sibling(p, new_cpu);

- while (sd) {
+ } else while (sd) {
struct sched_group *group;
int weight;

@@ -5085,10 +5081,10 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
- rcu_read_unlock();

+ rcu_read_unlock();
return new_cpu;
+
}

/*

2015-07-14 14:17:53

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On Tue, 2015-07-14 at 16:07 +0200, Peter Zijlstra wrote:
> On Tue, Jul 14, 2015 at 03:49:17PM +0200, Mike Galbraith wrote:
> > On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote:
> >
> > > OK, how about something like the below; it tightens things up by
> > > applying two rules:
> > >
> > > - We really should not continue looking for a balancing domain once
> > > SD_LOAD_BALANCE is not set.
> > >
> > > - SD (balance) flags should really be set in a single contiguous range,
> > > always starting at the bottom.
> > >
> > > The latter means what if !want_affine and the (first) sd doesn't have
> > > BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
> > > junk you didn't like..
> > >
> > > Hmm?
> >
> > Yeah, that's better. It's not big hairy deal either way, it just bugged
> > me to knowingly toss those cycles out the window ;-)
> >
> > select_idle_sibling() looks kinda funny down there, but otoh when the
> > day comes (hah) that we can just balance, it's closer to the exit.
>
> Right, not too pretty, does this look beter?

There's a buglet, I was just about to mention the inverse in the other.


> @@ -5041,17 +5037,17 @@ select_task_rq_fair(struct task_struct *
>
> if (tmp->flags & sd_flag)
> sd = tmp;
> + else if (!want_affine)
> + break;
> }
>
> - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> - prev_cpu = cpu;
> + if (affine_sd) { /* Prefer affinity over any other flags */
> + if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> + new_cpu = cpu;
>
> - if (sd_flag & SD_BALANCE_WAKE) {
> - new_cpu = select_idle_sibling(p, prev_cpu);
> - goto unlock;
> - }
> + new_cpu = select_idle_sibling(p, new_cpu);

We'll not look for a idle cpu when wake_wide() naks want_affine.

-Mike

2015-07-14 15:05:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On Tue, Jul 14, 2015 at 04:17:46PM +0200, Mike Galbraith wrote:
> There's a buglet,

> We'll not look for a idle cpu when wake_wide() naks want_affine.

*sigh* indeed.. fixing that'll bring us very close to what we started
out wiht..

The one XXX there raises the question on whether we should always so
select_idle_sibling() if we do not have a suitable balance flag, or only
on wakeups.



---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <[email protected]>
Date: Fri, 10 Jul 2015 07:19:26 +0200

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Mike Galbraith <[email protected]>
Tested-by: Josef Bacik <[email protected]>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
include/linux/sched.h | 4 +-
kernel/sched/fair.c | 69 ++++++++++++++++++++++++--------------------------
2 files changed, 36 insertions(+), 37 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,9 +1359,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;

int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_g

#endif

+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size. Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int master = current->wakee_flips;
+ unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);

- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (master < slave)
+ swap(master, slave);
+ if (slave < factor || master < slave * factor)
+ return 0;
+ return 1;
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;

- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu)
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
- struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+ struct sched_domain *tmp, affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;

if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));

rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
- continue;
+ break;

/*
* If both cpu and prev_cpu are part of this domain,
@@ -5041,17 +5037,21 @@ select_task_rq_fair(struct task_struct *

if (tmp->flags & sd_flag)
sd = tmp;
+ else if (!want_affine)
+ break;
}

- if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
-
- if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+ if (affine_sd) {
+ sd = NULL; /* Prefer wake_affine over balance flags */
+ if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ new_cpu = cpu;
}

- while (sd) {
+ if (!sd) {
+ if (sd_flags & SD_BALANCE_WAKE) /* XXX always ? */
+ new_cpu = select_idle_sibling(p, new_cpu);
+
+ } else while (sd) {
struct sched_group *group;
int weight;

@@ -5085,7 +5085,6 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();

return new_cpu;

2015-07-14 15:39:59

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On Tue, 2015-07-14 at 17:04 +0200, Peter Zijlstra wrote:
> On Tue, Jul 14, 2015 at 04:17:46PM +0200, Mike Galbraith wrote:
> > There's a buglet,
>
> > We'll not look for a idle cpu when wake_wide() naks want_affine.
>
> *sigh* indeed.. fixing that'll bring us very close to what we started
> out wiht..
>
> The one XXX there raises the question on whether we should always so
> select_idle_sibling() if we do not have a suitable balance flag, or only
> on wakeups.

That's what I've been sitting here waffling over, finally convinced
myself that should the user turn FORX/EXEC off, he shouldn't find that a
substitute quietly slipped in.. though otoh.. crap, guess I'm not done
waffling after all. Yeah, this will work just fine ;-)

(typos fixed)

---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <[email protected]>
Date: Fri, 10 Jul 2015 07:19:26 +0200

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Mike Galbraith <[email protected]>
Tested-by: Josef Bacik <[email protected]>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
include/linux/sched.h | 4 +-
kernel/sched/fair.c | 67 ++++++++++++++++++++++++--------------------------
2 files changed, 35 insertions(+), 36 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;

int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,26 +4730,29 @@ static long effective_load(struct task_g

#endif

+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size. Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int master = current->wakee_flips;
+ unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);

- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (master < slave)
+ swap(master, slave);
+ if (slave < factor || master < slave * factor)
+ return 0;
+ return 1;
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4764,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;

- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5021,17 +5017,17 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;

if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));

rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
- continue;
+ break;

/*
* If both cpu and prev_cpu are part of this domain,
@@ -5045,17 +5041,21 @@ select_task_rq_fair(struct task_struct *

if (tmp->flags & sd_flag)
sd = tmp;
+ else if (!want_affine)
+ break;
}

- if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
-
- if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+ if (affine_sd) {
+ sd = NULL; /* Prefer wake_affine over balance flags */
+ if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ new_cpu = cpu;
}

- while (sd) {
+ if (!sd) {
+ if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
+ new_cpu = select_idle_sibling(p, new_cpu);
+
+ } else while (sd) {
struct sched_group *group;
int weight;

@@ -5089,7 +5089,6 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();

return new_cpu;

2015-07-14 16:02:29

by Josef Bacik

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On 07/14/2015 11:39 AM, Mike Galbraith wrote:
> On Tue, 2015-07-14 at 17:04 +0200, Peter Zijlstra wrote:
>> On Tue, Jul 14, 2015 at 04:17:46PM +0200, Mike Galbraith wrote:
>>> There's a buglet,
>>
>>> We'll not look for a idle cpu when wake_wide() naks want_affine.
>>
>> *sigh* indeed.. fixing that'll bring us very close to what we started
>> out wiht..
>>
>> The one XXX there raises the question on whether we should always so
>> select_idle_sibling() if we do not have a suitable balance flag, or only
>> on wakeups.
>
> That's what I've been sitting here waffling over, finally convinced
> myself that should the user turn FORX/EXEC off, he shouldn't find that a
> substitute quietly slipped in.. though otoh.. crap, guess I'm not done
> waffling after all. Yeah, this will work just fine ;-)
>
> (typos fixed)
>

We happy with this or should I wait for more patches to fly by before I
test something ;)?

Josef

2015-07-14 17:59:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On Tue, 2015-07-14 at 12:01 -0400, Josef Bacik wrote:

> We happy with this or should I wait for more patches to fly by before I
> test something ;)?

Yeah, it should be The End time.

-Mike

2015-07-15 17:12:15

by Josef Bacik

[permalink] [raw]
Subject: Re: [patch] sched: beef up wake_wide()

On 07/14/2015 01:59 PM, Mike Galbraith wrote:
> On Tue, 2015-07-14 at 12:01 -0400, Josef Bacik wrote:
>
>> We happy with this or should I wait for more patches to fly by before I
>> test something ;)?
>
> Yeah, it should be The End time.

It's showing the same thing as one of the many earlier patches, it's
better at high RPS, but at low RPS there's a 3% regression. Thanks,

Josef

Subject: [tip:sched/core] sched/fair: Beef up wake_wide()

Commit-ID: 63b0e9edceec10fa41ec33393a1515a5ff444277
Gitweb: http://git.kernel.org/tip/63b0e9edceec10fa41ec33393a1515a5ff444277
Author: Mike Galbraith <[email protected]>
AuthorDate: Tue, 14 Jul 2015 17:39:50 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 3 Aug 2015 12:21:23 +0200

sched/fair: Beef up wake_wide()

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64-bit counter.

Tested-by: Josef Bacik <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>
[ Tidy things up. ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: kernel-team<[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 4 +--
kernel/sched/fair.c | 67 +++++++++++++++++++++++++--------------------------
2 files changed, 35 insertions(+), 36 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7412070..65a8a86 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,9 +1359,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;

int wake_cpu;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b384b8d..ea23f9f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)

#endif

+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size. Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int master = current->wakee_flips;
+ unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);

- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (master < slave)
+ swap(master, slave);
+ if (slave < factor || master < slave * factor)
+ return 0;
+ return 1;
}

static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
unsigned long weight;
int balanced;

- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5017,17 +5013,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;

if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));

rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
- continue;
+ break;

/*
* If both cpu and prev_cpu are part of this domain,
@@ -5041,17 +5037,21 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f

if (tmp->flags & sd_flag)
sd = tmp;
+ else if (!want_affine)
+ break;
}

- if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
-
- if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+ if (affine_sd) {
+ sd = NULL; /* Prefer wake_affine over balance flags */
+ if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ new_cpu = cpu;
}

- while (sd) {
+ if (!sd) {
+ if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
+ new_cpu = select_idle_sibling(p, new_cpu);
+
+ } else while (sd) {
struct sched_group *group;
int weight;

@@ -5085,7 +5085,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();

return new_cpu;