2017-06-30 00:19:17

by Joel Fernandes

[permalink] [raw]
Subject: wake_wide mechanism clarification

Dear Mike,

I wanted your kind help to understand your patch "sched: beef up
wake_wide()"[1] which is a modification to the original patch from
Michael Wang [2].

In particular, I didn't following the following comment:
" to shared cache, we look for a minimum 'flip' frequency of llc_size
in one partner, and a factor of lls_size higher frequency in the
other."

Why are wanting the master's flip frequency to be higher than the
slaves by the factor?

The code here is written as:

if (slave < factor || master < slave * factor)
return 0;

However I think we should just do (with my current and probably wrong
understanding):

if (slave < factor || master < factor)
return 0;

Basically, I didn't follow why we multiply the slave's flips with
llc_size. That makes it sound like the master has to have way more
flips than the slave to return 0 from wake_wide. Could you maybe give
an example to clarify? Thanks a lot for your help,

I am also CC'ing Peter and some ARM folks for the discussion (and also
Jocef who was discuss it with Mike on the mailing list few years ago).

Thanks,
Joel

[1] https://patchwork.kernel.org/patch/6787941/
[2] https://lkml.org/lkml/2013/7/4/20


2017-06-30 00:49:17

by Josef Bacik

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Thu, Jun 29, 2017 at 05:19:14PM -0700, Joel Fernandes wrote:
> Dear Mike,
>
> I wanted your kind help to understand your patch "sched: beef up
> wake_wide()"[1] which is a modification to the original patch from
> Michael Wang [2].
>
> In particular, I didn't following the following comment:
> " to shared cache, we look for a minimum 'flip' frequency of llc_size
> in one partner, and a factor of lls_size higher frequency in the
> other."
>
> Why are wanting the master's flip frequency to be higher than the
> slaves by the factor?

(Responding from my personal email as my work email is outlook shit and
impossible to use)

Because we are trying to detect the case that the master is waking many
different processes, and the 'slave' processes are only waking up the
master/some other specific processes to determine if we don't care about cache
locality.

>
> The code here is written as:
>
> if (slave < factor || master < slave * factor)
> return 0;
>
> However I think we should just do (with my current and probably wrong
> understanding):
>
> if (slave < factor || master < factor)
> return 0;
>

Actually I think both are wrong, but I need Mike to weigh in. In my example
above we'd return 0, because the 'producer' will definitely have a wakee_flip of
ridiculous values, but the 'consumer' could essentially have a wakee_flip of 1,
just the master to tell it that it's done. I _suppose_ in practice you have a
lock or something so the wakee_flip isn't going to be strictly 1, but some
significantly lower value than master. I'm skeptical of the slave < factor
test, I think it's too high of a bar in the case where cache locality doesn't
really matter, but the master < slave * factor makes sense, as slave is going to
be orders of magnitude lower than master.

> Basically, I didn't follow why we multiply the slave's flips with
> llc_size. That makes it sound like the master has to have way more
> flips than the slave to return 0 from wake_wide. Could you maybe give
> an example to clarify? Thanks a lot for your help,
>

It may be worth to try with schedbench and trace it to see how this turns out in
practice, as that's the workload that generated all this discussion before. I
imagine generally speaking this works out properly. The small regression I
reported before was at low RPS, so we wouldn't be waking up as many tasks as
often, so we would be returning 0 from wake_wide() and we'd get screwed. This
is where I think possibly dropping the slave < factor part of the test would
address that, but I'd have to trace it to say for sure. Thanks,

Josef

2017-06-30 03:05:02

by Joel Fernandes

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

Hi Josef,

Thanks a lot for your reply, :-)

On Thu, Jun 29, 2017 at 5:49 PM, Josef Bacik <[email protected]> wrote:
> Because we are trying to detect the case that the master is waking many
> different processes, and the 'slave' processes are only waking up the
> master/some other specific processes to determine if we don't care about cache
> locality.
>
>>
>> The code here is written as:
>>
>> if (slave < factor || master < slave * factor)
>> return 0;
>>
>> However I think we should just do (with my current and probably wrong
>> understanding):
>>
>> if (slave < factor || master < factor)
>> return 0;
>>
>
> Actually I think both are wrong, but I need Mike to weigh in. In my example
> above we'd return 0, because the 'producer' will definitely have a wakee_flip of
> ridiculous values, but the 'consumer' could essentially have a wakee_flip of 1,
> just the master to tell it that it's done. I _suppose_ in practice you have a
> lock or something so the wakee_flip isn't going to be strictly 1, but some
> significantly lower value than master. I'm skeptical of the slave < factor
> test, I think it's too high of a bar in the case where cache locality doesn't
> really matter, but the master < slave * factor makes sense, as slave is going to
> be orders of magnitude lower than master.

That makes sense that we multiply slave's flips by a factor because
its low, but I still didn't get why the factor is chosen to be
llc_size instead of something else for the multiplication with slave
(slave * factor). I know slave's flips are probably low and need to be
higher, but why its multiplied with llc_size than some other number
(in other words - why we are tying the size of a NUMA node or a
cluster size with the number of wake-ups that the slave made)? Is this
because of an assumption that a master is expected to evenly
distribute work amongs CPUs within its node or something like that?

More over, this case is only for when slave wakeups are far lower than
the master. But, what about the case where slave wakes up are greater
than the factor but approximately equal or around the same as the
masters'. Then, it sounds like (master < slave * factor) can return
true. In that case wake_wide() will be = 0. That sounds like a bad
thing to do I think - pull a busy slave onto a busy master.

>> Basically, I didn't follow why we multiply the slave's flips with
>> llc_size. That makes it sound like the master has to have way more
>> flips than the slave to return 0 from wake_wide. Could you maybe give
>> an example to clarify? Thanks a lot for your help,
>>
>
> It may be worth to try with schedbench and trace it to see how this turns out in
> practice, as that's the workload that generated all this discussion before. I
> imagine generally speaking this works out properly. The small regression I

I see. I will try to find some time to play with this tool. Thanks for
the pointer.

> reported before was at low RPS, so we wouldn't be waking up as many tasks as
> often, so we would be returning 0 from wake_wide() and we'd get screwed. This
> is where I think possibly dropping the slave < factor part of the test would
> address that, but I'd have to trace it to say for sure.

Ah, I see your problem. I guess its a conflicting case/requirement? I
guess somehow to handle your case, it has to be embedded into this
condition that cache locality doesn't matter as that seems to be the
assumption of slave < factor as you pointed.

Thanks,
Joel

2017-06-30 03:11:49

by Mike Galbraith

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Thu, 2017-06-29 at 20:49 -0400, Josef Bacik wrote:
> On Thu, Jun 29, 2017 at 05:19:14PM -0700, Joel Fernandes wrote:
>
> > Why are wanting the master's flip frequency to be higher than the
> > slaves by the factor?
>
> (Responding from my personal email as my work email is outlook shit and
> impossible to use)
>
> Because we are trying to detect the case that the master is waking many
> different processes, and the 'slave' processes are only waking up the
> master/some other specific processes to determine if we don't care about cache
> locality.

Yes, the heuristic (large delta implies waker/wakee are NOT 1:1, ie
filter out high frequency communication where eating misses doesn't
merely sting, it hurts like hell) just became bidirectional.

> Actually I think both are wrong, but I need Mike to weigh in.

My weigh in is this: if you have ideas to improve or replace that
heuristic, by all means go for it, just make damn sure it's dirt cheap.
 Heuristics all suck one way or another, problem is that nasty old
"perfect is the enemy of good" adage.  Make it perfect, it'll hurt.

-Mike

2017-06-30 13:11:29

by Matt Fleming

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Thu, 29 Jun, at 08:49:13PM, Josef Bacik wrote:
>
> It may be worth to try with schedbench and trace it to see how this turns out in
> practice, as that's the workload that generated all this discussion before. I
> imagine generally speaking this works out properly. The small regression I
> reported before was at low RPS, so we wouldn't be waking up as many tasks as
> often, so we would be returning 0 from wake_wide() and we'd get screwed. This
> is where I think possibly dropping the slave < factor part of the test would
> address that, but I'd have to trace it to say for sure. Thanks,

Just 2 weeks ago I was poking at wake_wide() because it's impacting
hackbench times now we're better at balancing on fork() (see commit
6b94780e45c1 ("sched/core: Use load_avg for selecting idlest group")).

What's happening is that occasionally the hackbench times will be
pretty large because the hackbench tasks are being pulled back and
forth across NUMA domains due to the wake_wide() logic.

Reproducing this issue does require a NUMA box with more CPUs than
hackbench tasks. I was using an 80-cpu 2 NUMA node box with 1
hackbench group (20 readers, 20 writers).

I did the following very quick hack,

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1f5efa51dc7..c1bc1b0434bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5055,7 +5055,7 @@ static int wake_wide(struct task_struct *p)

if (master < slave)
swap(master, slave);
- if (slave < factor || master < slave * factor)
+ if (master < slave * factor)
return 0;
return 1;
}

Which produces the following results for the 1 group (40 tasks) on one
of SUSE's enterprise kernels:

hackbench-process-pipes
4.4.71 4.4.71
patched+patched+-wake-wide-fix
Min 1 0.7000 ( 0.00%) 0.8480 (-21.14%)
Amean 1 1.0343 ( 0.00%) 0.9073 ( 12.28%)
Stddev 1 0.2373 ( 0.00%) 0.0447 ( 81.15%)
CoeffVar 1 22.9447 ( 0.00%) 4.9300 ( 78.51%)
Max 1 1.2270 ( 0.00%) 0.9560 ( 22.09%)

You'll see that the minimum value is worse with my change, but the
maximum is much better.

So the current wake_wide() code does help sometimes, but it also hurts
sometimes too.

I'm happy to gather performance data for any code suggestions.

2017-06-30 14:28:26

by Josef Bacik

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
> Hi Josef,
>
> Thanks a lot for your reply, :-)
>
> On Thu, Jun 29, 2017 at 5:49 PM, Josef Bacik <[email protected]> wrote:
> > Because we are trying to detect the case that the master is waking many
> > different processes, and the 'slave' processes are only waking up the
> > master/some other specific processes to determine if we don't care about cache
> > locality.
> >
> >>
> >> The code here is written as:
> >>
> >> if (slave < factor || master < slave * factor)
> >> return 0;
> >>
> >> However I think we should just do (with my current and probably wrong
> >> understanding):
> >>
> >> if (slave < factor || master < factor)
> >> return 0;
> >>
> >
> > Actually I think both are wrong, but I need Mike to weigh in. In my example
> > above we'd return 0, because the 'producer' will definitely have a wakee_flip of
> > ridiculous values, but the 'consumer' could essentially have a wakee_flip of 1,
> > just the master to tell it that it's done. I _suppose_ in practice you have a
> > lock or something so the wakee_flip isn't going to be strictly 1, but some
> > significantly lower value than master. I'm skeptical of the slave < factor
> > test, I think it's too high of a bar in the case where cache locality doesn't
> > really matter, but the master < slave * factor makes sense, as slave is going to
> > be orders of magnitude lower than master.
>
> That makes sense that we multiply slave's flips by a factor because
> its low, but I still didn't get why the factor is chosen to be
> llc_size instead of something else for the multiplication with slave
> (slave * factor). I know slave's flips are probably low and need to be
> higher, but why its multiplied with llc_size than some other number
> (in other words - why we are tying the size of a NUMA node or a
> cluster size with the number of wake-ups that the slave made)? Is this
> because of an assumption that a master is expected to evenly
> distribute work amongs CPUs within its node or something like that?
>

Yeah I don't know why llc_size was chosen, but I've been thinking about this on
and off since last night and I don't really know what the right thing to use
would be. We need some arbitrary number, and any arbitrary number is going to
be wrong for one workload when its fine for another workload.

What we really want is to know how many different tasks we could potentially
wake up to compare against, and that's kind of impossible. If we did it based
soley on actual wakee_flips values we could end up with the case that 1 tasks
that has 2 worker tasks always being wake_wide, and that may not be as helpful.
With processes that are threaded then sure we have a readily available number to
use, but for things that are discrete processes that talk over say a pipe or
shared memory that's going to be unpossible to tease out. I haven't had my Mt.
Dew this morning so if you have a better idea for a factor I'm all ears.

> More over, this case is only for when slave wakeups are far lower than
> the master. But, what about the case where slave wakes up are greater
> than the factor but approximately equal or around the same as the
> masters'. Then, it sounds like (master < slave * factor) can return
> true. In that case wake_wide() will be = 0. That sounds like a bad
> thing to do I think - pull a busy slave onto a busy master.
>

This I think is a flaw in our load balancing assumptions. I've been drowning in
this code recently because of a cgroups imbalance problem. We don't really
propagate load well for processes that are on the rq but haven't run yet, so I
feel this plays into this problem where we think the cpu we're pulling to has
plenty of capacity when in reality it doesnt.

I'm going to tool around with this logic some this morning and see if I can make
it a little bit smarter, I'll let you know how it goes. Thanks,

Josef

2017-06-30 17:02:35

by Mike Galbraith

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>
> > That makes sense that we multiply slave's flips by a factor because
> > its low, but I still didn't get why the factor is chosen to be
> > llc_size instead of something else for the multiplication with slave
> > (slave * factor).

> Yeah I don't know why llc_size was chosen...

static void update_top_cache_domain(int cpu)
{
        struct sched_domain_shared *sds = NULL;
        struct sched_domain *sd;
        int id = cpu;
        int size = 1;

        sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
        if (sd) {
                id = cpumask_first(sched_domain_span(sd));
                size = cpumask_weight(sched_domain_span(sd));
                sds = sd->shared;
        }

        rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
        per_cpu(sd_llc_size, cpu) = size;

The goal of wake wide was to approximate when pulling would be a futile
consolidation effort and counterproductive to scaling.  'course with
ever increasing socket size, any 1:N waker is ever more likely to run
out of CPU for its one and only self (slamming into scaling wall)
before it needing to turn its minions loose to conquer the world.

Something else to consider: network interrupt waking multiple workers
at high frequency.  If the waking CPU is idle, do you really want to
place a worker directly in front of a tattoo artist, or is it better
off nearly anywhere but there?

If the box is virtual, with no topology exposed (or real but ancient)
to let select_idle_sibling() come to the rescue, two workers can even
get tattooed simultaneously (see sync wakeup). 

-Mike

2017-06-30 17:55:46

by Josef Bacik

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:
> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
> > On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
> >
> > > That makes sense that we multiply slave's flips by a factor because
> > > its low, but I still didn't get why the factor is chosen to be
> > > llc_size instead of something else for the multiplication with slave
> > > (slave * factor).
>
> > Yeah I don't know why llc_size was chosen...
>
> static void update_top_cache_domain(int cpu)
> {
> ????????struct sched_domain_shared *sds = NULL;
> ????????struct sched_domain *sd;
> ????????int id = cpu;
> ????????int size = 1;
>
> ????????sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> ????????if (sd) {
> ????????????????id = cpumask_first(sched_domain_span(sd));
> ????????????????size = cpumask_weight(sched_domain_span(sd));
> ????????????????sds = sd->shared;
> ????????}
>
> ????????rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> ????????per_cpu(sd_llc_size, cpu) = size;
>
> The goal of wake wide was to approximate when pulling would be a futile
> consolidation effort and counterproductive to scaling. ?'course with
> ever increasing socket size, any 1:N waker is ever more likely to run
> out of CPU for its one and only self (slamming into scaling wall)
> before it needing to turn its minions loose to conquer the world.
>
> Something else to consider: network interrupt waking multiple workers
> at high frequency. ?If the waking CPU is idle, do you really want to
> place a worker directly in front of a tattoo artist, or is it better
> off nearly anywhere but there?
>
> If the box is virtual, with no topology exposed (or real but ancient)
> to let select_idle_sibling() come to the rescue, two workers can even
> get tattooed simultaneously (see sync wakeup).?
>

Heuristics are hard, news at 11. I think messing with wake_wide() itself is too
big of a hammer, we probably need a middle ground. I'm messing with it right
now so it's too early to say for sure, but i _suspect_ the bigger latencies we
see are not because we overload the cpu we're trying to pull to, but because
when we fail to do the wake_affine() we only look at siblings of the affine_sd
instead of doing the full "find the idlest cpu in the land!" thing. I _think_
the answer is to make select_idle_sibling() try less hard to find something
workable and only use obviously idle cpu's in the affine sd, and fall back to
the full load balance esque search.

This would make affine misses really expensive, but we can probably negate this
by tracking per task how often it misses the target, and use that to adjust when
we do wake_affine in the future for that task. Still experimenting some, I just
found out a few hours ago I need to rework some of this to fix my cpu imbalance
problem with cgroups, so once I get something working I'll throw it your way to
take a look. Thanks,

Josef

2017-07-29 08:01:29

by Joel Fernandes

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

Hi Mike,

I have take spent some time understanding the email thread and
previous discussions. Unfortunately the second condition we are
checking for in the wake_wide still didn't make sense to me (mentioned
below) :-(

On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
<[email protected]> wrote:
> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>
>> > That makes sense that we multiply slave's flips by a factor because
>> > its low, but I still didn't get why the factor is chosen to be
>> > llc_size instead of something else for the multiplication with slave
>> > (slave * factor).
>
>> Yeah I don't know why llc_size was chosen...
>
> static void update_top_cache_domain(int cpu)
> {
> struct sched_domain_shared *sds = NULL;
> struct sched_domain *sd;
> int id = cpu;
> int size = 1;
>
> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> if (sd) {
> id = cpumask_first(sched_domain_span(sd));
> size = cpumask_weight(sched_domain_span(sd));
> sds = sd->shared;
> }
>
> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> per_cpu(sd_llc_size, cpu) = size;
>
> The goal of wake wide was to approximate when pulling would be a futile
> consolidation effort and counterproductive to scaling. 'course with
> ever increasing socket size, any 1:N waker is ever more likely to run
> out of CPU for its one and only self (slamming into scaling wall)
> before it needing to turn its minions loose to conquer the world.

Actually the original question was why do we have the second condition
as "master < slave * factor", instead of "master < factor". that's
what didn't make sense to me. Why don't we return 0 from wake_wide if
master < factor ?

Infact, as the factor is set to the llc_size, I think the condition
that makes sense to me is:

if ((master + slave) < llc_size)
return 0;

In other words, if the master flips and the slave flips are totally
higher than the llc_size, then we are most likely waking up too many
tasks as affine and should then switch to wide to prevent overloading.

Digging further into the original patch from Michael Wang (I also CC'd
him), this was the code (before you had changed it to master/slave):

wakee->nr_wakee_switch > factor &&
waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)

To explain the second condition above, Michael Wang said the following in [1]

"Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
tasks rely on it, then waker's higher latency will damage all of them, pull
wakee seems to be a bad deal."

Again I didn't follow why the second condition couldn't just be:
waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
wakee->nr_wakee_switch) > factor, based on the above explanation from
Micheal Wang that I quoted.
and why he's instead doing the whole multiplication thing there that I
was talking about earlier: "factor * wakee->nr_wakee_switch".

Rephrasing my question in another way, why are we talking the ratio of
master/slave instead of the sum when comparing if its > factor? I am
surely missing something here.

Just taking an example:

Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8
slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common
between all 3 masters. Also to make it a bit more interesting, let s8
wake up some random task T0. A diagram to show the master/slave
relation ships could look like:

+-----+
| |
+------------------------+ +------+ M2 |
| | | | |
| M1 | | +--+--+----
| | | | | | |
| | | | | | s15
+--+--+--+--+--+--+---+--+---+ v v v
| | | | | | | | s9 s10 s11
v v v v v v v v
s1 s2 s3 s4 s5 s6 s7 s8 ---> T0
^
|
+-+---+
| |
| M3 |
| |
+--+--+-----
| | | |
v v v v
s12 s13 s14 s16


Lets consider the case of M1 waking up s8. As per the above diagram,
M1 has 8 flips and s8 has 4 flips.

With llc_size = 3, the condition

(slave < factor) would return FALSE, so then we would turn to the
(master < slave * factor) condition. This would be TRUE (8 < 4 * 3),
so wake_wide would return 0 and would cause s8 to be woken up as
affine with relation to M1's core.

So basically, it seems the heuristic is saying (with help of the
second condition - master < slave * factor). that Its a good idea for
s8 to be affine-woken-up with respect to M1's core. Why is it a good
idea to do that? It seems to me M1 has already several tasks its
waking as affine so causing s8 to be woken up affine could be harmful
and it may be a better choice to wake it up elsewhere.

Thanks for your help!

-Joel

[1] https://lkml.org/lkml/2013/7/4/20


>
> Something else to consider: network interrupt waking multiple workers
> at high frequency. If the waking CPU is idle, do you really want to
> place a worker directly in front of a tattoo artist, or is it better
> off nearly anywhere but there?
>
> If the box is virtual, with no topology exposed (or real but ancient)
> to let select_idle_sibling() come to the rescue, two workers can even
> get tattooed simultaneously (see sync wakeup).
>
> -Mike

2017-07-29 08:13:50

by Joel Fernandes

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

+Michael Wang on his current email address (old one bounced). (my
reply was to Mike Galbraith but I also meant to CC Michael Wang for
the discussion). Thanks

On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <[email protected]> wrote:
> Hi Mike,
>
> I have take spent some time understanding the email thread and
> previous discussions. Unfortunately the second condition we are
> checking for in the wake_wide still didn't make sense to me (mentioned
> below) :-(
>
> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
> <[email protected]> wrote:
>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>
>>> > That makes sense that we multiply slave's flips by a factor because
>>> > its low, but I still didn't get why the factor is chosen to be
>>> > llc_size instead of something else for the multiplication with slave
>>> > (slave * factor).
>>
>>> Yeah I don't know why llc_size was chosen...
>>
>> static void update_top_cache_domain(int cpu)
>> {
>> struct sched_domain_shared *sds = NULL;
>> struct sched_domain *sd;
>> int id = cpu;
>> int size = 1;
>>
>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>> if (sd) {
>> id = cpumask_first(sched_domain_span(sd));
>> size = cpumask_weight(sched_domain_span(sd));
>> sds = sd->shared;
>> }
>>
>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>> per_cpu(sd_llc_size, cpu) = size;
>>
>> The goal of wake wide was to approximate when pulling would be a futile
>> consolidation effort and counterproductive to scaling. 'course with
>> ever increasing socket size, any 1:N waker is ever more likely to run
>> out of CPU for its one and only self (slamming into scaling wall)
>> before it needing to turn its minions loose to conquer the world.
>
> Actually the original question was why do we have the second condition
> as "master < slave * factor", instead of "master < factor". that's
> what didn't make sense to me. Why don't we return 0 from wake_wide if
> master < factor ?
>
> Infact, as the factor is set to the llc_size, I think the condition
> that makes sense to me is:
>
> if ((master + slave) < llc_size)
> return 0;
>
> In other words, if the master flips and the slave flips are totally
> higher than the llc_size, then we are most likely waking up too many
> tasks as affine and should then switch to wide to prevent overloading.
>
> Digging further into the original patch from Michael Wang (I also CC'd
> him), this was the code (before you had changed it to master/slave):
>
> wakee->nr_wakee_switch > factor &&
> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>
> To explain the second condition above, Michael Wang said the following in [1]
>
> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
> tasks rely on it, then waker's higher latency will damage all of them, pull
> wakee seems to be a bad deal."
>
> Again I didn't follow why the second condition couldn't just be:
> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
> wakee->nr_wakee_switch) > factor, based on the above explanation from
> Micheal Wang that I quoted.
> and why he's instead doing the whole multiplication thing there that I
> was talking about earlier: "factor * wakee->nr_wakee_switch".
>
> Rephrasing my question in another way, why are we talking the ratio of
> master/slave instead of the sum when comparing if its > factor? I am
> surely missing something here.
>
> Just taking an example:
>
> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8
> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common
> between all 3 masters. Also to make it a bit more interesting, let s8
> wake up some random task T0. A diagram to show the master/slave
> relation ships could look like:
>
> +-----+
> | |
> +------------------------+ +------+ M2 |
> | | | | |
> | M1 | | +--+--+----
> | | | | | | |
> | | | | | | s15
> +--+--+--+--+--+--+---+--+---+ v v v
> | | | | | | | | s9 s10 s11
> v v v v v v v v
> s1 s2 s3 s4 s5 s6 s7 s8 ---> T0
> ^
> |
> +-+---+
> | |
> | M3 |
> | |
> +--+--+-----
> | | | |
> v v v v
> s12 s13 s14 s16
>
>
> Lets consider the case of M1 waking up s8. As per the above diagram,
> M1 has 8 flips and s8 has 4 flips.
>
> With llc_size = 3, the condition
>
> (slave < factor) would return FALSE, so then we would turn to the
> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3),
> so wake_wide would return 0 and would cause s8 to be woken up as
> affine with relation to M1's core.
>
> So basically, it seems the heuristic is saying (with help of the
> second condition - master < slave * factor). that Its a good idea for
> s8 to be affine-woken-up with respect to M1's core. Why is it a good
> idea to do that? It seems to me M1 has already several tasks its
> waking as affine so causing s8 to be woken up affine could be harmful
> and it may be a better choice to wake it up elsewhere.
>
> Thanks for your help!
>
> -Joel
>
> [1] https://lkml.org/lkml/2013/7/4/20
>
>
>>
>> Something else to consider: network interrupt waking multiple workers
>> at high frequency. If the waking CPU is idle, do you really want to
>> place a worker directly in front of a tattoo artist, or is it better
>> off nearly anywhere but there?
>>
>> If the box is virtual, with no topology exposed (or real but ancient)
>> to let select_idle_sibling() come to the rescue, two workers can even
>> get tattooed simultaneously (see sync wakeup).
>>
>> -Mike

2017-07-29 15:07:32

by Mike Galbraith

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Sat, 2017-07-29 at 01:01 -0700, Joel Fernandes wrote:
> Hi Mike,
>
> I have take spent some time understanding the email thread and
> previous discussions. Unfortunately the second condition we are
> checking for in the wake_wide still didn't make sense to me (mentioned
> below) :-(
>
> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
> <[email protected]> wrote:
> > On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
> >> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
> >>
> >> > That makes sense that we multiply slave's flips by a factor because
> >> > its low, but I still didn't get why the factor is chosen to be
> >> > llc_size instead of something else for the multiplication with slave
> >> > (slave * factor).
> >
> >> Yeah I don't know why llc_size was chosen...
> >
> > static void update_top_cache_domain(int cpu)
> > {
> > struct sched_domain_shared *sds = NULL;
> > struct sched_domain *sd;
> > int id = cpu;
> > int size = 1;
> >
> > sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> > if (sd) {
> > id = cpumask_first(sched_domain_span(sd));
> > size = cpumask_weight(sched_domain_span(sd));
> > sds = sd->shared;
> > }
> >
> > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> > per_cpu(sd_llc_size, cpu) = size;
> >
> > The goal of wake wide was to approximate when pulling would be a futile
> > consolidation effort and counterproductive to scaling. 'course with
> > ever increasing socket size, any 1:N waker is ever more likely to run
> > out of CPU for its one and only self (slamming into scaling wall)
> > before it needing to turn its minions loose to conquer the world.
>
> Actually the original question was why do we have the second condition
> as "master < slave * factor", instead of "master < factor". that's
> what didn't make sense to me. Why don't we return 0 from wake_wide if
> master < factor ?
>
> Infact, as the factor is set to the llc_size, I think the condition
> that makes sense to me is:
>
> if ((master + slave) < llc_size)
> return 0;

That says to me turn affine wakeups off for nearly everything.

> In other words, if the master flips and the slave flips are totally
> higher than the llc_size, then we are most likely waking up too many
> tasks as affine and should then switch to wide to prevent overloading.

The heuristic is mostly about the ratio of flip counts.

> Digging further into the original patch from Michael Wang (I also CC'd
> him), this was the code (before you had changed it to master/slave):
>
> wakee->nr_wakee_switch > factor &&
> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)

Yeah, it was originally unidirectional.

> To explain the second condition above, Michael Wang said the following in [1]
>
> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
> tasks rely on it, then waker's higher latency will damage all of them, pull
> wakee seems to be a bad deal."

Yes, "Furthermore". To detect 1:N, Michael chose llc_size as his N.  Is
the one flipping partners at least N/s, and the other about N times as
often?  If so, the two may be part of a too big to wisely pull 1:N.

If you have a better idea, by all means, pull it out.  Nobody is
attached to wake_wide(), in fact, I suspect Peter hates it.  I'm not
fond of it either, it having obvious holes.  The only thing it has
going for it is simplicity.  Bend it up, replace it, fire away.
  
> Again I didn't follow why the second condition couldn't just be:
> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
> wakee->nr_wakee_switch) > factor, based on the above explanation from
> Micheal Wang that I quoted.
> and why he's instead doing the whole multiplication thing there that I
> was talking about earlier: "factor * wakee->nr_wakee_switch".
>
> Rephrasing my question in another way, why are we talking the ratio of
> master/slave instead of the sum when comparing if its > factor? I am
> surely missing something here.

Because the heuristic tries to not demolish 1:1 buddies.  Big partner
flip delta means the pair are unlikely to be a communicating pair,
perhaps at high frequency where misses hurt like hell.

> Just taking an example:
>
> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8
> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common
> between all 3 masters. Also to make it a bit more interesting, let s8
> wake up some random task T0. A diagram to show the master/slave
> relation ships could look like:

I don't need the artwork, as soon as you describe a squid orgie in a
bucket with no adult supervision, I can visualize what happens: the
load balancer tries to make pedantically perfect load numbers, caring
not one whit about any of the relationships in either text or graphic
description, stirs bucket vigorously, making squid soup.

IMHO, placement optimization of that is a job for a human, the
scheduler ain't that smart. (quick like bunny -> smart like bunny)

>
> +-----+
> | |
> +------------------------+ +------+ M2 |
> | | | | |
> | M1 | | +--+--+----
> | | | | | | |
> | | | | | | s15
> +--+--+--+--+--+--+---+--+---+ v v v
> | | | | | | | | s9 s10 s11
> v v v v v v v v
> s1 s2 s3 s4 s5 s6 s7 s8 ---> T0
> ^
> |
> +-+---+
> | |
> | M3 |
> | |
> +--+--+-----
> | | | |
> v v v v
> s12 s13 s14 s16
>
>
> Lets consider the case of M1 waking up s8. As per the above diagram,
> M1 has 8 flips and s8 has 4 flips.
>
> With llc_size = 3, the condition
>
> (slave < factor) would return FALSE, so then we would turn to the
> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3),
> so wake_wide would return 0 and would cause s8 to be woken up as
> affine with relation to M1's core.
>
> So basically, it seems the heuristic is saying (with help of the
> second condition - master < slave * factor). that Its a good idea for
> s8 to be affine-woken-up with respect to M1's core. Why is it a good
> idea to do that? It seems to me M1 has already several tasks its
> waking as affine so causing s8 to be woken up affine could be harmful
> and it may be a better choice to wake it up elsewhere.

No matter what you do with s8, you'll be wrong from some POV.. unless
of course s8 is a high frequency waker, then migration is a good bet.

-Mike

2017-07-29 20:19:07

by Joel Fernandes

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

Hi Mike,

On Sat, Jul 29, 2017 at 8:07 AM, Mike Galbraith
<[email protected]> wrote:
> On Sat, 2017-07-29 at 01:01 -0700, Joel Fernandes wrote:
>> Hi Mike,
>>
>> I have take spent some time understanding the email thread and
>> previous discussions. Unfortunately the second condition we are
>> checking for in the wake_wide still didn't make sense to me (mentioned
>> below) :-(
>>
>> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
>> <[email protected]> wrote:
>> > On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>> >> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>> >>
>> >> > That makes sense that we multiply slave's flips by a factor because
>> >> > its low, but I still didn't get why the factor is chosen to be
>> >> > llc_size instead of something else for the multiplication with slave
>> >> > (slave * factor).
>> >
>> >> Yeah I don't know why llc_size was chosen...
<snip>
>> >
>> > The goal of wake wide was to approximate when pulling would be a futile
>> > consolidation effort and counterproductive to scaling. 'course with
>> > ever increasing socket size, any 1:N waker is ever more likely to run
>> > out of CPU for its one and only self (slamming into scaling wall)
>> > before it needing to turn its minions loose to conquer the world.
>>
>> Actually the original question was why do we have the second condition
>> as "master < slave * factor", instead of "master < factor". that's
>> what didn't make sense to me. Why don't we return 0 from wake_wide if
>> master < factor ?
>>
>> Infact, as the factor is set to the llc_size, I think the condition
>> that makes sense to me is:
>>
>> if ((master + slave) < llc_size)
>> return 0;
>
> That says to me turn affine wakeups off for nearly everything.

Ok, I see that now. thanks

>> To explain the second condition above, Michael Wang said the following in [1]
>>
>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
>> tasks rely on it, then waker's higher latency will damage all of them, pull
>> wakee seems to be a bad deal."
>
> Yes, "Furthermore". To detect 1:N, Michael chose llc_size as his N. Is
> the one flipping partners at least N/s, and the other about N times as
> often? If so, the two may be part of a too big to wisely pull 1:N.
>
> If you have a better idea, by all means, pull it out. Nobody is

Sure yeah, first I'm trying to understand the heuristic itself which
I'm glad to be making progress with thanks to yours and others' help!

> attached to wake_wide(), in fact, I suspect Peter hates it. I'm not
> fond of it either, it having obvious holes. The only thing it has
> going for it is simplicity. Bend it up, replace it, fire away.
>

Ok, it makes much more sense to me now. Also for the N:N case,
wouldn't the excessive wake-affine increase the latency and a
spreading might be better? Say if slave and master flips are much
greater than factor (llc_size), then slave > factor && master < slave
* factor, would probably return true a lot (and we would return 0
causing an affine wakeup). That's probably a bad thing right as it
could overload the waker's CPU quickly? I guess the heuristic tries to
maximize cache-hits more than reduce latency?

>> Again I didn't follow why the second condition couldn't just be:
>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
>> wakee->nr_wakee_switch) > factor, based on the above explanation from
>> Micheal Wang that I quoted.
>> and why he's instead doing the whole multiplication thing there that I
>> was talking about earlier: "factor * wakee->nr_wakee_switch".
>>
>> Rephrasing my question in another way, why are we talking the ratio of
>> master/slave instead of the sum when comparing if its > factor? I am
>> surely missing something here.
>
> Because the heuristic tries to not demolish 1:1 buddies. Big partner
> flip delta means the pair are unlikely to be a communicating pair,
> perhaps at high frequency where misses hurt like hell.

But it does seem to me to demolish the N:N communicating pairs from a
latency/load balancing standpoint. For he case of N readers and N
writers, the ratio (master/slave) comes down to 1:1 and we wake
affine. Hopefully I didn't miss something too obvious about that.

thanks,

-Joel

2017-07-29 22:28:59

by Joel Fernandes

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

Hi Mike,

On Sat, Jul 29, 2017 at 1:19 PM, Joel Fernandes <[email protected]> wrote:
<snip>
>
>>> To explain the second condition above, Michael Wang said the following in [1]
>>>
>>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
>>> tasks rely on it, then waker's higher latency will damage all of them, pull
>>> wakee seems to be a bad deal."
>>
>> Yes, "Furthermore". To detect 1:N, Michael chose llc_size as his N. Is
>> the one flipping partners at least N/s, and the other about N times as
>> often? If so, the two may be part of a too big to wisely pull 1:N.
>>
>> If you have a better idea, by all means, pull it out. Nobody is
>
> Sure yeah, first I'm trying to understand the heuristic itself which
> I'm glad to be making progress with thanks to yours and others' help!
>
>> attached to wake_wide(), in fact, I suspect Peter hates it. I'm not
>> fond of it either, it having obvious holes. The only thing it has
>> going for it is simplicity. Bend it up, replace it, fire away.
>>
>
> Ok, it makes much more sense to me now. Also for the N:N case,
> wouldn't the excessive wake-affine increase the latency and a
> spreading might be better? Say if slave and master flips are much
> greater than factor (llc_size), then slave > factor && master < slave
> * factor, would probably return true a lot (and we would return 0
> causing an affine wakeup). That's probably a bad thing right as it
> could overload the waker's CPU quickly? I guess the heuristic tries to
> maximize cache-hits more than reduce latency?
>
>>> Again I didn't follow why the second condition couldn't just be:
>>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
>>> wakee->nr_wakee_switch) > factor, based on the above explanation from
>>> Micheal Wang that I quoted.
>>> and why he's instead doing the whole multiplication thing there that I
>>> was talking about earlier: "factor * wakee->nr_wakee_switch".
>>>
>>> Rephrasing my question in another way, why are we talking the ratio of
>>> master/slave instead of the sum when comparing if its > factor? I am
>>> surely missing something here.
>>
>> Because the heuristic tries to not demolish 1:1 buddies. Big partner
>> flip delta means the pair are unlikely to be a communicating pair,
>> perhaps at high frequency where misses hurt like hell.
>
> But it does seem to me to demolish the N:N communicating pairs from a
> latency/load balancing standpoint. For he case of N readers and N
> writers, the ratio (master/slave) comes down to 1:1 and we wake
> affine. Hopefully I didn't miss something too obvious about that.

I think wake_affine() should correctly handle the case (of
overloading) I bring up here where wake_wide() is too conservative and
does affine a lot, (I don't have any data for this though, this just
from code reading), so I take this comment back for this reason.

thanks,

-Joel

2017-07-29 22:41:59

by Joel Fernandes

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Sat, Jul 29, 2017 at 3:28 PM, Joel Fernandes <[email protected]> wrote:
<snip>
>>>> Again I didn't follow why the second condition couldn't just be:
>>>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
>>>> wakee->nr_wakee_switch) > factor, based on the above explanation from
>>>> Micheal Wang that I quoted.
>>>> and why he's instead doing the whole multiplication thing there that I
>>>> was talking about earlier: "factor * wakee->nr_wakee_switch".
>>>>
>>>> Rephrasing my question in another way, why are we talking the ratio of
>>>> master/slave instead of the sum when comparing if its > factor? I am
>>>> surely missing something here.
>>>
>>> Because the heuristic tries to not demolish 1:1 buddies. Big partner
>>> flip delta means the pair are unlikely to be a communicating pair,
>>> perhaps at high frequency where misses hurt like hell.
>>
>> But it does seem to me to demolish the N:N communicating pairs from a
>> latency/load balancing standpoint. For he case of N readers and N
>> writers, the ratio (master/slave) comes down to 1:1 and we wake
>> affine. Hopefully I didn't miss something too obvious about that.
>
> I think wake_affine() should correctly handle the case (of
> overloading) I bring up here where wake_wide() is too conservative and
> does affine a lot, (I don't have any data for this though, this just
> from code reading), so I take this comment back for this reason.

aargh, nope :( it still runs select_idle_sibling although on the
previous CPU even if want_affine is 0 (and doesn't do the wider
wakeup..), so the comment still applies.. its easy to get lost into
the code with so many if statements :-\ sorry about the noise :)

thanks,

-Joel

2017-07-31 12:21:53

by Josef Bacik

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Sat, Jul 29, 2017 at 03:41:56PM -0700, Joel Fernandes wrote:
> On Sat, Jul 29, 2017 at 3:28 PM, Joel Fernandes <[email protected]> wrote:
> <snip>
> >>>> Again I didn't follow why the second condition couldn't just be:
> >>>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
> >>>> wakee->nr_wakee_switch) > factor, based on the above explanation from
> >>>> Micheal Wang that I quoted.
> >>>> and why he's instead doing the whole multiplication thing there that I
> >>>> was talking about earlier: "factor * wakee->nr_wakee_switch".
> >>>>
> >>>> Rephrasing my question in another way, why are we talking the ratio of
> >>>> master/slave instead of the sum when comparing if its > factor? I am
> >>>> surely missing something here.
> >>>
> >>> Because the heuristic tries to not demolish 1:1 buddies. Big partner
> >>> flip delta means the pair are unlikely to be a communicating pair,
> >>> perhaps at high frequency where misses hurt like hell.
> >>
> >> But it does seem to me to demolish the N:N communicating pairs from a
> >> latency/load balancing standpoint. For he case of N readers and N
> >> writers, the ratio (master/slave) comes down to 1:1 and we wake
> >> affine. Hopefully I didn't miss something too obvious about that.
> >
> > I think wake_affine() should correctly handle the case (of
> > overloading) I bring up here where wake_wide() is too conservative and
> > does affine a lot, (I don't have any data for this though, this just
> > from code reading), so I take this comment back for this reason.
>
> aargh, nope :( it still runs select_idle_sibling although on the
> previous CPU even if want_affine is 0 (and doesn't do the wider
> wakeup..), so the comment still applies.. its easy to get lost into
> the code with so many if statements :-\ sorry about the noise :)
>

I've been working in this area recently because of a cpu imbalance problem.
Wake_wide() definitely makes it so we're waking affine way too often, but I
think messing with wake_waide to solve that problem is the wrong solution. This
is just a heuristic to see if we should wake affine, the simpler the better. I
solved the problem of waking affine too often like this

https://marc.info/?l=linux-kernel&m=150003849602535&w=2

So why do you care about wake_wide() anyway? Are you observing some problem
that you suspect is affected by the affine wakeup stuff? Or are you just trying
to understand what is going on for fun? Cause if you are just doing this for
fun you are a very strange person, thanks,

Josef

2017-07-31 13:42:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Mon, 2017-07-31 at 12:21 +0000, Josef Bacik wrote:
>
> I've been working in this area recently because of a cpu imbalance problem.
> Wake_wide() definitely makes it so we're waking affine way too often, but I
> think messing with wake_waide to solve that problem is the wrong solution. This
> is just a heuristic to see if we should wake affine, the simpler the better. I
> solved the problem of waking affine too often like this
>
> https://marc.info/?l=linux-kernel&m=150003849602535&w=2

Wait a minute, that's not quite fair :)  Wake_wide() can't be blamed
for causing too frequent affine wakeups when what it does is filter
some out.  While it may not reject aggressively enough for you (why you
bent it up to be very aggressive), seems the problem from your loads
POV is the scheduler generally being too eager to bounce.

I've also played with rate limiting migration per task, but it had
negative effects too: when idle/periodic balance pulls buddies apart,
rate limiting inhibits them quickly finding each other again, making
undoing all that hard load balancer work a throughput win.  Sigh.

-Mike

2017-07-31 14:48:10

by Josef Bacik

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Mon, Jul 31, 2017 at 03:42:25PM +0200, Mike Galbraith wrote:
> On Mon, 2017-07-31 at 12:21 +0000, Josef Bacik wrote:
> >
> > I've been working in this area recently because of a cpu imbalance problem.
> > Wake_wide() definitely makes it so we're waking affine way too often, but I
> > think messing with wake_waide to solve that problem is the wrong solution. This
> > is just a heuristic to see if we should wake affine, the simpler the better. I
> > solved the problem of waking affine too often like this
> >
> > https://marc.info/?l=linux-kernel&m=150003849602535&w=2
>
> Wait a minute, that's not quite fair :) ?Wake_wide() can't be blamed
> for causing too frequent affine wakeups when what it does is filter
> some out. ?While it may not reject aggressively enough for you (why you
> bent it up to be very aggressive), seems the problem from your loads
> POV is the scheduler generally being too eager to bounce.
>

Yeah sorry, I hate this stuff because it's so hard to talk about without mixing
up different ideas. I should say the scheduler in general prefers to wake
affine super hard, and wake_wide() is conservative in it's filtering of this
behavior. The rest still holds true, I think tinkering with it is just hard and
the wrong place to do it, it's a good first step, and we can be smarter further
down.

> I've also played with rate limiting migration per task, but it had
> negative effects too: when idle/periodic balance pulls buddies apart,
> rate limiting inhibits them quickly finding each other again, making
> undoing all that hard load balancer work a throughput win. ?Sigh.
>

That's why I did the HZ thing, we don't touch the task for HZ to let things
settle out, and then allow affine wakeups after that. Now HZ may be an eternity
in scheduler time, but I think its a good middle ground. For our case the box
is loaded constantly, so we basically never want affine wakeups for our app.
For the case where there's spikey behavior we'll return to normal affine wakeups
a short while later.

But from my admittedly limited testing it appears to be a win overall. Thanks,

Josef

2017-07-31 16:21:49

by Joel Fernandes

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

Hi Josef,

On Mon, Jul 31, 2017 at 5:21 AM, Josef Bacik <[email protected]> wrote:
> On Sat, Jul 29, 2017 at 03:41:56PM -0700, Joel Fernandes wrote:
>> On Sat, Jul 29, 2017 at 3:28 PM, Joel Fernandes <[email protected]> wrote:
>> <snip>
>> >>>> Again I didn't follow why the second condition couldn't just be:
>> >>>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
>> >>>> wakee->nr_wakee_switch) > factor, based on the above explanation from
>> >>>> Micheal Wang that I quoted.
>> >>>> and why he's instead doing the whole multiplication thing there that I
>> >>>> was talking about earlier: "factor * wakee->nr_wakee_switch".
>> >>>>
>> >>>> Rephrasing my question in another way, why are we talking the ratio of
>> >>>> master/slave instead of the sum when comparing if its > factor? I am
>> >>>> surely missing something here.
>> >>>
>> >>> Because the heuristic tries to not demolish 1:1 buddies. Big partner
>> >>> flip delta means the pair are unlikely to be a communicating pair,
>> >>> perhaps at high frequency where misses hurt like hell.
>> >>
>> >> But it does seem to me to demolish the N:N communicating pairs from a
>> >> latency/load balancing standpoint. For he case of N readers and N
>> >> writers, the ratio (master/slave) comes down to 1:1 and we wake
>> >> affine. Hopefully I didn't miss something too obvious about that.
>> >
>> > I think wake_affine() should correctly handle the case (of
>> > overloading) I bring up here where wake_wide() is too conservative and
>> > does affine a lot, (I don't have any data for this though, this just
>> > from code reading), so I take this comment back for this reason.
>>
>> aargh, nope :( it still runs select_idle_sibling although on the
>> previous CPU even if want_affine is 0 (and doesn't do the wider
>> wakeup..), so the comment still applies.. its easy to get lost into
>> the code with so many if statements :-\ sorry about the noise :)
>>
>
> I've been working in this area recently because of a cpu imbalance problem.
> Wake_wide() definitely makes it so we're waking affine way too often, but I
> think messing with wake_waide to solve that problem is the wrong solution. This
> is just a heuristic to see if we should wake affine, the simpler the better. I
> solved the problem of waking affine too often like this
>
> https://marc.info/?l=linux-kernel&m=150003849602535&w=2

Thanks! Cool!

>
> So why do you care about wake_wide() anyway? Are you observing some problem
> that you suspect is affected by the affine wakeup stuff? Or are you just trying

I am dealing with an affine wake up issue, yes.

> to understand what is going on for fun? Cause if you are just doing this for
> fun you are a very strange person, thanks,

Its not just for fun :) Let me give you some background about me, I
work in the Android team and one of the things I want to do is to take
an out of tree patch that's been carried for some time and post a more
upstreamable solution - this is related to wake ups from the binder
driver which does sync wake ups (WF_SYNC). I can't find the exact out
of tree patch publicly since it wasn't posted to a list, but the code
is here [1]. What's worse is I have recently found really bad issues
with this patch itself where runnable times are increased. I should
have provided this background earlier (sorry that I didn't, my plan
was to trigger a separate discussion about the binder sync wake up
thing as a part of a patch/proposal I want to post - which I plan to
do so). Anyway, as a part of this effort, I want to understand
wake_wide() better and "respect" it since it sits in the wake up path
and I wanted to my proposal to work well with it, especially since I
want to solve this problem in an upstream-friendly way.

The other reason to trigger the discussion, is, I have seen
wake_wide() enough number of times and asked enough number of folks
how it works that it seems sensible to ask about it here (I was also
suggested to ask about wake_wide on LKML because since few people
seemingly understand how it works) and hopefully now its a bit better
understood.

I agree with you that instead of spending insane amounts of time on
wake_wide itself, its better to directly work on a problem and collect
some data - which is also what I'm doing, but I still thought its
worth doing some digging into wake_wide() during some free time I had,
thanks.

Cheers,

-Joel

[1] https://android.googlesource.com/kernel/msm.git/+/377e6e28b6097b3d6de7245d3d3def45fc8c9ffc/kernel/sched/fair.c#5492

2017-07-31 16:42:32

by Josef Bacik

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Mon, Jul 31, 2017 at 09:21:46AM -0700, Joel Fernandes wrote:
> Hi Josef,
>
> On Mon, Jul 31, 2017 at 5:21 AM, Josef Bacik <[email protected]> wrote:
> > On Sat, Jul 29, 2017 at 03:41:56PM -0700, Joel Fernandes wrote:
> >> On Sat, Jul 29, 2017 at 3:28 PM, Joel Fernandes <[email protected]> wrote:
> >> <snip>
> >> >>>> Again I didn't follow why the second condition couldn't just be:
> >> >>>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
> >> >>>> wakee->nr_wakee_switch) > factor, based on the above explanation from
> >> >>>> Micheal Wang that I quoted.
> >> >>>> and why he's instead doing the whole multiplication thing there that I
> >> >>>> was talking about earlier: "factor * wakee->nr_wakee_switch".
> >> >>>>
> >> >>>> Rephrasing my question in another way, why are we talking the ratio of
> >> >>>> master/slave instead of the sum when comparing if its > factor? I am
> >> >>>> surely missing something here.
> >> >>>
> >> >>> Because the heuristic tries to not demolish 1:1 buddies. Big partner
> >> >>> flip delta means the pair are unlikely to be a communicating pair,
> >> >>> perhaps at high frequency where misses hurt like hell.
> >> >>
> >> >> But it does seem to me to demolish the N:N communicating pairs from a
> >> >> latency/load balancing standpoint. For he case of N readers and N
> >> >> writers, the ratio (master/slave) comes down to 1:1 and we wake
> >> >> affine. Hopefully I didn't miss something too obvious about that.
> >> >
> >> > I think wake_affine() should correctly handle the case (of
> >> > overloading) I bring up here where wake_wide() is too conservative and
> >> > does affine a lot, (I don't have any data for this though, this just
> >> > from code reading), so I take this comment back for this reason.
> >>
> >> aargh, nope :( it still runs select_idle_sibling although on the
> >> previous CPU even if want_affine is 0 (and doesn't do the wider
> >> wakeup..), so the comment still applies.. its easy to get lost into
> >> the code with so many if statements :-\ sorry about the noise :)
> >>
> >
> > I've been working in this area recently because of a cpu imbalance problem.
> > Wake_wide() definitely makes it so we're waking affine way too often, but I
> > think messing with wake_waide to solve that problem is the wrong solution. This
> > is just a heuristic to see if we should wake affine, the simpler the better. I
> > solved the problem of waking affine too often like this
> >
> > https://marc.info/?l=linux-kernel&m=150003849602535&w=2
>
> Thanks! Cool!
>
> >
> > So why do you care about wake_wide() anyway? Are you observing some problem
> > that you suspect is affected by the affine wakeup stuff? Or are you just trying
>
> I am dealing with an affine wake up issue, yes.
>
> > to understand what is going on for fun? Cause if you are just doing this for
> > fun you are a very strange person, thanks,
>
> Its not just for fun :) Let me give you some background about me, I
> work in the Android team and one of the things I want to do is to take
> an out of tree patch that's been carried for some time and post a more
> upstreamable solution - this is related to wake ups from the binder
> driver which does sync wake ups (WF_SYNC). I can't find the exact out
> of tree patch publicly since it wasn't posted to a list, but the code
> is here [1]. What's worse is I have recently found really bad issues
> with this patch itself where runnable times are increased. I should
> have provided this background earlier (sorry that I didn't, my plan
> was to trigger a separate discussion about the binder sync wake up
> thing as a part of a patch/proposal I want to post - which I plan to
> do so). Anyway, as a part of this effort, I want to understand
> wake_wide() better and "respect" it since it sits in the wake up path
> and I wanted to my proposal to work well with it, especially since I
> want to solve this problem in an upstream-friendly way.
>
> The other reason to trigger the discussion, is, I have seen
> wake_wide() enough number of times and asked enough number of folks
> how it works that it seems sensible to ask about it here (I was also
> suggested to ask about wake_wide on LKML because since few people
> seemingly understand how it works) and hopefully now its a bit better
> understood.
>
> I agree with you that instead of spending insane amounts of time on
> wake_wide itself, its better to directly work on a problem and collect
> some data - which is also what I'm doing, but I still thought its
> worth doing some digging into wake_wide() during some free time I had,
> thanks.
>

Ok so your usecase is to _always_ wake affine if we're doing a sync wakeup. I
_think_ for your case it's best to make wake_affine() make this decision, and
you don't want wake_wide() to filter out your wakeup as not-affine? So perhaps
just throw a test in there to not wake wide if WF_SYNC is set. This makes
logical sense to me as synchronous wakeups are probably going to want to be
affine wakeups, and then we can rely on wake_affine() to do the load checks to
make sure it really makes sense. How does that sound? Thanks,

Josef

2017-07-31 17:23:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Mon, 2017-07-31 at 14:48 +0000, Josef Bacik wrote:
> On Mon, Jul 31, 2017 at 03:42:25PM +0200, Mike Galbraith wrote:
> > On Mon, 2017-07-31 at 12:21 +0000, Josef Bacik wrote:
> > >
> > > I've been working in this area recently because of a cpu imbalance problem.
> > > Wake_wide() definitely makes it so we're waking affine way too often, but I
> > > think messing with wake_waide to solve that problem is the wrong solution. This
> > > is just a heuristic to see if we should wake affine, the simpler the better. I
> > > solved the problem of waking affine too often like this
> > >
> > > https://marc.info/?l=linux-kernel&m=150003849602535&w=2
> >
> > Wait a minute, that's not quite fair :)  Wake_wide() can't be blamed
> > for causing too frequent affine wakeups when what it does is filter
> > some out.  While it may not reject aggressively enough for you (why you
> > bent it up to be very aggressive), seems the problem from your loads
> > POV is the scheduler generally being too eager to bounce.
> >
>
> Yeah sorry, I hate this stuff because it's so hard to talk about without mixing
> up different ideas. I should say the scheduler in general prefers to wake
> affine super hard, and wake_wide() is conservative in it's filtering of this
> behavior. The rest still holds true, I think tinkering with it is just hard and
> the wrong place to do it, it's a good first step, and we can be smarter further
> down.

Yeah, it's hard, and yeah, bottom line remains unchanged.

> > I've also played with rate limiting migration per task, but it had
> > negative effects too: when idle/periodic balance pulls buddies apart,
> > rate limiting inhibits them quickly finding each other again, making
> > undoing all that hard load balancer work a throughput win.  Sigh.
> >
>
> That's why I did the HZ thing, we don't touch the task for HZ to let things
> settle out, and then allow affine wakeups after that.

I kinda like the way you did it better than what I tried, but until a
means exists to _target_ the win, it's gonna be rob Peter to pay Paul,
swap rolls, repeat endlessly.

-Mike

2017-07-31 17:55:07

by Joel Fernandes

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Mon, Jul 31, 2017 at 9:42 AM, Josef Bacik <[email protected]> wrote:
<snip>
>>
>> >
>> > So why do you care about wake_wide() anyway? Are you observing some problem
>> > that you suspect is affected by the affine wakeup stuff? Or are you just trying
>>
>> I am dealing with an affine wake up issue, yes.
>>
>> > to understand what is going on for fun? Cause if you are just doing this for
>> > fun you are a very strange person, thanks,
>>
>> Its not just for fun :) Let me give you some background about me, I
>> work in the Android team and one of the things I want to do is to take
>> an out of tree patch that's been carried for some time and post a more
>> upstreamable solution - this is related to wake ups from the binder
>> driver which does sync wake ups (WF_SYNC). I can't find the exact out
>> of tree patch publicly since it wasn't posted to a list, but the code
>> is here [1]. What's worse is I have recently found really bad issues
>> with this patch itself where runnable times are increased. I should
>> have provided this background earlier (sorry that I didn't, my plan
>> was to trigger a separate discussion about the binder sync wake up
>> thing as a part of a patch/proposal I want to post - which I plan to
>> do so). Anyway, as a part of this effort, I want to understand
>> wake_wide() better and "respect" it since it sits in the wake up path
>> and I wanted to my proposal to work well with it, especially since I
>> want to solve this problem in an upstream-friendly way.
>>
>> The other reason to trigger the discussion, is, I have seen
>> wake_wide() enough number of times and asked enough number of folks
>> how it works that it seems sensible to ask about it here (I was also
>> suggested to ask about wake_wide on LKML because since few people
>> seemingly understand how it works) and hopefully now its a bit better
>> understood.
>>
>> I agree with you that instead of spending insane amounts of time on
>> wake_wide itself, its better to directly work on a problem and collect
>> some data - which is also what I'm doing, but I still thought its
>> worth doing some digging into wake_wide() during some free time I had,
>> thanks.
>>
>
> Ok so your usecase is to _always_ wake affine if we're doing a sync wakeup. I
> _think_ for your case it's best to make wake_affine() make this decision, and
> you don't want wake_wide() to filter out your wakeup as not-affine? So perhaps
> just throw a test in there to not wake wide if WF_SYNC is set. This makes

Hmm I was actually thinking that since 'sync' is more of a hint, that
we do a wake_wide() first anyway since its already so conservative,
and for the times it does resort to wide, its probably the right
decision from a scheduling standpoint to avoid affine and avoid too
many tasks too quickly. Do you think that's a fair?

I tried a quick patch and doing wake_wide first, and then checking for
sync does seem to work well.

> logical sense to me as synchronous wakeups are probably going to want to be
> affine wakeups, and then we can rely on wake_affine() to do the load checks to
> make sure it really makes sense. How does that sound? Thanks,

Yep that sounds good and I will try that.

What I was thinking was do the regular wake_wide and wake_affine
checks, and then do something like this:

in select_task_rq_fair, before calling select_idle_sibling, do
something like this to check if only 1 task is running on the waker's
CPU (this is after doing the wake_wide and wake_affine checks)

+ idle_sync = (sync && (new_cpu == cpu) &&
+ cpu_rq(cpu)->nr_running == 1);

and then in select_idle_sibling, something like:

+ /*
+ * If the previous and target CPU share cache and a sync wakeup
+ * is requested and the CPU is about to goto idle, because it
+ * has only the waker running which requested sync, the target
+ * is a better choice for cache affinity and keeping task's
+ * previous core idle and low power state.
+ */
+ if (idle_sync && cpus_share_cache(prev, target))
+ return target;
+

I haven't tested this patch though so I'm not sure it works well yet,
but just sharing the idea. What do you think?

thanks,

-Joel

2017-08-02 08:26:32

by Michael Wang

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

Hi, Joel

On 07/29/2017 10:13 AM, Joel Fernandes wrote:
> +Michael Wang on his current email address (old one bounced). (my
> reply was to Mike Galbraith but I also meant to CC Michael Wang for
> the discussion). Thanks

Just back from vacation and saw this long long discussion...

I think guys explained well on the idea, wake_wide() just try to filter
out the cases that may congest the waker's domain, as much as possible
without side effect (ideally).

The factor at very beginning is just a static number which picked by
enormous times of practice testing, Peter Zijlstr suggest we add the
domain size and make it a flexible factor, which by practice not that bad.

So the simple answer is we use the llc_size since no better option at
that time :-)

But things changing very fast and new feature can introduce new cases,
I'm pretty sure if we redo the testing the results will be very different,
however, the idea itself still make sense to me, at least on theory.

Recently I'm also thinking about the scheduler issue, cfs try to find out
general solution for all these cases and the best answer is obviously, all
the cases will suffer some damage and scheduler itself bloated to achieve
the goal 'taking care of all'.

So in order to achieve the maximum performance of particular workload, some
user defined scheduler would be an interesting idea :-P

Regards,
Michael Wang



>
> On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <[email protected]> wrote:
>> Hi Mike,
>>
>> I have take spent some time understanding the email thread and
>> previous discussions. Unfortunately the second condition we are
>> checking for in the wake_wide still didn't make sense to me (mentioned
>> below) :-(
>>
>> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
>> <[email protected]> wrote:
>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>>
>>>>> That makes sense that we multiply slave's flips by a factor because
>>>>> its low, but I still didn't get why the factor is chosen to be
>>>>> llc_size instead of something else for the multiplication with slave
>>>>> (slave * factor).
>>>
>>>> Yeah I don't know why llc_size was chosen...
>>>
>>> static void update_top_cache_domain(int cpu)
>>> {
>>> struct sched_domain_shared *sds = NULL;
>>> struct sched_domain *sd;
>>> int id = cpu;
>>> int size = 1;
>>>
>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>> if (sd) {
>>> id = cpumask_first(sched_domain_span(sd));
>>> size = cpumask_weight(sched_domain_span(sd));
>>> sds = sd->shared;
>>> }
>>>
>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>> per_cpu(sd_llc_size, cpu) = size;
>>>
>>> The goal of wake wide was to approximate when pulling would be a futile
>>> consolidation effort and counterproductive to scaling. 'course with
>>> ever increasing socket size, any 1:N waker is ever more likely to run
>>> out of CPU for its one and only self (slamming into scaling wall)
>>> before it needing to turn its minions loose to conquer the world.
>>
>> Actually the original question was why do we have the second condition
>> as "master < slave * factor", instead of "master < factor". that's
>> what didn't make sense to me. Why don't we return 0 from wake_wide if
>> master < factor ?
>>
>> Infact, as the factor is set to the llc_size, I think the condition
>> that makes sense to me is:
>>
>> if ((master + slave) < llc_size)
>> return 0;
>>
>> In other words, if the master flips and the slave flips are totally
>> higher than the llc_size, then we are most likely waking up too many
>> tasks as affine and should then switch to wide to prevent overloading.
>>
>> Digging further into the original patch from Michael Wang (I also CC'd
>> him), this was the code (before you had changed it to master/slave):
>>
>> wakee->nr_wakee_switch > factor &&
>> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>>
>> To explain the second condition above, Michael Wang said the following in [1]
>>
>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
>> tasks rely on it, then waker's higher latency will damage all of them, pull
>> wakee seems to be a bad deal."
>>
>> Again I didn't follow why the second condition couldn't just be:
>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
>> wakee->nr_wakee_switch) > factor, based on the above explanation from
>> Micheal Wang that I quoted.
>> and why he's instead doing the whole multiplication thing there that I
>> was talking about earlier: "factor * wakee->nr_wakee_switch".
>>
>> Rephrasing my question in another way, why are we talking the ratio of
>> master/slave instead of the sum when comparing if its > factor? I am
>> surely missing something here.
>>
>> Just taking an example:
>>
>> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8
>> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common
>> between all 3 masters. Also to make it a bit more interesting, let s8
>> wake up some random task T0. A diagram to show the master/slave
>> relation ships could look like:
>>
>> +-----+
>> | |
>> +------------------------+ +------+ M2 |
>> | | | | |
>> | M1 | | +--+--+----
>> | | | | | | |
>> | | | | | | s15
>> +--+--+--+--+--+--+---+--+---+ v v v
>> | | | | | | | | s9 s10 s11
>> v v v v v v v v
>> s1 s2 s3 s4 s5 s6 s7 s8 ---> T0
>> ^
>> |
>> +-+---+
>> | |
>> | M3 |
>> | |
>> +--+--+-----
>> | | | |
>> v v v v
>> s12 s13 s14 s16
>>
>>
>> Lets consider the case of M1 waking up s8. As per the above diagram,
>> M1 has 8 flips and s8 has 4 flips.
>>
>> With llc_size = 3, the condition
>>
>> (slave < factor) would return FALSE, so then we would turn to the
>> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3),
>> so wake_wide would return 0 and would cause s8 to be woken up as
>> affine with relation to M1's core.
>>
>> So basically, it seems the heuristic is saying (with help of the
>> second condition - master < slave * factor). that Its a good idea for
>> s8 to be affine-woken-up with respect to M1's core. Why is it a good
>> idea to do that? It seems to me M1 has already several tasks its
>> waking as affine so causing s8 to be woken up affine could be harmful
>> and it may be a better choice to wake it up elsewhere.
>>
>> Thanks for your help!
>>
>> -Joel
>>
>> [1] https://lkml.org/lkml/2013/7/4/20
>>
>>
>>>
>>> Something else to consider: network interrupt waking multiple workers
>>> at high frequency. If the waking CPU is idle, do you really want to
>>> place a worker directly in front of a tattoo artist, or is it better
>>> off nearly anywhere but there?
>>>
>>> If the box is virtual, with no topology exposed (or real but ancient)
>>> to let select_idle_sibling() come to the rescue, two workers can even
>>> get tattooed simultaneously (see sync wakeup).
>>>
>>> -Mike

2017-08-03 10:53:26

by Brendan Jackman

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification


Hi,

On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote:
> On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:
>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>> > On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>> >
>> > > That makes sense that we multiply slave's flips by a factor because
>> > > its low, but I still didn't get why the factor is chosen to be
>> > > llc_size instead of something else for the multiplication with slave
>> > > (slave * factor).
>>
>> > Yeah I don't know why llc_size was chosen...
>>
>> static void update_top_cache_domain(int cpu)
>> {
>> struct sched_domain_shared *sds = NULL;
>> struct sched_domain *sd;
>> int id = cpu;
>> int size = 1;
>>
>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>> if (sd) {
>> id = cpumask_first(sched_domain_span(sd));
>> size = cpumask_weight(sched_domain_span(sd));
>> sds = sd->shared;
>> }
>>
>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>> per_cpu(sd_llc_size, cpu) = size;
>>
>> The goal of wake wide was to approximate when pulling would be a futile
>> consolidation effort and counterproductive to scaling. 'course with
>> ever increasing socket size, any 1:N waker is ever more likely to run
>> out of CPU for its one and only self (slamming into scaling wall)
>> before it needing to turn its minions loose to conquer the world.
>>
>> Something else to consider: network interrupt waking multiple workers
>> at high frequency. If the waking CPU is idle, do you really want to
>> place a worker directly in front of a tattoo artist, or is it better
>> off nearly anywhere but there?
>>
>> If the box is virtual, with no topology exposed (or real but ancient)
>> to let select_idle_sibling() come to the rescue, two workers can even
>> get tattooed simultaneously (see sync wakeup).
>>
>
> Heuristics are hard, news at 11. I think messing with wake_wide() itself is too
> big of a hammer, we probably need a middle ground. I'm messing with it right
> now so it's too early to say for sure, but i _suspect_ the bigger latencies we
> see are not because we overload the cpu we're trying to pull to, but because
> when we fail to do the wake_affine() we only look at siblings of the affine_sd
> instead of doing the full "find the idlest cpu in the land!" thing.

This is the problem I've been hitting lately. My use case is 1 task per
CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1
task per CPU, they all do X amount of work then pthread_barrier_wait
(i.e. sleep until the last task finishes its X and hits the barrier). On
big.LITTLE, the tasks which get a "big" CPU finish faster, and then
those CPUs pull over the tasks that are still running:

v CPU v ->time->

-------------
0 (big) 11111 /333
-------------
1 (big) 22222 /444|
-------------
2 (LITTLE) 333333/
-------------
3 (LITTLE) 444444/
-------------

Now when task 4 hits the barrier (at |) and wakes the others up, there
are 4 tasks with prev_cpu=<big> and 0 tasks with
prev_cpu=<little>. Assuming that those wakeups happen on CPU4,
regardless of wake_affine, want_affine means that we'll only look in
sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the
bigs until the next load balance, something like this:

v CPU v ->time->

------------------------
0 (big) 11111 /333 31313\33333
------------------------
1 (big) 22222 /444|424\4444444
------------------------
2 (LITTLE) 333333/ \222222
------------------------
3 (LITTLE) 444444/ \1111
------------------------
^^^
underutilization

> I _think_
> the answer is to make select_idle_sibling() try less hard to find something
> workable and only use obviously idle cpu's in the affine sd, and fall back to
> the full load balance esque search.

So this idea of allowing select_idle_sibling to fail, and falling back
to the slow path, would help me too, I think.

This is also why I was playing with your
don't-affine-recently-balanced-tasks patch[1], which also helps my case
since it prevents want_affine for tasks 3 and 4 (which were recently
moved by an active balance).

[1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2
(also linked elsewhere in this thread)

> This would make affine misses really expensive, but we can probably negate this
> by tracking per task how often it misses the target, and use that to adjust when
> we do wake_affine in the future for that task. Still experimenting some, I just
> found out a few hours ago I need to rework some of this to fix my cpu imbalance
> problem with cgroups, so once I get something working I'll throw it your way to
> take a look. Thanks,

Cheers,
Brendan

2017-08-03 13:15:42

by Josef Bacik

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote:
>
> Hi,
>
> On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote:
> > On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:
> >> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
> >> > On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
> >> >
> >> > > That makes sense that we multiply slave's flips by a factor because
> >> > > its low, but I still didn't get why the factor is chosen to be
> >> > > llc_size instead of something else for the multiplication with slave
> >> > > (slave * factor).
> >>
> >> > Yeah I don't know why llc_size was chosen...
> >>
> >> static void update_top_cache_domain(int cpu)
> >> {
> >> struct sched_domain_shared *sds = NULL;
> >> struct sched_domain *sd;
> >> int id = cpu;
> >> int size = 1;
> >>
> >> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> >> if (sd) {
> >> id = cpumask_first(sched_domain_span(sd));
> >> size = cpumask_weight(sched_domain_span(sd));
> >> sds = sd->shared;
> >> }
> >>
> >> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> >> per_cpu(sd_llc_size, cpu) = size;
> >>
> >> The goal of wake wide was to approximate when pulling would be a futile
> >> consolidation effort and counterproductive to scaling. 'course with
> >> ever increasing socket size, any 1:N waker is ever more likely to run
> >> out of CPU for its one and only self (slamming into scaling wall)
> >> before it needing to turn its minions loose to conquer the world.
> >>
> >> Something else to consider: network interrupt waking multiple workers
> >> at high frequency. If the waking CPU is idle, do you really want to
> >> place a worker directly in front of a tattoo artist, or is it better
> >> off nearly anywhere but there?
> >>
> >> If the box is virtual, with no topology exposed (or real but ancient)
> >> to let select_idle_sibling() come to the rescue, two workers can even
> >> get tattooed simultaneously (see sync wakeup).
> >>
> >
> > Heuristics are hard, news at 11. I think messing with wake_wide() itself is too
> > big of a hammer, we probably need a middle ground. I'm messing with it right
> > now so it's too early to say for sure, but i _suspect_ the bigger latencies we
> > see are not because we overload the cpu we're trying to pull to, but because
> > when we fail to do the wake_affine() we only look at siblings of the affine_sd
> > instead of doing the full "find the idlest cpu in the land!" thing.
>
> This is the problem I've been hitting lately. My use case is 1 task per
> CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1
> task per CPU, they all do X amount of work then pthread_barrier_wait
> (i.e. sleep until the last task finishes its X and hits the barrier). On
> big.LITTLE, the tasks which get a "big" CPU finish faster, and then
> those CPUs pull over the tasks that are still running:
>
> v CPU v ->time->
>
> -------------
> 0 (big) 11111 /333
> -------------
> 1 (big) 22222 /444|
> -------------
> 2 (LITTLE) 333333/
> -------------
> 3 (LITTLE) 444444/
> -------------
>
> Now when task 4 hits the barrier (at |) and wakes the others up, there
> are 4 tasks with prev_cpu=<big> and 0 tasks with
> prev_cpu=<little>. Assuming that those wakeups happen on CPU4,
> regardless of wake_affine, want_affine means that we'll only look in
> sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the
> bigs until the next load balance, something like this:
>
> v CPU v ->time->
>
> ------------------------
> 0 (big) 11111 /333 31313\33333
> ------------------------
> 1 (big) 22222 /444|424\4444444
> ------------------------
> 2 (LITTLE) 333333/ \222222
> ------------------------
> 3 (LITTLE) 444444/ \1111
> ------------------------
> ^^^
> underutilization
>
> > I _think_
> > the answer is to make select_idle_sibling() try less hard to find something
> > workable and only use obviously idle cpu's in the affine sd, and fall back to
> > the full load balance esque search.
>
> So this idea of allowing select_idle_sibling to fail, and falling back
> to the slow path, would help me too, I think.

Unfortunately this statement of mine was wrong, I had it in my head that we
would fall back to a find the idlest cpu thing provided we failed to wake
affine, but we just do select_idle_sibling() and expect the load balancer to
move things around as needed.

>
> This is also why I was playing with your
> don't-affine-recently-balanced-tasks patch[1], which also helps my case
> since it prevents want_affine for tasks 3 and 4 (which were recently
> moved by an active balance).
>
> [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2
> (also linked elsewhere in this thread)
>

Would you try peter's sched/experimental branch and see how that affects your
workload? I'm still messing with my patches and I may drop this one as it now
appears to be too aggressive with the new set of patches. Thanks,

Josef

2017-08-03 15:05:18

by Brendan Jackman

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification


On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote:
> On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote:
>>
>> Hi,
>>
>> On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote:
>> > On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:
>> >> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>> >> > On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>> >> >
>> >> > > That makes sense that we multiply slave's flips by a factor because
>> >> > > its low, but I still didn't get why the factor is chosen to be
>> >> > > llc_size instead of something else for the multiplication with slave
>> >> > > (slave * factor).
>> >>
>> >> > Yeah I don't know why llc_size was chosen...
>> >>
>> >> static void update_top_cache_domain(int cpu)
>> >> {
>> >> struct sched_domain_shared *sds = NULL;
>> >> struct sched_domain *sd;
>> >> int id = cpu;
>> >> int size = 1;
>> >>
>> >> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>> >> if (sd) {
>> >> id = cpumask_first(sched_domain_span(sd));
>> >> size = cpumask_weight(sched_domain_span(sd));
>> >> sds = sd->shared;
>> >> }
>> >>
>> >> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>> >> per_cpu(sd_llc_size, cpu) = size;
>> >>
>> >> The goal of wake wide was to approximate when pulling would be a futile
>> >> consolidation effort and counterproductive to scaling. 'course with
>> >> ever increasing socket size, any 1:N waker is ever more likely to run
>> >> out of CPU for its one and only self (slamming into scaling wall)
>> >> before it needing to turn its minions loose to conquer the world.
>> >>
>> >> Something else to consider: network interrupt waking multiple workers
>> >> at high frequency. If the waking CPU is idle, do you really want to
>> >> place a worker directly in front of a tattoo artist, or is it better
>> >> off nearly anywhere but there?
>> >>
>> >> If the box is virtual, with no topology exposed (or real but ancient)
>> >> to let select_idle_sibling() come to the rescue, two workers can even
>> >> get tattooed simultaneously (see sync wakeup).
>> >>
>> >
>> > Heuristics are hard, news at 11. I think messing with wake_wide() itself is too
>> > big of a hammer, we probably need a middle ground. I'm messing with it right
>> > now so it's too early to say for sure, but i _suspect_ the bigger latencies we
>> > see are not because we overload the cpu we're trying to pull to, but because
>> > when we fail to do the wake_affine() we only look at siblings of the affine_sd
>> > instead of doing the full "find the idlest cpu in the land!" thing.
>>
>> This is the problem I've been hitting lately. My use case is 1 task per
>> CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1
>> task per CPU, they all do X amount of work then pthread_barrier_wait
>> (i.e. sleep until the last task finishes its X and hits the barrier). On
>> big.LITTLE, the tasks which get a "big" CPU finish faster, and then
>> those CPUs pull over the tasks that are still running:
>>
>> v CPU v ->time->
>>
>> -------------
>> 0 (big) 11111 /333
>> -------------
>> 1 (big) 22222 /444|
>> -------------
>> 2 (LITTLE) 333333/
>> -------------
>> 3 (LITTLE) 444444/
>> -------------
>>
>> Now when task 4 hits the barrier (at |) and wakes the others up, there
>> are 4 tasks with prev_cpu=<big> and 0 tasks with
>> prev_cpu=<little>. Assuming that those wakeups happen on CPU4,
>> regardless of wake_affine, want_affine means that we'll only look in
>> sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the
>> bigs until the next load balance, something like this:
>>
>> v CPU v ->time->
>>
>> ------------------------
>> 0 (big) 11111 /333 31313\33333
>> ------------------------
>> 1 (big) 22222 /444|424\4444444
>> ------------------------
>> 2 (LITTLE) 333333/ \222222
>> ------------------------
>> 3 (LITTLE) 444444/ \1111
>> ------------------------
>> ^^^
>> underutilization
>>
>> > I _think_
>> > the answer is to make select_idle_sibling() try less hard to find something
>> > workable and only use obviously idle cpu's in the affine sd, and fall back to
>> > the full load balance esque search.
>>
>> So this idea of allowing select_idle_sibling to fail, and falling back
>> to the slow path, would help me too, I think.
>
> Unfortunately this statement of mine was wrong, I had it in my head that we
> would fall back to a find the idlest cpu thing provided we failed to wake
> affine, but we just do select_idle_sibling() and expect the load balancer to
> move things around as needed.

Ah yes, when wake_affine() returns false, we still do
select_idle_sibling (except in prev_cpu's sd_llc instead of
smp_processor_id()'s), and that is the problem faced by my workload. I
thought you were suggesting to change the flow so that
select_idle_sibling can say "I didn't find any idle siblings - go to the
find_idlest_group path".

>> This is also why I was playing with your
>> don't-affine-recently-balanced-tasks patch[1], which also helps my case
>> since it prevents want_affine for tasks 3 and 4 (which were recently
>> moved by an active balance).
>>
>> [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2
>> (also linked elsewhere in this thread)
>>
>
> Would you try peter's sched/experimental branch and see how that affects your
> workload? I'm still messing with my patches and I may drop this one as it now
> appears to be too aggressive with the new set of patches. Thanks,

Sure, I'll take a look at those, thanks. I guess the idea of caching
values in LB and then using them in wakeup[2] is a lighter-handed way of
achieving the same thing as last_balance_ts? It won't solve my problem
directly since we'll still only look in sd_llc, but I think it could be
a basis for a way to say "go find_idlest_group path on these tasks" at
the beginning of select_task_rq_fair.

[2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef

Thanks,
Brendan

2017-08-03 23:48:56

by Joel Fernandes

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

Hi Michael,

Thanks for your reply.

On Wed, Aug 2, 2017 at 1:26 AM, Michael Wang <[email protected]> wrote:
> Hi, Joel
>
> On 07/29/2017 10:13 AM, Joel Fernandes wrote:
>> +Michael Wang on his current email address (old one bounced). (my
>> reply was to Mike Galbraith but I also meant to CC Michael Wang for
>> the discussion). Thanks
>
> Just back from vacation and saw this long long discussion...
>
> I think guys explained well on the idea, wake_wide() just try to filter
> out the cases that may congest the waker's domain, as much as possible
> without side effect (ideally).
>
> The factor at very beginning is just a static number which picked by
> enormous times of practice testing, Peter Zijlstr suggest we add the
> domain size and make it a flexible factor, which by practice not that bad.

Yeah making it flexible might be better.

>
> So the simple answer is we use the llc_size since no better option at
> that time :-)

:-)

>
> But things changing very fast and new feature can introduce new cases,
> I'm pretty sure if we redo the testing the results will be very different,
> however, the idea itself still make sense to me, at least on theory.
>
> Recently I'm also thinking about the scheduler issue, cfs try to find out
> general solution for all these cases and the best answer is obviously, all
> the cases will suffer some damage and scheduler itself bloated to achieve
> the goal 'taking care of all'.

My team is definitely seeing some weirdness with wake_wide() itself
not working correctly. We're collecting more data and do a more
thorough analysis of the behavior. I think it can be improved. Will
update soon as we have something.

thanks,

-Joel

>
> Regards,
> Michael Wang
>
>
>
>>
>> On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <[email protected]> wrote:
>>> Hi Mike,
>>>
>>> I have take spent some time understanding the email thread and
>>> previous discussions. Unfortunately the second condition we are
>>> checking for in the wake_wide still didn't make sense to me (mentioned
>>> below) :-(
>>>
>>> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
>>> <[email protected]> wrote:
>>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>>>
>>>>>> That makes sense that we multiply slave's flips by a factor because
>>>>>> its low, but I still didn't get why the factor is chosen to be
>>>>>> llc_size instead of something else for the multiplication with slave
>>>>>> (slave * factor).
>>>>
>>>>> Yeah I don't know why llc_size was chosen...
>>>>
>>>> static void update_top_cache_domain(int cpu)
>>>> {
>>>> struct sched_domain_shared *sds = NULL;
>>>> struct sched_domain *sd;
>>>> int id = cpu;
>>>> int size = 1;
>>>>
>>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>>> if (sd) {
>>>> id = cpumask_first(sched_domain_span(sd));
>>>> size = cpumask_weight(sched_domain_span(sd));
>>>> sds = sd->shared;
>>>> }
>>>>
>>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>>> per_cpu(sd_llc_size, cpu) = size;
>>>>
>>>> The goal of wake wide was to approximate when pulling would be a futile
>>>> consolidation effort and counterproductive to scaling. 'course with
>>>> ever increasing socket size, any 1:N waker is ever more likely to run
>>>> out of CPU for its one and only self (slamming into scaling wall)
>>>> before it needing to turn its minions loose to conquer the world.
>>>
>>> Actually the original question was why do we have the second condition
>>> as "master < slave * factor", instead of "master < factor". that's
>>> what didn't make sense to me. Why don't we return 0 from wake_wide if
>>> master < factor ?
>>>
>>> Infact, as the factor is set to the llc_size, I think the condition
>>> that makes sense to me is:
>>>
>>> if ((master + slave) < llc_size)
>>> return 0;
>>>
>>> In other words, if the master flips and the slave flips are totally
>>> higher than the llc_size, then we are most likely waking up too many
>>> tasks as affine and should then switch to wide to prevent overloading.
>>>
>>> Digging further into the original patch from Michael Wang (I also CC'd
>>> him), this was the code (before you had changed it to master/slave):
>>>
>>> wakee->nr_wakee_switch > factor &&
>>> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>>>
>>> To explain the second condition above, Michael Wang said the following in [1]
>>>
>>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
>>> tasks rely on it, then waker's higher latency will damage all of them, pull
>>> wakee seems to be a bad deal."
>>>
>>> Again I didn't follow why the second condition couldn't just be:
>>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
>>> wakee->nr_wakee_switch) > factor, based on the above explanation from
>>> Micheal Wang that I quoted.
>>> and why he's instead doing the whole multiplication thing there that I
>>> was talking about earlier: "factor * wakee->nr_wakee_switch".
>>>
>>> Rephrasing my question in another way, why are we talking the ratio of
>>> master/slave instead of the sum when comparing if its > factor? I am
>>> surely missing something here.
>>>
>>> Just taking an example:
>>>
>>> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8
>>> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common
>>> between all 3 masters. Also to make it a bit more interesting, let s8
>>> wake up some random task T0. A diagram to show the master/slave
>>> relation ships could look like:
>>>
>>> +-----+
>>> | |
>>> +------------------------+ +------+ M2 |
>>> | | | | |
>>> | M1 | | +--+--+----
>>> | | | | | | |
>>> | | | | | | s15
>>> +--+--+--+--+--+--+---+--+---+ v v v
>>> | | | | | | | | s9 s10 s11
>>> v v v v v v v v
>>> s1 s2 s3 s4 s5 s6 s7 s8 ---> T0
>>> ^
>>> |
>>> +-+---+
>>> | |
>>> | M3 |
>>> | |
>>> +--+--+-----
>>> | | | |
>>> v v v v
>>> s12 s13 s14 s16
>>>
>>>
>>> Lets consider the case of M1 waking up s8. As per the above diagram,
>>> M1 has 8 flips and s8 has 4 flips.
>>>
>>> With llc_size = 3, the condition
>>>
>>> (slave < factor) would return FALSE, so then we would turn to the
>>> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3),
>>> so wake_wide would return 0 and would cause s8 to be woken up as
>>> affine with relation to M1's core.
>>>
>>> So basically, it seems the heuristic is saying (with help of the
>>> second condition - master < slave * factor). that Its a good idea for
>>> s8 to be affine-woken-up with respect to M1's core. Why is it a good
>>> idea to do that? It seems to me M1 has already several tasks its
>>> waking as affine so causing s8 to be woken up affine could be harmful
>>> and it may be a better choice to wake it up elsewhere.
>>>
>>> Thanks for your help!
>>>
>>> -Joel
>>>
>>> [1] https://lkml.org/lkml/2013/7/4/20
>>>
>>>
>>>>
>>>> Something else to consider: network interrupt waking multiple workers
>>>> at high frequency. If the waking CPU is idle, do you really want to
>>>> place a worker directly in front of a tattoo artist, or is it better
>>>> off nearly anywhere but there?
>>>>
>>>> If the box is virtual, with no topology exposed (or real but ancient)
>>>> to let select_idle_sibling() come to the rescue, two workers can even
>>>> get tattooed simultaneously (see sync wakeup).
>>>>
>>>> -Mike

2017-08-09 21:23:22

by Atish Patra

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On 08/03/2017 10:05 AM, Brendan Jackman wrote:
>
> On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote:
>> On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote:
>>>
>>> Hi,
>>>
>>> On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote:
>>>> On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:
>>>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>>>>
>>>>>>> That makes sense that we multiply slave's flips by a factor because
>>>>>>> its low, but I still didn't get why the factor is chosen to be
>>>>>>> llc_size instead of something else for the multiplication with slave
>>>>>>> (slave * factor).
>>>>>
>>>>>> Yeah I don't know why llc_size was chosen...
>>>>>
>>>>> static void update_top_cache_domain(int cpu)
>>>>> {
>>>>> struct sched_domain_shared *sds = NULL;
>>>>> struct sched_domain *sd;
>>>>> int id = cpu;
>>>>> int size = 1;
>>>>>
>>>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>>>> if (sd) {
>>>>> id = cpumask_first(sched_domain_span(sd));
>>>>> size = cpumask_weight(sched_domain_span(sd));
>>>>> sds = sd->shared;
>>>>> }
>>>>>
>>>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>>>> per_cpu(sd_llc_size, cpu) = size;
>>>>>
>>>>> The goal of wake wide was to approximate when pulling would be a futile
>>>>> consolidation effort and counterproductive to scaling. 'course with
>>>>> ever increasing socket size, any 1:N waker is ever more likely to run
>>>>> out of CPU for its one and only self (slamming into scaling wall)
>>>>> before it needing to turn its minions loose to conquer the world.
>>>>>
>>>>> Something else to consider: network interrupt waking multiple workers
>>>>> at high frequency. If the waking CPU is idle, do you really want to
>>>>> place a worker directly in front of a tattoo artist, or is it better
>>>>> off nearly anywhere but there?
>>>>>
>>>>> If the box is virtual, with no topology exposed (or real but ancient)
>>>>> to let select_idle_sibling() come to the rescue, two workers can even
>>>>> get tattooed simultaneously (see sync wakeup).
>>>>>
>>>>
>>>> Heuristics are hard, news at 11. I think messing with wake_wide() itself is too
>>>> big of a hammer, we probably need a middle ground. I'm messing with it right
>>>> now so it's too early to say for sure, but i _suspect_ the bigger latencies we
>>>> see are not because we overload the cpu we're trying to pull to, but because
>>>> when we fail to do the wake_affine() we only look at siblings of the affine_sd
>>>> instead of doing the full "find the idlest cpu in the land!" thing.
>>>
>>> This is the problem I've been hitting lately. My use case is 1 task per
>>> CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1
>>> task per CPU, they all do X amount of work then pthread_barrier_wait
>>> (i.e. sleep until the last task finishes its X and hits the barrier). On
>>> big.LITTLE, the tasks which get a "big" CPU finish faster, and then
>>> those CPUs pull over the tasks that are still running:
>>>
>>> v CPU v ->time->
>>>
>>> -------------
>>> 0 (big) 11111 /333
>>> -------------
>>> 1 (big) 22222 /444|
>>> -------------
>>> 2 (LITTLE) 333333/
>>> -------------
>>> 3 (LITTLE) 444444/
>>> -------------
>>>
>>> Now when task 4 hits the barrier (at |) and wakes the others up, there
>>> are 4 tasks with prev_cpu=<big> and 0 tasks with
>>> prev_cpu=<little>. Assuming that those wakeups happen on CPU4,
>>> regardless of wake_affine, want_affine means that we'll only look in
>>> sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the
>>> bigs until the next load balance, something like this:
>>>
>>> v CPU v ->time->
>>>
>>> ------------------------
>>> 0 (big) 11111 /333 31313\33333
>>> ------------------------
>>> 1 (big) 22222 /444|424\4444444
>>> ------------------------
>>> 2 (LITTLE) 333333/ \222222
>>> ------------------------
>>> 3 (LITTLE) 444444/ \1111
>>> ------------------------
>>> ^^^
>>> underutilization
>>>
>>>> I _think_
>>>> the answer is to make select_idle_sibling() try less hard to find something
>>>> workable and only use obviously idle cpu's in the affine sd, and fall back to
>>>> the full load balance esque search.
>>>
>>> So this idea of allowing select_idle_sibling to fail, and falling back
>>> to the slow path, would help me too, I think.
>>
>> Unfortunately this statement of mine was wrong, I had it in my head that we
>> would fall back to a find the idlest cpu thing provided we failed to wake
>> affine, but we just do select_idle_sibling() and expect the load balancer to
>> move things around as needed.
>
> Ah yes, when wake_affine() returns false, we still do
> select_idle_sibling (except in prev_cpu's sd_llc instead of
> smp_processor_id()'s), and that is the problem faced by my workload. I
> thought you were suggesting to change the flow so that
> select_idle_sibling can say "I didn't find any idle siblings - go to the
> find_idlest_group path".
>
>>> This is also why I was playing with your
>>> don't-affine-recently-balanced-tasks patch[1], which also helps my case
>>> since it prevents want_affine for tasks 3 and 4 (which were recently
>>> moved by an active balance).
>>>
>>> [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2
>>> (also linked elsewhere in this thread)
>>>
>>
>> Would you try peter's sched/experimental branch and see how that affects your
>> workload? I'm still messing with my patches and I may drop this one as it now
>> appears to be too aggressive with the new set of patches. Thanks,
>
> Sure, I'll take a look at those, thanks. I guess the idea of caching
> values in LB and then using them in wakeup[2] is a lighter-handed way of
> achieving the same thing as last_balance_ts? It won't solve my problem
> directly since we'll still only look in sd_llc, but I think it could be
> a basis for a way to say "go find_idlest_group path on these tasks" at
> the beginning of select_task_rq_fair.
>
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef
>
> Thanks,
> Brendan
>

Would it be better if it searches for idle cpus in next higher domain
(if that is not NUMA) instead of doing find_idlest_group ?

The above approach [2] will still try to search a idle cpu in the llc
domain of new_cpu. If it finds one great. Otherwise, let's search for an
idle cpu in next higher domain(if that is not NUMA) excluding the
current domain which we have already searched. I think it would help in
cases where LLC < NUMA.


Regards,
Atish

2017-08-10 09:48:18

by Brendan Jackman

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification


On Wed, Aug 09 2017 at 21:22, Atish Patra wrote:
> On 08/03/2017 10:05 AM, Brendan Jackman wrote:
>>
>> On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote:
>>> On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote:
>>>>
>>>> Hi,
>>>>
>>>> On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote:
>>>>> On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:
>>>>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>>>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>>>>>
>>>>>>>> That makes sense that we multiply slave's flips by a factor because
>>>>>>>> its low, but I still didn't get why the factor is chosen to be
>>>>>>>> llc_size instead of something else for the multiplication with slave
>>>>>>>> (slave * factor).
>>>>>>
>>>>>>> Yeah I don't know why llc_size was chosen...
>>>>>>
>>>>>> static void update_top_cache_domain(int cpu)
>>>>>> {
>>>>>> struct sched_domain_shared *sds = NULL;
>>>>>> struct sched_domain *sd;
>>>>>> int id = cpu;
>>>>>> int size = 1;
>>>>>>
>>>>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>>>>> if (sd) {
>>>>>> id = cpumask_first(sched_domain_span(sd));
>>>>>> size = cpumask_weight(sched_domain_span(sd));
>>>>>> sds = sd->shared;
>>>>>> }
>>>>>>
>>>>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>>>>> per_cpu(sd_llc_size, cpu) = size;
>>>>>>
>>>>>> The goal of wake wide was to approximate when pulling would be a futile
>>>>>> consolidation effort and counterproductive to scaling. 'course with
>>>>>> ever increasing socket size, any 1:N waker is ever more likely to run
>>>>>> out of CPU for its one and only self (slamming into scaling wall)
>>>>>> before it needing to turn its minions loose to conquer the world.
>>>>>>
>>>>>> Something else to consider: network interrupt waking multiple workers
>>>>>> at high frequency. If the waking CPU is idle, do you really want to
>>>>>> place a worker directly in front of a tattoo artist, or is it better
>>>>>> off nearly anywhere but there?
>>>>>>
>>>>>> If the box is virtual, with no topology exposed (or real but ancient)
>>>>>> to let select_idle_sibling() come to the rescue, two workers can even
>>>>>> get tattooed simultaneously (see sync wakeup).
>>>>>>
>>>>>
>>>>> Heuristics are hard, news at 11. I think messing with wake_wide() itself is too
>>>>> big of a hammer, we probably need a middle ground. I'm messing with it right
>>>>> now so it's too early to say for sure, but i _suspect_ the bigger latencies we
>>>>> see are not because we overload the cpu we're trying to pull to, but because
>>>>> when we fail to do the wake_affine() we only look at siblings of the affine_sd
>>>>> instead of doing the full "find the idlest cpu in the land!" thing.
>>>>
>>>> This is the problem I've been hitting lately. My use case is 1 task per
>>>> CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1
>>>> task per CPU, they all do X amount of work then pthread_barrier_wait
>>>> (i.e. sleep until the last task finishes its X and hits the barrier). On
>>>> big.LITTLE, the tasks which get a "big" CPU finish faster, and then
>>>> those CPUs pull over the tasks that are still running:
>>>>
>>>> v CPU v ->time->
>>>>
>>>> -------------
>>>> 0 (big) 11111 /333
>>>> -------------
>>>> 1 (big) 22222 /444|
>>>> -------------
>>>> 2 (LITTLE) 333333/
>>>> -------------
>>>> 3 (LITTLE) 444444/
>>>> -------------
>>>>
>>>> Now when task 4 hits the barrier (at |) and wakes the others up, there
>>>> are 4 tasks with prev_cpu=<big> and 0 tasks with
>>>> prev_cpu=<little>. Assuming that those wakeups happen on CPU4,
>>>> regardless of wake_affine, want_affine means that we'll only look in
>>>> sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the
>>>> bigs until the next load balance, something like this:
>>>>
>>>> v CPU v ->time->
>>>>
>>>> ------------------------
>>>> 0 (big) 11111 /333 31313\33333
>>>> ------------------------
>>>> 1 (big) 22222 /444|424\4444444
>>>> ------------------------
>>>> 2 (LITTLE) 333333/ \222222
>>>> ------------------------
>>>> 3 (LITTLE) 444444/ \1111
>>>> ------------------------
>>>> ^^^
>>>> underutilization
>>>>
>>>>> I _think_
>>>>> the answer is to make select_idle_sibling() try less hard to find something
>>>>> workable and only use obviously idle cpu's in the affine sd, and fall back to
>>>>> the full load balance esque search.
>>>>
>>>> So this idea of allowing select_idle_sibling to fail, and falling back
>>>> to the slow path, would help me too, I think.
>>>
>>> Unfortunately this statement of mine was wrong, I had it in my head that we
>>> would fall back to a find the idlest cpu thing provided we failed to wake
>>> affine, but we just do select_idle_sibling() and expect the load balancer to
>>> move things around as needed.
>>
>> Ah yes, when wake_affine() returns false, we still do
>> select_idle_sibling (except in prev_cpu's sd_llc instead of
>> smp_processor_id()'s), and that is the problem faced by my workload. I
>> thought you were suggesting to change the flow so that
>> select_idle_sibling can say "I didn't find any idle siblings - go to the
>> find_idlest_group path".
>>
>>>> This is also why I was playing with your
>>>> don't-affine-recently-balanced-tasks patch[1], which also helps my case
>>>> since it prevents want_affine for tasks 3 and 4 (which were recently
>>>> moved by an active balance).
>>>>
>>>> [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2
>>>> (also linked elsewhere in this thread)
>>>>
>>>
>>> Would you try peter's sched/experimental branch and see how that affects your
>>> workload? I'm still messing with my patches and I may drop this one as it now
>>> appears to be too aggressive with the new set of patches. Thanks,
>>
>> Sure, I'll take a look at those, thanks. I guess the idea of caching
>> values in LB and then using them in wakeup[2] is a lighter-handed way of
>> achieving the same thing as last_balance_ts? It won't solve my problem
>> directly since we'll still only look in sd_llc, but I think it could be
>> a basis for a way to say "go find_idlest_group path on these tasks" at
>> the beginning of select_task_rq_fair.
>>
>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef
>>
>> Thanks,
>> Brendan
>>
>
> Would it be better if it searches for idle cpus in next higher domain
> (if that is not NUMA) instead of doing find_idlest_group ?
>
> The above approach [2] will still try to search a idle cpu in the llc
> domain of new_cpu. If it finds one great. Otherwise, let's search for an
> idle cpu in next higher domain(if that is not NUMA) excluding the
> current domain which we have already searched. I think it would help in
> cases where LLC < NUMA.

Well for my use case, "search for idle cpus in the next higher domain"
and "go to find_busiest_group & co" mean the same thing. But for NUMA
boxes (where I'm guessing there are >=3 sched_domains?) then yeah that
sounds like a good idea at least to my inexpert ears. Currently it's
either "only look at siblings of either prev_cpu or smp_processor_id()"
or "look through the whole system (or at least as far as sd_flag takes
us)", so maybe you want middle ground?

On the other hand, I'm guessing remote wakeups across NUMA nodes are
expensive, so that's why we only look at siblings? There's already a
choice between prev_cpu's and smp_processor_id()'s LLC siblings.

2017-08-10 17:41:52

by Atish Patra

[permalink] [raw]
Subject: Re: wake_wide mechanism clarification

On 08/10/2017 04:48 AM, Brendan Jackman wrote:
>
> On Wed, Aug 09 2017 at 21:22, Atish Patra wrote:
>> On 08/03/2017 10:05 AM, Brendan Jackman wrote:
>>>
>>> On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote:
>>>> On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote:
>>>>>> On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:
>>>>>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>>>>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>>>>>>
>>>>>>>>> That makes sense that we multiply slave's flips by a factor because
>>>>>>>>> its low, but I still didn't get why the factor is chosen to be
>>>>>>>>> llc_size instead of something else for the multiplication with slave
>>>>>>>>> (slave * factor).
>>>>>>>
>>>>>>>> Yeah I don't know why llc_size was chosen...
>>>>>>>
>>>>>>> static void update_top_cache_domain(int cpu)
>>>>>>> {
>>>>>>> struct sched_domain_shared *sds = NULL;
>>>>>>> struct sched_domain *sd;
>>>>>>> int id = cpu;
>>>>>>> int size = 1;
>>>>>>>
>>>>>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>>>>>> if (sd) {
>>>>>>> id = cpumask_first(sched_domain_span(sd));
>>>>>>> size = cpumask_weight(sched_domain_span(sd));
>>>>>>> sds = sd->shared;
>>>>>>> }
>>>>>>>
>>>>>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>>>>>> per_cpu(sd_llc_size, cpu) = size;
>>>>>>>
>>>>>>> The goal of wake wide was to approximate when pulling would be a futile
>>>>>>> consolidation effort and counterproductive to scaling. 'course with
>>>>>>> ever increasing socket size, any 1:N waker is ever more likely to run
>>>>>>> out of CPU for its one and only self (slamming into scaling wall)
>>>>>>> before it needing to turn its minions loose to conquer the world.
>>>>>>>
>>>>>>> Something else to consider: network interrupt waking multiple workers
>>>>>>> at high frequency. If the waking CPU is idle, do you really want to
>>>>>>> place a worker directly in front of a tattoo artist, or is it better
>>>>>>> off nearly anywhere but there?
>>>>>>>
>>>>>>> If the box is virtual, with no topology exposed (or real but ancient)
>>>>>>> to let select_idle_sibling() come to the rescue, two workers can even
>>>>>>> get tattooed simultaneously (see sync wakeup).
>>>>>>>
>>>>>>
>>>>>> Heuristics are hard, news at 11. I think messing with wake_wide() itself is too
>>>>>> big of a hammer, we probably need a middle ground. I'm messing with it right
>>>>>> now so it's too early to say for sure, but i _suspect_ the bigger latencies we
>>>>>> see are not because we overload the cpu we're trying to pull to, but because
>>>>>> when we fail to do the wake_affine() we only look at siblings of the affine_sd
>>>>>> instead of doing the full "find the idlest cpu in the land!" thing.
>>>>>
>>>>> This is the problem I've been hitting lately. My use case is 1 task per
>>>>> CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1
>>>>> task per CPU, they all do X amount of work then pthread_barrier_wait
>>>>> (i.e. sleep until the last task finishes its X and hits the barrier). On
>>>>> big.LITTLE, the tasks which get a "big" CPU finish faster, and then
>>>>> those CPUs pull over the tasks that are still running:
>>>>>
>>>>> v CPU v ->time->
>>>>>
>>>>> -------------
>>>>> 0 (big) 11111 /333
>>>>> -------------
>>>>> 1 (big) 22222 /444|
>>>>> -------------
>>>>> 2 (LITTLE) 333333/
>>>>> -------------
>>>>> 3 (LITTLE) 444444/
>>>>> -------------
>>>>>
>>>>> Now when task 4 hits the barrier (at |) and wakes the others up, there
>>>>> are 4 tasks with prev_cpu=<big> and 0 tasks with
>>>>> prev_cpu=<little>. Assuming that those wakeups happen on CPU4,
>>>>> regardless of wake_affine, want_affine means that we'll only look in
>>>>> sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the
>>>>> bigs until the next load balance, something like this:
>>>>>
>>>>> v CPU v ->time->
>>>>>
>>>>> ------------------------
>>>>> 0 (big) 11111 /333 31313\33333
>>>>> ------------------------
>>>>> 1 (big) 22222 /444|424\4444444
>>>>> ------------------------
>>>>> 2 (LITTLE) 333333/ \222222
>>>>> ------------------------
>>>>> 3 (LITTLE) 444444/ \1111
>>>>> ------------------------
>>>>> ^^^
>>>>> underutilization
>>>>>
>>>>>> I _think_
>>>>>> the answer is to make select_idle_sibling() try less hard to find something
>>>>>> workable and only use obviously idle cpu's in the affine sd, and fall back to
>>>>>> the full load balance esque search.
>>>>>
>>>>> So this idea of allowing select_idle_sibling to fail, and falling back
>>>>> to the slow path, would help me too, I think.
>>>>
>>>> Unfortunately this statement of mine was wrong, I had it in my head that we
>>>> would fall back to a find the idlest cpu thing provided we failed to wake
>>>> affine, but we just do select_idle_sibling() and expect the load balancer to
>>>> move things around as needed.
>>>
>>> Ah yes, when wake_affine() returns false, we still do
>>> select_idle_sibling (except in prev_cpu's sd_llc instead of
>>> smp_processor_id()'s), and that is the problem faced by my workload. I
>>> thought you were suggesting to change the flow so that
>>> select_idle_sibling can say "I didn't find any idle siblings - go to the
>>> find_idlest_group path".
>>>
>>>>> This is also why I was playing with your
>>>>> don't-affine-recently-balanced-tasks patch[1], which also helps my case
>>>>> since it prevents want_affine for tasks 3 and 4 (which were recently
>>>>> moved by an active balance).
>>>>>
>>>>> [1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2
>>>>> (also linked elsewhere in this thread)
>>>>>
>>>>
>>>> Would you try peter's sched/experimental branch and see how that affects your
>>>> workload? I'm still messing with my patches and I may drop this one as it now
>>>> appears to be too aggressive with the new set of patches. Thanks,
>>>
>>> Sure, I'll take a look at those, thanks. I guess the idea of caching
>>> values in LB and then using them in wakeup[2] is a lighter-handed way of
>>> achieving the same thing as last_balance_ts? It won't solve my problem
>>> directly since we'll still only look in sd_llc, but I think it could be
>>> a basis for a way to say "go find_idlest_group path on these tasks" at
>>> the beginning of select_task_rq_fair.
>>>
>>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef
>>>
>>> Thanks,
>>> Brendan
>>>
>>
>> Would it be better if it searches for idle cpus in next higher domain
>> (if that is not NUMA) instead of doing find_idlest_group ?
>>
>> The above approach [2] will still try to search a idle cpu in the llc
>> domain of new_cpu. If it finds one great. Otherwise, let's search for an
>> idle cpu in next higher domain(if that is not NUMA) excluding the
>> current domain which we have already searched. I think it would help in
>> cases where LLC < NUMA.
>
> Well for my use case, "search for idle cpus in the next higher domain"
> and "go to find_busiest_group & co" mean the same thing.
Yeah. It just may be tad cheaper to do idle cpu search than to
find_idlest_group and find_idlest_cpu.

But for NUMA
> boxes (where I'm guessing there are >=3 sched_domains?)
Yeah.
then yeah that
> sounds like a good idea at least to my inexpert ears. Currently it's
> either "only look at siblings of either prev_cpu or smp_processor_id()"
> or "look through the whole system (or at least as far as sd_flag takes
> us)", so maybe you want middle ground?
>
> On the other hand, I'm guessing remote wakeups across NUMA nodes are
> expensive, so that's why we only look at siblings?
Correct. But I am suggesting the above modification impacting only for
archs where LLC < NUMA i.e. you have SMT, LLC, DIE/CPU, NUMA.. domains.
For your case, I guess the domains would stop at CPU. Any architecture
with above domain hierarchy can search wide to find out an idle cpu.

Other architectures where NUMA is the next domain to LLC, it won't do
anything.

There's already a
> choice between prev_cpu's and smp_processor_id()'s LLC siblings.
>
Yes. So we may benefit by searching idle cpus other LLC groups as well.
Initially, I was worried about it may be too expensive to search wide
but Peter's recent patch (1ad3aaf3fcd2:Implement new approach to scale
select_idle_cpu) should abandon the search if it becomes too costly.

Did I miss any case it may affect performance negatively ?

Regards,
Atish