MIME-Version: 1.0
In-Reply-To: <CAJWu+oo7AY+PYqdwSoW6ToRe2o40MhT1wSCrRU8nkQuH0cFO4w@mail.gmail.com>
References: <CAJWu+oo0_zAWLjObRZp-r3VH5kKKb4aFVnRevtbYh+uH_tCNag@mail.gmail.com>
 <20170630004912.GA2457@destiny> <CAJWu+opFTvnGnYyRo9eYZJUxiHdoMCMoadUvnX3p=6ypwWX_Jw@mail.gmail.com>
 <20170630142815.GA9743@destiny> <1498842140.15161.66.camel@gmail.com> <CAJWu+oo7AY+PYqdwSoW6ToRe2o40MhT1wSCrRU8nkQuH0cFO4w@mail.gmail.com>
From: Joel Fernandes <joelaf@google.com>
Date: Sat, 29 Jul 2017 01:13:43 -0700
Message-ID: <CAJWu+oo31Sv1j=5SK4aBO9jASKs-h10VwFMQ3y+Pwy6B5hdm8w@mail.gmail.com>
Subject: Re: wake_wide mechanism clarification
To: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>,
        Peter Zijlstra <peterz@infradead.org>,
        LKML <linux-kernel@vger.kernel.org>, Juri Lelli <Juri.Lelli@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Patrick Bellasi <patrick.bellasi@arm.com>,
        Brendan Jackman <brendan.jackman@arm.com>,
        Chris Redpath <Chris.Redpath@arm.com>,
        Matt Fleming <matt@codeblueprint.co.uk>,
        Michael Wang <yun.wang@profitbricks.com>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6170
Lines: 151

+Michael Wang on his current email address (old one bounced). (my
reply was to Mike Galbraith but I also meant to CC Michael Wang for
the discussion). Thanks

On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <joelaf@google.com> wrote:
> Hi Mike,
>
> I have take spent some time understanding the email thread and
> previous discussions. Unfortunately the second condition we are
> checking for in the wake_wide still didn't make sense to me (mentioned
> below) :-(
>
> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
> <umgwanakikbuti@gmail.com> wrote:
>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>
>>> > That makes sense that we multiply slave's flips by a factor because
>>> > its low, but I still didn't get why the factor is chosen to be
>>> > llc_size instead of something else for the multiplication with slave
>>> > (slave * factor).
>>
>>> Yeah I don't know why llc_size was chosen...
>>
>> static void update_top_cache_domain(int cpu)
>> {
>>         struct sched_domain_shared *sds = NULL;
>>         struct sched_domain *sd;
>>         int id = cpu;
>>         int size = 1;
>>
>>         sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>         if (sd) {
>>                 id = cpumask_first(sched_domain_span(sd));
>>                 size = cpumask_weight(sched_domain_span(sd));
>>                 sds = sd->shared;
>>         }
>>
>>         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>         per_cpu(sd_llc_size, cpu) = size;
>>
>> The goal of wake wide was to approximate when pulling would be a futile
>> consolidation effort and counterproductive to scaling.  'course with
>> ever increasing socket size, any 1:N waker is ever more likely to run
>> out of CPU for its one and only self (slamming into scaling wall)
>> before it needing to turn its minions loose to conquer the world.
>
> Actually the original question was why do we have the second condition
> as "master < slave * factor", instead of "master < factor". that's
> what didn't make sense to me. Why don't we return 0 from wake_wide if
> master < factor ?
>
> Infact, as the factor is set to the llc_size, I think the condition
> that makes sense to me is:
>
> if ((master + slave) < llc_size)
>   return 0;
>
> In other words, if the master flips and the slave flips are totally
> higher than the llc_size, then we are most likely waking up too many
> tasks as affine and should then switch to wide to prevent overloading.
>
> Digging further into the original patch from Michael Wang (I also CC'd
> him), this was the code (before you had changed it to master/slave):
>
> wakee->nr_wakee_switch > factor &&
> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>
> To explain the second condition above, Michael Wang said the following in [1]
>
> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
> tasks rely on it, then waker's higher latency will damage all of them, pull
> wakee seems to be a bad deal."
>
> Again I didn't follow why the second condition couldn't just be:
> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
> wakee->nr_wakee_switch) > factor, based on the above explanation from
> Micheal Wang that I quoted.
> and why he's instead doing the whole multiplication thing there that I
> was talking about earlier: "factor * wakee->nr_wakee_switch".
>
> Rephrasing my question in another way, why are we talking the ratio of
> master/slave instead of the sum when comparing if its > factor? I am
> surely missing something here.
>
> Just taking an example:
>
> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8
> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common
> between all 3 masters. Also to make it a bit more interesting, let s8
> wake up some random task T0. A diagram to show the master/slave
> relation ships could look like:
>
>                                     +-----+
>                                     |     |
> +------------------------+   +------+ M2  |
> |                        |   |      |     |
> |        M1              |   |      +--+--+----
> |                        |   |      |  |  |   |
> |                        |   |      |  |  |   s15
> +--+--+--+--+--+--+---+--+---+      v  v  v
>    |  |  |  |  |  |   |      |      s9 s10 s11
>    v  v  v  v  v  v   v      v
>    s1 s2 s3 s4 s5 s6  s7     s8 ---> T0
>                              ^
>                              |
>                            +-+---+
>                            |     |
>                            | M3  |
>                            |     |
>                            +--+--+-----
>                            |  |  |    |
>                            v  v  v    v
>                           s12 s13 s14 s16
>
>
> Lets consider the case of M1 waking up s8. As per the above diagram,
> M1 has 8 flips and s8 has 4 flips.
>
> With llc_size = 3, the condition
>
> (slave < factor) would return FALSE, so then we would turn to the
> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3),
> so wake_wide would return 0 and would cause s8 to be woken up as
> affine with relation to M1's core.
>
> So basically, it seems the heuristic is saying (with help of the
> second condition - master < slave * factor). that Its a good idea for
> s8 to be affine-woken-up with respect to M1's core. Why is it a good
> idea to do that? It seems to me M1 has already several tasks its
> waking as affine so causing s8 to be woken up affine could be harmful
> and it may be a better choice to wake it up elsewhere.
>
> Thanks for your help!
>
> -Joel
>
> [1] https://lkml.org/lkml/2013/7/4/20
>
>
>>
>> Something else to consider: network interrupt waking multiple workers
>> at high frequency.  If the waking CPU is idle, do you really want to
>> place a worker directly in front of a tattoo artist, or is it better
>> off nearly anywhere but there?
>>
>> If the box is virtual, with no topology exposed (or real but ancient)
>> to let select_idle_sibling() come to the rescue, two workers can even
>> get tattooed simultaneously (see sync wakeup).
>>
>>         -Mike