Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753465AbdG2INu (ORCPT ); Sat, 29 Jul 2017 04:13:50 -0400 Received: from mail-oi0-f43.google.com ([209.85.218.43]:33909 "EHLO mail-oi0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753231AbdG2INp (ORCPT ); Sat, 29 Jul 2017 04:13:45 -0400 MIME-Version: 1.0 In-Reply-To: References: <20170630004912.GA2457@destiny> <20170630142815.GA9743@destiny> <1498842140.15161.66.camel@gmail.com> From: Joel Fernandes Date: Sat, 29 Jul 2017 01:13:43 -0700 Message-ID: Subject: Re: wake_wide mechanism clarification To: Mike Galbraith Cc: Josef Bacik , Peter Zijlstra , LKML , Juri Lelli , Dietmar Eggemann , Patrick Bellasi , Brendan Jackman , Chris Redpath , Matt Fleming , Michael Wang Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6170 Lines: 151 +Michael Wang on his current email address (old one bounced). (my reply was to Mike Galbraith but I also meant to CC Michael Wang for the discussion). Thanks On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes wrote: > Hi Mike, > > I have take spent some time understanding the email thread and > previous discussions. Unfortunately the second condition we are > checking for in the wake_wide still didn't make sense to me (mentioned > below) :-( > > On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith > wrote: >> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >>> >>> > That makes sense that we multiply slave's flips by a factor because >>> > its low, but I still didn't get why the factor is chosen to be >>> > llc_size instead of something else for the multiplication with slave >>> > (slave * factor). >> >>> Yeah I don't know why llc_size was chosen... >> >> static void update_top_cache_domain(int cpu) >> { >> struct sched_domain_shared *sds = NULL; >> struct sched_domain *sd; >> int id = cpu; >> int size = 1; >> >> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >> if (sd) { >> id = cpumask_first(sched_domain_span(sd)); >> size = cpumask_weight(sched_domain_span(sd)); >> sds = sd->shared; >> } >> >> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >> per_cpu(sd_llc_size, cpu) = size; >> >> The goal of wake wide was to approximate when pulling would be a futile >> consolidation effort and counterproductive to scaling. 'course with >> ever increasing socket size, any 1:N waker is ever more likely to run >> out of CPU for its one and only self (slamming into scaling wall) >> before it needing to turn its minions loose to conquer the world. > > Actually the original question was why do we have the second condition > as "master < slave * factor", instead of "master < factor". that's > what didn't make sense to me. Why don't we return 0 from wake_wide if > master < factor ? > > Infact, as the factor is set to the llc_size, I think the condition > that makes sense to me is: > > if ((master + slave) < llc_size) > return 0; > > In other words, if the master flips and the slave flips are totally > higher than the llc_size, then we are most likely waking up too many > tasks as affine and should then switch to wide to prevent overloading. > > Digging further into the original patch from Michael Wang (I also CC'd > him), this was the code (before you had changed it to master/slave): > > wakee->nr_wakee_switch > factor && > waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch) > > To explain the second condition above, Michael Wang said the following in [1] > > "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple > tasks rely on it, then waker's higher latency will damage all of them, pull > wakee seems to be a bad deal." > > Again I didn't follow why the second condition couldn't just be: > waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + > wakee->nr_wakee_switch) > factor, based on the above explanation from > Micheal Wang that I quoted. > and why he's instead doing the whole multiplication thing there that I > was talking about earlier: "factor * wakee->nr_wakee_switch". > > Rephrasing my question in another way, why are we talking the ratio of > master/slave instead of the sum when comparing if its > factor? I am > surely missing something here. > > Just taking an example: > > Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8 > slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common > between all 3 masters. Also to make it a bit more interesting, let s8 > wake up some random task T0. A diagram to show the master/slave > relation ships could look like: > > +-----+ > | | > +------------------------+ +------+ M2 | > | | | | | > | M1 | | +--+--+---- > | | | | | | | > | | | | | | s15 > +--+--+--+--+--+--+---+--+---+ v v v > | | | | | | | | s9 s10 s11 > v v v v v v v v > s1 s2 s3 s4 s5 s6 s7 s8 ---> T0 > ^ > | > +-+---+ > | | > | M3 | > | | > +--+--+----- > | | | | > v v v v > s12 s13 s14 s16 > > > Lets consider the case of M1 waking up s8. As per the above diagram, > M1 has 8 flips and s8 has 4 flips. > > With llc_size = 3, the condition > > (slave < factor) would return FALSE, so then we would turn to the > (master < slave * factor) condition. This would be TRUE (8 < 4 * 3), > so wake_wide would return 0 and would cause s8 to be woken up as > affine with relation to M1's core. > > So basically, it seems the heuristic is saying (with help of the > second condition - master < slave * factor). that Its a good idea for > s8 to be affine-woken-up with respect to M1's core. Why is it a good > idea to do that? It seems to me M1 has already several tasks its > waking as affine so causing s8 to be woken up affine could be harmful > and it may be a better choice to wake it up elsewhere. > > Thanks for your help! > > -Joel > > [1] https://lkml.org/lkml/2013/7/4/20 > > >> >> Something else to consider: network interrupt waking multiple workers >> at high frequency. If the waking CPU is idle, do you really want to >> place a worker directly in front of a tattoo artist, or is it better >> off nearly anywhere but there? >> >> If the box is virtual, with no topology exposed (or real but ancient) >> to let select_idle_sibling() come to the rescue, two workers can even >> get tattooed simultaneously (see sync wakeup). >> >> -Mike