Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752757AbdG2PHc (ORCPT ); Sat, 29 Jul 2017 11:07:32 -0400 Received: from mail-wr0-f196.google.com ([209.85.128.196]:38308 "EHLO mail-wr0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752479AbdG2PHa (ORCPT ); Sat, 29 Jul 2017 11:07:30 -0400 Message-ID: <1501340845.7706.168.camel@gmail.com> Subject: Re: wake_wide mechanism clarification From: Mike Galbraith To: Joel Fernandes Cc: Josef Bacik , Peter Zijlstra , LKML , Juri Lelli , Dietmar Eggemann , Patrick Bellasi , Brendan Jackman , Chris Redpath , Michael Wang , Matt Fleming Date: Sat, 29 Jul 2017 17:07:25 +0200 In-Reply-To: References: <20170630004912.GA2457@destiny> <20170630142815.GA9743@destiny> <1498842140.15161.66.camel@gmail.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.20.5 Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6941 Lines: 162 On Sat, 2017-07-29 at 01:01 -0700, Joel Fernandes wrote: > Hi Mike, > > I have take spent some time understanding the email thread and > previous discussions. Unfortunately the second condition we are > checking for in the wake_wide still didn't make sense to me (mentioned > below) :-( > > On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith > wrote: > > On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: > >> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: > >> > >> > That makes sense that we multiply slave's flips by a factor because > >> > its low, but I still didn't get why the factor is chosen to be > >> > llc_size instead of something else for the multiplication with slave > >> > (slave * factor). > > > >> Yeah I don't know why llc_size was chosen... > > > > static void update_top_cache_domain(int cpu) > > { > > struct sched_domain_shared *sds = NULL; > > struct sched_domain *sd; > > int id = cpu; > > int size = 1; > > > > sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); > > if (sd) { > > id = cpumask_first(sched_domain_span(sd)); > > size = cpumask_weight(sched_domain_span(sd)); > > sds = sd->shared; > > } > > > > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); > > per_cpu(sd_llc_size, cpu) = size; > > > > The goal of wake wide was to approximate when pulling would be a futile > > consolidation effort and counterproductive to scaling. 'course with > > ever increasing socket size, any 1:N waker is ever more likely to run > > out of CPU for its one and only self (slamming into scaling wall) > > before it needing to turn its minions loose to conquer the world. > > Actually the original question was why do we have the second condition > as "master < slave * factor", instead of "master < factor". that's > what didn't make sense to me. Why don't we return 0 from wake_wide if > master < factor ? > > Infact, as the factor is set to the llc_size, I think the condition > that makes sense to me is: > > if ((master + slave) < llc_size) > return 0; That says to me turn affine wakeups off for nearly everything. > In other words, if the master flips and the slave flips are totally > higher than the llc_size, then we are most likely waking up too many > tasks as affine and should then switch to wide to prevent overloading. The heuristic is mostly about the ratio of flip counts. > Digging further into the original patch from Michael Wang (I also CC'd > him), this was the code (before you had changed it to master/slave): > > wakee->nr_wakee_switch > factor && > waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch) Yeah, it was originally unidirectional. > To explain the second condition above, Michael Wang said the following in [1] > > "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple > tasks rely on it, then waker's higher latency will damage all of them, pull > wakee seems to be a bad deal." Yes, "Furthermore". To detect 1:N, Michael chose llc_size as his N.  Is the one flipping partners at least N/s, and the other about N times as often?  If so, the two may be part of a too big to wisely pull 1:N. If you have a better idea, by all means, pull it out.  Nobody is attached to wake_wide(), in fact, I suspect Peter hates it.  I'm not fond of it either, it having obvious holes.  The only thing it has going for it is simplicity.  Bend it up, replace it, fire away.    > Again I didn't follow why the second condition couldn't just be: > waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + > wakee->nr_wakee_switch) > factor, based on the above explanation from > Micheal Wang that I quoted. > and why he's instead doing the whole multiplication thing there that I > was talking about earlier: "factor * wakee->nr_wakee_switch". > > Rephrasing my question in another way, why are we talking the ratio of > master/slave instead of the sum when comparing if its > factor? I am > surely missing something here. Because the heuristic tries to not demolish 1:1 buddies.  Big partner flip delta means the pair are unlikely to be a communicating pair, perhaps at high frequency where misses hurt like hell. > Just taking an example: > > Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8 > slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common > between all 3 masters. Also to make it a bit more interesting, let s8 > wake up some random task T0. A diagram to show the master/slave > relation ships could look like: I don't need the artwork, as soon as you describe a squid orgie in a bucket with no adult supervision, I can visualize what happens: the load balancer tries to make pedantically perfect load numbers, caring not one whit about any of the relationships in either text or graphic description, stirs bucket vigorously, making squid soup. IMHO, placement optimization of that is a job for a human, the scheduler ain't that smart. (quick like bunny -> smart like bunny) > > +-----+ > | | > +------------------------+ +------+ M2 | > | | | | | > | M1 | | +--+--+---- > | | | | | | | > | | | | | | s15 > +--+--+--+--+--+--+---+--+---+ v v v > | | | | | | | | s9 s10 s11 > v v v v v v v v > s1 s2 s3 s4 s5 s6 s7 s8 ---> T0 > ^ > | > +-+---+ > | | > | M3 | > | | > +--+--+----- > | | | | > v v v v > s12 s13 s14 s16 > > > Lets consider the case of M1 waking up s8. As per the above diagram, > M1 has 8 flips and s8 has 4 flips. > > With llc_size = 3, the condition > > (slave < factor) would return FALSE, so then we would turn to the > (master < slave * factor) condition. This would be TRUE (8 < 4 * 3), > so wake_wide would return 0 and would cause s8 to be woken up as > affine with relation to M1's core. > > So basically, it seems the heuristic is saying (with help of the > second condition - master < slave * factor). that Its a good idea for > s8 to be affine-woken-up with respect to M1's core. Why is it a good > idea to do that? It seems to me M1 has already several tasks its > waking as affine so causing s8 to be woken up affine could be harmful > and it may be a better choice to wake it up elsewhere. No matter what you do with s8, you'll be wrong from some POV.. unless of course s8 is a high frequency waker, then migration is a good bet. -Mike