Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753429AbdG2IB3 (ORCPT ); Sat, 29 Jul 2017 04:01:29 -0400 Received: from mail-oi0-f47.google.com ([209.85.218.47]:33103 "EHLO mail-oi0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752066AbdG2IB1 (ORCPT ); Sat, 29 Jul 2017 04:01:27 -0400 MIME-Version: 1.0 In-Reply-To: <1498842140.15161.66.camel@gmail.com> References: <20170630004912.GA2457@destiny> <20170630142815.GA9743@destiny> <1498842140.15161.66.camel@gmail.com> From: Joel Fernandes Date: Sat, 29 Jul 2017 01:01:25 -0700 Message-ID: Subject: Re: wake_wide mechanism clarification To: Mike Galbraith Cc: Josef Bacik , Peter Zijlstra , LKML , Juri Lelli , Dietmar Eggemann , Patrick Bellasi , Brendan Jackman , Chris Redpath , Michael Wang , Matt Fleming Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5712 Lines: 146 Hi Mike, I have take spent some time understanding the email thread and previous discussions. Unfortunately the second condition we are checking for in the wake_wide still didn't make sense to me (mentioned below) :-( On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith wrote: > On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >> >> > That makes sense that we multiply slave's flips by a factor because >> > its low, but I still didn't get why the factor is chosen to be >> > llc_size instead of something else for the multiplication with slave >> > (slave * factor). > >> Yeah I don't know why llc_size was chosen... > > static void update_top_cache_domain(int cpu) > { > struct sched_domain_shared *sds = NULL; > struct sched_domain *sd; > int id = cpu; > int size = 1; > > sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); > if (sd) { > id = cpumask_first(sched_domain_span(sd)); > size = cpumask_weight(sched_domain_span(sd)); > sds = sd->shared; > } > > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); > per_cpu(sd_llc_size, cpu) = size; > > The goal of wake wide was to approximate when pulling would be a futile > consolidation effort and counterproductive to scaling. 'course with > ever increasing socket size, any 1:N waker is ever more likely to run > out of CPU for its one and only self (slamming into scaling wall) > before it needing to turn its minions loose to conquer the world. Actually the original question was why do we have the second condition as "master < slave * factor", instead of "master < factor". that's what didn't make sense to me. Why don't we return 0 from wake_wide if master < factor ? Infact, as the factor is set to the llc_size, I think the condition that makes sense to me is: if ((master + slave) < llc_size) return 0; In other words, if the master flips and the slave flips are totally higher than the llc_size, then we are most likely waking up too many tasks as affine and should then switch to wide to prevent overloading. Digging further into the original patch from Michael Wang (I also CC'd him), this was the code (before you had changed it to master/slave): wakee->nr_wakee_switch > factor && waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch) To explain the second condition above, Michael Wang said the following in [1] "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple tasks rely on it, then waker's higher latency will damage all of them, pull wakee seems to be a bad deal." Again I didn't follow why the second condition couldn't just be: waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + wakee->nr_wakee_switch) > factor, based on the above explanation from Micheal Wang that I quoted. and why he's instead doing the whole multiplication thing there that I was talking about earlier: "factor * wakee->nr_wakee_switch". Rephrasing my question in another way, why are we talking the ratio of master/slave instead of the sum when comparing if its > factor? I am surely missing something here. Just taking an example: Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8 slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common between all 3 masters. Also to make it a bit more interesting, let s8 wake up some random task T0. A diagram to show the master/slave relation ships could look like: +-----+ | | +------------------------+ +------+ M2 | | | | | | | M1 | | +--+--+---- | | | | | | | | | | | | | s15 +--+--+--+--+--+--+---+--+---+ v v v | | | | | | | | s9 s10 s11 v v v v v v v v s1 s2 s3 s4 s5 s6 s7 s8 ---> T0 ^ | +-+---+ | | | M3 | | | +--+--+----- | | | | v v v v s12 s13 s14 s16 Lets consider the case of M1 waking up s8. As per the above diagram, M1 has 8 flips and s8 has 4 flips. With llc_size = 3, the condition (slave < factor) would return FALSE, so then we would turn to the (master < slave * factor) condition. This would be TRUE (8 < 4 * 3), so wake_wide would return 0 and would cause s8 to be woken up as affine with relation to M1's core. So basically, it seems the heuristic is saying (with help of the second condition - master < slave * factor). that Its a good idea for s8 to be affine-woken-up with respect to M1's core. Why is it a good idea to do that? It seems to me M1 has already several tasks its waking as affine so causing s8 to be woken up affine could be harmful and it may be a better choice to wake it up elsewhere. Thanks for your help! -Joel [1] https://lkml.org/lkml/2013/7/4/20 > > Something else to consider: network interrupt waking multiple workers > at high frequency. If the waking CPU is idle, do you really want to > place a worker directly in front of a tattoo artist, or is it better > off nearly anywhere but there? > > If the box is virtual, with no topology exposed (or real but ancient) > to let select_idle_sibling() come to the rescue, two workers can even > get tattooed simultaneously (see sync wakeup). > > -Mike