Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751958AbdHCXs4 (ORCPT ); Thu, 3 Aug 2017 19:48:56 -0400 Received: from mail-oi0-f51.google.com ([209.85.218.51]:34279 "EHLO mail-oi0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751804AbdHCXsz (ORCPT ); Thu, 3 Aug 2017 19:48:55 -0400 MIME-Version: 1.0 In-Reply-To: <307308e7-fef2-2b3c-baa9-a96ec9ed2b8d@profitbricks.com> References: <20170630004912.GA2457@destiny> <20170630142815.GA9743@destiny> <1498842140.15161.66.camel@gmail.com> <307308e7-fef2-2b3c-baa9-a96ec9ed2b8d@profitbricks.com> From: Joel Fernandes Date: Thu, 3 Aug 2017 16:48:53 -0700 Message-ID: Subject: Re: wake_wide mechanism clarification To: Michael Wang Cc: Mike Galbraith , Josef Bacik , Peter Zijlstra , LKML , Juri Lelli , Dietmar Eggemann , Patrick Bellasi , Brendan Jackman , Chris Redpath , Matt Fleming Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8030 Lines: 203 Hi Michael, Thanks for your reply. On Wed, Aug 2, 2017 at 1:26 AM, Michael Wang wrote: > Hi, Joel > > On 07/29/2017 10:13 AM, Joel Fernandes wrote: >> +Michael Wang on his current email address (old one bounced). (my >> reply was to Mike Galbraith but I also meant to CC Michael Wang for >> the discussion). Thanks > > Just back from vacation and saw this long long discussion... > > I think guys explained well on the idea, wake_wide() just try to filter > out the cases that may congest the waker's domain, as much as possible > without side effect (ideally). > > The factor at very beginning is just a static number which picked by > enormous times of practice testing, Peter Zijlstr suggest we add the > domain size and make it a flexible factor, which by practice not that bad. Yeah making it flexible might be better. > > So the simple answer is we use the llc_size since no better option at > that time :-) :-) > > But things changing very fast and new feature can introduce new cases, > I'm pretty sure if we redo the testing the results will be very different, > however, the idea itself still make sense to me, at least on theory. > > Recently I'm also thinking about the scheduler issue, cfs try to find out > general solution for all these cases and the best answer is obviously, all > the cases will suffer some damage and scheduler itself bloated to achieve > the goal 'taking care of all'. My team is definitely seeing some weirdness with wake_wide() itself not working correctly. We're collecting more data and do a more thorough analysis of the behavior. I think it can be improved. Will update soon as we have something. thanks, -Joel > > Regards, > Michael Wang > > > >> >> On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes wrote: >>> Hi Mike, >>> >>> I have take spent some time understanding the email thread and >>> previous discussions. Unfortunately the second condition we are >>> checking for in the wake_wide still didn't make sense to me (mentioned >>> below) :-( >>> >>> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith >>> wrote: >>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: >>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: >>>>> >>>>>> That makes sense that we multiply slave's flips by a factor because >>>>>> its low, but I still didn't get why the factor is chosen to be >>>>>> llc_size instead of something else for the multiplication with slave >>>>>> (slave * factor). >>>> >>>>> Yeah I don't know why llc_size was chosen... >>>> >>>> static void update_top_cache_domain(int cpu) >>>> { >>>> struct sched_domain_shared *sds = NULL; >>>> struct sched_domain *sd; >>>> int id = cpu; >>>> int size = 1; >>>> >>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >>>> if (sd) { >>>> id = cpumask_first(sched_domain_span(sd)); >>>> size = cpumask_weight(sched_domain_span(sd)); >>>> sds = sd->shared; >>>> } >>>> >>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >>>> per_cpu(sd_llc_size, cpu) = size; >>>> >>>> The goal of wake wide was to approximate when pulling would be a futile >>>> consolidation effort and counterproductive to scaling. 'course with >>>> ever increasing socket size, any 1:N waker is ever more likely to run >>>> out of CPU for its one and only self (slamming into scaling wall) >>>> before it needing to turn its minions loose to conquer the world. >>> >>> Actually the original question was why do we have the second condition >>> as "master < slave * factor", instead of "master < factor". that's >>> what didn't make sense to me. Why don't we return 0 from wake_wide if >>> master < factor ? >>> >>> Infact, as the factor is set to the llc_size, I think the condition >>> that makes sense to me is: >>> >>> if ((master + slave) < llc_size) >>> return 0; >>> >>> In other words, if the master flips and the slave flips are totally >>> higher than the llc_size, then we are most likely waking up too many >>> tasks as affine and should then switch to wide to prevent overloading. >>> >>> Digging further into the original patch from Michael Wang (I also CC'd >>> him), this was the code (before you had changed it to master/slave): >>> >>> wakee->nr_wakee_switch > factor && >>> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch) >>> >>> To explain the second condition above, Michael Wang said the following in [1] >>> >>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple >>> tasks rely on it, then waker's higher latency will damage all of them, pull >>> wakee seems to be a bad deal." >>> >>> Again I didn't follow why the second condition couldn't just be: >>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch + >>> wakee->nr_wakee_switch) > factor, based on the above explanation from >>> Micheal Wang that I quoted. >>> and why he's instead doing the whole multiplication thing there that I >>> was talking about earlier: "factor * wakee->nr_wakee_switch". >>> >>> Rephrasing my question in another way, why are we talking the ratio of >>> master/slave instead of the sum when comparing if its > factor? I am >>> surely missing something here. >>> >>> Just taking an example: >>> >>> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8 >>> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common >>> between all 3 masters. Also to make it a bit more interesting, let s8 >>> wake up some random task T0. A diagram to show the master/slave >>> relation ships could look like: >>> >>> +-----+ >>> | | >>> +------------------------+ +------+ M2 | >>> | | | | | >>> | M1 | | +--+--+---- >>> | | | | | | | >>> | | | | | | s15 >>> +--+--+--+--+--+--+---+--+---+ v v v >>> | | | | | | | | s9 s10 s11 >>> v v v v v v v v >>> s1 s2 s3 s4 s5 s6 s7 s8 ---> T0 >>> ^ >>> | >>> +-+---+ >>> | | >>> | M3 | >>> | | >>> +--+--+----- >>> | | | | >>> v v v v >>> s12 s13 s14 s16 >>> >>> >>> Lets consider the case of M1 waking up s8. As per the above diagram, >>> M1 has 8 flips and s8 has 4 flips. >>> >>> With llc_size = 3, the condition >>> >>> (slave < factor) would return FALSE, so then we would turn to the >>> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3), >>> so wake_wide would return 0 and would cause s8 to be woken up as >>> affine with relation to M1's core. >>> >>> So basically, it seems the heuristic is saying (with help of the >>> second condition - master < slave * factor). that Its a good idea for >>> s8 to be affine-woken-up with respect to M1's core. Why is it a good >>> idea to do that? It seems to me M1 has already several tasks its >>> waking as affine so causing s8 to be woken up affine could be harmful >>> and it may be a better choice to wake it up elsewhere. >>> >>> Thanks for your help! >>> >>> -Joel >>> >>> [1] https://lkml.org/lkml/2013/7/4/20 >>> >>> >>>> >>>> Something else to consider: network interrupt waking multiple workers >>>> at high frequency. If the waking CPU is idle, do you really want to >>>> place a worker directly in front of a tattoo artist, or is it better >>>> off nearly anywhere but there? >>>> >>>> If the box is virtual, with no topology exposed (or real but ancient) >>>> to let select_idle_sibling() come to the rescue, two workers can even >>>> get tattooed simultaneously (see sync wakeup). >>>> >>>> -Mike