Received: by 10.223.176.5 with SMTP id f5csp2686791wra; Thu, 1 Feb 2018 04:35:34 -0800 (PST) X-Google-Smtp-Source: AH8x225zrzfyj4mUQv626s2mVFaPWm3ugOY3KZszLWDy/A3I0WXhwCoSZRl1bQr9nZRBDKRH8HXv X-Received: by 2002:a17:902:9686:: with SMTP id n6-v6mr31118442plp.333.1517488533983; Thu, 01 Feb 2018 04:35:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517488533; cv=none; d=google.com; s=arc-20160816; b=n0ASEPr0IMFSff4sgwXegHu5NCYt/E+ziVo2mbjyl/Q5+jmH45s25tubJdG0OLYpw5 MVxZg3/SsNo5jqFRVEUziBxJ+8MbbfWPaTX0zKNkxqyEDfDtERQFJwtKYdvsnfUzsVBB tDPTrjrbXjw+UoQUoJ+ULYXZ378o7uFts8qY+eU6IZ6jJHY088m/DERLyqVOzexqfoPF UT9IpFqtlNfP62caTUOtvvCfnt0EbVIqdOmUXaOwfDTEw0tG+HKXB6FEQFMb4llCI1SU 54dcvBTyHuHI9co+4hBrJonXULxSNRt9+mIWlGG6WChd0Cs9SNruGpgMHx3jUPp46H0W s+Fw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=iRNOQLSANo2HIBEycTsOmCctAYsFxctvy9Ozk8ekOcE=; b=bDBx6zSRQ+2UTUtnh3JxHsD4PVRvQWNIPAxKNHNgY6DBkGteQBk5p/K4UUjeYTG6ew vAzB7n3+2R4SyKMZiRYaayvEZi0wFFT4WNij5sQuPRc3e0GlBY6oNsN+R8BqlEOBtiIa cDzPcVqEuOWxEXT3Xyg2J2jZ4vMaxS6qYnRbdEjEjqs/VbQ6wk9FakNVE4wxBBV8udxj 1tljjSmd7c6zDkpcXGC8/une+Dj7tKlZWt8KzBH3PiGbdSvEscJNTwRUdMdiMEJS6Kof dmilSkY0rqAPMtMBPae0NXTDqM7xqkD4WJTSFzEWojJyp/edwDsAvwxQ9Slf7xksFDJU v82w== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=nfLESfsR; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t22-v6si4260411plo.256.2018.02.01.04.35.18; Thu, 01 Feb 2018 04:35:33 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=nfLESfsR; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752391AbeBAMdm (ORCPT + 99 others); Thu, 1 Feb 2018 07:33:42 -0500 Received: from bombadil.infradead.org ([65.50.211.133]:51481 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751916AbeBAMdl (ORCPT ); Thu, 1 Feb 2018 07:33:41 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=iRNOQLSANo2HIBEycTsOmCctAYsFxctvy9Ozk8ekOcE=; b=nfLESfsRv9Mpw1CC36UYqTlTv DGGtuc/DIE4ncxg0ZR8KwPAslvBxuE5ZcVHeWNmwGNakaN+0LPC27D20A5vEEWKb3Q7cMlMnL0biB SoqlpKuX1XkmLpuR9w9mCvjOGiyGIVS2trAQ6spH5SjQPjgauz9bXbbLlreEL7qonLfX84F+D4RUk pCoj6ZsOtYXb1RHePRXA01rLU39lf/4Jtx4f05nvDjFQOG0d4jlu65IGQUTRASiP0K2+d61gCg/T+ BS/zxqGjMyl8tcGTEy5aXldXPWqAhd5sKoWff3IMztUHVvIrYxjyTYlawCsS7fdKQzPQICsA9nxs1 SLH52Jxvw==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=hirez.programming.kicks-ass.net) by bombadil.infradead.org with esmtpsa (Exim 4.89 #1 (Red Hat Linux)) id 1ehE3e-0007AH-Qj; Thu, 01 Feb 2018 12:33:39 +0000 Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 746212029F9F9; Thu, 1 Feb 2018 13:33:35 +0100 (CET) Date: Thu, 1 Feb 2018 13:33:35 +0100 From: Peter Zijlstra To: subhra mazumdar Cc: linux-kernel@vger.kernel.org, mingo@redhat.com, steven.sistare@oracle.com, dhaval.giani@oracle.com Subject: Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance Message-ID: <20180201123335.GV2249@hirez.programming.kicks-ass.net> References: <20180129233102.19018-1-subhra.mazumdar@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180129233102.19018-1-subhra.mazumdar@oracle.com> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote: > +#ifdef CONFIG_SCHED_SMT > + > +/* > + * From sd_llc downward update the SMT utilization. Please don't use utilization for this, we already use that word for something else. > + * Skip the lowest level 0. > + */ > +void smt_util(struct rq *rq, int prev_busy, int next_busy) > +{ > + struct sched_domain *sd; > + struct sched_group *sg_cpu; > + int this_cpu = rq->cpu; > + int util; > + > + if (rq->curr_util == UTIL_UNINITIALIZED) > + prev_busy = 0; > + > + util = next_busy - prev_busy; > + rcu_read_lock(); > + sd = rcu_dereference(per_cpu(sd_llc, this_cpu)); > + if (util) { > + for_each_lower_domain(sd) { > + if (sd->level == 0) > + break; afaict you really only need this for the core, and here you're assuming everything below the LLC is cores. Would it not be much clearer if you introduce sd_core. As is, for_each_lower_domain includes the starting domain, sd->group then is the first core group for this cpu. But then you continue to the smt domain (on Intel, on other architectures there could be a cluster domain in between) and then you bail using that sd->level == 0 hack because otherwise things would go *bang*. Really rather messy this. I think you want to go allocate sched_domain_shared for the MC level and use that, much like sd_llc_shared. > + sg_cpu = sd->groups; > + > + /* atomically add/subtract the util */ > + if (util > 0) > + atomic_inc((atomic_t *)&sg_cpu->utilization); > + else > + atomic_dec((atomic_t *)&sg_cpu->utilization); *sigh*, wth do you need that cast? You didn't get the memo that spurious casts are where bugs happen and are terrible style at the best of times? > + } > + } > + > + if (sd) > + rq->curr_util = next_busy; > + rcu_read_unlock(); > +} > + smt_util(rq, 1, 0); > + smt_util(rq, 0, 1); That all seems like an overly complex way to write inc/dec. You turned what should be a boolean state space (2^1) into something like 2^64. Also, I think if you count idle instead of busy, you'll do away with the need for that whole uninitialized thing. > +#define CPU_PSEUDO_RANDOM(cpu) (cpu_rq(cpu)->rotor++) That's a bit of a stretch calling that pseudo random.. > +/* > + * Find an idle cpu in the core starting search from random index. > + */ > +static int select_idle_smt(struct task_struct *p, struct sched_group *sg) > { > + int i, rand_index, rand_cpu; > + int this_cpu = smp_processor_id(); > > + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight; > + rand_cpu = sg->cp_array[rand_index]; Right, so yuck.. I know why you need that, but that extra array and dereference is the reason I never went there. How much difference does it really make vs the 'normal' wrapping search from last CPU ? This really should be a separate patch with separate performance numbers on. > + for_each_cpu_wrap(i, sched_group_span(sg), rand_cpu) { > + if (!cpumask_test_cpu(i, &p->cpus_allowed)) > + continue; > + if (idle_cpu(i)) > + return i; > + } > > + return -1; > } > > /* > + * Compare the SMT utilization of the two groups and select the one which > + * has more capacity left. > */ > +static int smt_should_migrate(struct sched_group *here, > + struct sched_group *there, int self_util) > { > + int this_cpu = smp_processor_id(); > + int here_util = here->utilization, there_util = there->utilization; > > + /* Discount self utilization if it belongs to here or there */ > + if (self_util > 0) { > + if (cpumask_test_cpu(this_cpu, sched_group_span(here))) > + here_util -= self_util; > + else if (cpumask_test_cpu(this_cpu, sched_group_span(there))) > + there_util -= self_util; > } > > + /* Return true if other group has more capacity left */ > + return (there->group_weight - there_util > > + here->group_weight - here_util); > } OK, so this effectively picks the least-busiest/idlest SMT sibling of two. > /* > + * SMT balancing works by comparing the target cpu group with a random group > + * in llc domain. It calls smt_should_migrate to decide which group has more > + * capacity left. The balancing starts top down fom sd_llc to SMT core level. > + * Finally idle cpu search is only done in the core. > */ > +static int smt_balance(struct task_struct *p, struct sched_domain *sd, int cpu) > { > + struct sched_group *sg, *src_sg, *start_sg, *tgt_sg; > + struct cpumask *span; > + int rand_idx, weight; > + int cpu_orig = cpu; > + int rand_cpu; > + int this_cpu = smp_processor_id(); > + struct rq *this_rq = cpu_rq(this_cpu); > + struct rq *rq = cpu_rq(cpu); > + int self_util = 0; > + int balanced = 0; > > + if (p == this_rq->curr) > + self_util = rq->curr_util; > > + for_each_lower_domain(sd) { Again, I don't think we want to do that. You want to iterate all cores, and this is a rather cumbersome way to go about doing that. Esp. if there's a domain in between. Imagine an ARM bit.little with shared L3 growing SMT or something along those lines. > + if (sd->level == 0) > + break; > > + /* Pick a random group that has cpus where the thread can run */ > + src_sg = sd->groups; > + tgt_sg = NULL; > + rand_idx = CPU_PSEUDO_RANDOM(this_cpu) % sd->sg_num; > + start_sg = sd->sg_array[rand_idx]; > + sg = start_sg; > + do { > + span = sched_group_span(sg); > + if (sg != src_sg && > + cpumask_intersects(span, &p->cpus_allowed)) { > + tgt_sg = sg; > + break; > + } > + sg = sg->next; > + } while (sg != start_sg); OK, so this picks a 'random' group that has CPUs where our task is allowed to run (per the above this group could be a cluster, not a core). > + /* > + * Compare the target group with random group and select the > + * one which has more SMT capacity left. Choose a random CPU in > + * the group to spread the load, then find the group's parent sd > + * so the group's sd is used on the next main loop iteration. > + */ > + if (tgt_sg && smt_should_migrate(src_sg, tgt_sg, self_util)) { > + weight = tgt_sg->group_weight; > + rand_idx = CPU_PSEUDO_RANDOM(this_cpu) % weight; > + rand_cpu = tgt_sg->cp_array[rand_idx]; > + for_each_cpu_wrap(cpu, span, rand_cpu) { > + if (cpumask_test_cpu(cpu, &p->cpus_allowed)) > + break; > + } > + for_each_domain(cpu, sd) { > + if (weight < sd->span_weight) > + break; > + } > + balanced = 1; > } I really wonder how much that fake random stuff yields you vs the rest of this. In any case, this will not do for facebook I'm afraid, as is they already disable SIS_AVG_CPU and SIS_PROP (iirc) and always take the hit of doing a full LLC search. Yes that is expensive, but they really want to keep that tail latency down. Also, only ever considering _one_ other core might affect other workloads. If you look at those two features above, we should already reduce the amount of searching we do when there's not a lot of idle time. > } > > + /* sd is now lowest level SMT core */ > + if (sd->parent && (balanced || !idle_cpu(cpu_orig))) { > + cpu = select_idle_smt(p, sd->parent->groups); > + if (cpu >= 0) > return cpu; > } > > + return cpu_orig; > } > > @@ -6302,6 +6266,31 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > return min_cap * 1024 < task_util(p) * capacity_margin; > } > > +static int select_idle_sibling(struct task_struct *p, int prev, int target) > +{ > + struct sched_domain *sd; > + > + if (idle_cpu(target)) > + return target; > + > + /* > + * If the previous cpu is cache affine and idle, don't be stupid. > + */ > + if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev)) > + return prev; > + > + sd = rcu_dereference(per_cpu(sd_llc, target)); > + if (!sd) > + return target; > + > +#ifdef CONFIG_SCHED_SMT > + return smt_balance(p, sd, target); > +#else > + > + return select_idle_cpu(p, sd, target); > +#endif And that is just wrong.... > +} So while there are things in there that _might_ work, I really don't want to see this one giant horrible patch again.