Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754000AbdHXWaY convert rfc822-to-8bit (ORCPT ); Thu, 24 Aug 2017 18:30:24 -0400 Received: from mail.fireflyinternet.com ([109.228.58.192]:61962 "EHLO fireflyinternet.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753814AbdHXWaW (ORCPT ); Thu, 24 Aug 2017 18:30:22 -0400 X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8BIT To: Peter Zijlstra , "Josef Bacik" From: Chris Wilson In-Reply-To: <20170801214307.ig62jhe7smtisvlr@hirez.programming.kicks-ass.net> Cc: "Rik van Riel" , linux-kernel@vger.kernel.org, jhladky@redhat.com, mingo@kernel.org, mgorman@suse.de References: <20170626144611.GA5775@worktop> <20170626150401.GC4941@worktop> <1498490454.13083.45.camel@redhat.com> <20170626161250.GD4941@worktop> <1498505689.13083.49.camel@redhat.com> <20170627053906.GA7287@worktop> <1498575358.20270.114.camel@redhat.com> <20170801121912.fnykqlq3r5jcbtn2@hirez.programming.kicks-ass.net> <20170801192650.GA27425@destiny> <20170801214307.ig62jhe7smtisvlr@hirez.programming.kicks-ass.net> Message-ID: <150361379597.22505.2216754249003350303@mail.alporthouse.com> User-Agent: alot/0.3.6 Subject: Re: [PATCH] sched/fair: Fix wake_affine() for !NUMA_BALANCING Date: Thu, 24 Aug 2017 23:29:55 +0100 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4112 Lines: 109 We've stumbled across the same regression on Broxton/Apollolake (J3455). Quoting Peter Zijlstra (2017-08-01 22:43:07) > On Tue, Aug 01, 2017 at 03:26:51PM -0400, Josef Bacik wrote: > > > @@ -7574,6 +7607,18 @@ static inline void update_sd_lb_stats(st > > > env->dst_rq->rd->overload = overload; > > > } > > > > > > + /* > > > + * Since these are sums over groups they can contain some CPUs > > > + * multiple times for the NUMA domains. > > > + * > > > + * Currently only wake_affine_llc() and find_busiest_group() > > > + * uses these numbers, only the last is affected by this problem. > > > + * > > > + * XXX fix that. > > > + */ > > > + WRITE_ONCE(shared->nr_running, sds->total_running); > > > + WRITE_ONCE(shared->load, sds->total_load); > > > + WRITE_ONCE(shared->capacity, sds->total_capacity); > > > > This panic's on boot for me because shared is NULL. Same happens in > > select_task_rq_fair when it tries to do the READ_ONCE. Here is my .config in > > case it's something strange with my config. Thanks, > > Nah, its just me being an idiot and booting the wrong kernel. Unless I > messed up again, this one boots. > > There is state during boot and hotplug where there are no domains, and > thus also no shared. Simply ignoring things when that happens should be > good enough I think. This is still not as effective as the previous code in spreading across siblings. > +/* > + * Can a task be moved from prev_cpu to this_cpu without causing a load > + * imbalance that would trigger the load balancer? > + * > + * Since we're running on 'stale' values, we might in fact create an imbalance > + * but recomputing these values is expensive, as that'd mean iteration 2 cache > + * domains worth of CPUs. > + */ > +static bool > +wake_affine_llc(struct sched_domain *sd, struct task_struct *p, > + int this_cpu, int prev_cpu, int sync) > +{ > + struct llc_stats prev_stats, this_stats; > + s64 this_eff_load, prev_eff_load; > + unsigned long task_load; > + > + get_llc_stats(&prev_stats, prev_cpu); > + get_llc_stats(&this_stats, this_cpu); > + > + /* > + * If sync wakeup then subtract the (maximum possible) > + * effect of the currently running task from the load > + * of the current LLC. > + */ > + if (sync) { > + unsigned long current_load = task_h_load(current); > + > + /* in this case load hits 0 and this LLC is considered 'idle' */ > + if (current_load > this_stats.load) > + return true; > + > + this_stats.load -= current_load; > + } > + > + /* > + * The has_capacity stuff is not SMT aware, but by trying to balance > + * the nr_running on both ends we try and fill the domain at equal > + * rates, thereby first consuming cores before siblings. > + */ > + > + /* if the old cache has capacity, stay there */ > + if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1) > + return false; > + > + /* if this cache has capacity, come here */ > + if (this_stats.has_capacity && this_stats.nr_running < prev_stats.nr_running+1) I think you mean, if (this_stats.has_capacity && this_stats.nr_running + 1 < prev_stats.nr_running) and with that our particular graphics benchmark behaves similarly to as before (the regression appears fixed). But I'll let Eero double check. > + return true; > + > + > + /* > + * Check to see if we can move the load without causing too much > + * imbalance. > + */ > + task_load = task_h_load(p); > + > + this_eff_load = 100; > + this_eff_load *= prev_stats.capacity; > + > + prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; > + prev_eff_load *= this_stats.capacity; > + > + this_eff_load *= this_stats.load + task_load; > + prev_eff_load *= prev_stats.load - task_load; > + > + return this_eff_load <= prev_eff_load; > +}