Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751888AbdF0FjX (ORCPT ); Tue, 27 Jun 2017 01:39:23 -0400 Received: from bombadil.infradead.org ([65.50.211.133]:36970 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751690AbdF0FjP (ORCPT ); Tue, 27 Jun 2017 01:39:15 -0400 Date: Tue, 27 Jun 2017 07:39:06 +0200 From: Peter Zijlstra To: Rik van Riel Cc: linux-kernel@vger.kernel.org, jhladky@redhat.com, mingo@kernel.org, mgorman@suse.de Subject: Re: [PATCH 4/4] sched,fair: remove effective_load Message-ID: <20170627053906.GA7287@worktop> References: <20170623165530.22514-1-riel@redhat.com> <20170623165530.22514-5-riel@redhat.com> <20170626144437.GB4941@worktop> <20170626144611.GA5775@worktop> <1498488941.13083.43.camel@redhat.com> <20170626150401.GC4941@worktop> <1498490454.13083.45.camel@redhat.com> <20170626161250.GD4941@worktop> <1498505689.13083.49.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1498505689.13083.49.camel@redhat.com> User-Agent: Mutt/1.5.22.1 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2656 Lines: 69 On Mon, Jun 26, 2017 at 03:34:49PM -0400, Rik van Riel wrote: > On Mon, 2017-06-26 at 18:12 +0200, Peter Zijlstra wrote: > > On Mon, Jun 26, 2017 at 11:20:54AM -0400, Rik van Riel wrote: > > > > > Oh, indeed.??I guess in wake_affine() we should test > > > whether the CPUs are in the same NUMA node, rather than > > > doing cpus_share_cache() ? > > > > Well, since select_idle_sibling() is on LLC; the early test on > > cpus_share_cache(prev,this) seems to actually make sense. > > > > But then cutting out all the other bits seems wrong. Not in the least > > because !NUMA_BALACING should also still keep working. > > Even when !NUMA_BALANCING, I suspect it makes little sense > to compare the loads just one the cores in question, since > select_idle_sibling() will likely move the task somewhere > else. > > I suspect we want to compare the load on the whole LLC > for that reason, even with NUMA_BALANCING disabled. But we don't have that data around :/ One thing we could do is try and keep a copy of the last s*_lb_stats around in the sched_domain_shared stuff or something and try and use that. That way we can strictly keep things at the LLC level and not confuse things with NUMA. Similarly, we could use that same data to then avoid re-computing things for the NUMA domain as well and do away with numa_stats. > > > Or, alternatively, have an update_numa_stats() variant > > > for numa_wake_affine() that works on the LLC level? > > > > I think we want to retain the existing behaviour for everything > > larger than LLC, and when NUMA_BALANCING, smaller than NUMA. > > What do you mean by this, exactly? As you noted, when prev and this are in the same LLC, it doesn't matter and select_idle_sibling() will do its thing. So anything smaller than the LLC need not do anything. When NUMA_BALANCING we have the numa_stats thing and we can, as you propose use that. If LLC < NUMA or !NUMA_BALANCING we have a region that needs to do _something_. > How does the "existing behaviour" of only looking at > the load on two cores make sense when doing LLC-level > task placement? Right, might not be ideal, but its what we have now. Supposedly its better than not doing anything at all. But see above for other ideas. > > Also note that your use of task_h_load() in the new numa thing > > suffers > > from exactly the problem effective_load() is trying to solve. > > Are you saying task_h_load is wrong in task_numa_compare() > too, then? Should both use effective_load()? I need more than the few minutes I currently have, but probably. The question is of course, how much does it matter and how painful will it be to do it better.