Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756560Ab3GaQLw (ORCPT ); Wed, 31 Jul 2013 12:11:52 -0400 Received: from cantor2.suse.de ([195.135.220.15]:44248 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751883Ab3GaQLv (ORCPT ); Wed, 31 Jul 2013 12:11:51 -0400 Date: Wed, 31 Jul 2013 17:11:41 +0100 From: Mel Gorman To: Peter Zijlstra Cc: Srikar Dronamraju , Ingo Molnar , Andrea Arcangeli , Johannes Weiner , Linux-MM , LKML Subject: Re: [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5 Message-ID: <20130731161141.GX2296@suse.de> References: <1373901620-2021-1-git-send-email-mgorman@suse.de> <20130725103620.GM27075@twins.programming.kicks-ass.net> <20130731103052.GR2296@suse.de> <20130731104814.GA3008@twins.programming.kicks-ass.net> <20130731115719.GT2296@suse.de> <20130731153018.GD3008@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20130731153018.GD3008@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4256 Lines: 119 On Wed, Jul 31, 2013 at 05:30:18PM +0200, Peter Zijlstra wrote: > On Wed, Jul 31, 2013 at 12:57:19PM +0100, Mel Gorman wrote: > > > > Right, so what Ingo did is have the scan rate depend on the convergence. > > > What exactly did you dislike about that? > > > > > > > It depended entirely on properly detecting if we are converged or not. As > > things like false share detection within THP is still not there I was > > worried that it was too easy to make the wrong decision here and keep it > > pinned at the maximum scan rate. > > > > > We could define the convergence as all the faults inside the interleave > > > mask vs the total faults, and then run at: min + (1 - c)*(max-min). > > > > > > > And when we have such things properly in place then I think we can kick > > away the current crutch. > > OK, so I'll go write that patch I suppose ;-) > > > > Ah, well the reasoning on that was that all this NUMA business is > > > 'expensive' so we'd better only bother with tasks that persist long > > > enough for it to pay off. > > > > > > > Which is fair enough but tasks that lasted *just* longer than the interval > > still got punished. Processes running with a slightly slower CPU gets > > hurts meaning that it would be a difficult bug report to digest. > > > > > In that regard it makes perfect sense to wait a fixed amount of runtime > > > before we start scanning. > > > > > > So it was not a pure hack to make kbuild work again.. that is did was > > > good though. > > > > > > > Maybe we should reintroduce the delay then but I really would prefer that > > it was triggered on some sort of event. > > Humm: > > kernel/sched/fair.c: > > /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */ > unsigned int sysctl_numa_balancing_scan_delay = 1000; > > > kernel/sched/core.c:__sched_fork(): > > numa_scan_period = sysctl_numa_balancing_scan_delay > > > It seems its still there, no need to resuscitate. > Yes, reverting 5bca23035391928c4c7301835accca3551b96cc2 effectively restores the behaviour you are looking for. It just seems very crude. Then again, I also should not have left the scan delay on top of the first_nid check. > I share your preference for a clear event, although nothing really comes > to mind. The entire multi-process space seems devoid of useful triggers. > RSS was another option it felt as arbitrary as a plain delay. Should I revert 5bca23035391928c4c7301835accca3551b96cc2 with an explanation that it potentially is completely useless in the purely multi-process shared case? > > > On that rate-limit, this looks to be a hard-coded number unrelated to > > > the actual hardware. > > > > Guesstimate. > > > > > I think we should at the very least make it a > > > configurable number and preferably scale the number with the SLIT info. > > > Or alternatively actually measure the node to node bandwidth. > > > > > > > Ideally we should just kick it away because scan rate limiting works > > properly. Lets not make it a tunable just yet so we can avoid having to > > deprecate it later. > > I'm not seeing how the rate-limit as per the convergence is going to > help here. It should reduce the potential number of NUMA hinting faults that can be incurred. However, I accept your point because even it does not directly avoid a large number of migration events. > Suppose we migrate the task to another node and its going to > stay there. Then our convergence is going down to 0 (all our memory is > remote) so we end up at the max scan rate migrating every single page > ASAP. > > This would completely and utterly saturate any interconnect. > Good point and we'd arrive back at rate limiting the migration in an attempt to avoid it. > Also, in the case we don't have a fully connected system the memory > transfers will need multiple hops, which greatly complicates the entire > accounting trick :-) > Also unfortunately true. The larger the machine, the more likely this becomes. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/