Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751127AbbEFRGP (ORCPT ); Wed, 6 May 2015 13:06:15 -0400 Received: from mx1.redhat.com ([209.132.183.28]:50067 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750965AbbEFRGN (ORCPT ); Wed, 6 May 2015 13:06:13 -0400 Message-ID: <554A4A04.2010509@redhat.com> Date: Wed, 06 May 2015 13:06:12 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Peter Zijlstra CC: Artem Bityutskiy , linux-kernel@vger.kernel.org, mgorman@suse.de, jhladky@redhat.com Subject: Re: [PATCH] numa,sched: only consider less busy nodes as numa balancing destination References: <1430908530.7444.145.camel@sauron.fi.intel.com> <20150506114128.0c846a37@cuia.bos.redhat.com> <20150506170038.GB23123@twins.programming.kicks-ass.net> In-Reply-To: <20150506170038.GB23123@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3578 Lines: 95 On 05/06/2015 01:00 PM, Peter Zijlstra wrote: > On Wed, May 06, 2015 at 11:41:28AM -0400, Rik van Riel wrote: > >> Peter, Mel, I think it may be time to stop waiting for the impedance >> mismatch between the load balancer and NUMA balancing to be resolved, >> and try to just avoid the issue in the NUMA balancing code... > > That's a wee bit unfair since we 'all' decided to let the numa thing > rest for a while. So obviously that issue didn't get resolved. I'm not blaming anyone, I know I was involved in the decision to let the NUMA code rest for a while, too. After a year of just sitting there, this is the only big bug affecting the NUMA balancing code that I have heard about. >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index ffeaa4105e48..480e6a35ab35 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1409,6 +1409,30 @@ static void task_numa_find_cpu(struct task_numa_env *env, >> } >> } >> >> +/* Only move tasks to a NUMA node less busy than the current node. */ >> +static bool numa_has_capacity(struct task_numa_env *env) >> +{ >> + struct numa_stats *src = &env->src_stats; >> + struct numa_stats *dst = &env->dst_stats; >> + >> + if (src->has_free_capacity && !dst->has_free_capacity) >> + return false; >> + >> + /* >> + * Only consider a task move if the source has a higher destination >> + * than the destination, corrected for CPU capacity on each node. >> + * >> + * src->load dst->load >> + * --------------------- vs --------------------- >> + * src->compute_capacity dst->compute_capacity >> + */ >> + if (src->load * dst->compute_capacity > >> + dst->load * src->compute_capacity) >> + return true; >> + >> + return false; >> +} >> + >> static int task_numa_migrate(struct task_struct *p) >> { >> struct task_numa_env env = { >> @@ -1463,7 +1487,8 @@ static int task_numa_migrate(struct task_struct *p) >> update_numa_stats(&env.dst_stats, env.dst_nid); >> >> /* Try to find a spot on the preferred nid. */ >> - task_numa_find_cpu(&env, taskimp, groupimp); >> + if (numa_has_capacity(&env)) >> + task_numa_find_cpu(&env, taskimp, groupimp); >> >> /* >> * Look at other nodes in these cases: >> @@ -1494,7 +1519,8 @@ static int task_numa_migrate(struct task_struct *p) >> env.dist = dist; >> env.dst_nid = nid; >> update_numa_stats(&env.dst_stats, env.dst_nid); >> - task_numa_find_cpu(&env, taskimp, groupimp); >> + if (numa_has_capacity(&env)) >> + task_numa_find_cpu(&env, taskimp, groupimp); >> } >> } > > Does this not 'duplicate' the logic that we tried for with > task_numa_compare():balance section? That is where we try to avoid > making a decision that the regular load-balancer will dislike and undo. > > Alternatively; you can view that as a cpu guard and the proposed as a > node guard, in which case, should it not live inside > task_numa_find_cpu()? Instead of guarding all call sites. > > In any case, should we mix a bit of imbalance_pct in there? > > /me goes ponder this a bit further.. Yes, there is some duplication between this code and the logic in task_numa_compare() At this point I am not sure how to resolve that; I am interested in seeing whether this patch solves the issue reported by Artem and Jirka. If it does, we can think about cleanups. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/