Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750799AbbEFPld (ORCPT ); Wed, 6 May 2015 11:41:33 -0400 Received: from mx1.redhat.com ([209.132.183.28]:58702 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750698AbbEFPlb (ORCPT ); Wed, 6 May 2015 11:41:31 -0400 Date: Wed, 6 May 2015 11:41:28 -0400 From: Rik van Riel To: Artem Bityutskiy Cc: linux-kernel@vger.kernel.org, mgorman@suse.de, peterz@infradead.org, jhladky@redhat.com Subject: [PATCH] numa,sched: only consider less busy nodes as numa balancing destination Message-ID: <20150506114128.0c846a37@cuia.bos.redhat.com> In-Reply-To: <1430908530.7444.145.camel@sauron.fi.intel.com> References: <1430908530.7444.145.camel@sauron.fi.intel.com> Organization: Red Hat, Inc MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4265 Lines: 113 On Wed, 06 May 2015 13:35:30 +0300 Artem Bityutskiy wrote: > we observe a tremendous regression between kernel version 3.16 and 3.17 > (and up), and I've bisected it to this commit: > > a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node Artem, Jirka, does this patch fix (or at least improve) the issues you have been seeing? Does it introduce any new regressions? Peter, Mel, I think it may be time to stop waiting for the impedance mismatch between the load balancer and NUMA balancing to be resolved, and try to just avoid the issue in the NUMA balancing code... ----8<---- Subject: numa,sched: only consider less busy nodes as numa balancing destination Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the preferred node") fixes an issue where workloads would never converge on a fully loaded (or overloaded) system. However, it introduces a regression on less than fully loaded systems, where workloads converge on a few NUMA nodes, instead of properly staying spread out across the whole system. This leads to a reduction in available memory bandwidth, and usable CPU cache, with predictable performance problems. The root cause appears to be an interaction between the load balancer and NUMA balancing, where the short term load represented by the load balancer differs from the long term load the NUMA balancing code would like to base its decisions on. Simply reverting a43455a1 would re-introduce the non-convergence of workloads on fully loaded systems, so that is not a good option. As an aside, the check done before a43455a1 only applied to a task's preferred node, not to other candidate nodes in the system, so the converge-on-too-few-nodes problem still happens, just to a lesser degree. Instead, try to compensate for the impedance mismatch between the load balancer and NUMA balancing by only ever considering a lesser loaded node as a destination for NUMA balancing, regardless of whether the task is trying to move to the preferred node, or to another node. Signed-off-by: Rik van Riel Reported-by: Artem Bityutski Reported-by: Jirka Hladky --- kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ffeaa4105e48..480e6a35ab35 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1409,6 +1409,30 @@ static void task_numa_find_cpu(struct task_numa_env *env, } } +/* Only move tasks to a NUMA node less busy than the current node. */ +static bool numa_has_capacity(struct task_numa_env *env) +{ + struct numa_stats *src = &env->src_stats; + struct numa_stats *dst = &env->dst_stats; + + if (src->has_free_capacity && !dst->has_free_capacity) + return false; + + /* + * Only consider a task move if the source has a higher destination + * than the destination, corrected for CPU capacity on each node. + * + * src->load dst->load + * --------------------- vs --------------------- + * src->compute_capacity dst->compute_capacity + */ + if (src->load * dst->compute_capacity > + dst->load * src->compute_capacity) + return true; + + return false; +} + static int task_numa_migrate(struct task_struct *p) { struct task_numa_env env = { @@ -1463,7 +1487,8 @@ static int task_numa_migrate(struct task_struct *p) update_numa_stats(&env.dst_stats, env.dst_nid); /* Try to find a spot on the preferred nid. */ - task_numa_find_cpu(&env, taskimp, groupimp); + if (numa_has_capacity(&env)) + task_numa_find_cpu(&env, taskimp, groupimp); /* * Look at other nodes in these cases: @@ -1494,7 +1519,8 @@ static int task_numa_migrate(struct task_struct *p) env.dist = dist; env.dst_nid = nid; update_numa_stats(&env.dst_stats, env.dst_nid); - task_numa_find_cpu(&env, taskimp, groupimp); + if (numa_has_capacity(&env)) + task_numa_find_cpu(&env, taskimp, groupimp); } } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/