Received-SPF: pass (google.com: domain of linux-kernel+bounces-188281-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249;
From: Zhang Qiao <zhangqiao22@huawei.com>
To: <mingo@redhat.com>, <peterz@infradead.org>, <juri.lelli@redhat.com>,
	<vincent.guittot@linaro.org>, <dietmar.eggemann@arm.com>,
	<rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>,
	<bristot@redhat.com>, <vschneid@redhat.com>
CC: <linux-kernel@vger.kernel.org>, <zhangqiao22@huawei.com>
Subject: [PATCH] sched/numa: Correct NUMA imbalance calculation
Date: Fri, 24 May 2024 11:54:38 +0800
Message-ID: <20240524035438.2701479-1-zhangqiao22@huawei.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain

When perform load balance, a NUMA imbalance is allowed
if busy CPUs is less than the maximum threshold, it
remains a pair of communication tasks on the current
node when the source doamin is lightly loaded. In many
cases, this prevents communicating tasks being pulled apart.

But when I ran the lmbench bw_pipe testcase, I found that
it was a little inconsistent with the above expectations,
the communicating tasks were migrated to two different
NUMA nodes.

There may be two reasons for this issue:
1. calculate_imbalance() use local->sum_nr_running, it
may not be accurate, because the communication tasks run
on busiest group, it should be busiest->sum_nr_running.

2. In calculate_imbalance(), idles cpus are used to calculat
imbalance, but the group_weight may not be equal between local
and busiest group(My server has 4 NUMA nodes and kernel
builds 3 level NUMA sched_domain, some sched_group's weight
is different). In this case, even if both groups are very idle,
imbalance will be calculated very large, the difference of busy
cpus between groups might be more appropriate as imbalance value.

For lmbench bw_pipe(bw_pipe -P 1):
v6.6: 			1776.7533 MB/sec
v6.6 + this patch:	4323 	  MB/sec

Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
---
 kernel/sched/fair.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..c6170cde9c14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1323,7 +1323,6 @@ static inline bool is_core_idle(int cpu)
 }
 
 #ifdef CONFIG_NUMA
-#define NUMA_IMBALANCE_MIN 2
 
 static inline long
 adjust_numa_imbalance(int imbalance, int dst_running, int imb_numa_nr)
@@ -1342,7 +1341,7 @@ adjust_numa_imbalance(int imbalance, int dst_running, int imb_numa_nr)
 	 * Allow a small imbalance based on a simple pair of communicating
 	 * tasks that remain local when the destination is lightly loaded.
 	 */
-	if (imbalance <= NUMA_IMBALANCE_MIN)
+	if (imbalance <= imb_numa_nr)
 		return 0;
 
 	return imbalance;
@@ -10727,14 +10726,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			 */
 			env->migration_type = migrate_task;
 			env->imbalance = max_t(long, 0,
-					       (local->idle_cpus - busiest->idle_cpus));
+					(busiest->group_weight - busiest->idle_cpus) -
+					 (local->group_weight - local->idle_cpus));
 		}
 
 #ifdef CONFIG_NUMA
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-							       local->sum_nr_running + 1,
+							       busiest->sum_nr_running,
 							       env->sd->imb_numa_nr);
 		}
 #endif
-- 
2.18.0.huawei.25