Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp173381iob; Wed, 11 May 2022 11:51:16 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxg0FE0kx0qpRtvs5Qfr/5E/HyhDcGEeuCd23AF6A9LuNZ/c5gT+0ACfIC4zcsBREC7OyRE X-Received: by 2002:a17:906:dc93:b0:6f4:6a93:f227 with SMTP id cs19-20020a170906dc9300b006f46a93f227mr26144913ejc.661.1652295076384; Wed, 11 May 2022 11:51:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652295076; cv=none; d=google.com; s=arc-20160816; b=i+s+6gdNNbEX2Il8BropXrXw3ymIBFrFQdyPdnpM1c5NQjs4nqLVeQQqjp7YUDtJ5A nN+f51DXT961xbcmSlW7uB3F+Y9bN77jEzpgNtknzj0ARGe6uoQEfwG90kE1BdATWc0o 3HJzVDigtCBDewS/yrVwXLYGaX2rZORoyFZEwVa1NF77XoFZ0hn9QG9vwfsQGSz17Qsz 37X9/o0PNisbOraDCj590J/hJ/60WNAxpCeMFtkqkBAUTfQj/3f4qf8nHTmdQyl/bYKo I6JkbXPJBIWLveE0vnoa8XEesNvbjJ5QUZjUB43Cc5qgRS9OF/LE5TAazKJ0gh1/xBaz sLpQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=gDtCjkeUjzR3niYH06dr0jfe5rqLXpOR2J07VlAvw8Y=; b=Lh2Z0VVu3250vxGRgPh8RMuM7oWDcBaAwrMRiIIY0XdT0QLgPwiRx5gK14dFkn9E7h yL8+jiuiZjmzqqfJNuNSwOoPKviK801dp7ZLT7ZS9muiSgvknMYxXy1STfoJ7/5gobtK gD3949v8VM+fOD2r3xFabuScyE4cDHkzbS/cehVN8Cn2nZJe0BFLkHUvH+eXcoGIdOkS HtRvkfwrIGU1UWyzhM2TOL1x7+mYavRHjGvFHTZbjxh7/1Yen0smomgHEaaLPn5R3zxV a1E4tYnSNkGV/BKn0VZ90dWnKdtAgS5LItS3CLoQUGyLsy/RQ0jBg1d0B+Dg81nEpJzm 3a8g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d24-20020a056402145800b00425d68387ccsi2999818edx.369.2022.05.11.11.50.48; Wed, 11 May 2022 11:51:16 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244776AbiEKObP (ORCPT + 99 others); Wed, 11 May 2022 10:31:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45314 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244677AbiEKOaz (ORCPT ); Wed, 11 May 2022 10:30:55 -0400 Received: from outbound-smtp57.blacknight.com (outbound-smtp57.blacknight.com [46.22.136.241]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DBD5C40913 for ; Wed, 11 May 2022 07:30:51 -0700 (PDT) Received: from mail.blacknight.com (pemlinmail03.blacknight.ie [81.17.254.16]) by outbound-smtp57.blacknight.com (Postfix) with ESMTPS id 69ABFFB4F4 for ; Wed, 11 May 2022 15:30:50 +0100 (IST) Received: (qmail 6090 invoked from network); 11 May 2022 14:30:50 -0000 Received: from unknown (HELO morpheus.112glenside.lan) (mgorman@techsingularity.net@[84.203.198.246]) by 81.17.254.9 with ESMTPA; 11 May 2022 14:30:50 -0000 From: Mel Gorman To: Peter Zijlstra Cc: Ingo Molnar , Vincent Guittot , Valentin Schneider , Aubrey Li , LKML , Mel Gorman Subject: [PATCH 0/4] Mitigate inconsistent NUMA imbalance behaviour Date: Wed, 11 May 2022 15:30:34 +0100 Message-Id: <20220511143038.4620-1-mgorman@techsingularity.net> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,HEXHASH_WORD, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org A problem was reported privately related to inconsistent performance of NAS when parallelised with MPICH. The root of the problem is that the initial placement is unpredictable and there can be a larger imbalance than expected between NUMA nodes. As there is spare capacity and the faults are local, the imbalance persists for a long time and performance suffers. This is not 100% an "allowed imbalance" problem as setting the allowed imbalance to 0 does not fix the issue but the allowed imbalance contributes the the performance problem. The unpredictable behaviour was most recently introduced by commit c6f886546cb8 ("sched/fair: Trigger the update of blocked load on newly idle cpu"). mpirun forks hydra_pmi_proxy helpers with MPICH that go to sleep before the execing the target workload. As the new tasks are sleeping, the potential imbalance is not observed as idle_cpus does not reflect the tasks that will be running in the near future. How bad the problem depends on the timing of when fork happens and whether the new tasks are still running. Consequently, a large initial imbalance may not be detected until the workload is fully running. Once running, NUMA Balancing picks the preferred node based on locality and runtime load balancing often ignores the tasks as can_migrate_task() fails for either locality or task_hot reasons and instead picks unrelated tasks. This is the min, max and range of run time for mg.D parallelised with ~25% of the CPUs parallelised by MPICH running on a 2-socket machine (80 CPUs, 16 active for mg.D due to limitations of mg.D). v5.3 Min 95.84 Max 96.55 Range 0.71 Mean 96.16 v5.7 Min 95.44 Max 96.51 Range 1.07 Mean 96.14 v5.8 Min 96.02 Max 197.08 Range 101.06 Mean 154.70 v5.12 Min 104.45 Max 111.03 Range 6.58 Mean 105.94 v5.13 Min 104.38 Max 170.37 Range 65.99 Mean 117.35 v5.13-revert-c6f886546cb8 Min 104.40 Max 110.70 Range 6.30 Mean 105.68 v5.18rc4-baseline Min 104.46 Max 169.04 Range 64.58 Mean 130.49 v5.18rc4-revert-c6f886546cb8 Min 113.98 Max 117.29 Range 3.31 Mean 114.71 v5.18rc4-this_series Min 95.24 Max 175.33 Range 80.09 Mean 108.91 v5.18rc4-this_series+revert Min 95.24 Max 99.87 Range 4.63 Mean 96.54 This shows that we've had unpredictable performance for a long time for this load. Instability was introduced somewhere between v5.7 and v5.8, fixed in v5.12 and broken again since v5.13. The revert against 5.13 and 5.18-rc4 shows that c6f886546cb8 is the primary source of instability although the best case is still worse than 5.7. This series addresses the allowed imbalance problems to get the peak performance back to 5.7 although only some of the time due to the instability problem. The series plus the revert is both stable and has slightly better peak performance and similar average performance. I'm not convinced commit c6f886546cb8 is wrong but haven't isolated exactly why it's unstable so for now, I'm just noting it has an issue. Patch 1 initialises numa_migrate_retry. While this resolves itself eventually, it is unpredictable early in the lifetime of a task. Patch 2 will not swap NUMA tasks in the same NUMA group or without a NUMA group if there is spare capacity. Swapping is just punishing one task to help another. Patch 3 fixes an issue where a larger imbalance can be created at fork time than would be allowed at run time. This behaviour can help some workloads that are short lived and prefer to remain local but it punishes long-lived tasks that are memory intensive. Patch 4 adjusts the threshold where a NUMA imbalance is allowed to better approximate the number of memory channels, at least for x86-64. kernel/sched/fair.c | 59 ++++++++++++++++++++++++++--------------- kernel/sched/topology.c | 23 ++++++++++------ 2 files changed, 53 insertions(+), 29 deletions(-) -- 2.34.1