Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1012271imu; Mon, 5 Nov 2018 12:18:49 -0800 (PST) X-Google-Smtp-Source: AJdET5cQ5LZrcAcKUtRHLlNNmhTYbAVWjm/KHvCf/llsOr9sSlWurZ4QzO+UgMA3Z9xPqLTZmGF6 X-Received: by 2002:a62:85:: with SMTP id 127-v6mr1356907pfa.24.1541449129905; Mon, 05 Nov 2018 12:18:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541449129; cv=none; d=google.com; s=arc-20160816; b=MTWvlkpxgKw9Ab/xMnvzLjS7nW+F35PN4xbOFGsk9GA+6Ppe+cKGeQfZFtbrGIkc5w aEbj1aa2EesqaMC2uakzeJr/f+cH4du+QextGP0IF2rP7WDLHRXjntxs4dLz86hKh+oM 97Ukq9g47Q/lp/P1EHGHV7MIGJDCwt0x0qroaFHWcRXBMD52uDVplCA8uYZLcKL3Eqet de81gXMu3ZElrYbB9rN3X5dDFzxMgC9IhGYgJxYhSdjAwhpQpWj/J3lXjWlBYUBGN0fN qZI9xgji9EN/qI0iSYFt3MY0cjhytkH7g0I19XpwXoNn31GVpFh/l+9ywZ4DSzlaP31N uY2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature; bh=q+mVoonGzwWP683XAqCOQKNnQhuwOn6DaSUTT/HTlaw=; b=N5V5OsvQqzcn0ylj6JAfPeWxEHYeauh26ieClyevmV99i5zvBpD2uWRu4sV1gKBVF8 Cqd6P1sebb1ee8x15wZ1PIFvZqjRjKuTSku3MbGHoCTjzgzjWZITt1hMW2R5gsh87kOJ vXDmjrTCxkqZSpBi8C35I4zhaoylLDBem1ICASakG1qKCCBpXz6URGjAcirIoc3PPoBb UbhQjWTePC3HcJ0KQfRAOkC9VbDG0wtQRuS5ju+eiae6iWHCGdVkyKSBtja7LkY3OYBJ RXu//NpBs1Qrku4jDIfjEJiEImsp5k39H6i5U9Jqn0ncJP0YW1fHOe1qfxaGEqi4dtKb Fciw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=UjsyhQIi; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 191si13608900pgd.228.2018.11.05.12.18.34; Mon, 05 Nov 2018 12:18:49 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=UjsyhQIi; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730202AbeKFFjM (ORCPT + 99 others); Tue, 6 Nov 2018 00:39:12 -0500 Received: from userp2130.oracle.com ([156.151.31.86]:35992 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730178AbeKFFjM (ORCPT ); Tue, 6 Nov 2018 00:39:12 -0500 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wA5KDgKx081754; Mon, 5 Nov 2018 20:17:26 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references; s=corp-2018-07-02; bh=q+mVoonGzwWP683XAqCOQKNnQhuwOn6DaSUTT/HTlaw=; b=UjsyhQIiBe6osgm6BZsRz9vC71+/KsG7lfF7kVR7luOziux60gRLr1mbq6e1smuA4hXk XuzXvSHVKbKN1BuDJqTsJokbyEpLe5p63LiHS80jk2MIpmJIYhFVbtbaqjsuDEK24Y0z 7KHwPfggL1j49yyGgoijQIJb1adcM030egJ3TMFl5WrRit4gufsRP98bLDMUuy1cNaDA qTFTma6ugm5V64JMylw0KiNjBkPWWM4Nxzd2J3+8GtJVlqncVkwgflZrRm6Lsemqx04u dAyhGurjp3jmyZ9NiaAB+l+z5WDgpbaamb8+VECDv3IvweWGEvjMNRKRCET0uY9/52ml aA== Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by userp2130.oracle.com with ESMTP id 2nh33tsbuq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 05 Nov 2018 20:17:26 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wA5KHJ7w015610 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 5 Nov 2018 20:17:20 GMT Received: from abhmp0004.oracle.com (abhmp0004.oracle.com [141.146.116.10]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id wA5KHJSC025077; Mon, 5 Nov 2018 20:17:19 GMT Received: from ca-dev63.us.oracle.com (/10.211.8.221) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 05 Nov 2018 12:17:18 -0800 From: Steve Sistare To: mingo@redhat.com, peterz@infradead.org Cc: subhra.mazumdar@oracle.com, dhaval.giani@oracle.com, daniel.m.jordan@oracle.com, pavel.tatashin@microsoft.com, matt@codeblueprint.co.uk, umgwanakikbuti@gmail.com, riel@redhat.com, jbacik@fb.com, juri.lelli@redhat.com, valentin.schneider@arm.com, vincent.guittot@linaro.org, quentin.perret@arm.com, steven.sistare@oracle.com, linux-kernel@vger.kernel.org Subject: [PATCH v2 09/10] sched/fair: disable stealing if too many NUMA nodes Date: Mon, 5 Nov 2018 12:08:08 -0800 Message-Id: <1541448489-19692-10-git-send-email-steven.sistare@oracle.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1541448489-19692-1-git-send-email-steven.sistare@oracle.com> References: <1541448489-19692-1-git-send-email-steven.sistare@oracle.com> X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9068 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1811050180 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The STEAL feature causes regressions on hackbench on larger NUMA systems, so disable it on systems with more than sched_steal_node_limit nodes (default 2). Note that the feature remains enabled as seen in features.h and /sys/kernel/debug/sched_features, but stealing is only performed if nodes <= sched_steal_node_limit. This arrangement allows users to activate stealing on reboot by setting the kernel parameter sched_steal_node_limit on kernels built without CONFIG_SCHED_DEBUG. The parameter is temporary and will be deleted when the regression is fixed. Details of the regression follow. With the STEAL feature set, hackbench is slower on many-node systems: X5-8: 8 sockets * 18 cores * 2 hyperthreads = 288 CPUs Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz Average of 10 runs of: hackbench processes 50000 --- base -- --- new --- groups time %stdev time %stdev %speedup 1 3.627 15.8 3.876 7.3 -6.5 2 4.545 24.7 5.583 16.7 -18.6 3 5.716 25.0 7.367 14.2 -22.5 4 6.901 32.9 7.718 14.5 -10.6 8 8.604 38.5 9.111 16.0 -5.6 16 7.734 6.8 11.007 8.2 -29.8 Total CPU time increases. Profiling shows that CPU time increases uniformly across all functions, suggesting a systemic increase in cache or memory latency. This may be due to NUMA migrations, as they cause loss of LLC cache footprint and remote memory latencies. The domains for this system and their flags are: domain0 (SMT) : 1 core SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_SHARE_CPUCAPACITY SD_WAKE_AFFINE domain1 (MC) : 1 socket SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_WAKE_AFFINE domain2 (NUMA) : 4 sockets SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SERIALIZE SD_OVERLAP SD_NUMA SD_WAKE_AFFINE domain3 (NUMA) : 8 sockets SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_SERIALIZE SD_OVERLAP SD_NUMA Schedstats point to the root cause of the regression. hackbench is run 10 times per group and the average schedstat accumulation per-run and per-cpu is shown below. Note that domain3 moves are zero because SD_WAKE_AFFINE is not set there. NO_STEAL --- domain2 --- --- domain3 --- grp time %busy sched idle wake steal remote move pull remote move pull 1 20.3 10.3 28710 14346 14366 0 490 3378 0 4039 0 0 2 26.4 18.8 56721 28258 28469 0 792 7026 12 9229 0 7 3 29.9 28.3 90191 44933 45272 0 5380 7204 19 16481 0 3 4 30.2 35.8 121324 60409 60933 0 7012 9372 27 21438 0 5 8 27.7 64.2 229174 111917 117272 0 11991 1837 168 44006 0 32 16 32.6 74.0 334615 146784 188043 0 3404 1468 49 61405 0 8 STEAL --- domain2 --- --- domain3 --- grp time %busy sched idle wake steal remote move pull remote move pull 1 20.6 10.2 28490 14232 14261 18 3 3525 0 4254 0 0 2 27.9 18.8 56757 28203 28562 303 1675 7839 5 9690 0 2 3 35.3 27.7 87337 43274 44085 698 741 12785 14 15689 0 3 4 36.8 36.0 118630 58437 60216 1579 2973 14101 28 18732 0 7 8 48.1 73.8 289374 133681 155600 18646 35340 10179 171 65889 0 34 16 41.4 82.5 268925 91908 177172 47498 17206 6940 176 71776 0 20 Cross-numa-node migrations are caused by load balancing pulls and wake_affine moves. Pulls are small and similar for no_steal and steal. However, moves are significantly higher for steal, and rows above with the highest moves have the worst regressions for time; see for example grp=8. Moves increase for steal due to the following logic in wake_affine_idle() for synchronous wakeup: if (sync && cpu_rq(this_cpu)->nr_running == 1) return this_cpu; // move the task The steal feature does a better job of smoothing the load between idle and busy CPUs, so nr_running is 1 more often, and moves are performed more often. For hackbench, cross-node affine moves early in the run are good because they colocate wakers and wakees from the same group on the same node, but continued moves later in the run are bad, because the wakee is moved away from peers on its previous node. Note that even no_steal is far from optimal; binding an instance of "hackbench 2" to each of the 8 NUMA nodes runs much faster than running "hackbench 16" with no binding. Clearing SD_WAKE_AFFINE for domain2 eliminates the affine cross-node migrations and eliminates the difference between no_steal and steal performance. However, overall performance is lower than WA_IDLE because some migrations are helpful as explained above. I have tried many heuristics in a attempt to optimize the number of cross-node moves in all conditions, with limited success. The fundamental problem is that the scheduler does not track which groups of tasks talk to each other. Parts of several groups become entrenched on the same node, filling it to capacity, leaving no room for either group to pull its peers over, and there is neither data nor mechanism for the scheduler to evict one group to make room for the other. For now, disable STEAL on such systems until we can do better, or it is shown that hackbench is atypical and most workloads benefit from stealing. Signed-off-by: Steve Sistare --- kernel/sched/fair.c | 16 +++++++++++++--- kernel/sched/sched.h | 2 +- kernel/sched/topology.c | 25 +++++++++++++++++++++++++ 3 files changed, 39 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0f12f56..56dce30 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3726,11 +3726,21 @@ static inline bool within_margin(int value, int margin) #define IF_SMP(statement) statement +static inline bool steal_enabled(void) +{ +#ifdef CONFIG_NUMA + bool allow = static_branch_likely(&sched_steal_allow); +#else + bool allow = true; +#endif + return sched_feat(STEAL) && allow; +} + static void overload_clear(struct rq *rq) { struct sparsemask *overload_cpus; - if (!sched_feat(STEAL)) + if (!steal_enabled()) return; rcu_read_lock(); @@ -3744,7 +3754,7 @@ static void overload_set(struct rq *rq) { struct sparsemask *overload_cpus; - if (!sched_feat(STEAL)) + if (!steal_enabled()) return; rcu_read_lock(); @@ -9786,7 +9796,7 @@ static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf) int stolen = 0; struct sparsemask *overload_cpus; - if (!sched_feat(STEAL)) + if (!steal_enabled()) return 0; if (!cpu_active(dst_cpu)) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index aadfe68..5f181e9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -928,7 +928,6 @@ static inline int cpu_of(struct rq *rq) #endif } - #ifdef CONFIG_SCHED_SMT extern struct static_key_false sched_smt_present; @@ -1083,6 +1082,7 @@ enum numa_topology_type { #endif #ifdef CONFIG_NUMA +extern struct static_key_true sched_steal_allow; extern void sched_init_numa(void); extern void sched_domains_numa_masks_set(unsigned int cpu); extern void sched_domains_numa_masks_clear(unsigned int cpu); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index f18c416..e80c354 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1337,6 +1337,30 @@ static void init_numa_topology_type(void) } } +DEFINE_STATIC_KEY_TRUE(sched_steal_allow); +static int sched_steal_node_limit; +#define SCHED_STEAL_NODE_LIMIT_DEFAULT 2 + +static int __init steal_node_limit_setup(char *buf) +{ + get_option(&buf, &sched_steal_node_limit); + return 0; +} + +early_param("sched_steal_node_limit", steal_node_limit_setup); + +static void check_node_limit(void) +{ + int n = num_possible_nodes(); + + if (sched_steal_node_limit == 0) + sched_steal_node_limit = SCHED_STEAL_NODE_LIMIT_DEFAULT; + if (n > sched_steal_node_limit) { + static_branch_disable(&sched_steal_allow); + pr_debug("Suppressing sched STEAL. To enable, reboot with sched_steal_node_limit=%d", n); + } +} + void sched_init_numa(void) { int next_distance, curr_distance = node_distance(0, 0); @@ -1485,6 +1509,7 @@ void sched_init_numa(void) sched_max_numa_distance = sched_domains_numa_distance[level - 1]; init_numa_topology_type(); + check_node_limit(); } void sched_domains_numa_masks_set(unsigned int cpu) -- 1.8.3.1