Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp8687975ybi; Tue, 23 Jul 2019 13:09:23 -0700 (PDT) X-Google-Smtp-Source: APXvYqxb4nJqa2Tb8+kcm+vizJxQIAdmEKfiM25CG5u8YjSJ0cB9D/cDKm5i+nO/IBaNDgYVQGCO X-Received: by 2002:aa7:97b7:: with SMTP id d23mr7820232pfq.203.1563912563879; Tue, 23 Jul 2019 13:09:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563912563; cv=none; d=google.com; s=arc-20160816; b=r1prSSF5DXBwCNys/NDPj+krh0LgkM6Vm9cyb03hCLLpw5SQizldOzcTmEnlyZ/AvL kmWTw6oUyisQ+HdwxTrj5tDS5pLKx/1MTWdm9RWoAvIpHTVYIo38DG9A4mKw8p+qGmMl ymqWGIGLGwDeowW8ZaIlBQwslRTRp4p3UA0GgKyg2MLufxCJtgF10DyqYllsqpTjLSMU +V7PKYnl3TK448A11suBs8JGKH87P/LEqCtzF9FQLRJYS+TZU8rZANSc53tsB/qW05Ju R7JxaZ2KWW988AU9GCsSu1sFV9+cnqdUVkfTrmumm79OX43cry66uBIiCw8weWF/uSy9 XzXg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :dkim-signature; bh=/vbv8X7Ffa53TnjHVA0Vtm+sDJ1pMmI/u+My+FwraRA=; b=R3xdPoWQ3Pk2oicLU1XVsIj0Ha0lol0A44Qz81QkfjH0eCovYDTpLhsUG+3jkhfmzY rOMZOEreJ9QW2TqeiDflD4KwZLANJeug7l2SzeyvboU56grubebCyd48XtHGuebgOX3r El6F5syWPAx23sxbt9SuDGFV1pye8Dm10t5H73txeWjpshaA0lrmaEhQueX+qDIPI21Q kzt59jMgfno6hDLRjKyqwesoe4gO6yTKDriOHkG7Q8SwcofcuZU1q7r5TiM/tq0Z5wos vY7yrxs0yhi2+KfyTmYD6i8s1g372Tdh96aVMHOHEH+y/gAfS+zFMMRVMLpTtoFqeKSg PPQw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@codeblueprint-co-uk.20150623.gappssmtp.com header.s=20150623 header.b=m2QO0kdL; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a25si11952774pfo.234.2019.07.23.13.09.08; Tue, 23 Jul 2019 13:09:23 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@codeblueprint-co-uk.20150623.gappssmtp.com header.s=20150623 header.b=m2QO0kdL; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730569AbfGWKsh (ORCPT + 99 others); Tue, 23 Jul 2019 06:48:37 -0400 Received: from mail-ed1-f68.google.com ([209.85.208.68]:38634 "EHLO mail-ed1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729671AbfGWKsg (ORCPT ); Tue, 23 Jul 2019 06:48:36 -0400 Received: by mail-ed1-f68.google.com with SMTP id r12so8591298edo.5 for ; Tue, 23 Jul 2019 03:48:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=codeblueprint-co-uk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id; bh=/vbv8X7Ffa53TnjHVA0Vtm+sDJ1pMmI/u+My+FwraRA=; b=m2QO0kdLiNSFb6rRE9L049cc1C6UBL+zUeXbF2OB0gNyiWkLdgK2F+0p/5cRkvTBUQ 74pku7WzeieXlWYJx25ab+fvHvpdkz1E2beMszzVrni1UJKMVtCeLJUhTml3OK/YNbD4 zipyUn5Sk4dObc2nwMW3vVsh13ut5V2K/L4kebI+ROvkdRi9fd8oR9FwKoA6Z6BiXWWs Tc3ne0LO6Jz8KLwoZeW59YHPBsZHFmRdKzOSQ7STyVxywIWsVT5I5UDUprg4In6ElKl2 s9Qo9mEwC0kaNQvyk1NIh9kLt8o3qSe6yvlyVLqQt75P2nw5ucmzRi3JFGg7/vbnPymh 1ocg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=/vbv8X7Ffa53TnjHVA0Vtm+sDJ1pMmI/u+My+FwraRA=; b=V/cDEO1yKfV6RdTki6FMDWFDD6iEWQ7Zcey3hI0JwuGROGnXBw8hyHltBCgoZY1yI9 /xgTwnsxgzlyk2/rjpXSHVYWoOIJMZD3qD0X1zeNp54cH1oQG2R9limyjSj4td2CEvoI cuqVRRutzbZHexzUSruYJt6HJqxcbKOGsTDuXjguRMCUW2yDbqZR9Jwv+lza8WVHWE9U ZVIHUP5q5fAkbOMg0jHXjyPJLGEOy9KyqUEcFeKj9nuk2PDFX4zJc4gpwDKbPrIoDyD0 4fjRInV67NA+IM15XxfIBLh31srvInk4D+Z770RWibAvgkonrW+z3eD9jAfZJ+vjf7us BCiw== X-Gm-Message-State: APjAAAVkpL6xU12foli6jt/jCQ+09OWLbvZhBU9fiD6iCkM15PeL5tUW YqmKJ49Rzmvq3eNpPt90E6A= X-Received: by 2002:a17:906:838a:: with SMTP id p10mr55331714ejx.237.1563878912114; Tue, 23 Jul 2019 03:48:32 -0700 (PDT) Received: from localhost (5ec096bd.skybroadband.com. [94.192.150.189]) by smtp.gmail.com with ESMTPSA id l1sm11632097edr.17.2019.07.23.03.48.31 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 23 Jul 2019 03:48:31 -0700 (PDT) From: Matt Fleming To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Matt Fleming , "Suthikulpanit, Suravee" , Mel Gorman , "Lendacky, Thomas" , Borislav Petkov Subject: [PATCH v3] sched/topology: Improve load balancing on AMD EPYC Date: Tue, 23 Jul 2019 11:48:30 +0100 Message-Id: <20190723104830.26623-1-matt@codeblueprint.co.uk> X-Mailer: git-send-email 2.13.7 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init() for any sched domains with a NUMA distance greater than 2 hops (RECLAIM_DISTANCE). The idea being that it's expensive to balance across domains that far apart. However, as is rather unfortunately explained in commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30") the value for RECLAIM_DISTANCE is based on node distance tables from 2011-era hardware. Current AMD EPYC machines have the following NUMA node distances: node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 16 32 32 32 32 1: 16 10 16 16 32 32 32 32 2: 16 16 10 16 32 32 32 32 3: 16 16 16 10 32 32 32 32 4: 32 32 32 32 10 16 16 16 5: 32 32 32 32 16 10 16 16 6: 32 32 32 32 16 16 10 16 7: 32 32 32 32 16 16 16 10 where 2 hops is 32. The result is that the scheduler fails to load balance properly across NUMA nodes on different sockets -- 2 hops apart. For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4 (CPUs 32-39) like so, $ numactl -C 0-7,32-39 ./spinner 16 causes all threads to fork and remain on node 0 until the active balancer kicks in after a few seconds and forcibly moves some threads to node 4. Override node_reclaim_distance for AMD Zen. Signed-off-by: Matt Fleming Cc: "Suthikulpanit, Suravee" Cc: Mel Gorman Cc: "Lendacky, Thomas" Cc: Borislav Petkov --- arch/x86/kernel/cpu/amd.c | 3 +++ include/linux/topology.h | 14 ++++++++++++++ kernel/sched/topology.c | 3 ++- mm/khugepaged.c | 2 +- mm/page_alloc.c | 2 +- 5 files changed, 21 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 8d4e50428b68..d94bf83d5ee6 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c) { set_cpu_cap(c, X86_FEATURE_ZEN); + node_reclaim_distance = 32; + /* * Fix erratum 1076: CPB feature bit not being set in CPUID. * Always set it, except when running under a hypervisor. diff --git a/include/linux/topology.h b/include/linux/topology.h index 47a3e3c08036..579522ec446c 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -59,6 +59,20 @@ int arch_update_cpu_topology(void); */ #define RECLAIM_DISTANCE 30 #endif + +/* + * The following tunable allows platforms to override the default node + * reclaim distance (RECLAIM_DISTANCE) if remote memory accesses are + * sufficiently fast that the default value actually hurts + * performance. + * + * AMD EPYC machines use this because even though the 2-hop distance + * is 32 (3.2x slower than a local memory access) performance actually + * *improves* if allowed to reclaim memory and load balance tasks + * between NUMA nodes 2-hops apart. + */ +extern int __read_mostly node_reclaim_distance; + #ifndef PENALTY_FOR_NODE_WITH_CPUS #define PENALTY_FOR_NODE_WITH_CPUS (1) #endif diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index f751ce0b783e..f684fde00536 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1284,6 +1284,7 @@ static int sched_domains_curr_level; int sched_max_numa_distance; static int *sched_domains_numa_distance; static struct cpumask ***sched_domains_numa_masks; +int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE; #endif /* @@ -1402,7 +1403,7 @@ sd_init(struct sched_domain_topology_level *tl, sd->flags &= ~SD_PREFER_SIBLING; sd->flags |= SD_SERIALIZE; - if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { + if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) { sd->flags &= ~(SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index eaaa21b23215..ccede2425c3f 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -710,7 +710,7 @@ static bool khugepaged_scan_abort(int nid) for (i = 0; i < MAX_NUMNODES; i++) { if (!khugepaged_node_load[i]) continue; - if (node_distance(nid, i) > RECLAIM_DISTANCE) + if (node_distance(nid, i) > node_reclaim_distance) return true; } return false; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 272c6de1bf4e..0d54cd2c43a4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3522,7 +3522,7 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order, static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) { return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <= - RECLAIM_DISTANCE; + node_reclaim_distance; } #else /* CONFIG_NUMA */ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) -- 2.13.7