Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753338AbdHJQl6 (ORCPT ); Thu, 10 Aug 2017 12:41:58 -0400 Received: from merlin.infradead.org ([205.233.59.134]:40808 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752751AbdHJQlz (ORCPT ); Thu, 10 Aug 2017 12:41:55 -0400 Date: Thu, 10 Aug 2017 18:41:49 +0200 From: Peter Zijlstra To: Suravee Suthikulpanit Cc: linux-kernel@vger.kernel.org, mingo@redhat.com, bp@suse.de Subject: Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain Message-ID: <20170810164149.2rkkp55km5cxcarg@hirez.programming.kicks-ass.net> References: <1502378452-6632-1-git-send-email-suravee.suthikulpanit@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1502378452-6632-1-git-send-email-suravee.suthikulpanit@amd.com> User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4698 Lines: 134 On Thu, Aug 10, 2017 at 10:20:52AM -0500, Suravee Suthikulpanit wrote: > On AMD Family17h-based (EPYC) system, a NUMA node can contain > upto 8 cores (16 threads) with the following topology. > > ---------------------------- > C0 | T0 T1 | || | T0 T1 | C4 > --------| || |-------- > C1 | T0 T1 | L3 || L3 | T0 T1 | C5 > --------| || |-------- > C2 | T0 T1 | #0 || #1 | T0 T1 | C6 > --------| || |-------- > C3 | T0 T1 | || | T0 T1 | C7 > ---------------------------- > > Here, there are 2 last-level (L3) caches per NUMA node. A socket can > contain upto 4 NUMA nodes, and a system can support upto 2 sockets. > With full system configuration, current scheduler creates 4 sched > domains: > > domain0 SMT (span a core) > domain1 MC (span a last-level-cache) Right, so traditionally we'd have the DIE level do that, but because x86_has_numa_in_package we don't generate that, right? > domain2 NUMA (span a socket: 4 nodes) > domain3 NUMA (span a system: 8 nodes) > > Note that there is no domain to represent cpus spaning a NUMA node. > With this hierachy of sched domains, the scheduler does not balance > properly in the following cases: > > Case1: > When running 8 tasks, a properly balanced system should > schedule a task per NUMA node. This is not the case for > the current scheduler. > > Case2: > When running 'taskset -c 0-7 ', > a properly balanced system should schedule 8 threads on 8 cpus > (e.g. T0 of C0-C7). However, current scheduler would schedule > some threads on the same cpu, while others are idle. Sure.. could you amend with a few actual performance numbers? > Introducing NUMA identity node sched domain, which is based on how > SRAT/SLIT table define a NUMA node. This results in the following > hierachy of sched domains on the same system described above. > > domain0 SMT (span a core) > domain1 MC (span a last-level-cache) > domain2 NUMA_IDEN (span a NUMA node) Hate that name though.. > domain3 NUMA (span a socket: 4 nodes) > domain4 NUMA (span a system: 8 nodes) > > This fixes the improper load balancing cases mentioned above. > > Signed-off-by: Suravee Suthikulpanit > --- > kernel/sched/topology.c | 24 +++++++++++++++++++++--- > 1 file changed, 21 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 79895ae..c57df98 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1335,6 +1335,10 @@ void sched_init_numa(void) > if (!sched_domains_numa_distance) > return; > > + /* Includes NUMA identity node at level 0. */ > + sched_domains_numa_distance[level++] = curr_distance; > + sched_domains_numa_levels = level; > + > /* > * O(nr_nodes^2) deduplicating selection sort -- in order to find the > * unique distances in the node_distance() table. > @@ -1382,8 +1386,7 @@ void sched_init_numa(void) > return; > > /* > - * 'level' contains the number of unique distances, excluding the > - * identity distance node_distance(i,i). > + * 'level' contains the number of unique distances > * > * The sched_domains_numa_distance[] array includes the actual distance > * numbers. > @@ -1445,9 +1448,24 @@ void sched_init_numa(void) > tl[i] = sched_domain_topology[i]; > > /* > + * Ignore the NUMA identity level if it has the same cpumask > + * as previous level. This is the case for: > + * - System with last-level-cache (MC) sched domain span a NUMA node. > + * - System with DIE sched domain span a NUMA node. > + * > + * Assume all NUMA nodes are identical, so only check node 0. > + */ > + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0))) > + tl[i++] = (struct sched_domain_topology_level){ > + .mask = sd_numa_mask, > + .numa_level = 0, > + SD_INIT_NAME(NUMA_IDEN) Shall we make that: SD_INIT_NAME(NODE), instead? > + }; This misses a set of '{}'. While C doesn't require it, out coding style warrants blocks around any multi-line statement. So what you've forgotten to mention is that for those systems where the LLC == NODE this now superfluous level gets removed by the degenerate code. Have you verified that does the right thing? > + > + /* > * .. and append 'j' levels of NUMA goodness. > */ > - for (j = 0; j < level; i++, j++) { > + for (j = 1; j < level; i++, j++) { > tl[i] = (struct sched_domain_topology_level){ > .mask = sd_numa_mask, > .sd_flags = cpu_numa_flags, > -- > 2.7.4 >