Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752536AbdI0LLl (ORCPT ); Wed, 27 Sep 2017 07:11:41 -0400 Received: from mx2.suse.de ([195.135.220.15]:55509 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751800AbdI0LLj (ORCPT ); Wed, 27 Sep 2017 07:11:39 -0400 Date: Wed, 27 Sep 2017 13:11:28 +0200 From: Borislav Petkov To: Suravee Suthikulpanit Cc: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org Subject: Re: [PATCH v3] sched/topology: Introduce NUMA identity node sched domain Message-ID: <20170927111128.rh4hmlymqroulp4c@pd.tnic> References: <1504768805-46716-1-git-send-email-suravee.suthikulpanit@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1504768805-46716-1-git-send-email-suravee.suthikulpanit@amd.com> User-Agent: NeoMutt/20170113 (1.7.2) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4126 Lines: 128 On Thu, Sep 07, 2017 at 02:20:05AM -0500, Suravee Suthikulpanit wrote: > On AMD Family17h-based (EPYC) system, a logical NUMA node can contain Let's simply spell it F17h like we did for the older families. > upto 8 cores (16 threads) with the following topology. > > ---------------------------- > C0 | T0 T1 | || | T0 T1 | C4 > --------| || |-------- > C1 | T0 T1 | L3 || L3 | T0 T1 | C5 > --------| || |-------- > C2 | T0 T1 | #0 || #1 | T0 T1 | C6 > --------| || |-------- > C3 | T0 T1 | || | T0 T1 | C7 > ---------------------------- > > Here, there are 2 last-level (L3) caches per logical NUMA node. > A socket can contain upto 4 NUMA nodes, and a system can support > upto 2 sockets. With full system configuration, current scheduler > creates 4 sched domains: > > domain0 SMT (span a core) > domain1 MC (span a last-level-cache) > domain2 NUMA (span a socket: 4 nodes) > domain3 NUMA (span a system: 8 nodes) > > Note that there is no domain to represent cpus spaning a logical s/cpus/CPUs/ s/spaning/spanning/ Please introduce a spellchecker into your patch creation workflow. > NUMA node. With this hierarchy of sched domains, the scheduler does > not balance properly in the following cases: > > Case1: > When running 8 tasks, a properly balanced system should > schedule a task per logical NUMA node. This is not the case for > the current scheduler. I'd like to have a sentence or two here saying what the problem is, i.e., how do the 8 tasks get placed... > > Case2: > In some cases, threads are scheduled on the same cpu, while other s/cpu/CPU/ > cpus are idle. ... like this sentence, for example, explaining what happens without that patch. > This results in run-to-run inconsistency. For example: > > taskset -c 0-7 sysbench --num-threads=8 --test=cpu \ > --cpu-max-prime=100000 run > > Total execution time ranges from 25.1s to 33.5s depending on threads > placement, where 25.1s is when all 8 threads are balanced properly > on 8 cpus. s/cpus/CPUs/ Please check the whole patch (comments, etc). > Introducing NUMA identity node sched domain, which is based on how "Introduce... " no ing form but plain procedural, do this, do that. > SRAT/SLIT table define a logical NUMA node. This results in the following > hierarchy of sched domains on the same system described above. > > domain0 SMT (span a core) > domain1 MC (span a last-level-cache) > domain2 NODE (span a logical NUMA node) > domain3 NUMA (span a socket: 4 nodes) > domain4 NUMA (span a system: 8 nodes) > > This fixes the improper load balancing cases mentioned above. > > Note that in case cpumask of the last-level-cache and NODE domains > are the same (e.g. on AMD family10h/15h servers), the NODE domain As above. > will be excluded. Therefore, this change will not affect those systems. Right, and this is running on *all* machines, not only AMD or x86. Why doesn't it affect others? The degenerate code maybe? ... > @@ -1445,9 +1448,26 @@ void sched_init_numa(void) > tl[i] = sched_domain_topology[i]; > > /* > + * Do not setup NUMA node level if it has the same cpumask > + * as sched domain at previous level. This is the case for > + * system with: > + * LLC == NODE : LLC (MC) sched domain span a NUMA node. > + * DIE == NODE : DIE sched domain span a NUMA node. > + * > + * Assume all NUMA nodes are identical, so only check node 0. > + */ > + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0))) { > + tl[i++] = (struct sched_domain_topology_level){ > + .mask = sd_numa_mask, > + .numa_level = 0, > + SD_INIT_NAME(NODE) > + }; > + } Right, I think the issue wrt the degenerate code is not fully discussed yet judging by: https://lkml.kernel.org/r/f85d6d5d-64b7-7e08-939f-b321e5f05949@amd.com -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) --