Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751638AbdF1JNi (ORCPT ); Wed, 28 Jun 2017 05:13:38 -0400 Received: from mail.skyhub.de ([5.9.137.197]:51316 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751501AbdF1JNU (ORCPT ); Wed, 28 Jun 2017 05:13:20 -0400 Date: Wed, 28 Jun 2017 11:12:52 +0200 From: Borislav Petkov To: Suravee Suthikulpanit Cc: x86@kernel.org, linux-kernel@vger.kernel.org, leo.duran@amd.com, yazen.ghannam@amd.com, Peter Zijlstra , "Lendacky, Thomas" , Matt Fleming , Mel Gorman Subject: Re: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of socket Message-ID: <20170628091252.vz6zebl36f4rw523@pd.tnic> References: <1498545653-6755-1-git-send-email-suravee.suthikulpanit@amd.com> <1498545653-6755-2-git-send-email-suravee.suthikulpanit@amd.com> <20170627104803.wlhsqhaylbeqod37@pd.tnic> <20170627134251.5ztfqqlm3rg6y7ql@pd.tnic> <2f3bf9f2-cbca-ca72-219a-13c6faf9d314@amd.com> <20170627174411.gheip4jmra2ihuhq@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170113 (1.7.2) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2958 Lines: 68 + Matt and Mel. On Wed, Jun 28, 2017 at 03:26:10AM +0700, Suravee Suthikulpanit wrote: > So, from the definition above, we would like all those 16 threads to be in > the same sched-domain, where threads from C0,1,2,3 are in the same > sched-group, and threads in C4,5,6,7 are in another sched-group. Figures, you want to have a sched group per L3. > > Now that thing has a memory controller attached to it, correct? > > Yes > > > If so, why is this thing not a logical NUMA node, as described in > > SRAT/SLIT? > > Yes, this thing is a logical NUMA node and represented correctly in the SRAT/SLIT. > > > Now, SRAT should contain the assignment which core belongs to which > > node. Why is that not sufficient? > > Yes, SRAT provides cpu-to-node mapping, which is sufficient to tell > scheduler what are the cpus within a NUMA node. > > However, looking at the current sched-domain below. Notice that there is no > sched-domain with 16 threads to represent a NUMA node: > > cpu0 > domain0 00000000,00000001,00000000,00000001 (SMT) > domain1 00000000,0000000f,00000000,0000000f (MC) > domain2 00000000,ffffffff,00000000,ffffffff (NUMA) > domain3 ffffffff,ffffffff,ffffffff,ffffffff (NUMA) > > sched-domain2 (which represents a sched-domain containing all cpus within a > socket) would have 8 sched-groups (based on the cpumasks from domain1). > According to the documentation snippet above regarding balancing within a > sched-domain, scheduler will try to do (NUMA) load-balance between 8 groups > (spanning 4 NUMA node). Here, IINM, it would be more beneficial if the > scheduler would try to load balance between the two groups within the same > NUMA node first before, going across NUMA node in order to minimize memory > latency. This would require another sched-domain between domain 1 and 2, > which represent all 16 threads within a NUMA node (i.e. die sched-domain), > this would allow scheduler to load balance within the NUMA node first, > before going across NUMA node. > > However, since the current code decides that x86_has_numa_in_package is > true, it omits the die sched-domain. In order to avoid this, we are > proposing to represent cpuinfo_x86.phys_proc_id using NUMA node ID (i.e. die > ID). And this is the main point of the patch series. Ok, so here's what I'm reading from all this: a 16-thread package *is* represented as a logical NUMA node properly by SRAT/SLIT, after all. However, sched_init_numa() doesn't parse it properly or something else along that path keeps the scheduler from seeing the 4 NUMA nodes on the physical socket. Because we do use the SLIT table to build the NUMA domains. So considering that 16T package really is represented properly as a logical NUMA node, why doesn't it get detected as such? This is what needs asnwering first and not some forceful fitting the topology into a DIE domain. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.