Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753833AbbDGROU (ORCPT ); Tue, 7 Apr 2015 13:14:20 -0400 Received: from e9.ny.us.ibm.com ([32.97.182.139]:49582 "EHLO e9.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753746AbbDGROT (ORCPT ); Tue, 7 Apr 2015 13:14:19 -0400 Date: Tue, 7 Apr 2015 10:14:10 -0700 From: Nishanth Aravamudan To: Peter Zijlstra Cc: Ingo Molnar , linux-kernel@vger.kernel.org, Srikar Dronamraju , Boqun Feng , Anshuman Khandual , linuxppc-dev@lists.ozlabs.org, Benjamin Herrenschmidt , Anton Blanchard Subject: Re: Topology updates and NUMA-level sched domains Message-ID: <20150407171410.GA62529@linux.vnet.ibm.com> References: <20150406214558.GA38501@linux.vnet.ibm.com> <20150407102147.GJ23123@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150407102147.GJ23123@twins.programming.kicks-ass.net> X-Operating-System: Linux 3.13.0-40-generic (x86_64) User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15040717-0033-0000-0000-000002393D94 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3415 Lines: 79 On 07.04.2015 [12:21:47 +0200], Peter Zijlstra wrote: > On Mon, Apr 06, 2015 at 02:45:58PM -0700, Nishanth Aravamudan wrote: > > Hi Peter, > > > > As you are very aware, I think, power has some odd NUMA topologies (and > > changes to the those topologies) at run-time. In particular, we can see > > a topology at boot: > > > > Node 0: all Cpus > > Node 7: no cpus > > > > Then we get a notification from the hypervisor that a core (or two) have > > moved from node 0 to node 7. This results in the: > > > or a re-init API (which won't try to reallocate various bits), because > > the topology could be completely different now (e.g., > > sched_domains_numa_distance will also be inaccurate now). Really, a > > topology update on power (not sure on s390x, but those are the only two > > archs that return a positive value from arch_update_cpu_topology() right > > now, afaics) is a lot like a hotplug event and we need to re-initialize > > any dependent structures. > > > > I'm just sending out feelers, as we can limp by with the above warning, > > it seems, but is less than ideal. Any help or insight you could provide > > would be greatly appreciated! > > So I think (and ISTR having stated this before) that dynamic cpu<->node > maps are absolutely insane. Sorry if I wasn't involved at the time. I agree that it's a bit of a mess! > There is a ton of stuff that assumes the cpu<->node relation is a boot > time fixed one. Userspace being one of them. Per-cpu memory another. Well, userspace already deals with CPU hotplug, right? And the topology updates are, in a lot of ways, just like you've hotplugged a CPU from one node and re-hotplugged it into another node. I'll look into the per-cpu memory case. For what it's worth, our test teams are stressing the kernel with these topology updates and hopefully we'll be able to resolve any issues that result. > You simply cannot do this without causing massive borkage. > > So please come up with a coherent plan to deal with the entire problem > of dynamic cpu to memory relation and I might consider the scheduler > impact. But we're not going to hack around and maybe make it not crash > in a few corner cases while the entire thing is shite. Well, it doesn't crash now. In fact, it stays up reasonable well and seems to dtrt (from the kernel perspective) other than the sched domain messages. I will look into per-cpu memory, and also another case I have been thinking about where if a process is bound to a CPU/node combination via numactl and then the topology changes, what exactly will happen. In theory, via these topology updates, a node could go from memoryless -> not and v.v., which seems like it might not be well supported (but again, should not be much different from hotplugging all the memory out from a node). And, in fact, I think topologically speaking, I think I should be able to repeat the same sched domain warnings if I start off with a 2-node system with all CPUs on one node, and then hotplug a CPU onto the second node, right? That has nothing to do with power, that I can tell. I'll see if I can demonstrate it via a KVM guest. Thanks for your quick response! -Nish -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/