Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753510AbbDJTsr (ORCPT ); Fri, 10 Apr 2015 15:48:47 -0400 Received: from e18.ny.us.ibm.com ([129.33.205.208]:42346 "EHLO e18.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751973AbbDJTso (ORCPT ); Fri, 10 Apr 2015 15:48:44 -0400 Date: Fri, 10 Apr 2015 12:48:39 -0700 From: Nishanth Aravamudan To: Konstantin Khlebnikov Cc: Konstantin Khlebnikov , Grant Likely , devicetree@vger.kernel.org, Rob Herring , Linux Kernel Mailing List , sparclinux@vger.kernel.org, "linux-mm@kvack.org" , linuxppc-dev@lists.ozlabs.org Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid() Message-ID: <20150410194839.GA31621@linux.vnet.ibm.com> References: <20150408165920.25007.6869.stgit@buzz> <55255F84.6060608@yandex-team.ru> <20150408230740.GB53918@linux.vnet.ibm.com> <20150409225817.GI53918@linux.vnet.ibm.com> <5527B5EF.8090401@yandex-team.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5527B5EF.8090401@yandex-team.ru> X-Operating-System: Linux 3.13.0-40-generic (x86_64) User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15041019-0033-0000-0000-00000053A6AA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5899 Lines: 163 On 10.04.2015 [14:37:19 +0300], Konstantin Khlebnikov wrote: > On 10.04.2015 01:58, Tanisha Aravamudan wrote: > >On 09.04.2015 [07:27:28 +0300], Konstantin Khlebnikov wrote: > >>On Thu, Apr 9, 2015 at 2:07 AM, Nishanth Aravamudan > >> wrote: > >>>On 08.04.2015 [20:04:04 +0300], Konstantin Khlebnikov wrote: > >>>>On 08.04.2015 19:59, Konstantin Khlebnikov wrote: > >>>>>Node 0 might be offline as well as any other numa node, > >>>>>in this case kernel cannot handle memory allocation and crashes. > >>> > >>>Isn't the bug that numa_node_id() returned an offline node? That > >>>shouldn't happen. > >> > >>Offline node 0 came from static-inline copy of that function from of.h > >>I've patched weak function for keeping consistency. > > > >Got it, that's not necessarily clear in the original commit message. > > Sorry. > > > > >>>#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID > >>>... > >>>#ifndef numa_node_id > >>>/* Returns the number of the current Node. */ > >>>static inline int numa_node_id(void) > >>>{ > >>> return raw_cpu_read(numa_node); > >>>} > >>>#endif > >>>... > >>>#else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */ > >>> > >>>/* Returns the number of the current Node. */ > >>>#ifndef numa_node_id > >>>static inline int numa_node_id(void) > >>>{ > >>> return cpu_to_node(raw_smp_processor_id()); > >>>} > >>>#endif > >>>... > >>> > >>>So that's either the per-cpu numa_node value, right? Or the result of > >>>cpu_to_node on the current processor. > >>> > >>>>Example: > >>>> > >>>>[ 0.027133] ------------[ cut here ]------------ > >>>>[ 0.027938] kernel BUG at include/linux/gfp.h:322! > >>> > >>>This is > >>> > >>>VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); > >>> > >>>in > >>> > >>>alloc_pages_exact_node(). > >>> > >>>And based on the trace below, that's > >>> > >>>__slab_alloc -> alloc > >>> > >>>alloc_pages_exact_node > >>> <- alloc_slab_page > >>> <- allocate_slab > >>> <- new_slab > >>> <- new_slab_objects > >>> < __slab_alloc? > >>> > >>>which is just passing the node value down, right? Which I think was > >>>from: > >>> > >>> domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size), > >>> GFP_KERNEL, of_node_to_nid(of_node)); > >>> > >>>? > >>> > >>> > >>>What platform is this on, looks to be x86? qemu emulation of a > >>>pathological topology? What was the topology? > >> > >>qemu x86_64, 2 cpu, 2 numa nodes, all memory in second. > > > >Ok, this worked before? That is, this is a regression? > > Seems like that worked before 3.17 where > bug was exposed by commit 44767bfaaed782d6d635ecbb13f3980041e6f33e > (x86, irq: Enhance mp_register_ioapic() to support irqdomain) > this is first usage of *irq_domain_add*() in x86. Ok. > >> I've slightly patched it to allow that setup (in qemu hardcoded 1Mb > >>of memory connected to node 0) And i've found unrelated bug -- > >>if numa node has less that 4Mb ram then kernel crashes even > >>earlier because numa code ignores that node > >>but buddy allocator still tries to use that pages. > > > >So this isn't an actually supported topology by qemu? > > Qemu easily created memoryless numa nodes but node 0 have hardcoded > 1Mb of ram. This seems like legacy prop for DOS era software. Well, the problem is that x86 doesn't support memoryless nodes. git grep MEMORYLESS_NODES arch/ia64/Kconfig:config HAVE_MEMORYLESS_NODES arch/powerpc/Kconfig:config HAVE_MEMORYLESS_NODES > >>>Note that there is a ton of code that seems to assume node 0 is online. > >>>I started working on removing this assumption myself and it just led > >>>down a rathole (on power, we always have node 0 online, even if it is > >>>memoryless and cpuless, as a result). > >>> > >>>I am guessing this is just happening early in boot before the per-cpu > >>>areas are setup? That's why (I think) x86 has the early_cpu_to_node() > >>>function... > >>> > >>>Or do you not have CONFIG_OF set? So isn't the only change necessary to > >>>the include file, and it should just return first_online_node rather > >>>than 0? > >>> > >>>Ah and there's more of those node 0 assumptions :) > >> > >>That was x86 where is no CONFIG_OF at all. > >> > >>I don't know what's wrong with that machine but ACPI reports that > >>cpus and memory from node 0 as connected to node 1 and everything > >>seems worked fine until lates upgrade -- seems like buggy static-inline > >>of_node_to_nid was intoduced in 3.13 but x86 ioapic uses it during > >>early allocations only in since 3.17. Machine owner teells that 3.15 > >>worked fine. > > > >So, this was a qemu emulation of this actual physical machine without a > >node 0? > > Yep. Also I have crash from real machine but that stacktrace is messy > because CONFIG_DEBUG_VM wasn't enabled and kernel crashed inside > buddy allocator when tried to touch unallocated numa node structure. > > > > >As I mentioned, there are lots of node 0 assumptions through the kernel. > >You might run into more issues at runtime. > > I think it's possible to trigger kernel crash for any memoryless numa > node (not just for 0) if some device (like ioapic in my case) points to > it in its acpi tables. In runtime numa affinity configured by user > usually validated by the kernel, while numbers from firmware might > be used without proper validation. > > Anyway seems like at least one x86 machines works fine without > memory in node 0. You're going to run into more issues, without adding proper memoryless node support, I think. -Nish -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/