Received: by 2002:a25:e7d8:0:0:0:0:0 with SMTP id e207csp1525608ybh; Fri, 13 Mar 2020 02:50:23 -0700 (PDT) X-Google-Smtp-Source: ADFU+vsbCb9ZmNTtsIjda/cpZCo6nM3AyG4FEjJlEDb8u0QUo+NGVmg+7I5fCDhpq8jNoaE9jPM+ X-Received: by 2002:a9d:4f0:: with SMTP id 103mr6680206otm.336.1584093023281; Fri, 13 Mar 2020 02:50:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1584093023; cv=none; d=google.com; s=arc-20160816; b=XiHIlVAg3tFUghlDhKCDyfp7r9+ok6NImLqI6yJrGA8Jo8FkIRxkfnYJow7iXlhNCh Mx5a9gHY8Yox0jyE0w5DEW4YwTTylBG3HnXD8sjoDaONz7L6pZuXfbAKBCgHsFuTweii xb1d1SeQq/buFg0J+5x3rcpuslaI0JSFOgUdA426X0hkKYM3CjvbFbLyrIpANQQqe16y V49zWwYq3+WgppOopuFNvabRYSCRuG03kQoBGrf3rqG34yqcQh8QwZ4tRTyldctzSl+u /A63yhLYiFU9bhWUqHL/pPyZBP0fPrkcrKP7T2x840ki3GD2QV4o0grRJakC99I4MtPN SWzg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=8ype9a2Pgy0TDkVr+jjtxMr6KTyOlymmisvMN8Riefk=; b=qVPcRz7iTal2Ng8j5qpbG+1BU4qktLi55CD4FbbMj11Tz6ZeAB0KBTVEDZ1FT0myXO ZjF1HDxzLkPTIuHUq9Ahq4QXRTfVJQqt+zMFxAYy78CAd110Rs1PbrsjucVfnOs+8mZB cQHURYTRKW8rYdO8JPLV2u5P75UYXBTDFm4Q78ToyqqhN93faPE8QEbcuptnTm5140mt rIz2TEo4cxRsIvNUxOemk36agI8Cz0y/BkSvTe0lG7dSAllWAezY7xQNida4w4JxHre3 gR51DuNbvlBtQEQiWn4eDtrOASQWovyjbW1VY/D1NH8UMSUVBeJT17guakEFAI2ZCRsZ mP/w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=idV80ylM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v22si4200462oic.271.2020.03.13.02.50.10; Fri, 13 Mar 2020 02:50:23 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=idV80ylM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726621AbgCMJsE (ORCPT + 99 others); Fri, 13 Mar 2020 05:48:04 -0400 Received: from mail-qt1-f195.google.com ([209.85.160.195]:43225 "EHLO mail-qt1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726512AbgCMJsC (ORCPT ); Fri, 13 Mar 2020 05:48:02 -0400 Received: by mail-qt1-f195.google.com with SMTP id l13so6922286qtv.10 for ; Fri, 13 Mar 2020 02:48:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=8ype9a2Pgy0TDkVr+jjtxMr6KTyOlymmisvMN8Riefk=; b=idV80ylMvEIMsvllcp2OWlep7Wg7lp0Ru/uKDyRC6qmc69JFzVeRgtSHUhR+D59WlM AoU7d3uKDNnIU1L8At21wB0rJIG17hd5BoXi5lLqgMBGnmwOMs8blHzdVC04jvi6xYui VIoMUsD22lhwAoWHU8w/iRpsazyrL77UIBOQZ7O0YewBF3lH5V3AYjrSvFPWYyu6I4Dz lmIpRCuLy0T/8xJY2bLzVqt7BVGuMqfPhBVGymY4PMzaxnf3FAOI9ltIzrUHPQFGt6xC foq4mOfumN0EfS2Z3VkXuD11EFgNuPjWhxsAjPbmAMIGuFGisI46U2A4qBu9rYkKJwnb bZ6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=8ype9a2Pgy0TDkVr+jjtxMr6KTyOlymmisvMN8Riefk=; b=ZewLMj73eFierun+idORHVHh7UvbRl6wkze2xjs00NZ4vRiodGTYKZm3frsQloRKRm z96a8nxh3q3s2DAhNQAhrotgzdjVd+SbOc30ZIN3ZzN5g+KmQtUrU5nyJ2jjJAdMez/q Sjqd1m6BrKkdKTIUD8xmqZoJffHgTH1osC5/7/1AyD/uyWkSSrPia4RkgzygWQcYT2Mu zxpDrooNbJ9t4PMEkcgtGQ54d6qoQWAiO4LON7bIMVBIZvbTxcgc75VnLPsojQ15Ysd7 wJZqMW6THmRVVH/blmoZDV7aiHBf+f27GYKVkN8eK5QWj5fq+VexeTPRF6mcm4zzj6ER f0Bg== X-Gm-Message-State: ANhLgQ2v9GjBcqPuZt6r8u09mpjPPRZXwjWwk8SmXw3lEZ8grfON/Q+b Mz36T1oLbYrGYE8Wq/oHXumKFUs1OyRFULN+8jM= X-Received: by 2002:aed:3346:: with SMTP id u64mr4634827qtd.333.1584092880504; Fri, 13 Mar 2020 02:48:00 -0700 (PDT) MIME-Version: 1.0 References: <20200311110237.5731-1-srikar@linux.vnet.ibm.com> <20200311110237.5731-2-srikar@linux.vnet.ibm.com> <20200311115735.GM23944@dhcp22.suse.cz> <20200312052707.GA3277@linux.vnet.ibm.com> <5e5c736a-a88c-7c76-fc3d-7bc765e8dcba@suse.cz> <20200312131438.GB3277@linux.vnet.ibm.com> <61437352-8b54-38fa-4471-044a65c9d05a@suse.cz> <20200312161310.GC3277@linux.vnet.ibm.com> In-Reply-To: From: Joonsoo Kim Date: Fri, 13 Mar 2020 18:47:49 +0900 Message-ID: Subject: Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus To: Vlastimil Babka Cc: Srikar Dronamraju , Sachin Sant , Michal Hocko , Linus Torvalds , LKML , Linux Memory Management List , Mel Gorman , "Kirill A. Shutemov" , Andrew Morton , linuxppc-dev , Christopher Lameter , Joonsoo Kim , Kirill Tkhai Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 2020=EB=85=84 3=EC=9B=94 13=EC=9D=BC (=EA=B8=88) =EC=98=A4=EC=A0=84 1:42, V= lastimil Babka =EB=8B=98=EC=9D=B4 =EC=9E=91=EC=84=B1: > > On 3/12/20 5:13 PM, Srikar Dronamraju wrote: > > * Vlastimil Babka [2020-03-12 14:51:38]: > > > >> > * Vlastimil Babka [2020-03-12 10:30:50]: > >> > > >> >> On 3/12/20 9:23 AM, Sachin Sant wrote: > >> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju wrote: > >> >> >> * Michal Hocko [2020-03-11 12:57:35]: > >> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote: > >> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powe= rpc need > >> >> >>>> to make sure all possible but not present cpu_to_node are set = to a > >> >> >>>> proper node. > >> >> >>> > >> >> >>> Just curious, is this somehow related to > >> >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz? > >> >> >>> > >> >> >> > >> >> >> The issue I am trying to fix is a known issue in Powerpc since m= any years. > >> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c= : allocate > >> >> >> shrinker_map on appropriate NUMA node"). > >> >> >> > >> > > >> > While I am not an expert in the slub area, I looked at the patch > >> > a75056fc1e7c and had some thoughts on why this could be causing this= issue. > >> > > >> > On the system where the crash happens, the possible number of nodes = is much > >> > greater than the number of onlined nodes. The pdgat or the NODE_DATA= is only > >> > available for onlined nodes. > >> > > >> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzall= oc_node > >> > for all possible nodes and in ___slab_alloc we end up looking at the > >> > node_present_pages which is NODE_DATA(nid)->node_present_pages. > >> > i.e for a node whose pdgat struct is not allocated, we are trying to > >> > dereference. > >> > >> From what we saw, the pgdat does exist, the problem is that slab's per= -node data > >> doesn't exist for a node that doesn't have present pages, as it would = be a waste > >> of memory. > > > > Just to be clear > > Before my 3 patches to fix dummy node: > > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible > > 0-31 > > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online > > 0-1 > > OK > > >> > >> Uh actually you are probably right, the NODE_DATA doesn't exist anymor= e? In > >> Sachin's first report [1] we have > >> > >> [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff] > >> [ 0.000000] numa: NODE_DATA(0) on node 1 > >> [ 0.000000] numa: NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff] > >> > > > > So even if pgdat would exist for nodes 0 and 1, there is no pgdat for t= he > > rest 30 nodes. > > I see. Perhaps node_present_pages(node) is not safe in SLUB then and it s= hould > check online first, as you suggested. > > >> But in this thread, with your patches Sachin reports: > > > > and with my patches > > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible > > 0-31 > > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online > > 1 > > > >> > >> [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff] > >> > > > > so we only see one pgdat. > > > >> So I assume it's just node 1. In that case, node_present_pages is real= ly dangerous. > >> > >> [1] > >> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A6396= 8@linux.vnet.ibm.com/ > >> > >> > Also for a memoryless/cpuless node or possible but not present nodes= , > >> > node_to_mem_node(node) will still end up as node (atleast on powerpc= ). > >> > >> I think that's the place where this would be best to fix. > >> > > > > Maybe. I thought about it but the current set_numa_mem semantics are ap= t > > for memoryless cpu node and not for possible nodes. We could have upto= 256 > > possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memo= ry. > > node_to_mem_node seems to return what is set in set_numa_mem(). > > set_numa_mem() seems to say set my numa_mem node for the current memory= less > > node to the param passed. > > > > But how do we set numa_mem for all the other 253 possible nodes, which > > probably will have 0 as default? > > > > Should we introduce another API such that we could update for all possi= ble > > nodes? > > If we want to rely on node_to_mem_node() to give us something safe for ea= ch > possible node, then probably it would have to be like that, yeah. > > >> > I tried with this hunk below and it works. > >> > > >> > But I am not sure if we need to check at other places were > >> > node_present_pages is being called. > >> > >> I think this seems to defeat the purpose of node_to_mem_node()? Should= n't it > >> return only nodes that are online with present memory? > >> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topo= logy: add > >> support for node_to_mem_node() to determine the fallback node") > >> > > > > Agree I lost all the memory about it. :) Anyway, how about this? 1. make node_present_pages() safer static inline node_present_pages(nid) { if (!node_online(nid)) return 0; return (NODE_DATA(nid)->node_present_pages); } 2. make node_to_mem_node() safer for all cases In ppc arch's mem_topology_setup(void) for_each_present_cpu(cpu) { numa_setup_cpu(cpu); mem_node =3D node_to_mem_node(numa_mem_id()); if (!node_present_pages(mem_node)) { _node_numa_mem_[numa_mem_id()] =3D first_online_node; } } With these two changes, we can uses node_present_pages() and node_to_mem_no= de() as intended. Thanks. > >> I think we do need well defined and documented rules around node_to_me= m_node(), > >> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc= so > >> everyone handles it the same, safe way. > > So let's try to brainstorm how this would look like? What I mean are some= rules > like below, even if some details in my current understanding are most lik= ely > incorrect: > > with nid present in: > N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some o= nline > node with memory so that we don't require everyone to search for it in sl= ightly > different ways > N_ONLINE - pgdat must exist, there doesn't have to be present memory, > node_to_mem_node() still has to return something else (?) > N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns its= elf > N_HIGH_MEMORY - node has present high memory