Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp4250243pxb; Tue, 2 Nov 2021 06:44:06 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzVJT34Z/uQ4nEAAL2QFu6s8h9IYgbsALPuJlWH6BQsfJs0F6mTi/NGhpMwdlBrHkGaJrEw X-Received: by 2002:a50:d984:: with SMTP id w4mr51482349edj.375.1635860645925; Tue, 02 Nov 2021 06:44:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1635860645; cv=none; d=google.com; s=arc-20160816; b=R3+omXYy8fkvaOjdeYJZ09ux5sNFqyKpGJqmdDSt4lRJvJkvYb35sdOQaVjNTTiQVh si+XO7GJv5V6gVApJ5iBTfzFwBb/vPWLm4GggOdL7jsEqBHasE4hKMYM45dtiD5+PFKn w2Zvb9u5qXPYYJozFivRK2L/IiqfkpmSqmkxMB7X+gp1w2I7zpucv8qBIL7074u1DzY6 qgMXvtojRFxLxCE+3LLDRiDofLAWT/BZZChojp4TFtFfn4jRnRIFlF2d8+XuTIwJMSSc 087HMG75q6PUJKQ6x0cepmG38/Trwe6p9FWbG5mFl5v/g8al0kIMY5J0/U/b6DE8b2qN JIVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :organization:from:references:cc:to:content-language:user-agent :mime-version:date:message-id:dkim-signature; bh=QQLlgC2+pa37FKsjZxqtI9j/y4YM779H4bT7Ffbeg8Y=; b=k4Wex39r6PCSAc4msrugOORAiRTY926Yfh2wRbdjmCBQruz7FSw51JWvjSGnOSFI9b uy1gauk1YPhSQs4DyCxd31CqWV1viopNyr2An4tEF2VU/YvbHytVrYsPjvooOVhZyrWC pOG2Wf/20BtNEfvEWTsLOuWH3bA0arNS0t5u7IgbnFqGGuG61vhcMDyf2rufiJM0CKam qHj3loBzhSyJyZpByp8PIOsmfoNC6Yhc/3vj/btK1Ew4AQFjqG6+AG9OsO430so0jEh9 cLCl2PnMI81l9W6q24zLphnNqtSZ18WyHPW9jXI6MY5EO32V5IyOBHLbywwsceBF0Haj 9zaQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=b16TKqhW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h12si6968287edb.124.2021.11.02.06.43.41; Tue, 02 Nov 2021 06:44:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=b16TKqhW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230124AbhKBNoF (ORCPT + 99 others); Tue, 2 Nov 2021 09:44:05 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:30370 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229530AbhKBNoF (ORCPT ); Tue, 2 Nov 2021 09:44:05 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1635860489; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QQLlgC2+pa37FKsjZxqtI9j/y4YM779H4bT7Ffbeg8Y=; b=b16TKqhWXH1EKwBRGONpA+05DMhvLnj+8IN9I1xgV1RFeTkZWhoCncJHbMUZA970wQKDDr MmE/MBs4MlkuGzZe3jdE9rHzYgaHIBak6xkOgKz9n/sHIiS3QU/o3yXWafNKW6EKvoQ1nj uN87DnMWv51ryYiiwK+Wdj/FCiuJNPw= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-330-rFVsUS70NBW1-sO0gNNvHA-1; Tue, 02 Nov 2021 09:41:28 -0400 X-MC-Unique: rFVsUS70NBW1-sO0gNNvHA-1 Received: by mail-wm1-f69.google.com with SMTP id v5-20020a1cac05000000b0032ccf04ba2cso1164392wme.2 for ; Tue, 02 Nov 2021 06:41:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=QQLlgC2+pa37FKsjZxqtI9j/y4YM779H4bT7Ffbeg8Y=; b=JPW49+0Sav/Q9fWGzqTGLqaBromrjvEDXYogfhCfJobbgtVh1A6F9eAvs6yl8en0Dw x4u6c2mgVNfEBoU3PbS3zBOzVgmdRgc6lbKl29xechkNRkQSGj/9SG2v9mIYwmJ5oZSn vEBbKf0Xjsng0JjDMvD4QT2zQCRa3ehv9IrA8zvWAHqT94fLTekyzDPy4+EKq78ZgsId 6/u4cfcC7ByyDl4HG+QGNPBZuy3rC9mxiCdjkrEN0KCYia6+bZZrq80m58OcfeesaBz4 cabNG0EUwrp1KxiGGNREMBxywasZ3+j5XbTs+N/LlHqPn/T45dAvQc4zjcI9ymHKinm8 WfLA== X-Gm-Message-State: AOAM5332qIKJUcuyFQpHm4dQyNvqPrkdLRHsfpVfpDNlfUdQr+mhzDp2 DvHet29RN/SK6EnN4EH0RhUdeQyZqpmOomJXfnvuD4KqpqF/nH4ayj+5LyujHpV9Ap8/XBNcSTd A587Kz14bO7BGZm8P7A5O4PKw X-Received: by 2002:a05:6000:2a4:: with SMTP id l4mr20861699wry.238.1635860487359; Tue, 02 Nov 2021 06:41:27 -0700 (PDT) X-Received: by 2002:a05:6000:2a4:: with SMTP id l4mr20861664wry.238.1635860487135; Tue, 02 Nov 2021 06:41:27 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6810.dip0.t-ipconnect.de. [91.12.104.16]) by smtp.gmail.com with ESMTPSA id u15sm2896627wmq.12.2021.11.02.06.41.26 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 02 Nov 2021 06:41:26 -0700 (PDT) Message-ID: Date: Tue, 2 Nov 2021 14:41:25 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 Content-Language: en-US To: Michal Hocko Cc: Alexey Makhalov , "linux-mm@kvack.org" , Andrew Morton , "linux-kernel@vger.kernel.org" , "stable@vger.kernel.org" , Oscar Salvador References: <7136c959-63ff-b866-b8e4-f311e0454492@redhat.com> <42abfba6-b27e-ca8b-8cdf-883a9398b506@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH] mm: fix panic in __alloc_pages In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02.11.21 14:25, Michal Hocko wrote: > On Tue 02-11-21 13:39:06, David Hildenbrand wrote: >>>> Yes, but a zonelist cannot be correct for an offline node, where we might >>>> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as >>>> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat(). >>> >>> Yes, that is what I had in mind. We are talking about two things here. >>> Memoryless nodes and offline nodes. The later sounds like a bug to me. >> >> Agreed. memoryless nodes should just have proper zonelists -- which >> seems to be the case. >> >>>> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly >>>> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as >>>> preferred nid. >>> >>> Historically, those allocation interfaces were not trying to be robust >>> against wrong inputs because that adds cpu cycles for everybody for >>> "what if buggy" code. This has worked (surprisingly) well. Memory less >>> nodes have brought in some confusion but this is still something that we >>> can address on a higher level. Nobody give arbitrary nodes as an input. >>> cpu_to_node might be tricky because it can point to a memory less node >>> which along with __GFP_THISNODE is very likely not something anybody >>> wants. Hence cpu_to_mem should be used for allocations. I hate we have >>> two very similar APIs... >> >> To be precise, I'm wondering if we should do: >> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h >> index 55b2ec1f965a..8c49b88336ee 100644 >> --- a/include/linux/gfp.h >> +++ b/include/linux/gfp.h >> @@ -565,7 +565,7 @@ static inline struct page * >> __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) >> { >> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); >> - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); >> + VM_WARN_ON(!node_online(nid)); >> >> return __alloc_pages(gfp_mask, order, nid, NULL); >> } >> >> (Or maybe VM_BUG_ON) >> >> Because it cannot possibly work and we'll dereference NULL later. > > VM_BUG_ON would be silent for most configurations and crash would happen > even without it so I am not sure about the additional value. VM_WARN_ON > doesn't really add much on top - except it would crash in some > configurations. If we really care to catch this case then we would have > to do a reasonable fallback with a printk note and a dumps stack. As I learned, VM_BUG_ON and friends are active for e.g., Fedora, which can catch quite some issues early, before they end up in enterprise distro kernels. I think it has value. > Something like > if (unlikely(!node_online(nid))) { > pr_err("%d is an offline numa node and using it is a bug in a caller. Please report...\n"); > dump_stack(); > nid = numa_mem_id(); > } > > But again this is adding quite some cycles to a hotpath of the page > allocator. Is this worth it? Don't think a fallback makes sense. > >>> But something seems wrong in this case. cpu_to_node shouldn't return >>> offline nodes. That is just a land mine. It is not clear to me how the >>> cpu has been brought up so that the numa node allocation was left >>> behind. As pointed in other email add_cpu resp. cpu_up is not it. >>> Is it possible that the cpu bring up was only half way? >> >> I tried to follow the code (what sets a CPU present, what sets a CPU >> online, when do we update cpu_to_node() mapping) and IMHO it's all a big >> mess. Maybe it's clearer to people familiar with that code, but CPU >> hotplug in general seems to be a confusing piece of (arch-specific) code. > > Yes there are different arch specific parts that make this quite hard to > follow. > > I think we want to learn how exactly Alexey brought that cpu up. Because > his initial thought on add_cpu resp cpu_up doesn't seem to be correct. > Or I am just not following the code properly. Once we know all those > details we can get in touch with cpu hotplug maintainers and see what > can we do. Yes. > > Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem? You mean s/cpu_to_node/cpu_to_mem/ or also handling offline nids? cpu_to_mem() corresponds to cpu_to_node() unless on ia64+ppc IIUC, so it won't help for this very report. > One last thing, there were some mentions of __GFP_THISNODE but I fail to > see connection with the pcp allocator... Me to. If pcpu would be using __GFP_THISNODE, we'd be hitting the VM_WARN_ON but still crash. -- Thanks, David / dhildenb