Received: by 10.213.65.68 with SMTP id h4csp1498406imn; Thu, 15 Mar 2018 01:01:59 -0700 (PDT) X-Google-Smtp-Source: AG47ELsunQ+E4gtFwy0BgSx2jurkkSxM2W1HYyzg6wJGJZ4Bi7Q24n2iE7+sY9sCrjtEolSoB0fC X-Received: by 2002:a17:902:f24:: with SMTP id 33-v6mr7050191ply.242.1521100919361; Thu, 15 Mar 2018 01:01:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521100919; cv=none; d=google.com; s=arc-20160816; b=s31IVoplmcT4kordvKADd4YrkrDrCgjPLZk5imXwe6EsBDo2/0FS7NgTEVn7F9KSW3 R2tv3mKoGNfdkDt5kmQu7SYerq98t6UOx2wZMWtYvG4OF8Bjuu7/Ji4HkJlD7gG5y7lk tqWkm3qJoMOUMyPY03ZcrfS/2wNaiZo11Gx+4xAqG5k6mSPFgGpnabPI146GheFmk1TN W1nHTqb8jctvgbkGVMC1CZ3gV2/f6EsyASTIzx0A/Zurq+3VTlbgsWzencqAc6X7R4MX VB5emO0OiN8jVCSHKzW6GVtSeJEu/ixFgiM4Qj6aR3ROkLk2hwea5g8sf/Z4DHLoiJQF zoLA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=a/9ZV2jUFPzUaA7d0N/+MLoo8UFMdWanMGFlf0nvjAo=; b=vaAlsK0wpb6Q5kDaHsMZIbbG69pB3RxQO1pFT/p5rX3komJs4EGvfVKyQCQdS4NGb6 MmGMMsAKlYwZ4R4gw7SoSucUwUUckiq4/+xmCIs7Jp5vwUQYqtdcXkEDeI1H4k4CeoM8 prFqcHU1PBJ1UDptOzVXZhssadQQgUAuc+iyJLNk6z15fuQrbhMaq/q6boPdA/tU0OBv I7oLAfuZ9zYw/yMTgHDAqot7z6pPffZPB2NygW73i8j9WbMu8i1gMzUGfMqoyrR6Qpz5 3p7VbwuoxAKl/HJf7TPDcXwFl2CAFRMuRH6rsMZSOPJNsF3aDN93RsArTL4vS1uQIjrX Psyw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=ZKMfgYoa; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w11si3060873pgo.826.2018.03.15.01.01.36; Thu, 15 Mar 2018 01:01:59 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=ZKMfgYoa; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751699AbeCOH7W (ORCPT + 99 others); Thu, 15 Mar 2018 03:59:22 -0400 Received: from mail-it0-f65.google.com ([209.85.214.65]:35007 "EHLO mail-it0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751652AbeCOH7S (ORCPT ); Thu, 15 Mar 2018 03:59:18 -0400 Received: by mail-it0-f65.google.com with SMTP id v194-v6so8047870itb.0 for ; Thu, 15 Mar 2018 00:59:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=a/9ZV2jUFPzUaA7d0N/+MLoo8UFMdWanMGFlf0nvjAo=; b=ZKMfgYoawNAs4nXqJwDBq9P0UnfbXdFPXp8RitSi2Qd/hm032cJge6D4BHJJHxBEzr c8sjD+9IOFxE/14eOjc9ex6PoLdcr4ea0Enez218erVxCDj0bTN36jkYe1hI6XZNF+pW W4tFJwRqAu4LBz45GhUm5GLrpnOgB48Oo5gr8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=a/9ZV2jUFPzUaA7d0N/+MLoo8UFMdWanMGFlf0nvjAo=; b=Hs9BfR1rYQ8UBKO35mz4puK5sI1PV2qe02gSLP3KJlHi3USz/vNb9T1M4E55RuhX70 zUyZymUMz2v7jze5eFY2PP3JB2/Fry3/Cd6PIibimYgaNv+wWodzUcmUJ9kdOYsnTSYi 78WviexRpJA4fyUBoW84sCyiAztESUB13yt3eiv6e70Nh3j5UseO8xpLhaFTQ4ddwimz Qx5Db1WdVgowTyPCdbsWPSi3gLn2xlZbef8PsD3zu/a5VE1gPdH6uBbHO33tJpuZkj9c GxIxfssGjgOCw8W9CBM/jSXRjA+SP10HZAidr3oo/ZpuCKIb+dFOlAEbc3npTKUtL1bJ T08A== X-Gm-Message-State: AElRT7FJ8KPrz+TyvBk2r2TOKMMG/XABzgc+AWQekqeFAsaUp9JNK133 0uy+WF0HjmRPiuMyN1j45+IxHHDzouBNGGBwD4nVCwqTr5o= X-Received: by 10.36.90.5 with SMTP id v5mr5193179ita.138.1521100757478; Thu, 15 Mar 2018 00:59:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.138.209 with HTTP; Thu, 15 Mar 2018 00:59:16 -0700 (PDT) In-Reply-To: References: <20180314134431.13241-1-ard.biesheuvel@linaro.org> <20180314141323.GD23100@dhcp22.suse.cz> <20180314145450.GI23100@dhcp22.suse.cz> From: Ard Biesheuvel Date: Thu, 15 Mar 2018 07:59:16 +0000 Message-ID: Subject: Re: [PATCH] Revert "mm/page_alloc: fix memmap_init_zone pageblock alignment" To: Daniel Vacek Cc: Michal Hocko , linux-arm-kernel , Linux Kernel Mailing List , Mark Rutland , Will Deacon , Catalin Marinas , Marc Zyngier , Mel Gorman , Paul Burton , Pavel Tatashin , Vlastimil Babka , Andrew Morton , Linus Torvalds Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 15 March 2018 at 07:42, Daniel Vacek wrote: > On Thu, Mar 15, 2018 at 7:39 AM, Ard Biesheuvel > wrote: >> On 15 March 2018 at 02:32, Daniel Vacek wrote: >>> On Wed, Mar 14, 2018 at 6:36 PM, Ard Biesheuvel >>> wrote: >>>> On 14 March 2018 at 16:41, Ard Biesheuvel wrote: >>>>> On 14 March 2018 at 15:54, Ard Biesheuvel wrote: >>>>>> On 14 March 2018 at 14:54, Michal Hocko wrote: >>>>>>> On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote: >>>>>>>> On 14 March 2018 at 14:13, Michal Hocko wrote: >>>>>>>> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com >>>>>>>> > fix your issue? From the debugging info you provided it should because >>>>>>>> > the patch prevents jumping backwards. >>>>>>>> > >>>>>>>> >>>>>>>> The patch does fix the boot hang. >>>>>>>> >>>>>>>> But I am concerned that we are papering over a fundamental flaw in >>>>>>>> memblock_next_valid_pfn(). >>>>>>> >>>>>>> It seems that memblock_next_valid_pfn is doing the right thing here. It >>>>>>> is the alignment which moves the pfn back AFAICS. I am not really >>>>>>> impressed about the original patch either, to be completely honest. >>>>>>> It just looks awfully tricky. I still didn't manage to wrap my head >>>>>>> around the original issue though so I do not have much better ideas to >>>>>>> be honest. >>>>>> >>>>>> So first of all, memblock_next_valid_pfn() never refers to its max_pfn >>>>>> argument, which is odd nut easily fixed. >>>>>> Then, the whole idea of substracting one so that the pfn++ will >>>>>> produce the expected value is rather hacky, >>>>>> >>>>>> But the real problem is that rounding down pfn for the next iteration >>>>>> is dodgy, because early_pfn_valid() isn't guaranteed to return true >>>>>> for the rounded down value. I know it is probably fine in reality, but >>>>>> dodgy as hell. The same applies to the call to early_pfn_in_nid() btw >>>>>> >>>>>> So how about something like this (apologies on Gmail's behalf for the >>>>>> whitespace damage, I can resend it as a proper patch) >>>>>> >>>>>> >>>>>> ---------8<----------- >>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>> index 3d974cb2a1a1..b89ca999ee3b 100644 >>>>>> --- a/mm/page_alloc.c >>>>>> +++ b/mm/page_alloc.c >>>>>> @@ -5352,28 +5352,29 @@ >>>>>> * function. They do not exist on hotplugged memory. >>>>>> */ >>>>>> if (context != MEMMAP_EARLY) >>>>>> goto not_early; >>>>>> >>>>>> - if (!early_pfn_valid(pfn)) { >>>>>> + if (!early_pfn_valid(pfn) || !early_pfn_in_nid(pfn, nid)) { >>>>>> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP >>>>>> /* >>>>>> * Skip to the pfn preceding the next valid one (or >>>>>> * end_pfn), such that we hit a valid pfn (or end_pfn) >>>>>> * on our next iteration of the loop. Note that it needs >>>>>> * to be pageblock aligned even when the region itself >>>>>> * is not. move_freepages_block() can shift ahead of >>>>>> * the valid region but still depends on correct page >>>>>> * metadata. >>>>>> */ >>>>>> - pfn = (memblock_next_valid_pfn(pfn, end_pfn) & >>>>>> - ~(pageblock_nr_pages-1)) - 1; >>>>>> -#endif >>>>>> + pfn = memblock_next_valid_pfn(pfn, end_pfn); >>>>>> + if (pfn >= end_pfn) >>>>>> + break; >>>>>> + pfn &= ~(pageblock_nr_pages - 1); >>>>>> +#else >>>>>> continue; >>>>>> +#endif >>>>>> } >>>>>> - if (!early_pfn_in_nid(pfn, nid)) >>>>>> - continue; >>>>>> if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised)) >>>>>> break; >>>>>> >>>>>> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP >>>>>> /* >>>>>> ---------8<----------- >>>>>> >>>>>> This ensures that we enter the remainder of the loop with a properly >>>>>> aligned pfn, rather than tweaking the value of pfn so it assumes the >>>>>> expected value after 'pfn++' >>>>> >>>>> Um, this does not actually solve the issue. I guess this is due to the >>>>> fact that a single pageblock size chunk could have both valid and >>>>> invalid PFNs, and so rounding down the first PFN of the second valid >>>>> chunk moves you back to the first chunk. >>>> >>>> OK, so the original patch attempted to ensure that of each pageblock, >>>> at least the first struct page gets initialized, even though the PFN >>>> may not be valid. Unfortunately, this code is not complete, given that >>>> start_pfn itself may be misaligned, and so the issue it attempts to >>>> solve may still occur. >>> >>> You're wrong here. >>> >> >> You only align down after encountering an invalid PFN. If start_pfn >> itself is not pageblock aligned, how do you initialize the first >> struct page of the pageblock? >> >>>> Then, I think it is absolutely dodgy to settle for only initializing >>>> the first struct page, rather than all of them, only because a >>>> specific VM_BUG_ON() references the flag field of the first struct >>>> page. >>>> IMO, we should fix this by initializing all struct page entries for >>>> each pageblock sized chunk that has any valid PFNs. >>> >>> That's precisely what my patch does. At least with >>> CONFIG_HAVE_ARCH_PFN_VALID disabled. And it looks only arm implements >>> arch pfn_valid() which I was not testing with and I am not sure it's >>> correct. Check my other email >>> >> >> No, your patch only initializes the first struct page of a pageblock. >> If the next one is invalid, we will skip to the next valid one. > > I believe you're pretty puzzled here. > It depends on your implementation of early_pfn_valid(). You keep making the assumption that your architecture's interpretation of pfn_valid() is in some way more correct, and so it is justified to make assumptions about its implementation, i.e., to round down to the start of a pageblock since pfn_valid() is known to return the same result for all PFNs covered by the same pageblock. early_pfn_valid() may return false for the rounded down PFN value, and for any subsequent ones. So whether the rounding has the desired effect (or any effect at all, other than hanging the box) is uncertain. As I said, if memblock_next_valid_pfn() skips PFNs for which pfn_valid() returns true, we should fix that instead. >> You are making the assumption that pfn_valid() will return true for >> all pages in a pageblock if it returns true for one of them, and this >> does not hold on other architectures. > > It does. At least the generic version defined in > include/linux/mmzone.h. And this seems to be used by all arches but > arm with CONFIG_HAVE_ARCH_PFN_VALID. With that config disabled I guess > even arm behaves the same. Though I could be wrong. > None of that matters. It is not part of the contract of pfn_valid() that you may infer that arbitrary PFNs in the vicinity of the pfn value you gave it are valid as well.