Received: by 10.213.65.68 with SMTP id h4csp1491285imn; Thu, 15 Mar 2018 00:44:13 -0700 (PDT) X-Google-Smtp-Source: AG47ELvBwTfiXTmHBbDtkPPHtFk+JPm/D6VKFVCzSJU0Vx4/gRK0bSKcnM+ERwuVfqlSEe7zq9wR X-Received: by 10.98.225.2 with SMTP id q2mr6858568pfh.23.1521099853582; Thu, 15 Mar 2018 00:44:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521099853; cv=none; d=google.com; s=arc-20160816; b=DFxJoIFZzvzpqNIBH8NVFOHDijxCThnUJ3U5UuODaroMi1tx64DWnhBq3OaWAbWD7K zEqbhN+mz9KZZimtDSjSic9s0hVvVy9s5iC3mJbx5MdpDWMp2oxXgpxKJPIhnO+X7MP2 uM/D5PHIoO53winMMDFpMwyEn2Lxz3WBNOFoY29ZyEJDQGF8Cw3/MTl0yv0UuuIkpsMc 1DtiMhGvXbEwFux4C508POQ7JlRW2b5cTSQ6Trk5LUpoUhgu4hLjUo8LrX8KuqY57tAi IUeGjitCXY54PJQ33V2+Hl6bb9LmnKAM07f8P913mhGlfpJAynHC9XdcJFZoK0uWTbVC 30kg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:arc-authentication-results; bh=gp9m80uvKYxE3eOpXd+oRkCIJwwTEd48poTWS65qLwg=; b=Z2xefdY3nlFOEIB1mflo2CixSGvtNODYJ9v4tdGRhJuLbrWj9bDWoST5W6iw/b95sO V68lkFpmre0XQ/oUOBNmIHIFqtYx+0ES510y3JTUf+FhFL20EAcYdq0hCDnNGI0e4sZT FDegtDny4nn5FC7Wp4vRYXvttU8iklIVgBon1nzE/S+BWaZexNCzn1slaTQmH2Sv+DiS MiPdsWgjVr+XrcNyvE/YPVMUdT74HwB3YlZY0Yb6vJPmuQcvRN+LJ+Lh27e10i1j47Jn SivnSnR2tVaTyhz6UMb1UZnfu3RTGNHFf4p1zNi9T0pwCbMxeAXN4nLRxvtUvWYJ6XRL JmCA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a9si3104273pgf.172.2018.03.15.00.43.58; Thu, 15 Mar 2018 00:44:13 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751886AbeCOHm7 (ORCPT + 99 others); Thu, 15 Mar 2018 03:42:59 -0400 Received: from mail-oi0-f67.google.com ([209.85.218.67]:46479 "EHLO mail-oi0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750729AbeCOHm6 (ORCPT ); Thu, 15 Mar 2018 03:42:58 -0400 Received: by mail-oi0-f67.google.com with SMTP id x12so4893144oie.13 for ; Thu, 15 Mar 2018 00:42:58 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=gp9m80uvKYxE3eOpXd+oRkCIJwwTEd48poTWS65qLwg=; b=RVQVVRF+5gW9HbAT/0KnNcSvF3F3oyip671RDkcOH5K+SPIyW+0Axe2heDBHN8i7JG BLLhGxg3NhFWpEyNkRGabDIU/p5DVvVtNwdYn6MV2EyWOzQpyJnwT9Y8fbPFLtkMbLtx yebG6bW9YmvdDX783Iq9ucDrUycXoCAFsiwYpH8cHtjwrVuW7nqGO1O7BCBZtb7/NtP4 vaCfHiAOuYvnMqqO6mvW2vkwqyOB9duJEIq00d5c3h/Jw4GGSAncbaqVfAeDYnTTDKtJ xXrVJ7s5Gpdxof7jq0gGxMbw2QCcYuiKo2gH6V4+ZGPBvPxzKFHlArJMWaSzuTgZmL5g MttQ== X-Gm-Message-State: AElRT7GrvDP5xoc4slo1/KHV1eRUzDMSGwaLgatA+mV6R7VEX81cqEIi 2OvM5AH561wQQNIju4W7N0goi7TffI3+MUZn2tdg4Q== X-Received: by 10.202.28.15 with SMTP id c15mr4187874oic.243.1521099777748; Thu, 15 Mar 2018 00:42:57 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a9d:39f6:0:0:0:0:0 with HTTP; Thu, 15 Mar 2018 00:42:57 -0700 (PDT) In-Reply-To: References: <20180314134431.13241-1-ard.biesheuvel@linaro.org> <20180314141323.GD23100@dhcp22.suse.cz> <20180314145450.GI23100@dhcp22.suse.cz> From: Daniel Vacek Date: Thu, 15 Mar 2018 08:42:57 +0100 Message-ID: Subject: Re: [PATCH] Revert "mm/page_alloc: fix memmap_init_zone pageblock alignment" To: Ard Biesheuvel Cc: Michal Hocko , linux-arm-kernel , Linux Kernel Mailing List , Mark Rutland , Will Deacon , Catalin Marinas , Marc Zyngier , Mel Gorman , Paul Burton , Pavel Tatashin , Vlastimil Babka , Andrew Morton , Linus Torvalds Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 15, 2018 at 7:39 AM, Ard Biesheuvel wrote: > On 15 March 2018 at 02:32, Daniel Vacek wrote: >> On Wed, Mar 14, 2018 at 6:36 PM, Ard Biesheuvel >> wrote: >>> On 14 March 2018 at 16:41, Ard Biesheuvel wrote: >>>> On 14 March 2018 at 15:54, Ard Biesheuvel wrote: >>>>> On 14 March 2018 at 14:54, Michal Hocko wrote: >>>>>> On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote: >>>>>>> On 14 March 2018 at 14:13, Michal Hocko wrote: >>>>>>> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com >>>>>>> > fix your issue? From the debugging info you provided it should because >>>>>>> > the patch prevents jumping backwards. >>>>>>> > >>>>>>> >>>>>>> The patch does fix the boot hang. >>>>>>> >>>>>>> But I am concerned that we are papering over a fundamental flaw in >>>>>>> memblock_next_valid_pfn(). >>>>>> >>>>>> It seems that memblock_next_valid_pfn is doing the right thing here. It >>>>>> is the alignment which moves the pfn back AFAICS. I am not really >>>>>> impressed about the original patch either, to be completely honest. >>>>>> It just looks awfully tricky. I still didn't manage to wrap my head >>>>>> around the original issue though so I do not have much better ideas to >>>>>> be honest. >>>>> >>>>> So first of all, memblock_next_valid_pfn() never refers to its max_pfn >>>>> argument, which is odd nut easily fixed. >>>>> Then, the whole idea of substracting one so that the pfn++ will >>>>> produce the expected value is rather hacky, >>>>> >>>>> But the real problem is that rounding down pfn for the next iteration >>>>> is dodgy, because early_pfn_valid() isn't guaranteed to return true >>>>> for the rounded down value. I know it is probably fine in reality, but >>>>> dodgy as hell. The same applies to the call to early_pfn_in_nid() btw >>>>> >>>>> So how about something like this (apologies on Gmail's behalf for the >>>>> whitespace damage, I can resend it as a proper patch) >>>>> >>>>> >>>>> ---------8<----------- >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>> index 3d974cb2a1a1..b89ca999ee3b 100644 >>>>> --- a/mm/page_alloc.c >>>>> +++ b/mm/page_alloc.c >>>>> @@ -5352,28 +5352,29 @@ >>>>> * function. They do not exist on hotplugged memory. >>>>> */ >>>>> if (context != MEMMAP_EARLY) >>>>> goto not_early; >>>>> >>>>> - if (!early_pfn_valid(pfn)) { >>>>> + if (!early_pfn_valid(pfn) || !early_pfn_in_nid(pfn, nid)) { >>>>> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP >>>>> /* >>>>> * Skip to the pfn preceding the next valid one (or >>>>> * end_pfn), such that we hit a valid pfn (or end_pfn) >>>>> * on our next iteration of the loop. Note that it needs >>>>> * to be pageblock aligned even when the region itself >>>>> * is not. move_freepages_block() can shift ahead of >>>>> * the valid region but still depends on correct page >>>>> * metadata. >>>>> */ >>>>> - pfn = (memblock_next_valid_pfn(pfn, end_pfn) & >>>>> - ~(pageblock_nr_pages-1)) - 1; >>>>> -#endif >>>>> + pfn = memblock_next_valid_pfn(pfn, end_pfn); >>>>> + if (pfn >= end_pfn) >>>>> + break; >>>>> + pfn &= ~(pageblock_nr_pages - 1); >>>>> +#else >>>>> continue; >>>>> +#endif >>>>> } >>>>> - if (!early_pfn_in_nid(pfn, nid)) >>>>> - continue; >>>>> if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised)) >>>>> break; >>>>> >>>>> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP >>>>> /* >>>>> ---------8<----------- >>>>> >>>>> This ensures that we enter the remainder of the loop with a properly >>>>> aligned pfn, rather than tweaking the value of pfn so it assumes the >>>>> expected value after 'pfn++' >>>> >>>> Um, this does not actually solve the issue. I guess this is due to the >>>> fact that a single pageblock size chunk could have both valid and >>>> invalid PFNs, and so rounding down the first PFN of the second valid >>>> chunk moves you back to the first chunk. >>> >>> OK, so the original patch attempted to ensure that of each pageblock, >>> at least the first struct page gets initialized, even though the PFN >>> may not be valid. Unfortunately, this code is not complete, given that >>> start_pfn itself may be misaligned, and so the issue it attempts to >>> solve may still occur. >> >> You're wrong here. >> > > You only align down after encountering an invalid PFN. If start_pfn > itself is not pageblock aligned, how do you initialize the first > struct page of the pageblock? > >>> Then, I think it is absolutely dodgy to settle for only initializing >>> the first struct page, rather than all of them, only because a >>> specific VM_BUG_ON() references the flag field of the first struct >>> page. >>> IMO, we should fix this by initializing all struct page entries for >>> each pageblock sized chunk that has any valid PFNs. >> >> That's precisely what my patch does. At least with >> CONFIG_HAVE_ARCH_PFN_VALID disabled. And it looks only arm implements >> arch pfn_valid() which I was not testing with and I am not sure it's >> correct. Check my other email >> > > No, your patch only initializes the first struct page of a pageblock. > If the next one is invalid, we will skip to the next valid one. I believe you're pretty puzzled here. > You are making the assumption that pfn_valid() will return true for > all pages in a pageblock if it returns true for one of them, and this > does not hold on other architectures. It does. At least the generic version defined in include/linux/mmzone.h. And this seems to be used by all arches but arm with CONFIG_HAVE_ARCH_PFN_VALID. With that config disabled I guess even arm behaves the same. Though I could be wrong. --nX