Received: by 10.213.65.68 with SMTP id h4csp1353209imn; Wed, 14 Mar 2018 18:32:13 -0700 (PDT) X-Google-Smtp-Source: AG47ELt0V5uaZ/4EFTNoEtZII/9mzGVXib6GhOru0kdnc+qvHaAauJOdJO6EFVxmde+O5cvEDf0Y X-Received: by 10.99.189.82 with SMTP id d18mr5345669pgp.172.1521077533317; Wed, 14 Mar 2018 18:32:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521077533; cv=none; d=google.com; s=arc-20160816; b=NLMoR94AnS8o/yLX36d9aKXqrSo84Il/lavKVbf2pOnGCTp+Fz4ZmpjVLsRzMfazEm d5JWKpFOVnijtK5Z9aVKlgZVYGryKFQNA89a5MPPyDIoMHvzfjavr1UYF8oKMJ6+V0Fk 6r8mfIncuYRLkGw3dLHf+VTaKbOqtJkLIUgcZS+Ix1xlgyRt1GGhqpjx0EY6H3sbeM4b /JwXoSp3EJh/TjQe9a5gSlFyyUuupRtQ0LdaffxkccXIcX+I7KIwa8racyUaJI3cjBXB V5qKMunrdd8Z9ubkSIW/Vl3gW6kSiKpvSECpFEBa0muvbfECWAHh+OU0Q07IEt1abom5 Boew== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:arc-authentication-results; bh=4yYKFci4NeC3IYR4kYVXFeAuscWyI1rFA7x7EGijm54=; b=Alr3UMcw3U/7feMotUB9o/9/urUiP+k+QWKlZSqH/BOuHcmAnlVlm9h0iJK5w2zXIM wTI0yBFdZaklCpOKQVrqsGvqtEhG9tf8CB1Jq5S/HTL5w9+XJ2rz/fAH6cyvqEH18GEs auawmiz/8Jg6s91W8ZNWwweFyzCxvpjC7Pn8674GgaKULdXljOp4rkafnldCvVCNllbv TQ0dq8INAXJkyoG7Fomvi9yisiC5YauJg3xatupi8MgkDIrI6EkS1+8hiLm9T1bkpITq t3v+EYPI94J66SzjMPZc/gtQPMR/agxBYJfsXLmAJnvcJ+b72e8ZspL7+5zmATtqU75M PeAw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 44-v6si2919692plb.635.2018.03.14.18.31.59; Wed, 14 Mar 2018 18:32:13 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751959AbeCOBao (ORCPT + 99 others); Wed, 14 Mar 2018 21:30:44 -0400 Received: from mail-ot0-f193.google.com ([74.125.82.193]:41869 "EHLO mail-ot0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751517AbeCOBam (ORCPT ); Wed, 14 Mar 2018 21:30:42 -0400 Received: by mail-ot0-f193.google.com with SMTP id i28-v6so598523otf.8 for ; Wed, 14 Mar 2018 18:30:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=4yYKFci4NeC3IYR4kYVXFeAuscWyI1rFA7x7EGijm54=; b=QUcSy6ynDW7b1odAu/U0OKG39ldO1XeXoe0LPhTHGh/wv3Kb4ej8NqnemUsZWsX0al uyqiakFV1mRdCYuc0iv8yKLBtmCPOpNh/M0WpXjMjuNctmsGJXuFfLXuRJTNar0kC0l0 nxGu2GXjtzW7fsxSn2xkeJeZ5WFjhdF+0/YNJ9rZo//VX/cJyIGB1LBL/e1732b5wF7g dzvgHEeuieQ0Zl16KgvUE3j4VDygPyK0jmy9/6+E+0SaJFFw2hcUJ6OUz/i55RBekT5G hc6EAJ6u6RRz2t/4uEAdU/bw9IMVo49g4veJIhfXSuaIo/FSTcS14kfhJhUhbeuVn8Wy 6gxQ== X-Gm-Message-State: AElRT7HerpkvbsZZ7efSdl6g573m538lRxCtIZV2GQ50eKuKTApLNLn5 mZq4BGaZJEYq++YAxdNBnTchU1H9yu81Kc4IY2Y8Nw== X-Received: by 10.157.56.40 with SMTP id i37mr4683469otc.197.1521077442184; Wed, 14 Mar 2018 18:30:42 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a9d:39f6:0:0:0:0:0 with HTTP; Wed, 14 Mar 2018 18:30:41 -0700 (PDT) In-Reply-To: <20180314141727.GE23100@dhcp22.suse.cz> References: <20180313224240.25295-1-neelx@redhat.com> <20180314141727.GE23100@dhcp22.suse.cz> From: Daniel Vacek Date: Thu, 15 Mar 2018 02:30:41 +0100 Message-ID: Subject: Re: [PATCH] mm/page_alloc: fix boot hang in memmap_init_zone To: Michal Hocko , Naresh Kamboju , Sudeep Holla , Ard Biesheuvel Cc: open list , linux-mm@kvack.org, Andrew Morton , Mel Gorman , Paul Burton , Pavel Tatashin , Vlastimil Babka , stable Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 14, 2018 at 3:17 PM, Michal Hocko wrote: > On Tue 13-03-18 23:42:40, Daniel Vacek wrote: >> On some architectures (reported on arm64) commit 864b75f9d6b01 ("mm/page_alloc: fix memmap_init_zone pageblock alignment") >> causes a boot hang. This patch fixes the hang making sure the alignment >> never steps back. > > I am sorry to be complaining again, but the code is so obscure that I No worries, I'm glad for any review. Which code exactly you do find obscure? This patch or my former fix or the original commit introducing memblock_next_valid_pfn()? Coz I'd agree the original commit looks pretty obscure... > would _really_ appreciate some more information about what is going > on here. memblock_next_valid_pfn will most likely return a pfn within > the same memblock and the alignment will move it before the old pfn > which is not valid - so the block has some holes. Is that correct? I do not understand what you mean by 'pfn within the same memblock'? And by 'the block has some holes'? memblock has types 'memory' (as usable memory) and 'reserved' (for unusable mem), if I understand correctly. Both of them have an array of regions (as an equivalent or kind of abstraction of ranges, IIUC). memblock is a global symbol which contains all of this. So one could say all pfns are within the same memblock. Did you mean the same memory region perhaps? A region is solid by definition, I believe. So there should be no holes within a single region _by_definition_ anyways. Or am I wrong here? Now, memblock_next_valid_pfn(pfn) is called only conditionally when the old pfn is already 'invalid' in the first place. Here the meaning of this 'invalid' is arch/config dependent. With my former fix based on x86_64 config I was working with assumption that the semantics of 'invalid' means the memsection does not have memmap: early_pfn_valid(pfn) pfn_valid(pfn) valid_section(__nr_to_section(pfn_to_section_nr(pfn))) return (section && (section->section_mem_map & SECTION_HAS_MEM_MAP)) From this I implied that when memblock_next_valid_pfn(pfn) is called this pfn is the first in a section (ie. section aligned) and this whole section is going to be skipped. So that the returned pfn will gonna be at least one full section away from the old pfn. That's also exactly what I can see in the memory dumps from bugreports resulting in my former fix. Note that with next_pfn being at least a full section away from old pfn, there is no need to check whether it steps back or not. Even when pageblock aligned. That's why I did not include the hunk in this patch in my former fix. Now I have no idea why above said does not hold true for arm64 or what config was used there. I did not have a chance nor time to get my hands on any arm hardware where it broke. The same way as I did not test any architecture but x86_64 where I had original reports of crashes before applying my former fix. Also I am not deeply experienced with mm details and internals and how everything works on every architecture. And even less when we speak about early boot init. I mostly only considered the algorithm abstractions. I bisected and reviewed the patch which regressed x86_64. It clearly skips some pfns based on memblock memory ranges (if available) and the skipped pages are not initialized, causing the crash. This is clearly visible in the memory dumps I got from reports and hopefully I already clearly explained that. I fixed this by applying the alignment to keep at least the bare minimum of pages initialized (so skipping a bit less than to the beginning of the next usable range in case that range is not section/memblock/pageblock/whatever aligned - with the pageblock alignment being the significant one here). Since the next pfn will always be from different section on x86_64 (at least with the configurations I checked and my limited knowledge of this stuff), this patch was not originally applied as it seemed redundant to me at that time. It was only theoretically possible the algo can start looping if the next pfn would fall under the original one but it seemed this is impossible to happen. Since this is generic code and not arch specific, I expected other arches to behave in a similar manner. Though it seems arm has it's own view of pfn_valid() based on memblock instead of mem sections: neelx@metal:~/nX/src/linux$ global -g CONFIG_HAVE_ARCH_PFN_VALID arch/arm/include/asm/page.h arch/arm/mm/init.c arch/arm64/include/asm/page.h arch/arm64/mm/init.c include/linux/mmzone.h neelx@metal:~/nX/src/linux$ sed '287,293!d' arch/arm64/mm/init.c #ifdef CONFIG_HAVE_ARCH_PFN_VALID int pfn_valid(unsigned long pfn) { return memblock_is_map_memory(pfn << PAGE_SHIFT); } EXPORT_SYMBOL(pfn_valid); #endif neelx@metal:~/nX/src/linux$ ARM seems to be a bit unique here and that's what I missed and that's why my former fix broke arm. I was simply not expecting this. And that's why this patch is needed (exclusively for arm it seems). Now we can start discussing if pfn_valid() based on mem sections is fundamentally broken or if pfn_valid() on arm based on memblock when CONFIG_HAVE_ARCH_PFN_VALID is enabled is broken. Personally I do not like the double meaning of it. Ard, Naresh, Sudeep> Is CONFIG_HAVE_ARCH_PFN_VALID actually enabled for your builds? --nX > If yes then please put it into the changelog. Maybe reuse data provided > by Arnd http://lkml.kernel.org/r/20180314134431.13241-1-ard.biesheuvel@linaro.org > >> Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945.git.neelx@redhat.com >> Fixes: 864b75f9d6b01 ("mm/page_alloc: fix memmap_init_zone pageblock alignment") >> Signed-off-by: Daniel Vacek >> Tested-by: Sudeep Holla >> Tested-by: Naresh Kamboju >> Cc: Andrew Morton >> Cc: Mel Gorman >> Cc: Michal Hocko >> Cc: Paul Burton >> Cc: Pavel Tatashin >> Cc: Vlastimil Babka >> Cc: >> --- >> mm/page_alloc.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 3d974cb2a1a1..e033a6895c6f 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -5364,9 +5364,14 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, >> * is not. move_freepages_block() can shift ahead of >> * the valid region but still depends on correct page >> * metadata. >> + * Also make sure we never step back. >> */ >> - pfn = (memblock_next_valid_pfn(pfn, end_pfn) & >> + unsigned long next_pfn; >> + >> + next_pfn = (memblock_next_valid_pfn(pfn, end_pfn) & >> ~(pageblock_nr_pages-1)) - 1; >> + if (next_pfn > pfn) >> + pfn = next_pfn; >> #endif >> continue; >> } >> -- >> 2.16.2 >> > > -- > Michal Hocko > SUSE Labs