Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp136215pxu; Wed, 25 Nov 2020 15:27:44 -0800 (PST) X-Google-Smtp-Source: ABdhPJyPUX03BSGnIhqJ3MvIKUKQaIpnFAHd0WkvBo2brl3/4utdY/LXU3MRGansVnNUi/0w/+EC X-Received: by 2002:a05:6402:1155:: with SMTP id g21mr41278edw.53.1606346864467; Wed, 25 Nov 2020 15:27:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1606346864; cv=none; d=google.com; s=arc-20160816; b=wvKJ0lrotJxoXEEKTmkk/uY8Z3dmA8RIILETP+pCGGU7gQFErOzhkAbE5uTJjmT9jQ W7ym5IyK+ySVFus2NA1RZqpowoC8fxwRuXUBBUz+VkGxdvuYAV3I8Zd4ZdG49C+hTNYD I6CJ56RgiSe2cOLC3nOx6x+JisFf8TRXffUPNbfQ5ED1VcWe62P6Ai9Q42eRtx2023lO F/CfvLLCfYpvbSqbPiip7QceonjUMfaevm1HC33LFm19vwJGllBfCzIK3delV3a6KZBm ihIuYTUZNVoJ9ttki9tRuMOu0xGl7Y4p64ND+MvEWOT+WxLKm0cu4pWnqGeCT/D8CRb2 UNvw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=8cd2jOc4EY0Hm/DYXh8AybG7dAGI+7EMMq4q48oG85k=; b=zzPeUECbcNLZrhosiaAFX/Ne3S0ZeFPMm4RpAGOIFhcvQ0/bDcACFa1sssUmE3+Giy 2TqLfJLwv8w+05CRrDH5IrcllWZyitflDOlTlWmUqNLibbQYH6gLfTwjw3nWpc8jVCMt VjEIC4oE+fzZmthRl9r7ge2aHwhexzA6n+CY1UdHVlYtilQjsm7Q92Rj5MJ8k0tGV1QU EG4Uis5GjEU7j6lHLS1QqSS8t0zWPzrmWgxeR1C69kc0eeu3UZAwdqs0O87Ub8Suf4lN 8QV3zzGHzfmzFZ8iuWTxy9hy6uAkBXkpVZqm1LyjbUJrT8EYgLV0BX4ZaPsN2j861I88 D8LQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=dURP6vZ0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bz12si2426402ejc.672.2020.11.25.15.27.22; Wed, 25 Nov 2020 15:27:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=dURP6vZ0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732122AbgKYVi1 (ORCPT + 99 others); Wed, 25 Nov 2020 16:38:27 -0500 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:26131 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731952AbgKYVi0 (ORCPT ); Wed, 25 Nov 2020 16:38:26 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1606340305; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=8cd2jOc4EY0Hm/DYXh8AybG7dAGI+7EMMq4q48oG85k=; b=dURP6vZ0rC/sfTbZUixN3ivy33v8xnrmmMLlf8pLrTROJlnVxGDC8BNiYqsxp+reuvDQX3 +m6axnNxwGW052p6S2o8a0jRdmA7IfUGq8GrMeTtc7MI/vtS4GYc/3ORgkzgsE7FbzBqCv jVZ2Yu9AsN+Vomsj93NO/IPdNM5H6u8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-328-oJi7CX4JPTuRCdqkq_z3Mw-1; Wed, 25 Nov 2020 16:38:20 -0500 X-MC-Unique: oJi7CX4JPTuRCdqkq_z3Mw-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 12B9D10866A2; Wed, 25 Nov 2020 21:38:17 +0000 (UTC) Received: from mail (ovpn-112-118.rdu2.redhat.com [10.10.112.118]) by smtp.corp.redhat.com (Postfix) with ESMTPS id E41345C1B4; Wed, 25 Nov 2020 21:38:16 +0000 (UTC) Date: Wed, 25 Nov 2020 16:38:16 -0500 From: Andrea Arcangeli To: Mike Rapoport Cc: David Hildenbrand , Vlastimil Babka , Mel Gorman , Andrew Morton , linux-mm@kvack.org, Qian Cai , Michal Hocko , linux-kernel@vger.kernel.org, Baoquan He Subject: Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages Message-ID: References: <35F8AADA-6CAA-4BD6-A4CF-6F29B3F402A4@redhat.com> <20201125210414.GO123287@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201125210414.GO123287@linux.ibm.com> User-Agent: Mutt/2.0.2 (2020-11-20) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 25, 2020 at 11:04:14PM +0200, Mike Rapoport wrote: > I think the very root cause is how e820__memblock_setup() registers > memory with memblock: > > if (entry->type == E820_TYPE_SOFT_RESERVED) > memblock_reserve(entry->addr, entry->size); > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > continue; > > memblock_add(entry->addr, entry->size); > > From that point the system has inconsistent view of RAM in both > memblock.memory and memblock.reserved and, which is then translated to > memmap etc. > > Unfortunately, simply adding all RAM to memblock is not possible as > there are systems that for them "the addresses listed in the reserved > range must never be accessed, or (as we discovered) even be reachable by > an active page table entry" [1]. > > [1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/ It looks like what's missing is a blockmem_reserve which I don't think would interfere at all with the issue above since it won't create direct mapping and it'll simply invoke the second stage that wasn't invoked here. I guess this would have a better chance to have the second initialization stage run in reserve_bootmem_region and it would likely solve the problem without breaking E820_TYPE_RESERVED which is known by the kernel: > if (entry->type == E820_TYPE_SOFT_RESERVED) > memblock_reserve(entry->addr, entry->size); > + if (entry->type == 20) + memblock_reserve(entry->addr, entry->size); > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > continue; > This is however just to show the problem, I didn't check what type 20 is. To me it doesn't look the root cause though, the root cause is that if you don't call memblock_reserve the page->flags remains uninitialized. I think the page_alloc.c need to be more robust and detect at least if if holes within zones (but ideally all pfn_valid of all struct pages in system even if beyond the end of the zone) aren't being initialized in the second stage without relying on the arch code to remember to call memblock_reserve. In fact it's not clear why memblock_reserve even exists, that information can be calculated reliably by page_alloc in function of memblock.memory alone by walking all nodes and all zones. It doesn't even seem to help in destroying the direct mapping, reserve_bootmem_region just initializes the struct pages so it doesn't need a special memeblock_reserved to find those ranges. In fact it's scary that codes then does stuff like this trusting the memblock_reserve is nearly complete information (which obviously isn't given type 20 doesn't get queued and I got that type 20 in all my systems): for_each_reserved_mem_region(i, &start, &end) { if (addr >= start && addr_end <= end) return true; } That code in irq-gic-v3-its.c should stop using for_each_reserved_mem_region and start doing pfn_valid(addr>>PAGE_SHIFT) if PageReserved(pfn_to_page(addr>>PAGE_SHIFT)) instead. At best memory.reserved should be calculated automatically by the page_alloc.c based on the zone_start_pfn/zone_end_pfn and not passed by the e820 caller, instead of adding the memory_reserve call for type 20 we should delete the memory_reserve function. Thanks, Andrea