Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp4439660imm; Fri, 18 May 2018 05:15:29 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpv7urX8nRfSJz0txiPsxC58ed4a8jRdoC9bhS3GldtyLNCzVKDlNQap0/CIhIoINwLpV0T X-Received: by 2002:a65:610d:: with SMTP id z13-v6mr7416796pgu.260.1526645729761; Fri, 18 May 2018 05:15:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526645729; cv=none; d=google.com; s=arc-20160816; b=bDCuGB8EoD48VqrU2zIphQzTrigyESfZNmUir/6Ko2ARImVhqZcL5Dn86zPVLad6+U leDMwn+YUATGV5pi6oFrIrrxsWeklsJTz250Pi/1/+diQCJCl14xSQkONsFeKFJM52yn 1jQDNFog/vEWsoNhrOtVoMplDJdKYkULayFv7IZTstj3Z1dEjZjz/PlbS1n7rhJqZ31Q pYqkY0p1L4llqontz2kIF3HEWGC3VCVyfLPNV/aDoG6LUlz3i3CTR03Cmqtkp08HQp3r Jj2K30Vbxuh5yRNIxYcpXMvT1Z6w8OPAm75lS0IHV5JMiGKiyJ498OBGJBoYeX/BSiIH y6aw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=XV8orB+lqMqrih3uEmtXoO3/5X4FE4MufM7tfpaAfcw=; b=XuUbYc5Dz98kuNwpByX9GpgZMKLrFndMeFhF/aUZj9EcHcYztgUUh304DkqLMTeIuj ZW3v9pe2riMEJyrAgynpIQ6QPYF0e8FouIZ53tynIZRQp5GabeQyuJGUTuAiYtMuC1n8 nMcFiaCheVKIWqvLTSuY05cL+6Sy3Kpo75+O9B6/gy9B2EMqX33zsMZku87/cmOO4jE+ KdagyFzRQkhxIN4QYu0KmAlTZ0IjOpllL8s7zqVVb8/Nij5K1qAa8yPgYcPJn38+ypFT nmQET4EGYMYJ/V5BTsU2CljGQv2xDM/KyD3+Cdn1cHMDOBojQWPNHkee/E+J0CYnFVa+ mKcw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y14-v6si5742958pgo.286.2018.05.18.05.15.13; Fri, 18 May 2018 05:15:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751221AbeERMPD (ORCPT + 99 others); Fri, 18 May 2018 08:15:03 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:43780 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750763AbeERMPC (ORCPT ); Fri, 18 May 2018 08:15:02 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id B88A0406F8B0; Fri, 18 May 2018 12:15:01 +0000 (UTC) Received: from localhost (ovpn-8-19.pek2.redhat.com [10.72.8.19]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 71B4110B2B55; Fri, 18 May 2018 12:14:58 +0000 (UTC) Date: Fri, 18 May 2018 20:14:55 +0800 From: Baoquan He To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, lcapitulino@redhat.com, keescook@chromium.org, tglx@linutronix.de, x86@kernel.org, hpa@zytor.com, fanc.fnst@cn.fujitsu.com, yasu.isimatu@gmail.com, indou.takao@jp.fujitsu.com, douly.fnst@cn.fujitsu.com Subject: Re: [PATCH 0/2] x86/boot/KASLR: Skip specified number of 1GB huge pages when do physical randomization Message-ID: <20180518121455.GT24627@MiWiFi-R3L-srv> References: <20180516100532.14083-1-bhe@redhat.com> <20180518070046.GA18660@gmail.com> <20180518074359.GR24627@MiWiFi-R3L-srv> <20180518081919.GB11379@gmail.com> <20180518112836.GS24627@MiWiFi-R3L-srv> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180518112836.GS24627@MiWiFi-R3L-srv> User-Agent: Mutt/1.9.1 (2017-09-22) X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Fri, 18 May 2018 12:15:01 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Fri, 18 May 2018 12:15:01 +0000 (UTC) for IP:'10.11.54.3' DOMAIN:'int-mx03.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'bhe@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/18/18 at 07:28pm, Baoquan He wrote: > On 05/18/18 at 10:19am, Ingo Molnar wrote: > > > > * Baoquan He wrote: > > > > > OK, I realized my saying above is misled because I didn't explain the > > > background clearly. Let me add it: > > > > > > Previously, FJ reported the movable_node issue that KASLR will put > > > kernel into movable_node. That cause those movable_nodes can't be hot > > > plugged any more. So finally we plannned to solve it by adding a new > > > kernel parameter : > > > > > > kaslr_boot_mem=nn[KMG]@ss[KMG] > > > > > > We want customer to specify memory regions which KASLR can make use to > > > randomize kernel into. > > > > *WHY* should the "customer" care? > > > > This is a _bug_: movable, hotpluggable zones of physical memory should not be > > randomized into. > > Yes, for movable zones, agreed. > > But for huge pages, it's only related to memory layout. > > > > > > [...] Outside of the specified regions, we need avoid to put kernel into those > > > regions even though they are also available RAM. As for movable_node issue, we > > > can add immovable regions into kaslr_boot_mem=nn[KMG]@ss[KMG]. > > > > > > During this hotplug issue reviewing, Luiz's team reported this 1GB hugepages > > > regression bug, I reproduced the bug and found out the root cause, then > > > realized that I can utilize kaslr_boot_mem=nn[KMG]@ss[KMG] parameter to > > > fix it too. E.g the KVM guest with 4GB RAM, we have a good 1GB huge > > > page, then we can add "kaslr_boot_mem=1G@0, kaslr_boot_mem=3G@2G" to > > > kernel command-line, then the good 1GB region [1G, 2G) won't be taken > > > into account for kernel physical randomization. > > > > > > Later, you pointed out that 'kaslr_boot_mem=' way need user to specify > > > memory region manually, it's not good, suggested to solve them by > > > getting information and solving them in KASLR boot code. So they are two > > > issues now, for the movable_node issue, we need get hotplug information > > > from SRAT table and then avoid them; for this 1GB hugepage issue, we > > > need get information from kernel command-line, then avoid them. > > > > > > This patch is for the hugepage issue only. Since FJ reported the hotplug > > > issue and they assigned engineers to work on it, I would like to wait > > > for them to post according to your suggestion. > > > > All of this is handling it the wrong way about. This is *not* primarily about > > KASLR at all, and the user should not be required to specify some weird KASLR > > parameters. > > > > This is a basic _memory map enumeration_ problem in both cases: > > > > - in the hotplug case KASLR doesn't know that it's a movable zone and relocates > > into it, > > Yes, in boot KASLR, we haven't parsed ACPI table to get hotplug > information. If decide to read SRAT table, we can get if memory region > is hotpluggable, then avoid them. This can be consistent with the later > code after entering kernel. > > > > > - and in the KVM case KASLR doesn't know that it's a valuable 1GB page that > > shouldn't be broken up. > > > > Note that it's not KASLR specific: if we had some other kernel feature that tried > > to allocate a piece of memory from what appears to be perfectly usable generic RAM > > we'd have the same problems! > > Hmm, this may not be the situation for 1GB huge pages. For 1GB huge > pages, the bug is that on KVM guest with 4GB ram, when user adds > 'default_hugepagesz=1G hugepagesz=1G hugepages=1' to kernel > command-line, if 'nokaslr' is specified, they can get 1GB huge page > allocated successfully. If remove 'nokaslr', namely KASLR is enabled, > the 1GB huge page allocation failed. > > In hugetlb_nrpages_setup(), you can see that the current huge page code > relies on memblock to get 1GB huge pages. Below is the e820 memory > map from Luiz's bug report. In fact there are two good 1GB huge pages, > one is [0x40000000, 0x7fffffff], the 2nd one is > [0x100000000, 0x13fffffff]. by default memblock will allocate top-down > if movable_node is set, then [0x100000000, 0x13fffffff] will be broken ~not Sorry, missed 'not'. void __init setup_arch(char **cmdline_p) { ... #ifdef CONFIG_MEMORY_HOTPLUG if (movable_node_is_enabled()) memblock_set_bottom_up(true); #endif ... } > when system initialization goes into hugetlb_nrpages_setup() invocation. > So normally huge page can only get one good 1GB huge page, whether KASLR > is enanled or not. This is not bug, but decided by the current huge page > implementation. In this case, KASLR boot code can see two good 1GB huge > pages, and try to avoid them. Besides, if it's a good 1GB huge page, > it's not defined in memory map and also not attribute. It's only decided > by the memory layout and also decided by the memory usage situation in > the running system. If want to keep all good 1GB huge pages untouched, > we may need to adjust the current memblock allocation code, to avoid > any possibility to step into good 1GB huge pages before huge page > allocation. However this comes to the improvement area of huge pages > implementation, not related to KASLR. > > [ +0.000000] e820: BIOS-provided physical RAM map: > [ +0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable > [ +0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved > [ +0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved > [ +0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable > [ +0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved > [ +0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved > [ +0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved > [ +0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000013fffffff] usable > > Furthermore, on bare-metal with large memory, e.g with 100GB memory, if > user specifies 'default_hugepagesz=1G hugepagesz=1G hugepages=2' to only > expect two 1GB huge pages reserved, if we save all those tens of good > 1GB huge pages untouched, it seems to be over reactive. > > Not sure if I understand your point correctly, this is my thought about > the huge page issue, please help to point out anything wrong if any. > > Thanks > Baoquan > > > > We need to fix the real root problem, which is lack of knowledge about crutial > > attributes of physical memory. Once that knowledge is properly represented at this > > early boot stage both KASLR and other memory allocators can make use of it to > > avoid those regions. > > > > Thanks, > > > > Ingo