Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Fri, 18 May 2018 20:14:55 +0800
From:   Baoquan He <bhe@redhat.com>
To:     Ingo Molnar <mingo@kernel.org>
Cc:     linux-kernel@vger.kernel.org, lcapitulino@redhat.com,
        keescook@chromium.org, tglx@linutronix.de, x86@kernel.org,
        hpa@zytor.com, fanc.fnst@cn.fujitsu.com, yasu.isimatu@gmail.com,
        indou.takao@jp.fujitsu.com, douly.fnst@cn.fujitsu.com
Subject: Re: [PATCH 0/2] x86/boot/KASLR: Skip specified number of 1GB huge
 pages when do physical randomization
Message-ID: <20180518121455.GT24627@MiWiFi-R3L-srv>
References: <20180516100532.14083-1-bhe@redhat.com>
 <20180518070046.GA18660@gmail.com>
 <20180518074359.GR24627@MiWiFi-R3L-srv>
 <20180518081919.GB11379@gmail.com>
 <20180518112836.GS24627@MiWiFi-R3L-srv>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20180518112836.GS24627@MiWiFi-R3L-srv>
User-Agent: Mutt/1.9.1 (2017-09-22)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On 05/18/18 at 07:28pm, Baoquan He wrote:
> On 05/18/18 at 10:19am, Ingo Molnar wrote:
> > 
> > * Baoquan He <bhe@redhat.com> wrote:
> > 
> > > OK, I realized my saying above is misled because I didn't explain the
> > > background clearly. Let me add it:
> > > 
> > > Previously, FJ reported the movable_node issue that KASLR will put
> > > kernel into movable_node. That cause those movable_nodes can't be hot
> > > plugged any more. So finally we plannned to solve it by adding a new
> > > kernel parameter :
> > > 
> > > 	kaslr_boot_mem=nn[KMG]@ss[KMG]
> > > 
> > > We want customer to specify memory regions which KASLR can make use to
> > > randomize kernel into.
> > 
> > *WHY* should the "customer" care?
> > 
> > This is a _bug_: movable, hotpluggable zones of physical memory should not be 
> > randomized into.
> 
> Yes, for movable zones, agreed.
> 
> But for huge pages, it's only related to memory layout.
> 
> > 
> > > [...] Outside of the specified regions, we need avoid to put kernel into those 
> > > regions even though they are also available RAM. As for movable_node issue, we 
> > > can add immovable regions into kaslr_boot_mem=nn[KMG]@ss[KMG].
> > > 
> > > During this hotplug issue reviewing, Luiz's team reported this 1GB hugepages
> > > regression bug, I reproduced the bug and found out the root cause, then
> > > realized that I can utilize kaslr_boot_mem=nn[KMG]@ss[KMG] parameter to
> > > fix it too. E.g the KVM guest with 4GB RAM, we have a good 1GB huge
> > > page, then we can add "kaslr_boot_mem=1G@0, kaslr_boot_mem=3G@2G" to
> > > kernel command-line, then the good 1GB region [1G, 2G) won't be taken
> > > into account for kernel physical randomization.
> > > 
> > > Later, you pointed out that 'kaslr_boot_mem=' way need user to specify
> > > memory region manually, it's not good, suggested to solve them by
> > > getting information and solving them in KASLR boot code. So they are two
> > > issues now, for the movable_node issue, we need get hotplug information
> > > from SRAT table and then avoid them; for this 1GB hugepage issue, we
> > > need get information from kernel command-line, then avoid them.
> > > 
> > > This patch is for the hugepage issue only. Since FJ reported the hotplug
> > > issue and they assigned engineers to work on it, I would like to wait
> > > for them to post according to your suggestion.
> > 
> > All of this is handling it the wrong way about. This is *not* primarily about 
> > KASLR at all, and the user should not be required to specify some weird KASLR 
> > parameters.
> > 
> > This is a basic _memory map enumeration_ problem in both cases:
> > 
> >  - in the hotplug case KASLR doesn't know that it's a movable zone and relocates 
> >    into it,
> 
> Yes, in boot KASLR, we haven't parsed ACPI table to get hotplug
> information. If decide to read SRAT table, we can get if memory region
> is hotpluggable, then avoid them. This can be consistent with the later
> code after entering kernel.
> 
> > 
> >  - and in the KVM case KASLR doesn't know that it's a valuable 1GB page that
> >    shouldn't be broken up.
> > 
> > Note that it's not KASLR specific: if we had some other kernel feature that tried 
> > to allocate a piece of memory from what appears to be perfectly usable generic RAM 
> > we'd have the same problems!
> 
> Hmm, this may not be the situation for 1GB huge pages. For 1GB huge
> pages, the bug is that on KVM guest with 4GB ram, when user adds
> 'default_hugepagesz=1G hugepagesz=1G hugepages=1' to kernel
> command-line, if 'nokaslr' is specified, they can get 1GB huge page
> allocated successfully. If remove 'nokaslr', namely KASLR is enabled,
> the 1GB huge page allocation failed.
> 
> In hugetlb_nrpages_setup(), you can see that the current huge page code
> relies on memblock to get 1GB huge pages. Below is the e820 memory
> map from Luiz's bug report. In fact there are two good 1GB huge pages,
> one is [0x40000000, 0x7fffffff], the 2nd one is
> [0x100000000, 0x13fffffff]. by default memblock will allocate top-down
> if movable_node is set, then [0x100000000, 0x13fffffff] will be broken
		    ~not
Sorry, missed 'not'. 

void __init setup_arch(char **cmdline_p)
{
	...
#ifdef CONFIG_MEMORY_HOTPLUG
        if (movable_node_is_enabled())
                memblock_set_bottom_up(true);
#endif
	...
}

> when system initialization goes into hugetlb_nrpages_setup() invocation.
> So normally huge page can only get one good 1GB huge page, whether KASLR
> is enanled or not. This is not bug, but decided by the current huge page
> implementation. In this case, KASLR boot code can see two good 1GB huge
> pages, and try to avoid them. Besides, if it's a good 1GB huge page,
> it's not defined in memory map and also not attribute. It's only decided
> by the memory layout and also decided by the memory usage situation in
> the running system. If want to keep all good 1GB huge pages untouched,
> we may need to adjust the current memblock allocation code, to avoid
> any possibility to step into good 1GB huge pages before huge page
> allocation. However this comes to the improvement area of huge pages
> implementation, not related to KASLR.
> 
> [  +0.000000] e820: BIOS-provided physical RAM map:
> [  +0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> [  +0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> [  +0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> [  +0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
> [  +0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
> [  +0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> [  +0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> [  +0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000013fffffff] usable
> 
> Furthermore, on bare-metal with large memory, e.g with 100GB memory, if
> user specifies 'default_hugepagesz=1G hugepagesz=1G hugepages=2' to only
> expect two 1GB huge pages reserved, if we save all those tens of good
> 1GB huge pages untouched, it seems to be over reactive.
> 
> Not sure if I understand your point correctly, this is my thought about
> the huge page issue, please help to point out anything wrong if any.
> 
> Thanks
> Baoquan
> > 
> > We need to fix the real root problem, which is lack of knowledge about crutial 
> > attributes of physical memory. Once that knowledge is properly represented at this 
> > early boot stage both KASLR and other memory allocators can make use of it to 
> > avoid those regions.
> > 
> > Thanks,
> > 
> > 	Ingo