2018-06-05 00:56:46

by Naoya Horiguchi

[permalink] [raw]
Subject: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

Hi everyone,

I detected a kernel panic in mmotm on 5/25.

[ 33.316734] BUG: unable to handle kernel paging request at fffffffffffffffe
[ 33.318157] PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
[ 33.319201] Oops: 0000 [#1] SMP PTI
[ 33.319876] CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
[ 33.321952] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
[ 33.323626] RIP: 0010:stable_page_flags+0x27/0x3c0
[ 33.324550] Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
[ 33.328149] RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
[ 33.329152] RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
[ 33.330509] RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
[ 33.331864] RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
[ 33.333223] R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
[ 33.334575] R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
[ 33.335941] FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
[ 33.337478] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 33.338580] CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
[ 33.339946] Call Trace:
[ 33.340433] kpageflags_read+0xc7/0x120
[ 33.341182] proc_reg_read+0x3c/0x60
[ 33.341874] __vfs_read+0x36/0x170
[ 33.342540] vfs_read+0x89/0x130
[ 33.343172] ksys_pread64+0x71/0x90
[ 33.343851] do_syscall_64+0x5b/0x160
[ 33.344567] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 33.345538] RIP: 0033:0x7efc42e75e23
[ 33.346235] Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
[ 33.349813] RSP: 002b:00007ffc33c35208 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
[ 33.351255] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efc42e75e23
[ 33.352628] RDX: 0000000000080000 RSI: 00007ffc33cb52b0 RDI: 0000000000000003
[ 33.353994] RBP: 00007ffc33c35250 R08: 0000000000010000 R09: 0000000000000000
[ 33.355352] R10: 0000000000580000 R11: 0000000000000246 R12: 0000000000401560
[ 33.356710] R13: 00007ffc33d35410 R14: 0000000000000000 R15: 0000000000000000
[ 33.358072] Modules linked in: fuse btrfs xor zstd_compress raid6_pq zstd_decompress xxhash ufs hfsplus hfs minix vfat msdos fat jfs ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc dax_pmem device_dax joydev virtio_balloon pcspkr i2c_piix4 xfs libcrc32c virtio_blk 8139too crc32c_intel floppy serio_raw virtio_pci virtio_ring virtio 8139cp mii ata_generic pata_acpi qemu_fw_cfg
[ 33.370035] CR2: fffffffffffffffe
[ 33.370680] ---[ end trace d6df609ab41af1c5 ]---

Reproduction precedure is like this:
- enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
- read /proc/kpageflags (or call tools/vm/page-types with no arguments)
(- my kernel config is attached)

I spent a few days on this, but didn't reach any solutions.
So let me report this with some details below ...

In the critial page request, stable_page_flags() is called with an argument
page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
And compound_head() returns (struct page *)(head - 1), which explains the
address 0xfffffffffffffffe in the above message.

It seems that this kernel panic happens when reading kpageflags of pfn range
[0xbffd7, 0xc0000), which coresponds to a 'reserved' range.

[ 0.000000] user-defined physical RAM map:
[ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
[ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
[ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)

So I guess 'memmap=' parameter might badly affect the memory initialization process.

This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
I hope this info helps you find the solution/workaround.

Thanks,
Naoya Horiguchi


Attachments:
config.gz (39.38 kB)
config.gz

2018-06-05 01:19:23

by Matthew Wilcox

[permalink] [raw]
Subject: Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote:
> Reproduction precedure is like this:
> - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
> - read /proc/kpageflags (or call tools/vm/page-types with no arguments)
> (- my kernel config is attached)
>
> I spent a few days on this, but didn't reach any solutions.
> So let me report this with some details below ...
>
> In the critial page request, stable_page_flags() is called with an argument
> page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
> And compound_head() returns (struct page *)(head - 1), which explains the
> address 0xfffffffffffffffe in the above message.

Hm. compound_head shares with:

struct list_head lru;
struct list_head slab_list; /* uses lru */
struct { /* Partial pages */
struct page *next;
unsigned long _compound_pad_1; /* compound_head */
unsigned long _pt_pad_1; /* compound_head */
struct dev_pagemap *pgmap;
struct rcu_head rcu_head;

None of them should be -1.

> It seems that this kernel panic happens when reading kpageflags of pfn range
> [0xbffd7, 0xc0000), which coresponds to a 'reserved' range.
>
> [ 0.000000] user-defined physical RAM map:
> [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
> [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
>
> So I guess 'memmap=' parameter might badly affect the memory initialization process.
>
> This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
> I hope this info helps you find the solution/workaround.

Can you try bisecting this? It could be one of my patches to reorder struct
page, or it could be one of Pavel's deferred page initialisation patches.
Or something else ;-)


2018-06-05 07:39:13

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote:
> On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote:
> > Reproduction precedure is like this:
> > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
> > - read /proc/kpageflags (or call tools/vm/page-types with no arguments)
> > (- my kernel config is attached)
> >
> > I spent a few days on this, but didn't reach any solutions.
> > So let me report this with some details below ...
> >
> > In the critial page request, stable_page_flags() is called with an argument
> > page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
> > And compound_head() returns (struct page *)(head - 1), which explains the
> > address 0xfffffffffffffffe in the above message.
>
> Hm. compound_head shares with:
>
> struct list_head lru;
> struct list_head slab_list; /* uses lru */
> struct { /* Partial pages */
> struct page *next;
> unsigned long _compound_pad_1; /* compound_head */
> unsigned long _pt_pad_1; /* compound_head */
> struct dev_pagemap *pgmap;
> struct rcu_head rcu_head;
>
> None of them should be -1.
>
> > It seems that this kernel panic happens when reading kpageflags of pfn range
> > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range.
> >
> > [ 0.000000] user-defined physical RAM map:
> > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
> > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
> >
> > So I guess 'memmap=' parameter might badly affect the memory initialization process.
> >
> > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
> > I hope this info helps you find the solution/workaround.
>
> Can you try bisecting this? It could be one of my patches to reorder struct
> page, or it could be one of Pavel's deferred page initialisation patches.
> Or something else ;-)

Thank you for the comment. I'm trying bisecting now, let you know the result later.

And I found that my statement "not reproduce on v4.17" was wrong (I used
different kvm guests, which made some different test condition and misguided me),
this seems an older (at least < 4.15) bug.

Thanks,
Naoya Horiguchi

2018-06-06 05:18:21

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote:
> > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote:
> > > Reproduction precedure is like this:
> > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
> > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments)
> > > (- my kernel config is attached)
> > >
> > > I spent a few days on this, but didn't reach any solutions.
> > > So let me report this with some details below ...
> > >
> > > In the critial page request, stable_page_flags() is called with an argument
> > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
> > > And compound_head() returns (struct page *)(head - 1), which explains the
> > > address 0xfffffffffffffffe in the above message.
> >
> > Hm. compound_head shares with:
> >
> > struct list_head lru;
> > struct list_head slab_list; /* uses lru */
> > struct { /* Partial pages */
> > struct page *next;
> > unsigned long _compound_pad_1; /* compound_head */
> > unsigned long _pt_pad_1; /* compound_head */
> > struct dev_pagemap *pgmap;
> > struct rcu_head rcu_head;
> >
> > None of them should be -1.
> >
> > > It seems that this kernel panic happens when reading kpageflags of pfn range
> > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range.
> > >
> > > [ 0.000000] user-defined physical RAM map:
> > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
> > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
> > >
> > > So I guess 'memmap=' parameter might badly affect the memory initialization process.
> > >
> > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
> > > I hope this info helps you find the solution/workaround.
> >
> > Can you try bisecting this? It could be one of my patches to reorder struct
> > page, or it could be one of Pavel's deferred page initialisation patches.
> > Or something else ;-)
>
> Thank you for the comment. I'm trying bisecting now, let you know the result later.
>
> And I found that my statement "not reproduce on v4.17" was wrong (I used
> different kvm guests, which made some different test condition and misguided me),
> this seems an older (at least < 4.15) bug.

(Cc: Pavel)

Bisection showed that the following commit introduced this issue:

commit f7f99100d8d95dbcf09e0216a143211e79418b9f
Author: Pavel Tatashin <[email protected]>
Date: Wed Nov 15 17:36:44 2017 -0800

mm: stop zeroing memory during allocation in vmemmap

This patch postpones struct page zeroing to later stage of memory initialization.
My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
__init_single_page() were never reached. So in such case, struct pages populated
by vmemmap_pte_populate() could be left uninitialized?
And I'm not sure yet how this issue becomes visible with memmap= setting.

Thanks,
Naoya Horiguchi

2018-06-06 08:05:09

by Oscar Salvador

[permalink] [raw]
Subject: Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote:
> On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote:
> > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote:
> > > > Reproduction precedure is like this:
> > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
> > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments)
> > > > (- my kernel config is attached)
> > > >
> > > > I spent a few days on this, but didn't reach any solutions.
> > > > So let me report this with some details below ...
> > > >
> > > > In the critial page request, stable_page_flags() is called with an argument
> > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
> > > > And compound_head() returns (struct page *)(head - 1), which explains the
> > > > address 0xfffffffffffffffe in the above message.
> > >
> > > Hm. compound_head shares with:
> > >
> > > struct list_head lru;
> > > struct list_head slab_list; /* uses lru */
> > > struct { /* Partial pages */
> > > struct page *next;
> > > unsigned long _compound_pad_1; /* compound_head */
> > > unsigned long _pt_pad_1; /* compound_head */
> > > struct dev_pagemap *pgmap;
> > > struct rcu_head rcu_head;
> > >
> > > None of them should be -1.
> > >
> > > > It seems that this kernel panic happens when reading kpageflags of pfn range
> > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range.
> > > >
> > > > [ 0.000000] user-defined physical RAM map:
> > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
> > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
> > > >
> > > > So I guess 'memmap=' parameter might badly affect the memory initialization process.
> > > >
> > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
> > > > I hope this info helps you find the solution/workaround.
> > >
> > > Can you try bisecting this? It could be one of my patches to reorder struct
> > > page, or it could be one of Pavel's deferred page initialisation patches.
> > > Or something else ;-)
> >
> > Thank you for the comment. I'm trying bisecting now, let you know the result later.
> >
> > And I found that my statement "not reproduce on v4.17" was wrong (I used
> > different kvm guests, which made some different test condition and misguided me),
> > this seems an older (at least < 4.15) bug.
>
> (Cc: Pavel)
>
> Bisection showed that the following commit introduced this issue:
>
> commit f7f99100d8d95dbcf09e0216a143211e79418b9f
> Author: Pavel Tatashin <[email protected]>
> Date: Wed Nov 15 17:36:44 2017 -0800
>
> mm: stop zeroing memory during allocation in vmemmap
>
> This patch postpones struct page zeroing to later stage of memory initialization.
> My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> __init_single_page() were never reached. So in such case, struct pages populated
> by vmemmap_pte_populate() could be left uninitialized?
> And I'm not sure yet how this issue becomes visible with memmap= setting.

I think that this becomes visible because memmap=x!y creates a persistent memory region:

parse_memmap_one
{
...
} else if (*p == '!') {
start_at = memparse(p+1, &p);
e820__range_add(start_at, mem_size, E820_TYPE_PRAM);
...
}

and this region it is not added neither in memblock.memory nor in memblock.reserved.
Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed
in free_low_memory_core_early():

static unsigned long __init free_low_memory_core_early(void)
{
...
for_each_reserved_mem_region(i, &start, &end)
reserve_bootmem_region(start, end);
...
}


Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved.
A comment in do_mark_busy() suggests this:

static bool __init do_mark_busy(enum e820_type type, struct resource *res)
{

...
/*
* Treat persistent memory like device memory, i.e. reserve it
* for exclusive use of a driver
*/
...
}


I wonder if something like this could work and if so, if it is right (i haven't tested it yet):

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 71c11ad5643e..3c9686ef74e5 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void)
if (end != (resource_size_t)end)
continue;

+ if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) {
+ memblock_reserve(entry->addr, entry->size);
+ continue;
+ }
+
if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
continue;

Best Regards
Oscar Salvador

2018-06-06 08:55:19

by Oscar Salvador

[permalink] [raw]
Subject: Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote:
> On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote:
> > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote:
> > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote:
> > > > > Reproduction precedure is like this:
> > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
> > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments)
> > > > > (- my kernel config is attached)
> > > > >
> > > > > I spent a few days on this, but didn't reach any solutions.
> > > > > So let me report this with some details below ...
> > > > >
> > > > > In the critial page request, stable_page_flags() is called with an argument
> > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
> > > > > And compound_head() returns (struct page *)(head - 1), which explains the
> > > > > address 0xfffffffffffffffe in the above message.
> > > >
> > > > Hm. compound_head shares with:
> > > >
> > > > struct list_head lru;
> > > > struct list_head slab_list; /* uses lru */
> > > > struct { /* Partial pages */
> > > > struct page *next;
> > > > unsigned long _compound_pad_1; /* compound_head */
> > > > unsigned long _pt_pad_1; /* compound_head */
> > > > struct dev_pagemap *pgmap;
> > > > struct rcu_head rcu_head;
> > > >
> > > > None of them should be -1.
> > > >
> > > > > It seems that this kernel panic happens when reading kpageflags of pfn range
> > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range.
> > > > >
> > > > > [ 0.000000] user-defined physical RAM map:
> > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
> > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
> > > > >
> > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process.
> > > > >
> > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
> > > > > I hope this info helps you find the solution/workaround.
> > > >
> > > > Can you try bisecting this? It could be one of my patches to reorder struct
> > > > page, or it could be one of Pavel's deferred page initialisation patches.
> > > > Or something else ;-)
> > >
> > > Thank you for the comment. I'm trying bisecting now, let you know the result later.
> > >
> > > And I found that my statement "not reproduce on v4.17" was wrong (I used
> > > different kvm guests, which made some different test condition and misguided me),
> > > this seems an older (at least < 4.15) bug.
> >
> > (Cc: Pavel)
> >
> > Bisection showed that the following commit introduced this issue:
> >
> > commit f7f99100d8d95dbcf09e0216a143211e79418b9f
> > Author: Pavel Tatashin <[email protected]>
> > Date: Wed Nov 15 17:36:44 2017 -0800
> >
> > mm: stop zeroing memory during allocation in vmemmap
> >
> > This patch postpones struct page zeroing to later stage of memory initialization.
> > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > __init_single_page() were never reached. So in such case, struct pages populated
> > by vmemmap_pte_populate() could be left uninitialized?
> > And I'm not sure yet how this issue becomes visible with memmap= setting.
>
> I think that this becomes visible because memmap=x!y creates a persistent memory region:
>
> parse_memmap_one
> {
> ...
> } else if (*p == '!') {
> start_at = memparse(p+1, &p);
> e820__range_add(start_at, mem_size, E820_TYPE_PRAM);
> ...
> }
>
> and this region it is not added neither in memblock.memory nor in memblock.reserved.
> Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed
> in free_low_memory_core_early():
>
> static unsigned long __init free_low_memory_core_early(void)
> {
> ...
> for_each_reserved_mem_region(i, &start, &end)
> reserve_bootmem_region(start, end);
> ...
> }
>
>
> Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved.
> A comment in do_mark_busy() suggests this:
>
> static bool __init do_mark_busy(enum e820_type type, struct resource *res)
> {
>
> ...
> /*
> * Treat persistent memory like device memory, i.e. reserve it
> * for exclusive use of a driver
> */
> ...
> }
>
>
> I wonder if something like this could work and if so, if it is right (i haven't tested it yet):
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 71c11ad5643e..3c9686ef74e5 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void)
> if (end != (resource_size_t)end)
> continue;
>
> + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) {
> + memblock_reserve(entry->addr, entry->size);
> + continue;
> + }
> +
> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> continue;

It does not seem to work, so the reasoning might be incorrect.

Best Regards
Oscar Salvador

2018-06-06 09:08:15

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote:
> On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote:
> > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote:
> > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote:
> > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote:
> > > > > > Reproduction precedure is like this:
> > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
> > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments)
> > > > > > (- my kernel config is attached)
> > > > > >
> > > > > > I spent a few days on this, but didn't reach any solutions.
> > > > > > So let me report this with some details below ...
> > > > > >
> > > > > > In the critial page request, stable_page_flags() is called with an argument
> > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
> > > > > > And compound_head() returns (struct page *)(head - 1), which explains the
> > > > > > address 0xfffffffffffffffe in the above message.
> > > > >
> > > > > Hm. compound_head shares with:
> > > > >
> > > > > struct list_head lru;
> > > > > struct list_head slab_list; /* uses lru */
> > > > > struct { /* Partial pages */
> > > > > struct page *next;
> > > > > unsigned long _compound_pad_1; /* compound_head */
> > > > > unsigned long _pt_pad_1; /* compound_head */
> > > > > struct dev_pagemap *pgmap;
> > > > > struct rcu_head rcu_head;
> > > > >
> > > > > None of them should be -1.
> > > > >
> > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range
> > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range.
> > > > > >
> > > > > > [ 0.000000] user-defined physical RAM map:
> > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
> > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
> > > > > >
> > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process.
> > > > > >
> > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
> > > > > > I hope this info helps you find the solution/workaround.
> > > > >
> > > > > Can you try bisecting this? It could be one of my patches to reorder struct
> > > > > page, or it could be one of Pavel's deferred page initialisation patches.
> > > > > Or something else ;-)
> > > >
> > > > Thank you for the comment. I'm trying bisecting now, let you know the result later.
> > > >
> > > > And I found that my statement "not reproduce on v4.17" was wrong (I used
> > > > different kvm guests, which made some different test condition and misguided me),
> > > > this seems an older (at least < 4.15) bug.
> > >
> > > (Cc: Pavel)
> > >
> > > Bisection showed that the following commit introduced this issue:
> > >
> > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f
> > > Author: Pavel Tatashin <[email protected]>
> > > Date: Wed Nov 15 17:36:44 2017 -0800
> > >
> > > mm: stop zeroing memory during allocation in vmemmap
> > >
> > > This patch postpones struct page zeroing to later stage of memory initialization.
> > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > > __init_single_page() were never reached. So in such case, struct pages populated
> > > by vmemmap_pte_populate() could be left uninitialized?
> > > And I'm not sure yet how this issue becomes visible with memmap= setting.
> >
> > I think that this becomes visible because memmap=x!y creates a persistent memory region:
> >
> > parse_memmap_one
> > {
> > ...
> > } else if (*p == '!') {
> > start_at = memparse(p+1, &p);
> > e820__range_add(start_at, mem_size, E820_TYPE_PRAM);
> > ...
> > }
> >
> > and this region it is not added neither in memblock.memory nor in memblock.reserved.
> > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed
> > in free_low_memory_core_early():
> >
> > static unsigned long __init free_low_memory_core_early(void)
> > {
> > ...
> > for_each_reserved_mem_region(i, &start, &end)
> > reserve_bootmem_region(start, end);
> > ...
> > }
> >
> >
> > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved.
> > A comment in do_mark_busy() suggests this:
> >
> > static bool __init do_mark_busy(enum e820_type type, struct resource *res)
> > {
> >
> > ...
> > /*
> > * Treat persistent memory like device memory, i.e. reserve it
> > * for exclusive use of a driver
> > */
> > ...
> > }
> >
> >
> > I wonder if something like this could work and if so, if it is right (i haven't tested it yet):
> >
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index 71c11ad5643e..3c9686ef74e5 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void)
> > if (end != (resource_size_t)end)
> > continue;
> >
> > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) {
> > + memblock_reserve(entry->addr, entry->size);
> > + continue;
> > + }
> > +
> > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > continue;
>
> It does not seem to work, so the reasoning might be incorrect.

Thank you for the comment.

One note is that the memory region with "broken struct page" is a typical
reserved region, not a pmem region. Strangely reading offset 0xbffd7 of
/proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists.
Reading the offset like 0x100000 (on pmem region) does not cause the crash,
so pmem region seems properly set up.

[ 0.000000] user-defined physical RAM map:
[ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
[ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region
[ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region
[ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable

Thanks,
Naoya Horiguchi

2018-06-06 09:27:46

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

On Wed, Jun 06, 2018 at 09:06:30AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote:
> > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote:
> > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote:
> > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote:
> > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote:
> > > > > > > Reproduction precedure is like this:
> > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
> > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments)
> > > > > > > (- my kernel config is attached)
> > > > > > >
> > > > > > > I spent a few days on this, but didn't reach any solutions.
> > > > > > > So let me report this with some details below ...
> > > > > > >
> > > > > > > In the critial page request, stable_page_flags() is called with an argument
> > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
> > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the
> > > > > > > address 0xfffffffffffffffe in the above message.
> > > > > >
> > > > > > Hm. compound_head shares with:
> > > > > >
> > > > > > struct list_head lru;
> > > > > > struct list_head slab_list; /* uses lru */
> > > > > > struct { /* Partial pages */
> > > > > > struct page *next;
> > > > > > unsigned long _compound_pad_1; /* compound_head */
> > > > > > unsigned long _pt_pad_1; /* compound_head */
> > > > > > struct dev_pagemap *pgmap;
> > > > > > struct rcu_head rcu_head;
> > > > > >
> > > > > > None of them should be -1.
> > > > > >
> > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range
> > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range.
> > > > > > >
> > > > > > > [ 0.000000] user-defined physical RAM map:
> > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
> > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
> > > > > > >
> > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process.
> > > > > > >
> > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
> > > > > > > I hope this info helps you find the solution/workaround.
> > > > > >
> > > > > > Can you try bisecting this? It could be one of my patches to reorder struct
> > > > > > page, or it could be one of Pavel's deferred page initialisation patches.
> > > > > > Or something else ;-)
> > > > >
> > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later.
> > > > >
> > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used
> > > > > different kvm guests, which made some different test condition and misguided me),
> > > > > this seems an older (at least < 4.15) bug.
> > > >
> > > > (Cc: Pavel)
> > > >
> > > > Bisection showed that the following commit introduced this issue:
> > > >
> > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f
> > > > Author: Pavel Tatashin <[email protected]>
> > > > Date: Wed Nov 15 17:36:44 2017 -0800
> > > >
> > > > mm: stop zeroing memory during allocation in vmemmap
> > > >
> > > > This patch postpones struct page zeroing to later stage of memory initialization.
> > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > > > __init_single_page() were never reached. So in such case, struct pages populated
> > > > by vmemmap_pte_populate() could be left uninitialized?
> > > > And I'm not sure yet how this issue becomes visible with memmap= setting.
> > >
> > > I think that this becomes visible because memmap=x!y creates a persistent memory region:
> > >
> > > parse_memmap_one
> > > {
> > > ...
> > > } else if (*p == '!') {
> > > start_at = memparse(p+1, &p);
> > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM);
> > > ...
> > > }
> > >
> > > and this region it is not added neither in memblock.memory nor in memblock.reserved.
> > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed
> > > in free_low_memory_core_early():
> > >
> > > static unsigned long __init free_low_memory_core_early(void)
> > > {
> > > ...
> > > for_each_reserved_mem_region(i, &start, &end)
> > > reserve_bootmem_region(start, end);
> > > ...
> > > }
> > >
> > >
> > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved.
> > > A comment in do_mark_busy() suggests this:
> > >
> > > static bool __init do_mark_busy(enum e820_type type, struct resource *res)
> > > {
> > >
> > > ...
> > > /*
> > > * Treat persistent memory like device memory, i.e. reserve it
> > > * for exclusive use of a driver
> > > */
> > > ...
> > > }
> > >
> > >
> > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet):
> > >
> > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > index 71c11ad5643e..3c9686ef74e5 100644
> > > --- a/arch/x86/kernel/e820.c
> > > +++ b/arch/x86/kernel/e820.c
> > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void)
> > > if (end != (resource_size_t)end)
> > > continue;
> > >
> > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) {
> > > + memblock_reserve(entry->addr, entry->size);
> > > + continue;
> > > + }
> > > +
> > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > continue;
> >
> > It does not seem to work, so the reasoning might be incorrect.
>
> Thank you for the comment.
>
> One note is that the memory region with "broken struct page" is a typical
> reserved region, not a pmem region. Strangely reading offset 0xbffd7 of
> /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists.
> Reading the offset like 0x100000 (on pmem region) does not cause the crash,
> so pmem region seems properly set up.
>
> [ 0.000000] user-defined physical RAM map:
> [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region
> [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region
> [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable
>

I have another note:

> My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> __init_single_page() were never reached. So in such case, struct pages populated
> by vmemmap_pte_populate() could be left uninitialized?

I quickly checked whether enabling CONFIG_DEFERRED_STRUCT_PAGE_INIT affect
the issue. And found that the kernel panic happens even with this config enabled.
So I'm still confused...

- Naoya

2018-06-07 06:35:09

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

On Wed, Jun 06, 2018 at 09:24:05AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> On Wed, Jun 06, 2018 at 09:06:30AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> > On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote:
> > > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote:
> > > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote:
> > > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> > > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote:
> > > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote:
> > > > > > > > Reproduction precedure is like this:
> > > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
> > > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments)
> > > > > > > > (- my kernel config is attached)
> > > > > > > >
> > > > > > > > I spent a few days on this, but didn't reach any solutions.
> > > > > > > > So let me report this with some details below ...
> > > > > > > >
> > > > > > > > In the critial page request, stable_page_flags() is called with an argument
> > > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
> > > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the
> > > > > > > > address 0xfffffffffffffffe in the above message.
> > > > > > >
> > > > > > > Hm. compound_head shares with:
> > > > > > >
> > > > > > > struct list_head lru;
> > > > > > > struct list_head slab_list; /* uses lru */
> > > > > > > struct { /* Partial pages */
> > > > > > > struct page *next;
> > > > > > > unsigned long _compound_pad_1; /* compound_head */
> > > > > > > unsigned long _pt_pad_1; /* compound_head */
> > > > > > > struct dev_pagemap *pgmap;
> > > > > > > struct rcu_head rcu_head;
> > > > > > >
> > > > > > > None of them should be -1.
> > > > > > >
> > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range
> > > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range.
> > > > > > > >
> > > > > > > > [ 0.000000] user-defined physical RAM map:
> > > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
> > > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
> > > > > > > >
> > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process.
> > > > > > > >
> > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
> > > > > > > > I hope this info helps you find the solution/workaround.
> > > > > > >
> > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct
> > > > > > > page, or it could be one of Pavel's deferred page initialisation patches.
> > > > > > > Or something else ;-)
> > > > > >
> > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later.
> > > > > >
> > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used
> > > > > > different kvm guests, which made some different test condition and misguided me),
> > > > > > this seems an older (at least < 4.15) bug.
> > > > >
> > > > > (Cc: Pavel)
> > > > >
> > > > > Bisection showed that the following commit introduced this issue:
> > > > >
> > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f
> > > > > Author: Pavel Tatashin <[email protected]>
> > > > > Date: Wed Nov 15 17:36:44 2017 -0800
> > > > >
> > > > > mm: stop zeroing memory during allocation in vmemmap
> > > > >
> > > > > This patch postpones struct page zeroing to later stage of memory initialization.
> > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > > > > __init_single_page() were never reached. So in such case, struct pages populated
> > > > > by vmemmap_pte_populate() could be left uninitialized?
> > > > > And I'm not sure yet how this issue becomes visible with memmap= setting.
> > > >
> > > > I think that this becomes visible because memmap=x!y creates a persistent memory region:
> > > >
> > > > parse_memmap_one
> > > > {
> > > > ...
> > > > } else if (*p == '!') {
> > > > start_at = memparse(p+1, &p);
> > > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM);
> > > > ...
> > > > }
> > > >
> > > > and this region it is not added neither in memblock.memory nor in memblock.reserved.
> > > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed
> > > > in free_low_memory_core_early():
> > > >
> > > > static unsigned long __init free_low_memory_core_early(void)
> > > > {
> > > > ...
> > > > for_each_reserved_mem_region(i, &start, &end)
> > > > reserve_bootmem_region(start, end);
> > > > ...
> > > > }
> > > >
> > > >
> > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved.
> > > > A comment in do_mark_busy() suggests this:
> > > >
> > > > static bool __init do_mark_busy(enum e820_type type, struct resource *res)
> > > > {
> > > >
> > > > ...
> > > > /*
> > > > * Treat persistent memory like device memory, i.e. reserve it
> > > > * for exclusive use of a driver
> > > > */
> > > > ...
> > > > }
> > > >
> > > >
> > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet):
> > > >
> > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > index 71c11ad5643e..3c9686ef74e5 100644
> > > > --- a/arch/x86/kernel/e820.c
> > > > +++ b/arch/x86/kernel/e820.c
> > > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void)
> > > > if (end != (resource_size_t)end)
> > > > continue;
> > > >
> > > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) {
> > > > + memblock_reserve(entry->addr, entry->size);
> > > > + continue;
> > > > + }
> > > > +
> > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > > continue;
> > >
> > > It does not seem to work, so the reasoning might be incorrect.
> >
> > Thank you for the comment.
> >
> > One note is that the memory region with "broken struct page" is a typical
> > reserved region, not a pmem region. Strangely reading offset 0xbffd7 of
> > /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists.
> > Reading the offset like 0x100000 (on pmem region) does not cause the crash,
> > so pmem region seems properly set up.
> >
> > [ 0.000000] user-defined physical RAM map:
> > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region
> > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region
> > [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable
> >
>
> I have another note:
>
> > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > __init_single_page() were never reached. So in such case, struct pages populated
> > by vmemmap_pte_populate() could be left uninitialized?
>
> I quickly checked whether enabling CONFIG_DEFERRED_STRUCT_PAGE_INIT affect
> the issue. And found that the kernel panic happens even with this config enabled.
> So I'm still confused...

Let me share some new facts:

I gave accidentally an inconvenient memmap layout like 'memmap=1G!4G' in
2 NUMA node with 8 GB memory.
While I didn't intended this, but 4GB is the address starting some memory
block when no "memmap=" option is provided.

(messages from free_area_init_nodes() for no "memmap=" case
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
[ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] // <---
[ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff]

When "memmap=1G!4G" is given, the range [0x0000000100000000-0x000000013fffffff]
disappears and kernel messages are like below:

(messages from free_area_init_nodes() for "memmap=1G!4G" case
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
[ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff]

This makes kernel think that the end pfn of node 0 is 0 0xbffd7
instead of 0x140000, which affects the memory initialization process.
memmap_init_zone() calls __init_single_page() for each page within a zone,
so if zone->spanned_pages are underestimated, some pages are left uninitialized.

If I provide 'memmap=1G!7G', the kernel panic does not reproduce and
kernel messages are like below.

(messages from free_area_init_nodes() for "memmap=1G!7G" case
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
[ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff]
[ 0.000000] node 1: [mem 0x0000000140000000-0x00000001bfffffff]
[ 0.000000] node 1: [mem 0x0000000200000000-0x000000023fffffff]


I think that in order to fix this, we need some conditions and/or prechecks
for memblock layout, does it make sense? Or any other better approaches?

Thanks,
Naoya Horiguchi

2018-06-07 07:00:39

by Oscar Salvador

[permalink] [raw]
Subject: Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

On Thu, Jun 07, 2018 at 06:22:19AM +0000, Naoya Horiguchi wrote:
> On Wed, Jun 06, 2018 at 09:24:05AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> > On Wed, Jun 06, 2018 at 09:06:30AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> > > On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote:
> > > > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote:
> > > > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote:
> > > > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> > > > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote:
> > > > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote:
> > > > > > > > > Reproduction precedure is like this:
> > > > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
> > > > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments)
> > > > > > > > > (- my kernel config is attached)
> > > > > > > > >
> > > > > > > > > I spent a few days on this, but didn't reach any solutions.
> > > > > > > > > So let me report this with some details below ...
> > > > > > > > >
> > > > > > > > > In the critial page request, stable_page_flags() is called with an argument
> > > > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
> > > > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the
> > > > > > > > > address 0xfffffffffffffffe in the above message.
> > > > > > > >
> > > > > > > > Hm. compound_head shares with:
> > > > > > > >
> > > > > > > > struct list_head lru;
> > > > > > > > struct list_head slab_list; /* uses lru */
> > > > > > > > struct { /* Partial pages */
> > > > > > > > struct page *next;
> > > > > > > > unsigned long _compound_pad_1; /* compound_head */
> > > > > > > > unsigned long _pt_pad_1; /* compound_head */
> > > > > > > > struct dev_pagemap *pgmap;
> > > > > > > > struct rcu_head rcu_head;
> > > > > > > >
> > > > > > > > None of them should be -1.
> > > > > > > >
> > > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range
> > > > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range.
> > > > > > > > >
> > > > > > > > > [ 0.000000] user-defined physical RAM map:
> > > > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
> > > > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
> > > > > > > > >
> > > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process.
> > > > > > > > >
> > > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
> > > > > > > > > I hope this info helps you find the solution/workaround.
> > > > > > > >
> > > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct
> > > > > > > > page, or it could be one of Pavel's deferred page initialisation patches.
> > > > > > > > Or something else ;-)
> > > > > > >
> > > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later.
> > > > > > >
> > > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used
> > > > > > > different kvm guests, which made some different test condition and misguided me),
> > > > > > > this seems an older (at least < 4.15) bug.
> > > > > >
> > > > > > (Cc: Pavel)
> > > > > >
> > > > > > Bisection showed that the following commit introduced this issue:
> > > > > >
> > > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f
> > > > > > Author: Pavel Tatashin <[email protected]>
> > > > > > Date: Wed Nov 15 17:36:44 2017 -0800
> > > > > >
> > > > > > mm: stop zeroing memory during allocation in vmemmap
> > > > > >
> > > > > > This patch postpones struct page zeroing to later stage of memory initialization.
> > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > > > > > __init_single_page() were never reached. So in such case, struct pages populated
> > > > > > by vmemmap_pte_populate() could be left uninitialized?
> > > > > > And I'm not sure yet how this issue becomes visible with memmap= setting.
> > > > >
> > > > > I think that this becomes visible because memmap=x!y creates a persistent memory region:
> > > > >
> > > > > parse_memmap_one
> > > > > {
> > > > > ...
> > > > > } else if (*p == '!') {
> > > > > start_at = memparse(p+1, &p);
> > > > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM);
> > > > > ...
> > > > > }
> > > > >
> > > > > and this region it is not added neither in memblock.memory nor in memblock.reserved.
> > > > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed
> > > > > in free_low_memory_core_early():
> > > > >
> > > > > static unsigned long __init free_low_memory_core_early(void)
> > > > > {
> > > > > ...
> > > > > for_each_reserved_mem_region(i, &start, &end)
> > > > > reserve_bootmem_region(start, end);
> > > > > ...
> > > > > }
> > > > >
> > > > >
> > > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved.
> > > > > A comment in do_mark_busy() suggests this:
> > > > >
> > > > > static bool __init do_mark_busy(enum e820_type type, struct resource *res)
> > > > > {
> > > > >
> > > > > ...
> > > > > /*
> > > > > * Treat persistent memory like device memory, i.e. reserve it
> > > > > * for exclusive use of a driver
> > > > > */
> > > > > ...
> > > > > }
> > > > >
> > > > >
> > > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet):
> > > > >
> > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > > index 71c11ad5643e..3c9686ef74e5 100644
> > > > > --- a/arch/x86/kernel/e820.c
> > > > > +++ b/arch/x86/kernel/e820.c
> > > > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void)
> > > > > if (end != (resource_size_t)end)
> > > > > continue;
> > > > >
> > > > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) {
> > > > > + memblock_reserve(entry->addr, entry->size);
> > > > > + continue;
> > > > > + }
> > > > > +
> > > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > > > continue;
> > > >
> > > > It does not seem to work, so the reasoning might be incorrect.
> > >
> > > Thank you for the comment.
> > >
> > > One note is that the memory region with "broken struct page" is a typical
> > > reserved region, not a pmem region. Strangely reading offset 0xbffd7 of
> > > /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists.
> > > Reading the offset like 0x100000 (on pmem region) does not cause the crash,
> > > so pmem region seems properly set up.
> > >
> > > [ 0.000000] user-defined physical RAM map:
> > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region
> > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region
> > > [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable
> > >
> >
> > I have another note:
> >
> > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > > __init_single_page() were never reached. So in such case, struct pages populated
> > > by vmemmap_pte_populate() could be left uninitialized?
> >
> > I quickly checked whether enabling CONFIG_DEFERRED_STRUCT_PAGE_INIT affect
> > the issue. And found that the kernel panic happens even with this config enabled.
> > So I'm still confused...
>
> Let me share some new facts:
>
> I gave accidentally an inconvenient memmap layout like 'memmap=1G!4G' in
> 2 NUMA node with 8 GB memory.
> While I didn't intended this, but 4GB is the address starting some memory
> block when no "memmap=" option is provided.
>
> (messages from free_area_init_nodes() for no "memmap=" case
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
> [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] // <---
> [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff]
>
> When "memmap=1G!4G" is given, the range [0x0000000100000000-0x000000013fffffff]
> disappears and kernel messages are like below:
>
> (messages from free_area_init_nodes() for "memmap=1G!4G" case
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
> [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff]
>
> This makes kernel think that the end pfn of node 0 is 0 0xbffd7
> instead of 0x140000, which affects the memory initialization process.
> memmap_init_zone() calls __init_single_page() for each page within a zone,
> so if zone->spanned_pages are underestimated, some pages are left uninitialized.
>
> If I provide 'memmap=1G!7G', the kernel panic does not reproduce and
> kernel messages are like below.
>
> (messages from free_area_init_nodes() for "memmap=1G!7G" case
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
> [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff]
> [ 0.000000] node 1: [mem 0x0000000140000000-0x00000001bfffffff]
> [ 0.000000] node 1: [mem 0x0000000200000000-0x000000023fffffff]
>
>
> I think that in order to fix this, we need some conditions and/or prechecks
> for memblock layout, does it make sense? Or any other better approaches?

This is what I am seeing too, some memory just vanishes and is left unitialized.
All this is handled in parse_memmap_one(), so I wonder if the right to do would be that in
case we detect that an user-specified map falls in an usable map, we just back off and do not insert it.

Maybe a subroutine that checks for that kind of overlapping maps before calling e820__range_add()?

Oscar Salvador

2018-06-07 10:33:15

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

On Thu, Jun 07, 2018 at 11:49:21AM +0200, Oscar Salvador wrote:
> On Thu, Jun 07, 2018 at 08:59:40AM +0200, Oscar Salvador wrote:
> > On Thu, Jun 07, 2018 at 06:22:19AM +0000, Naoya Horiguchi wrote:
> > > On Wed, Jun 06, 2018 at 09:24:05AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> > > > On Wed, Jun 06, 2018 at 09:06:30AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> > > > > On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote:
> > > > > > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote:
> > > > > > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote:
> > > > > > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> > > > > > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote:
> > > > > > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote:
> > > > > > > > > > > Reproduction precedure is like this:
> > > > > > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
> > > > > > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments)
> > > > > > > > > > > (- my kernel config is attached)
> > > > > > > > > > >
> > > > > > > > > > > I spent a few days on this, but didn't reach any solutions.
> > > > > > > > > > > So let me report this with some details below ...
> > > > > > > > > > >
> > > > > > > > > > > In the critial page request, stable_page_flags() is called with an argument
> > > > > > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
> > > > > > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the
> > > > > > > > > > > address 0xfffffffffffffffe in the above message.
> > > > > > > > > >
> > > > > > > > > > Hm. compound_head shares with:
> > > > > > > > > >
> > > > > > > > > > struct list_head lru;
> > > > > > > > > > struct list_head slab_list; /* uses lru */
> > > > > > > > > > struct { /* Partial pages */
> > > > > > > > > > struct page *next;
> > > > > > > > > > unsigned long _compound_pad_1; /* compound_head */
> > > > > > > > > > unsigned long _pt_pad_1; /* compound_head */
> > > > > > > > > > struct dev_pagemap *pgmap;
> > > > > > > > > > struct rcu_head rcu_head;
> > > > > > > > > >
> > > > > > > > > > None of them should be -1.
> > > > > > > > > >
> > > > > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range
> > > > > > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range.
> > > > > > > > > > >
> > > > > > > > > > > [ 0.000000] user-defined physical RAM map:
> > > > > > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > > > > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > > > > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > > > > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > > > > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
> > > > > > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > > > > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > > > > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
> > > > > > > > > > >
> > > > > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process.
> > > > > > > > > > >
> > > > > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
> > > > > > > > > > > I hope this info helps you find the solution/workaround.
> > > > > > > > > >
> > > > > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct
> > > > > > > > > > page, or it could be one of Pavel's deferred page initialisation patches.
> > > > > > > > > > Or something else ;-)
> > > > > > > > >
> > > > > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later.
> > > > > > > > >
> > > > > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used
> > > > > > > > > different kvm guests, which made some different test condition and misguided me),
> > > > > > > > > this seems an older (at least < 4.15) bug.
> > > > > > > >
> > > > > > > > (Cc: Pavel)
> > > > > > > >
> > > > > > > > Bisection showed that the following commit introduced this issue:
> > > > > > > >
> > > > > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f
> > > > > > > > Author: Pavel Tatashin <[email protected]>
> > > > > > > > Date: Wed Nov 15 17:36:44 2017 -0800
> > > > > > > >
> > > > > > > > mm: stop zeroing memory during allocation in vmemmap
> > > > > > > >
> > > > > > > > This patch postpones struct page zeroing to later stage of memory initialization.
> > > > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > > > > > > > __init_single_page() were never reached. So in such case, struct pages populated
> > > > > > > > by vmemmap_pte_populate() could be left uninitialized?
> > > > > > > > And I'm not sure yet how this issue becomes visible with memmap= setting.
> > > > > > >
> > > > > > > I think that this becomes visible because memmap=x!y creates a persistent memory region:
> > > > > > >
> > > > > > > parse_memmap_one
> > > > > > > {
> > > > > > > ...
> > > > > > > } else if (*p == '!') {
> > > > > > > start_at = memparse(p+1, &p);
> > > > > > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM);
> > > > > > > ...
> > > > > > > }
> > > > > > >
> > > > > > > and this region it is not added neither in memblock.memory nor in memblock.reserved.
> > > > > > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed
> > > > > > > in free_low_memory_core_early():
> > > > > > >
> > > > > > > static unsigned long __init free_low_memory_core_early(void)
> > > > > > > {
> > > > > > > ...
> > > > > > > for_each_reserved_mem_region(i, &start, &end)
> > > > > > > reserve_bootmem_region(start, end);
> > > > > > > ...
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved.
> > > > > > > A comment in do_mark_busy() suggests this:
> > > > > > >
> > > > > > > static bool __init do_mark_busy(enum e820_type type, struct resource *res)
> > > > > > > {
> > > > > > >
> > > > > > > ...
> > > > > > > /*
> > > > > > > * Treat persistent memory like device memory, i.e. reserve it
> > > > > > > * for exclusive use of a driver
> > > > > > > */
> > > > > > > ...
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet):
> > > > > > >
> > > > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > > > > index 71c11ad5643e..3c9686ef74e5 100644
> > > > > > > --- a/arch/x86/kernel/e820.c
> > > > > > > +++ b/arch/x86/kernel/e820.c
> > > > > > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void)
> > > > > > > if (end != (resource_size_t)end)
> > > > > > > continue;
> > > > > > >
> > > > > > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) {
> > > > > > > + memblock_reserve(entry->addr, entry->size);
> > > > > > > + continue;
> > > > > > > + }
> > > > > > > +
> > > > > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > > > > > continue;
> > > > > >
> > > > > > It does not seem to work, so the reasoning might be incorrect.
> > > > >
> > > > > Thank you for the comment.
> > > > >
> > > > > One note is that the memory region with "broken struct page" is a typical
> > > > > reserved region, not a pmem region. Strangely reading offset 0xbffd7 of
> > > > > /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists.
> > > > > Reading the offset like 0x100000 (on pmem region) does not cause the crash,
> > > > > so pmem region seems properly set up.
> > > > >
> > > > > [ 0.000000] user-defined physical RAM map:
> > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region
> > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region
> > > > > [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable
> > > > >
> > > >
> > > > I have another note:
> > > >
> > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > > > > __init_single_page() were never reached. So in such case, struct pages populated
> > > > > by vmemmap_pte_populate() could be left uninitialized?
> > > >
> > > > I quickly checked whether enabling CONFIG_DEFERRED_STRUCT_PAGE_INIT affect
> > > > the issue. And found that the kernel panic happens even with this config enabled.
> > > > So I'm still confused...
> > >
> > > Let me share some new facts:
> > >
> > > I gave accidentally an inconvenient memmap layout like 'memmap=1G!4G' in
> > > 2 NUMA node with 8 GB memory.
> > > While I didn't intended this, but 4GB is the address starting some memory
> > > block when no "memmap=" option is provided.
> > >
> > > (messages from free_area_init_nodes() for no "memmap=" case
> > > [ 0.000000] Early memory node ranges
> > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
> > > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] // <---
> > > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff]
> > >
> > > When "memmap=1G!4G" is given, the range [0x0000000100000000-0x000000013fffffff]
> > > disappears and kernel messages are like below:
> > >
> > > (messages from free_area_init_nodes() for "memmap=1G!4G" case
> > > [ 0.000000] Early memory node ranges
> > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
> > > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff]
> > >
> > > This makes kernel think that the end pfn of node 0 is 0 0xbffd7
> > > instead of 0x140000, which affects the memory initialization process.
> > > memmap_init_zone() calls __init_single_page() for each page within a zone,
> > > so if zone->spanned_pages are underestimated, some pages are left uninitialized.
> > >
> > > If I provide 'memmap=1G!7G', the kernel panic does not reproduce and
> > > kernel messages are like below.
> > >
> > > (messages from free_area_init_nodes() for "memmap=1G!7G" case
> > > [ 0.000000] Early memory node ranges
> > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
> > > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff]
> > > [ 0.000000] node 1: [mem 0x0000000140000000-0x00000001bfffffff]
> > > [ 0.000000] node 1: [mem 0x0000000200000000-0x000000023fffffff]
> > >
> > >
> > > I think that in order to fix this, we need some conditions and/or prechecks
> > > for memblock layout, does it make sense? Or any other better approaches?
>
> Could you share the "e820: BIOS-provided physical RAM map" and "e820: user-defined physical RAM map"
> output with both memmap= args (1G!4G and 1G!7G)?

Sure, here it is:

# memmap=1G!4G

BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000bffd6fff] usable
BIOS-e820: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable
NX (Execute Disable) protection: active
user-defined physical RAM map:
user: [mem 0x0000000000000000-0x000000000009fbff] usable
user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
user: [mem 0x0000000140000000-0x000000023fffffff] usable

# memmap=1G!7G

BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000bffd6fff] usable
BIOS-e820: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable
NX (Execute Disable) protection: active
user-defined physical RAM map:
user: [mem 0x0000000000000000-0x000000000009fbff] usable
user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
user: [mem 0x0000000100000000-0x00000001bfffffff] usable
user: [mem 0x00000001c0000000-0x00000001ffffffff] persistent (type 12)
user: [mem 0x0000000200000000-0x000000023fffffff] usable

I attached full dmesgs (including some printk output for debugging) just in case.

Thanks,
Naoya Horiguchi


Attachments:
dmesg.1G_4G.gz (13.77 kB)
dmesg.1G_4G.gz
dmesg.1G_7G.gz (14.13 kB)
dmesg.1G_7G.gz
Download all attachments

2018-06-07 10:55:47

by Oscar Salvador

[permalink] [raw]
Subject: Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

On Thu, Jun 07, 2018 at 08:59:40AM +0200, Oscar Salvador wrote:
> On Thu, Jun 07, 2018 at 06:22:19AM +0000, Naoya Horiguchi wrote:
> > On Wed, Jun 06, 2018 at 09:24:05AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> > > On Wed, Jun 06, 2018 at 09:06:30AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> > > > On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote:
> > > > > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote:
> > > > > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote:
> > > > > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> > > > > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote:
> > > > > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote:
> > > > > > > > > > Reproduction precedure is like this:
> > > > > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
> > > > > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments)
> > > > > > > > > > (- my kernel config is attached)
> > > > > > > > > >
> > > > > > > > > > I spent a few days on this, but didn't reach any solutions.
> > > > > > > > > > So let me report this with some details below ...
> > > > > > > > > >
> > > > > > > > > > In the critial page request, stable_page_flags() is called with an argument
> > > > > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
> > > > > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the
> > > > > > > > > > address 0xfffffffffffffffe in the above message.
> > > > > > > > >
> > > > > > > > > Hm. compound_head shares with:
> > > > > > > > >
> > > > > > > > > struct list_head lru;
> > > > > > > > > struct list_head slab_list; /* uses lru */
> > > > > > > > > struct { /* Partial pages */
> > > > > > > > > struct page *next;
> > > > > > > > > unsigned long _compound_pad_1; /* compound_head */
> > > > > > > > > unsigned long _pt_pad_1; /* compound_head */
> > > > > > > > > struct dev_pagemap *pgmap;
> > > > > > > > > struct rcu_head rcu_head;
> > > > > > > > >
> > > > > > > > > None of them should be -1.
> > > > > > > > >
> > > > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range
> > > > > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range.
> > > > > > > > > >
> > > > > > > > > > [ 0.000000] user-defined physical RAM map:
> > > > > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > > > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > > > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > > > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > > > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
> > > > > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > > > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > > > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
> > > > > > > > > >
> > > > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process.
> > > > > > > > > >
> > > > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
> > > > > > > > > > I hope this info helps you find the solution/workaround.
> > > > > > > > >
> > > > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct
> > > > > > > > > page, or it could be one of Pavel's deferred page initialisation patches.
> > > > > > > > > Or something else ;-)
> > > > > > > >
> > > > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later.
> > > > > > > >
> > > > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used
> > > > > > > > different kvm guests, which made some different test condition and misguided me),
> > > > > > > > this seems an older (at least < 4.15) bug.
> > > > > > >
> > > > > > > (Cc: Pavel)
> > > > > > >
> > > > > > > Bisection showed that the following commit introduced this issue:
> > > > > > >
> > > > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f
> > > > > > > Author: Pavel Tatashin <[email protected]>
> > > > > > > Date: Wed Nov 15 17:36:44 2017 -0800
> > > > > > >
> > > > > > > mm: stop zeroing memory during allocation in vmemmap
> > > > > > >
> > > > > > > This patch postpones struct page zeroing to later stage of memory initialization.
> > > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > > > > > > __init_single_page() were never reached. So in such case, struct pages populated
> > > > > > > by vmemmap_pte_populate() could be left uninitialized?
> > > > > > > And I'm not sure yet how this issue becomes visible with memmap= setting.
> > > > > >
> > > > > > I think that this becomes visible because memmap=x!y creates a persistent memory region:
> > > > > >
> > > > > > parse_memmap_one
> > > > > > {
> > > > > > ...
> > > > > > } else if (*p == '!') {
> > > > > > start_at = memparse(p+1, &p);
> > > > > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM);
> > > > > > ...
> > > > > > }
> > > > > >
> > > > > > and this region it is not added neither in memblock.memory nor in memblock.reserved.
> > > > > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed
> > > > > > in free_low_memory_core_early():
> > > > > >
> > > > > > static unsigned long __init free_low_memory_core_early(void)
> > > > > > {
> > > > > > ...
> > > > > > for_each_reserved_mem_region(i, &start, &end)
> > > > > > reserve_bootmem_region(start, end);
> > > > > > ...
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved.
> > > > > > A comment in do_mark_busy() suggests this:
> > > > > >
> > > > > > static bool __init do_mark_busy(enum e820_type type, struct resource *res)
> > > > > > {
> > > > > >
> > > > > > ...
> > > > > > /*
> > > > > > * Treat persistent memory like device memory, i.e. reserve it
> > > > > > * for exclusive use of a driver
> > > > > > */
> > > > > > ...
> > > > > > }
> > > > > >
> > > > > >
> > > > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet):
> > > > > >
> > > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > > > index 71c11ad5643e..3c9686ef74e5 100644
> > > > > > --- a/arch/x86/kernel/e820.c
> > > > > > +++ b/arch/x86/kernel/e820.c
> > > > > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void)
> > > > > > if (end != (resource_size_t)end)
> > > > > > continue;
> > > > > >
> > > > > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) {
> > > > > > + memblock_reserve(entry->addr, entry->size);
> > > > > > + continue;
> > > > > > + }
> > > > > > +
> > > > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > > > > continue;
> > > > >
> > > > > It does not seem to work, so the reasoning might be incorrect.
> > > >
> > > > Thank you for the comment.
> > > >
> > > > One note is that the memory region with "broken struct page" is a typical
> > > > reserved region, not a pmem region. Strangely reading offset 0xbffd7 of
> > > > /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists.
> > > > Reading the offset like 0x100000 (on pmem region) does not cause the crash,
> > > > so pmem region seems properly set up.
> > > >
> > > > [ 0.000000] user-defined physical RAM map:
> > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region
> > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region
> > > > [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable
> > > >
> > >
> > > I have another note:
> > >
> > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > > > __init_single_page() were never reached. So in such case, struct pages populated
> > > > by vmemmap_pte_populate() could be left uninitialized?
> > >
> > > I quickly checked whether enabling CONFIG_DEFERRED_STRUCT_PAGE_INIT affect
> > > the issue. And found that the kernel panic happens even with this config enabled.
> > > So I'm still confused...
> >
> > Let me share some new facts:
> >
> > I gave accidentally an inconvenient memmap layout like 'memmap=1G!4G' in
> > 2 NUMA node with 8 GB memory.
> > While I didn't intended this, but 4GB is the address starting some memory
> > block when no "memmap=" option is provided.
> >
> > (messages from free_area_init_nodes() for no "memmap=" case
> > [ 0.000000] Early memory node ranges
> > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
> > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] // <---
> > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff]
> >
> > When "memmap=1G!4G" is given, the range [0x0000000100000000-0x000000013fffffff]
> > disappears and kernel messages are like below:
> >
> > (messages from free_area_init_nodes() for "memmap=1G!4G" case
> > [ 0.000000] Early memory node ranges
> > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
> > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff]
> >
> > This makes kernel think that the end pfn of node 0 is 0 0xbffd7
> > instead of 0x140000, which affects the memory initialization process.
> > memmap_init_zone() calls __init_single_page() for each page within a zone,
> > so if zone->spanned_pages are underestimated, some pages are left uninitialized.
> >
> > If I provide 'memmap=1G!7G', the kernel panic does not reproduce and
> > kernel messages are like below.
> >
> > (messages from free_area_init_nodes() for "memmap=1G!7G" case
> > [ 0.000000] Early memory node ranges
> > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
> > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff]
> > [ 0.000000] node 1: [mem 0x0000000140000000-0x00000001bfffffff]
> > [ 0.000000] node 1: [mem 0x0000000200000000-0x000000023fffffff]
> >
> >
> > I think that in order to fix this, we need some conditions and/or prechecks
> > for memblock layout, does it make sense? Or any other better approaches?

Could you share the "e820: BIOS-provided physical RAM map" and "e820: user-defined physical RAM map"
output with both memmap= args (1G!4G and 1G!7G)?

thanks
Oscar Salvador

2018-06-11 09:17:21

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM

On Thu, Jun 07, 2018 at 10:02:56AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> On Thu, Jun 07, 2018 at 11:49:21AM +0200, Oscar Salvador wrote:
> > On Thu, Jun 07, 2018 at 08:59:40AM +0200, Oscar Salvador wrote:
> > > On Thu, Jun 07, 2018 at 06:22:19AM +0000, Naoya Horiguchi wrote:
> > > > On Wed, Jun 06, 2018 at 09:24:05AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> > > > > On Wed, Jun 06, 2018 at 09:06:30AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> > > > > > On Wed, Jun 06, 2018 at 10:53:19AM +0200, Oscar Salvador wrote:
> > > > > > > On Wed, Jun 06, 2018 at 10:04:08AM +0200, Oscar Salvador wrote:
> > > > > > > > On Wed, Jun 06, 2018 at 05:16:24AM +0000, Naoya Horiguchi wrote:
> > > > > > > > > On Tue, Jun 05, 2018 at 07:35:01AM +0000, Horiguchi Naoya($BKY8}(B $BD>Li(B) wrote:
> > > > > > > > > > On Mon, Jun 04, 2018 at 06:18:36PM -0700, Matthew Wilcox wrote:
> > > > > > > > > > > On Tue, Jun 05, 2018 at 12:54:03AM +0000, Naoya Horiguchi wrote:
> > > > > > > > > > > > Reproduction precedure is like this:
> > > > > > > > > > > > - enable RAM based PMEM (with a kernel boot parameter like memmap=1G!4G)
> > > > > > > > > > > > - read /proc/kpageflags (or call tools/vm/page-types with no arguments)
> > > > > > > > > > > > (- my kernel config is attached)
> > > > > > > > > > > >
> > > > > > > > > > > > I spent a few days on this, but didn't reach any solutions.
> > > > > > > > > > > > So let me report this with some details below ...
> > > > > > > > > > > >
> > > > > > > > > > > > In the critial page request, stable_page_flags() is called with an argument
> > > > > > > > > > > > page whose ->compound_head was somehow filled with '0xffffffffffffffff'.
> > > > > > > > > > > > And compound_head() returns (struct page *)(head - 1), which explains the
> > > > > > > > > > > > address 0xfffffffffffffffe in the above message.
> > > > > > > > > > >
> > > > > > > > > > > Hm. compound_head shares with:
> > > > > > > > > > >
> > > > > > > > > > > struct list_head lru;
> > > > > > > > > > > struct list_head slab_list; /* uses lru */
> > > > > > > > > > > struct { /* Partial pages */
> > > > > > > > > > > struct page *next;
> > > > > > > > > > > unsigned long _compound_pad_1; /* compound_head */
> > > > > > > > > > > unsigned long _pt_pad_1; /* compound_head */
> > > > > > > > > > > struct dev_pagemap *pgmap;
> > > > > > > > > > > struct rcu_head rcu_head;
> > > > > > > > > > >
> > > > > > > > > > > None of them should be -1.
> > > > > > > > > > >
> > > > > > > > > > > > It seems that this kernel panic happens when reading kpageflags of pfn range
> > > > > > > > > > > > [0xbffd7, 0xc0000), which coresponds to a 'reserved' range.
> > > > > > > > > > > >
> > > > > > > > > > > > [ 0.000000] user-defined physical RAM map:
> > > > > > > > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > > > > > > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > > > > > > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > > > > > > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > > > > > > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved
> > > > > > > > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > > > > > > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > > > > > > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12)
> > > > > > > > > > > >
> > > > > > > > > > > > So I guess 'memmap=' parameter might badly affect the memory initialization process.
> > > > > > > > > > > >
> > > > > > > > > > > > This problem doesn't reproduce on v4.17, so some pre-released patch introduces it.
> > > > > > > > > > > > I hope this info helps you find the solution/workaround.
> > > > > > > > > > >
> > > > > > > > > > > Can you try bisecting this? It could be one of my patches to reorder struct
> > > > > > > > > > > page, or it could be one of Pavel's deferred page initialisation patches.
> > > > > > > > > > > Or something else ;-)
> > > > > > > > > >
> > > > > > > > > > Thank you for the comment. I'm trying bisecting now, let you know the result later.
> > > > > > > > > >
> > > > > > > > > > And I found that my statement "not reproduce on v4.17" was wrong (I used
> > > > > > > > > > different kvm guests, which made some different test condition and misguided me),
> > > > > > > > > > this seems an older (at least < 4.15) bug.
> > > > > > > > >
> > > > > > > > > (Cc: Pavel)
> > > > > > > > >
> > > > > > > > > Bisection showed that the following commit introduced this issue:
> > > > > > > > >
> > > > > > > > > commit f7f99100d8d95dbcf09e0216a143211e79418b9f
> > > > > > > > > Author: Pavel Tatashin <[email protected]>
> > > > > > > > > Date: Wed Nov 15 17:36:44 2017 -0800
> > > > > > > > >
> > > > > > > > > mm: stop zeroing memory during allocation in vmemmap
> > > > > > > > >
> > > > > > > > > This patch postpones struct page zeroing to later stage of memory initialization.
> > > > > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > > > > > > > > __init_single_page() were never reached. So in such case, struct pages populated
> > > > > > > > > by vmemmap_pte_populate() could be left uninitialized?
> > > > > > > > > And I'm not sure yet how this issue becomes visible with memmap= setting.
> > > > > > > >
> > > > > > > > I think that this becomes visible because memmap=x!y creates a persistent memory region:
> > > > > > > >
> > > > > > > > parse_memmap_one
> > > > > > > > {
> > > > > > > > ...
> > > > > > > > } else if (*p == '!') {
> > > > > > > > start_at = memparse(p+1, &p);
> > > > > > > > e820__range_add(start_at, mem_size, E820_TYPE_PRAM);
> > > > > > > > ...
> > > > > > > > }
> > > > > > > >
> > > > > > > > and this region it is not added neither in memblock.memory nor in memblock.reserved.
> > > > > > > > Ranges in memblock.memory get zeroed in memmap_init_zone(), while memblock.reserved get zeroed
> > > > > > > > in free_low_memory_core_early():
> > > > > > > >
> > > > > > > > static unsigned long __init free_low_memory_core_early(void)
> > > > > > > > {
> > > > > > > > ...
> > > > > > > > for_each_reserved_mem_region(i, &start, &end)
> > > > > > > > reserve_bootmem_region(start, end);
> > > > > > > > ...
> > > > > > > > }
> > > > > > > >
> > > > > > > >
> > > > > > > > Maybe I am mistaken, but I think that persistent memory regions should be marked as reserved.
> > > > > > > > A comment in do_mark_busy() suggests this:
> > > > > > > >
> > > > > > > > static bool __init do_mark_busy(enum e820_type type, struct resource *res)
> > > > > > > > {
> > > > > > > >
> > > > > > > > ...
> > > > > > > > /*
> > > > > > > > * Treat persistent memory like device memory, i.e. reserve it
> > > > > > > > * for exclusive use of a driver
> > > > > > > > */
> > > > > > > > ...
> > > > > > > > }
> > > > > > > >
> > > > > > > >
> > > > > > > > I wonder if something like this could work and if so, if it is right (i haven't tested it yet):
> > > > > > > >
> > > > > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > > > > > index 71c11ad5643e..3c9686ef74e5 100644
> > > > > > > > --- a/arch/x86/kernel/e820.c
> > > > > > > > +++ b/arch/x86/kernel/e820.c
> > > > > > > > @@ -1247,6 +1247,11 @@ void __init e820__memblock_setup(void)
> > > > > > > > if (end != (resource_size_t)end)
> > > > > > > > continue;
> > > > > > > >
> > > > > > > > + if (entry->type == E820_TYPE_PRAM || entry->type == E820_TYPE_PMEM) {
> > > > > > > > + memblock_reserve(entry->addr, entry->size);
> > > > > > > > + continue;
> > > > > > > > + }
> > > > > > > > +
> > > > > > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > > > > > > continue;
> > > > > > >
> > > > > > > It does not seem to work, so the reasoning might be incorrect.
> > > > > >
> > > > > > Thank you for the comment.
> > > > > >
> > > > > > One note is that the memory region with "broken struct page" is a typical
> > > > > > reserved region, not a pmem region. Strangely reading offset 0xbffd7 of
> > > > > > /proc/kpageflags is OK if pmem region does not exist, but NG if pmem region exists.
> > > > > > Reading the offset like 0x100000 (on pmem region) does not cause the crash,
> > > > > > so pmem region seems properly set up.
> > > > > >
> > > > > > [ 0.000000] user-defined physical RAM map:
> > > > > > [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
> > > > > > [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > > > > > [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > > > > > [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffd6fff] usable
> > > > > > [ 0.000000] user: [mem 0x00000000bffd7000-0x00000000bfffffff] reserved ===> "broken struct page" region
> > > > > > [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > > > > > [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > > > > > [ 0.000000] user: [mem 0x0000000100000000-0x000000013fffffff] persistent (type 12) => pmem region
> > > > > > [ 0.000000] user: [mem 0x0000000140000000-0x000000023fffffff] usable
> > > > > >
> > > > >
> > > > > I have another note:
> > > > >
> > > > > > My kernel config disabled CONFIG_DEFERRED_STRUCT_PAGE_INIT so two callsites of
> > > > > > __init_single_page() were never reached. So in such case, struct pages populated
> > > > > > by vmemmap_pte_populate() could be left uninitialized?
> > > > >
> > > > > I quickly checked whether enabling CONFIG_DEFERRED_STRUCT_PAGE_INIT affect
> > > > > the issue. And found that the kernel panic happens even with this config enabled.
> > > > > So I'm still confused...
> > > >
> > > > Let me share some new facts:
> > > >
> > > > I gave accidentally an inconvenient memmap layout like 'memmap=1G!4G' in
> > > > 2 NUMA node with 8 GB memory.
> > > > While I didn't intended this, but 4GB is the address starting some memory
> > > > block when no "memmap=" option is provided.
> > > >
> > > > (messages from free_area_init_nodes() for no "memmap=" case
> > > > [ 0.000000] Early memory node ranges
> > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
> > > > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff] // <---
> > > > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff]
> > > >
> > > > When "memmap=1G!4G" is given, the range [0x0000000100000000-0x000000013fffffff]
> > > > disappears and kernel messages are like below:
> > > >
> > > > (messages from free_area_init_nodes() for "memmap=1G!4G" case
> > > > [ 0.000000] Early memory node ranges
> > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
> > > > [ 0.000000] node 1: [mem 0x0000000140000000-0x000000023fffffff]
> > > >
> > > > This makes kernel think that the end pfn of node 0 is 0 0xbffd7
> > > > instead of 0x140000, which affects the memory initialization process.
> > > > memmap_init_zone() calls __init_single_page() for each page within a zone,
> > > > so if zone->spanned_pages are underestimated, some pages are left uninitialized.
> > > >
> > > > If I provide 'memmap=1G!7G', the kernel panic does not reproduce and
> > > > kernel messages are like below.
> > > >
> > > > (messages from free_area_init_nodes() for "memmap=1G!7G" case
> > > > [ 0.000000] Early memory node ranges
> > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffd6fff]
> > > > [ 0.000000] node 0: [mem 0x0000000100000000-0x000000013fffffff]
> > > > [ 0.000000] node 1: [mem 0x0000000140000000-0x00000001bfffffff]
> > > > [ 0.000000] node 1: [mem 0x0000000200000000-0x000000023fffffff]
> > > >
> > > >
> > > > I think that in order to fix this, we need some conditions and/or prechecks
> > > > for memblock layout, does it make sense? Or any other better approaches?
> >

> All this is handled in parse_memmap_one(), so I wonder if the right to do would be that in
> case we detect that an user-specified map falls in an usable map, we just back off and do not insert it.
>
> Maybe a subroutine that checks for that kind of overlapping maps before calling e820__range_add()?
>

This problem seems to happen when the end address of the user-defined memmap
is equal to the end address of NUMA node (0x140000000 or 5GB in this case.)
So that's the condition to be checked, I think.
However we don't initialize numa at parse_memmap_one(), so we need identify
right place to do this, so I'll do this next.

Thanks,
Naoya Horiguchi

2018-06-13 05:43:47

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v1] mm: zero remaining unavailable struct pages (Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM)

Hi everyone,

I wrote a patch for this issue.
There was a discussion about prechecking approach, but I finally found
out it's hard to make change on memblock after numa_init, so I take
another apporach (see patch description).

I'm glad if you check that it works for you.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi <[email protected]>
Date: Wed, 13 Jun 2018 12:43:27 +0900
Subject: [PATCH] mm: zero remaining unavailable struct pages

There is a kernel panic that is triggered when reading /proc/kpageflags
on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':

BUG: unable to handle kernel paging request at fffffffffffffffe
PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
RIP: 0010:stable_page_flags+0x27/0x3c0
Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
Call Trace:
kpageflags_read+0xc7/0x120
proc_reg_read+0x3c/0x60
__vfs_read+0x36/0x170
vfs_read+0x89/0x130
ksys_pread64+0x71/0x90
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7efc42e75e23
Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24

According to kernel bisection, this problem became visible due to commit
f7f99100d8d9 which changes how struct pages are initialized.

Memblock layout affects the pfn ranges covered by node/zone. Consider
that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
the default (no memmap= given) memblock layout is like below:

MEMBLOCK configuration:
memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x4
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...

If you give memmap=1G!4G (so it just covers memory[0x2]),
the range [0x100000000-0x13fffffff] is gone:

MEMBLOCK configuration:
memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x3
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...

This causes shrinking node 0's pfn range because it is calculated by
the address range of memblock.memory. So some of struct pages in the
gap range are left uninitialized.

We have a function zero_resv_unavail() which does zeroing the struct
pages outside memblock.memory, but currently it covers only the reserved
unavailable range (i.e. memblock.memory && !memblock.reserved).
This patch extends it to cover all unavailable range, which fixes
the reported issue.

Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Naoya Horiguchi <[email protected]>
---
include/linux/memblock.h | 16 ----------------
mm/page_alloc.c | 33 ++++++++++++++++++++++++---------
2 files changed, 24 insertions(+), 25 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index ca59883c8364..f191e51c5d2a 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -236,22 +236,6 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved, \
nid, flags, p_start, p_end, p_nid)

-/**
- * for_each_resv_unavail_range - iterate through reserved and unavailable memory
- * @i: u64 used as loop variable
- * @flags: pick from blocks based on memory attributes
- * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
- * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
- *
- * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
- * Available as soon as memblock is initialized.
- * Note: because this memory does not belong to any physical node, flags and
- * nid arguments do not make sense and thus not exported as arguments.
- */
-#define for_each_resv_unavail_range(i, p_start, p_end) \
- for_each_mem_range(i, &memblock.reserved, &memblock.memory, \
- NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
-
static inline void memblock_set_region_flags(struct memblock_region *r,
unsigned long flags)
{
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1772513358e9..098f7c2c127b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6487,25 +6487,40 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
* struct pages which are reserved in memblock allocator and their fields
* may be accessed (for example page_to_pfn() on some configuration accesses
* flags). We must explicitly zero those struct pages.
+ *
+ * This function also addresses a similar issue where struct pages are left
+ * uninitialized because the physical address range is not covered by
+ * memblock.memory or memblock.reserved. That could happen when memblock
+ * layout is manually configured via memmap=.
*/
void __paginginit zero_resv_unavail(void)
{
phys_addr_t start, end;
unsigned long pfn;
u64 i, pgcnt;
+ phys_addr_t next = 0;

/*
- * Loop through ranges that are reserved, but do not have reported
- * physical memory backing.
+ * Loop through unavailable ranges not covered by memblock.memory.
*/
pgcnt = 0;
- for_each_resv_unavail_range(i, &start, &end) {
- for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
- if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
- continue;
- mm_zero_struct_page(pfn_to_page(pfn));
- pgcnt++;
+ for_each_mem_range(i, &memblock.memory, NULL,
+ NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) {
+ if (next < start) {
+ for (pfn = PFN_DOWN(next); pfn < PFN_UP(start); pfn++) {
+ if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
+ continue;
+ mm_zero_struct_page(pfn_to_page(pfn));
+ pgcnt++;
+ }
}
+ next = end;
+ }
+ for (pfn = PFN_DOWN(next); pfn < max_pfn; pfn++) {
+ if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
+ continue;
+ mm_zero_struct_page(pfn_to_page(pfn));
+ pgcnt++;
}

/*
@@ -6516,7 +6531,7 @@ void __paginginit zero_resv_unavail(void)
* this code can be removed.
*/
if (pgcnt)
- pr_info("Reserved but unavailable: %lld pages", pgcnt);
+ pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
}
#endif /* CONFIG_HAVE_MEMBLOCK */

--
2.7.4


2018-06-13 08:41:23

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v1] mm: zero remaining unavailable struct pages (Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM)

On Wed, Jun 13, 2018 at 05:41:08AM +0000, Naoya Horiguchi wrote:
> Hi everyone,
>
> I wrote a patch for this issue.
> There was a discussion about prechecking approach, but I finally found
> out it's hard to make change on memblock after numa_init, so I take
> another apporach (see patch description).
>
> I'm glad if you check that it works for you.
>
> Thanks,
> Naoya Horiguchi
> ---
> From: Naoya Horiguchi <[email protected]>
> Date: Wed, 13 Jun 2018 12:43:27 +0900
> Subject: [PATCH] mm: zero remaining unavailable struct pages
>
> There is a kernel panic that is triggered when reading /proc/kpageflags
> on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
>
> BUG: unable to handle kernel paging request at fffffffffffffffe
> PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> Oops: 0000 [#1] SMP PTI
> CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> RIP: 0010:stable_page_flags+0x27/0x3c0
> Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> Call Trace:
> kpageflags_read+0xc7/0x120
> proc_reg_read+0x3c/0x60
> __vfs_read+0x36/0x170
> vfs_read+0x89/0x130
> ksys_pread64+0x71/0x90
> do_syscall_64+0x5b/0x160
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7efc42e75e23
> Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
>
> According to kernel bisection, this problem became visible due to commit
> f7f99100d8d9 which changes how struct pages are initialized.
>
> Memblock layout affects the pfn ranges covered by node/zone. Consider
> that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> the default (no memmap= given) memblock layout is like below:
>
> MEMBLOCK configuration:
> memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> memory.cnt = 0x4
> memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> ...
>
> If you give memmap=1G!4G (so it just covers memory[0x2]),
> the range [0x100000000-0x13fffffff] is gone:
>
> MEMBLOCK configuration:
> memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> memory.cnt = 0x3
> memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> ...
>
> This causes shrinking node 0's pfn range because it is calculated by
> the address range of memblock.memory. So some of struct pages in the
> gap range are left uninitialized.
>
> We have a function zero_resv_unavail() which does zeroing the struct
> pages outside memblock.memory, but currently it covers only the reserved
> unavailable range (i.e. memblock.memory && !memblock.reserved).
> This patch extends it to cover all unavailable range, which fixes
> the reported issue.
>
> Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> include/linux/memblock.h | 16 ----------------
> mm/page_alloc.c | 33 ++++++++++++++++++++++++---------
> 2 files changed, 24 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index ca59883c8364..f191e51c5d2a 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -236,22 +236,6 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved, \
> nid, flags, p_start, p_end, p_nid)
>
> -/**
> - * for_each_resv_unavail_range - iterate through reserved and unavailable memory
> - * @i: u64 used as loop variable
> - * @flags: pick from blocks based on memory attributes
> - * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> - * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> - *
> - * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
> - * Available as soon as memblock is initialized.
> - * Note: because this memory does not belong to any physical node, flags and
> - * nid arguments do not make sense and thus not exported as arguments.
> - */
> -#define for_each_resv_unavail_range(i, p_start, p_end) \
> - for_each_mem_range(i, &memblock.reserved, &memblock.memory, \
> - NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
> -
> static inline void memblock_set_region_flags(struct memblock_region *r,
> unsigned long flags)
> {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1772513358e9..098f7c2c127b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6487,25 +6487,40 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
> * struct pages which are reserved in memblock allocator and their fields
> * may be accessed (for example page_to_pfn() on some configuration accesses
> * flags). We must explicitly zero those struct pages.
> + *
> + * This function also addresses a similar issue where struct pages are left
> + * uninitialized because the physical address range is not covered by
> + * memblock.memory or memblock.reserved. That could happen when memblock
> + * layout is manually configured via memmap=.
> */
> void __paginginit zero_resv_unavail(void)
> {
> phys_addr_t start, end;
> unsigned long pfn;
> u64 i, pgcnt;
> + phys_addr_t next = 0;
>
> /*
> - * Loop through ranges that are reserved, but do not have reported
> - * physical memory backing.
> + * Loop through unavailable ranges not covered by memblock.memory.
> */
> pgcnt = 0;
> - for_each_resv_unavail_range(i, &start, &end) {
> - for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
> - if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
> - continue;
> - mm_zero_struct_page(pfn_to_page(pfn));
> - pgcnt++;
> + for_each_mem_range(i, &memblock.memory, NULL,
> + NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) {
> + if (next < start) {
> + for (pfn = PFN_DOWN(next); pfn < PFN_UP(start); pfn++) {
> + if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
> + continue;
> + mm_zero_struct_page(pfn_to_page(pfn));
> + pgcnt++;
> + }
> }
> + next = end;
> + }
> + for (pfn = PFN_DOWN(next); pfn < max_pfn; pfn++) {
> + if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
> + continue;
> + mm_zero_struct_page(pfn_to_page(pfn));
> + pgcnt++;
> }

Hi Naoya,

Is the second loop really needed?

AFAIK, max_pfn is set to the latest pfn of E820_TYPE_RAM type, and since
you are going through all memory ranges within memblock.memory, and then assigning next = end,
I think that at the time we are done with the first loop, next will always point
to max_pfn (I only checked it for x86).
Am I right o did I overlooked something?

Besides that, I did some tests and I can no longer reproduce the error.
So feel free to add:

Tested-by: Oscar Salvador <[email protected]>

>
> /*
> @@ -6516,7 +6531,7 @@ void __paginginit zero_resv_unavail(void)
> * this code can be removed.
> */
> if (pgcnt)
> - pr_info("Reserved but unavailable: %lld pages", pgcnt);
> + pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
> }
> #endif /* CONFIG_HAVE_MEMBLOCK */
>
> --
> 2.7.4
>

Thanks

Best Regards
Oscar Salvador

2018-06-13 09:07:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v1] mm: zero remaining unavailable struct pages (Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM)

On Wed 13-06-18 05:41:08, Naoya Horiguchi wrote:
[...]
> From: Naoya Horiguchi <[email protected]>
> Date: Wed, 13 Jun 2018 12:43:27 +0900
> Subject: [PATCH] mm: zero remaining unavailable struct pages
>
> There is a kernel panic that is triggered when reading /proc/kpageflags
> on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
>
> BUG: unable to handle kernel paging request at fffffffffffffffe
> PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> Oops: 0000 [#1] SMP PTI
> CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> RIP: 0010:stable_page_flags+0x27/0x3c0
> Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> Call Trace:
> kpageflags_read+0xc7/0x120
> proc_reg_read+0x3c/0x60
> __vfs_read+0x36/0x170
> vfs_read+0x89/0x130
> ksys_pread64+0x71/0x90
> do_syscall_64+0x5b/0x160
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7efc42e75e23
> Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
>
> According to kernel bisection, this problem became visible due to commit
> f7f99100d8d9 which changes how struct pages are initialized.
>
> Memblock layout affects the pfn ranges covered by node/zone. Consider
> that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> the default (no memmap= given) memblock layout is like below:
>
> MEMBLOCK configuration:
> memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> memory.cnt = 0x4
> memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> ...
>
> If you give memmap=1G!4G (so it just covers memory[0x2]),
> the range [0x100000000-0x13fffffff] is gone:
>
> MEMBLOCK configuration:
> memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> memory.cnt = 0x3
> memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> ...
>
> This causes shrinking node 0's pfn range because it is calculated by
> the address range of memblock.memory. So some of struct pages in the
> gap range are left uninitialized.
>
> We have a function zero_resv_unavail() which does zeroing the struct
> pages outside memblock.memory, but currently it covers only the reserved
> unavailable range (i.e. memblock.memory && !memblock.reserved).
> This patch extends it to cover all unavailable range, which fixes
> the reported issue.

Thanks for pin pointing this down Naoya! I am wondering why we cannot
simply mark the excluded ranges to be reserved instead.
--
Michal Hocko
SUSE Labs

2018-06-14 05:03:19

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v1] mm: zero remaining unavailable struct pages (Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM)

On Wed, Jun 13, 2018 at 10:40:32AM +0200, Oscar Salvador wrote:
> On Wed, Jun 13, 2018 at 05:41:08AM +0000, Naoya Horiguchi wrote:
> > Hi everyone,
> >
> > I wrote a patch for this issue.
> > There was a discussion about prechecking approach, but I finally found
> > out it's hard to make change on memblock after numa_init, so I take
> > another apporach (see patch description).
> >
> > I'm glad if you check that it works for you.
> >
> > Thanks,
> > Naoya Horiguchi
> > ---
> > From: Naoya Horiguchi <[email protected]>
> > Date: Wed, 13 Jun 2018 12:43:27 +0900
> > Subject: [PATCH] mm: zero remaining unavailable struct pages
> >
> > There is a kernel panic that is triggered when reading /proc/kpageflags
> > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> >
> > BUG: unable to handle kernel paging request at fffffffffffffffe
> > PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> > RIP: 0010:stable_page_flags+0x27/0x3c0
> > Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> > RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> > RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> > RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> > RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> > R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> > R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> > FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> > Call Trace:
> > kpageflags_read+0xc7/0x120
> > proc_reg_read+0x3c/0x60
> > __vfs_read+0x36/0x170
> > vfs_read+0x89/0x130
> > ksys_pread64+0x71/0x90
> > do_syscall_64+0x5b/0x160
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > RIP: 0033:0x7efc42e75e23
> > Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> >
> > According to kernel bisection, this problem became visible due to commit
> > f7f99100d8d9 which changes how struct pages are initialized.
> >
> > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > the default (no memmap= given) memblock layout is like below:
> >
> > MEMBLOCK configuration:
> > memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> > memory.cnt = 0x4
> > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> > memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > ...
> >
> > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > the range [0x100000000-0x13fffffff] is gone:
> >
> > MEMBLOCK configuration:
> > memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> > memory.cnt = 0x3
> > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > ...
> >
> > This causes shrinking node 0's pfn range because it is calculated by
> > the address range of memblock.memory. So some of struct pages in the
> > gap range are left uninitialized.
> >
> > We have a function zero_resv_unavail() which does zeroing the struct
> > pages outside memblock.memory, but currently it covers only the reserved
> > unavailable range (i.e. memblock.memory && !memblock.reserved).
> > This patch extends it to cover all unavailable range, which fixes
> > the reported issue.
> >
> > Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > ---
> > include/linux/memblock.h | 16 ----------------
> > mm/page_alloc.c | 33 ++++++++++++++++++++++++---------
> > 2 files changed, 24 insertions(+), 25 deletions(-)
> >
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index ca59883c8364..f191e51c5d2a 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -236,22 +236,6 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> > for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved, \
> > nid, flags, p_start, p_end, p_nid)
> >
> > -/**
> > - * for_each_resv_unavail_range - iterate through reserved and unavailable memory
> > - * @i: u64 used as loop variable
> > - * @flags: pick from blocks based on memory attributes
> > - * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > - * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > - *
> > - * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
> > - * Available as soon as memblock is initialized.
> > - * Note: because this memory does not belong to any physical node, flags and
> > - * nid arguments do not make sense and thus not exported as arguments.
> > - */
> > -#define for_each_resv_unavail_range(i, p_start, p_end) \
> > - for_each_mem_range(i, &memblock.reserved, &memblock.memory, \
> > - NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
> > -
> > static inline void memblock_set_region_flags(struct memblock_region *r,
> > unsigned long flags)
> > {
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1772513358e9..098f7c2c127b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6487,25 +6487,40 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
> > * struct pages which are reserved in memblock allocator and their fields
> > * may be accessed (for example page_to_pfn() on some configuration accesses
> > * flags). We must explicitly zero those struct pages.
> > + *
> > + * This function also addresses a similar issue where struct pages are left
> > + * uninitialized because the physical address range is not covered by
> > + * memblock.memory or memblock.reserved. That could happen when memblock
> > + * layout is manually configured via memmap=.
> > */
> > void __paginginit zero_resv_unavail(void)
> > {
> > phys_addr_t start, end;
> > unsigned long pfn;
> > u64 i, pgcnt;
> > + phys_addr_t next = 0;
> >
> > /*
> > - * Loop through ranges that are reserved, but do not have reported
> > - * physical memory backing.
> > + * Loop through unavailable ranges not covered by memblock.memory.
> > */
> > pgcnt = 0;
> > - for_each_resv_unavail_range(i, &start, &end) {
> > - for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
> > - if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
> > - continue;
> > - mm_zero_struct_page(pfn_to_page(pfn));
> > - pgcnt++;
> > + for_each_mem_range(i, &memblock.memory, NULL,
> > + NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) {
> > + if (next < start) {
> > + for (pfn = PFN_DOWN(next); pfn < PFN_UP(start); pfn++) {
> > + if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
> > + continue;
> > + mm_zero_struct_page(pfn_to_page(pfn));
> > + pgcnt++;
> > + }
> > }
> > + next = end;
> > + }
> > + for (pfn = PFN_DOWN(next); pfn < max_pfn; pfn++) {
> > + if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
> > + continue;
> > + mm_zero_struct_page(pfn_to_page(pfn));
> > + pgcnt++;
> > }
>
> Hi Naoya,
>
> Is the second loop really needed?
>
> AFAIK, max_pfn is set to the latest pfn of E820_TYPE_RAM type, and since
> you are going through all memory ranges within memblock.memory, and then assigning next = end,
> I think that at the time we are done with the first loop, next will always point
> to max_pfn (I only checked it for x86).
> Am I right o did I overlooked something?

Hi Oscar,

Thank you for the comment.
Some archs do set max_pfn to end pfn of E820_TYPE_RAM, but some archs
(s390, arm, mips, ...) seem to determine max_pfn in their own way.
I'm not sure this problem is visible in such archs.

>
> Besides that, I did some tests and I can no longer reproduce the error.
> So feel free to add:
>
> Tested-by: Oscar Salvador <[email protected]>

Thank you!
- Naoya

>
> >
> > /*
> > @@ -6516,7 +6531,7 @@ void __paginginit zero_resv_unavail(void)
> > * this code can be removed.
> > */
> > if (pgcnt)
> > - pr_info("Reserved but unavailable: %lld pages", pgcnt);
> > + pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
> > }
> > #endif /* CONFIG_HAVE_MEMBLOCK */
> >
> > --
> > 2.7.4
> >
>
> Thanks
>
> Best Regards
> Oscar Salvador
>

2018-06-14 05:18:53

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v1] mm: zero remaining unavailable struct pages (Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM)

On Wed, Jun 13, 2018 at 11:07:00AM +0200, Michal Hocko wrote:
> On Wed 13-06-18 05:41:08, Naoya Horiguchi wrote:
> [...]
> > From: Naoya Horiguchi <[email protected]>
> > Date: Wed, 13 Jun 2018 12:43:27 +0900
> > Subject: [PATCH] mm: zero remaining unavailable struct pages
> >
> > There is a kernel panic that is triggered when reading /proc/kpageflags
> > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> >
> > BUG: unable to handle kernel paging request at fffffffffffffffe
> > PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> > RIP: 0010:stable_page_flags+0x27/0x3c0
> > Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> > RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> > RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> > RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> > RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> > R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> > R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> > FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> > Call Trace:
> > kpageflags_read+0xc7/0x120
> > proc_reg_read+0x3c/0x60
> > __vfs_read+0x36/0x170
> > vfs_read+0x89/0x130
> > ksys_pread64+0x71/0x90
> > do_syscall_64+0x5b/0x160
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > RIP: 0033:0x7efc42e75e23
> > Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> >
> > According to kernel bisection, this problem became visible due to commit
> > f7f99100d8d9 which changes how struct pages are initialized.
> >
> > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > the default (no memmap= given) memblock layout is like below:
> >
> > MEMBLOCK configuration:
> > memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> > memory.cnt = 0x4
> > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> > memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > ...
> >
> > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > the range [0x100000000-0x13fffffff] is gone:
> >
> > MEMBLOCK configuration:
> > memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> > memory.cnt = 0x3
> > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > ...
> >
> > This causes shrinking node 0's pfn range because it is calculated by
> > the address range of memblock.memory. So some of struct pages in the
> > gap range are left uninitialized.
> >
> > We have a function zero_resv_unavail() which does zeroing the struct
> > pages outside memblock.memory, but currently it covers only the reserved
> > unavailable range (i.e. memblock.memory && !memblock.reserved).
> > This patch extends it to cover all unavailable range, which fixes
> > the reported issue.
>
> Thanks for pin pointing this down Naoya! I am wondering why we cannot
> simply mark the excluded ranges to be reserved instead.

I tried your idea with the change below, and it also fixes the kernel panic.

---
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index d1f25c831447..2cef120535d4 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
{
int i;
u64 end;
+ u64 addr = 0;

/*
* The bootstrap memblock region count maximum is 128 entries
@@ -1264,13 +1265,16 @@ void __init e820__memblock_setup(void)
struct e820_entry *entry = &e820_table->entries[i];

end = entry->addr + entry->size;
+ if (addr < entry->addr)
+ memblock_reserve(addr, entry->addr - addr);
+ addr = end;
if (end != (resource_size_t)end)
continue;

if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
- continue;
-
- memblock_add(entry->addr, entry->size);
+ memblock_reserve(entry->addr, entry->size);
+ else
+ memblock_add(entry->addr, entry->size);
}

/* Throw away partial pages: */


My concern is that there are a few E820 memory types rather than
E820_TYPE_RAM and E820_TYPE_RESERVED, and I'm not sure that putting them
all into memblock.reserved is really acceptable.

Thanks,
Naoya Horiguchi

2018-06-14 05:39:46

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v1] mm: zero remaining unavailable struct pages (Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM)

On Thu, Jun 14, 2018 at 05:16:18AM +0000, Naoya Horiguchi wrote:
> On Wed, Jun 13, 2018 at 11:07:00AM +0200, Michal Hocko wrote:
> > On Wed 13-06-18 05:41:08, Naoya Horiguchi wrote:
> > [...]
> > > From: Naoya Horiguchi <[email protected]>
> > > Date: Wed, 13 Jun 2018 12:43:27 +0900
> > > Subject: [PATCH] mm: zero remaining unavailable struct pages
> > >
> > > There is a kernel panic that is triggered when reading /proc/kpageflags
> > > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> > >
> > > BUG: unable to handle kernel paging request at fffffffffffffffe
> > > PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> > > Oops: 0000 [#1] SMP PTI
> > > CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> > > RIP: 0010:stable_page_flags+0x27/0x3c0
> > > Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> > > RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> > > RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> > > RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> > > RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> > > R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> > > R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> > > FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> > > Call Trace:
> > > kpageflags_read+0xc7/0x120
> > > proc_reg_read+0x3c/0x60
> > > __vfs_read+0x36/0x170
> > > vfs_read+0x89/0x130
> > > ksys_pread64+0x71/0x90
> > > do_syscall_64+0x5b/0x160
> > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > RIP: 0033:0x7efc42e75e23
> > > Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> > >
> > > According to kernel bisection, this problem became visible due to commit
> > > f7f99100d8d9 which changes how struct pages are initialized.
> > >
> > > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > > the default (no memmap= given) memblock layout is like below:
> > >
> > > MEMBLOCK configuration:
> > > memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> > > memory.cnt = 0x4
> > > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > > memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> > > memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > > ...
> > >
> > > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > > the range [0x100000000-0x13fffffff] is gone:
> > >
> > > MEMBLOCK configuration:
> > > memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> > > memory.cnt = 0x3
> > > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > > memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > > ...
> > >
> > > This causes shrinking node 0's pfn range because it is calculated by
> > > the address range of memblock.memory. So some of struct pages in the
> > > gap range are left uninitialized.
> > >
> > > We have a function zero_resv_unavail() which does zeroing the struct
> > > pages outside memblock.memory, but currently it covers only the reserved
> > > unavailable range (i.e. memblock.memory && !memblock.reserved).
> > > This patch extends it to cover all unavailable range, which fixes
> > > the reported issue.
> >
> > Thanks for pin pointing this down Naoya! I am wondering why we cannot
> > simply mark the excluded ranges to be reserved instead.
>
> I tried your idea with the change below, and it also fixes the kernel panic.
>
> ---
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index d1f25c831447..2cef120535d4 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> {
> int i;
> u64 end;
> + u64 addr = 0;
>
> /*
> * The bootstrap memblock region count maximum is 128 entries
> @@ -1264,13 +1265,16 @@ void __init e820__memblock_setup(void)
> struct e820_entry *entry = &e820_table->entries[i];
>
> end = entry->addr + entry->size;
> + if (addr < entry->addr)
> + memblock_reserve(addr, entry->addr - addr);
> + addr = end;
> if (end != (resource_size_t)end)
> continue;
>
> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> - continue;
> -
> - memblock_add(entry->addr, entry->size);
> + memblock_reserve(entry->addr, entry->size);
> + else
> + memblock_add(entry->addr, entry->size);
> }
>
> /* Throw away partial pages: */
>
>
> My concern is that there are a few E820 memory types rather than
> E820_TYPE_RAM and E820_TYPE_RESERVED, and I'm not sure that putting them
> all into memblock.reserved is really acceptable.

Hi Naoya,

Maybe you could just add to memblock.reserved, all unavailable ranges within
E820_TYPE_RAM.
Actually, in your original patch, you are walking memblock.memory, which should
only contain E820_TYPE_RAM ranges (talking about x86).

So I think the below would to the trick as well?

@@ -1248,6 +1276,7 @@ void __init e820__memblock_setup(void)
{
int i;
u64 end;
+ u64 next = 0;

/*
* The bootstrap memblock region count maximum is 128 entries

@@ -1269,6 +1299,14 @@ void __init e820__memblock_setup(void)

if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
continue;
+
+ if (entry->type == E820_TYPE_RAM)
+ if (next < entry->addr) {
+ memblock_reserve (next, next + (entry->addr - next));
+ next = end;
+ }

With the above patch, I can no longer see the issues either.

Although, there is a difference between this and your original patch.
In your original patch, you are just zeroing the pages, while with this one (or with your second patch),
we will zero the page in reserve_bootmem_region(), but that function also init
some other fields of the struct page:

mm_zero_struct_page(page);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);

So I am not sure we want to bother doing that for pages that are really unreachable.

Oscar Salvador
Best Regards

2018-06-14 06:40:59

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v2] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

On Thu, Jun 14, 2018 at 07:38:59AM +0200, Oscar Salvador wrote:
> On Thu, Jun 14, 2018 at 05:16:18AM +0000, Naoya Horiguchi wrote:
...
> >
> > My concern is that there are a few E820 memory types rather than
> > E820_TYPE_RAM and E820_TYPE_RESERVED, and I'm not sure that putting them
> > all into memblock.reserved is really acceptable.
>
> Hi Naoya,
>
> Maybe you could just add to memblock.reserved, all unavailable ranges within
> E820_TYPE_RAM.
> Actually, in your original patch, you are walking memblock.memory, which should
> only contain E820_TYPE_RAM ranges (talking about x86).
>
> So I think the below would to the trick as well?
>
> @@ -1248,6 +1276,7 @@ void __init e820__memblock_setup(void)
> {
> int i;
> u64 end;
> + u64 next = 0;
>
> /*
> * The bootstrap memblock region count maximum is 128 entries
>
> @@ -1269,6 +1299,14 @@ void __init e820__memblock_setup(void)
>
> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> continue;
>
> +
> + if (entry->type == E820_TYPE_RAM)
> + if (next < entry->addr) {
> + memblock_reserve (next, next + (entry->addr - next));
> + next = end;
> + }
>
> With the above patch, I can no longer see the issues either.

I double-checked and this change looks good to me.

>
> Although, there is a difference between this and your original patch.
> In your original patch, you are just zeroing the pages, while with this one (or with your second patch),
> we will zero the page in reserve_bootmem_region(), but that function also init
> some other fields of the struct page:
>
> mm_zero_struct_page(page);
> set_page_links(page, zone, nid, pfn);
> init_page_count(page);
> page_mapcount_reset(page);
> page_cpupid_reset_last(page);
>
> So I am not sure we want to bother doing that for pages that are really unreachable.

I think that considering that /proc/kpageflags can check them, some data
(even if it's trivial) might be better than just zeros.

Here's the updated patch.
Thanks for the suggestion and testing!

---
From: Naoya Horiguchi <[email protected]>
Date: Thu, 14 Jun 2018 14:44:36 +0900
Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

There is a kernel panic that is triggered when reading /proc/kpageflags
on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':

BUG: unable to handle kernel paging request at fffffffffffffffe
PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
RIP: 0010:stable_page_flags+0x27/0x3c0
Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
Call Trace:
kpageflags_read+0xc7/0x120
proc_reg_read+0x3c/0x60
__vfs_read+0x36/0x170
vfs_read+0x89/0x130
ksys_pread64+0x71/0x90
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7efc42e75e23
Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24

According to kernel bisection, this problem became visible due to commit
f7f99100d8d9 which changes how struct pages are initialized.

Memblock layout affects the pfn ranges covered by node/zone. Consider
that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
the default (no memmap= given) memblock layout is like below:

MEMBLOCK configuration:
memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x4
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...

If you give memmap=1G!4G (so it just covers memory[0x2]),
the range [0x100000000-0x13fffffff] is gone:

MEMBLOCK configuration:
memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x3
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...

This causes shrinking node 0's pfn range because it is calculated by
the address range of memblock.memory. So some of struct pages in the
gap range are left uninitialized.

We have a function zero_resv_unavail() which does zeroing the struct
pages within the reserved unavailable range (i.e. memblock.memory &&
!memblock.reserved). This patch utilizes it to cover all unavailable
ranges by putting them into memblock.reserved.

Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Naoya Horiguchi <[email protected]>
Suggested-by: Oscar Salvador <[email protected]>
Tested-by: Oscar Salvador <[email protected]>
---
arch/x86/kernel/e820.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index d1f25c831447..d15ef47ea354 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
{
int i;
u64 end;
+ u64 next = 0;

/*
* The bootstrap memblock region count maximum is 128 entries
@@ -1270,6 +1271,17 @@ void __init e820__memblock_setup(void)
if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
continue;

+ /*
+ * Ranges unavailable in E820_TYPE_RAM are put into
+ * memblock.reserved, to make sure that struct pages in such
+ * regions are not left uninitialized after bootup.
+ */
+ if (entry->type == E820_TYPE_RAM)
+ if (next < entry->addr) {
+ memblock_reserve (next, next + (entry->addr - next));
+ next = end;
+ }
+
memblock_add(entry->addr, entry->size);
}

--
2.7.4


2018-06-14 07:02:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v1] mm: zero remaining unavailable struct pages (Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM)

On Thu 14-06-18 05:16:18, Naoya Horiguchi wrote:
> On Wed, Jun 13, 2018 at 11:07:00AM +0200, Michal Hocko wrote:
> > On Wed 13-06-18 05:41:08, Naoya Horiguchi wrote:
> > [...]
> > > From: Naoya Horiguchi <[email protected]>
> > > Date: Wed, 13 Jun 2018 12:43:27 +0900
> > > Subject: [PATCH] mm: zero remaining unavailable struct pages
> > >
> > > There is a kernel panic that is triggered when reading /proc/kpageflags
> > > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> > >
> > > BUG: unable to handle kernel paging request at fffffffffffffffe
> > > PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> > > Oops: 0000 [#1] SMP PTI
> > > CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> > > RIP: 0010:stable_page_flags+0x27/0x3c0
> > > Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> > > RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> > > RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> > > RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> > > RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> > > R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> > > R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> > > FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> > > Call Trace:
> > > kpageflags_read+0xc7/0x120
> > > proc_reg_read+0x3c/0x60
> > > __vfs_read+0x36/0x170
> > > vfs_read+0x89/0x130
> > > ksys_pread64+0x71/0x90
> > > do_syscall_64+0x5b/0x160
> > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > RIP: 0033:0x7efc42e75e23
> > > Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> > >
> > > According to kernel bisection, this problem became visible due to commit
> > > f7f99100d8d9 which changes how struct pages are initialized.
> > >
> > > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > > the default (no memmap= given) memblock layout is like below:
> > >
> > > MEMBLOCK configuration:
> > > memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> > > memory.cnt = 0x4
> > > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > > memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> > > memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > > ...
> > >
> > > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > > the range [0x100000000-0x13fffffff] is gone:
> > >
> > > MEMBLOCK configuration:
> > > memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> > > memory.cnt = 0x3
> > > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > > memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > > ...
> > >
> > > This causes shrinking node 0's pfn range because it is calculated by
> > > the address range of memblock.memory. So some of struct pages in the
> > > gap range are left uninitialized.
> > >
> > > We have a function zero_resv_unavail() which does zeroing the struct
> > > pages outside memblock.memory, but currently it covers only the reserved
> > > unavailable range (i.e. memblock.memory && !memblock.reserved).
> > > This patch extends it to cover all unavailable range, which fixes
> > > the reported issue.
> >
> > Thanks for pin pointing this down Naoya! I am wondering why we cannot
> > simply mark the excluded ranges to be reserved instead.
>
> I tried your idea with the change below, and it also fixes the kernel panic.
>
> ---
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index d1f25c831447..2cef120535d4 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> {
> int i;
> u64 end;
> + u64 addr = 0;
>
> /*
> * The bootstrap memblock region count maximum is 128 entries
> @@ -1264,13 +1265,16 @@ void __init e820__memblock_setup(void)
> struct e820_entry *entry = &e820_table->entries[i];
>
> end = entry->addr + entry->size;
> + if (addr < entry->addr)
> + memblock_reserve(addr, entry->addr - addr);
> + addr = end;
> if (end != (resource_size_t)end)
> continue;
>
> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> - continue;
> -
> - memblock_add(entry->addr, entry->size);
> + memblock_reserve(entry->addr, entry->size);
> + else
> + memblock_add(entry->addr, entry->size);
> }
>
> /* Throw away partial pages: */

Yes, this looks so much better. Although I was more focusing on
e820__range_remove.

> My concern is that there are a few E820 memory types rather than
> E820_TYPE_RAM and E820_TYPE_RESERVED, and I'm not sure that putting them
> all into memblock.reserved is really acceptable.

Why it wouldn't? I mean reserved ranges are to be never touched unless
the owner of that range free them up.
--
Michal Hocko
SUSE Labs

2018-06-14 07:21:42

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

On Thu, Jun 14, 2018 at 06:34:55AM +0000, Naoya Horiguchi wrote:
> On Thu, Jun 14, 2018 at 07:38:59AM +0200, Oscar Salvador wrote:
> > On Thu, Jun 14, 2018 at 05:16:18AM +0000, Naoya Horiguchi wrote:
> ...
> > >
> > > My concern is that there are a few E820 memory types rather than
> > > E820_TYPE_RAM and E820_TYPE_RESERVED, and I'm not sure that putting them
> > > all into memblock.reserved is really acceptable.
> >
> > Hi Naoya,
> >
> > Maybe you could just add to memblock.reserved, all unavailable ranges within
> > E820_TYPE_RAM.
> > Actually, in your original patch, you are walking memblock.memory, which should
> > only contain E820_TYPE_RAM ranges (talking about x86).
> >
> > So I think the below would to the trick as well?
> >
> > @@ -1248,6 +1276,7 @@ void __init e820__memblock_setup(void)
> > {
> > int i;
> > u64 end;
> > + u64 next = 0;
> >
> > /*
> > * The bootstrap memblock region count maximum is 128 entries
> >
> > @@ -1269,6 +1299,14 @@ void __init e820__memblock_setup(void)
> >
> > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > continue;
> >
> > +
> > + if (entry->type == E820_TYPE_RAM)
> > + if (next < entry->addr) {
> > + memblock_reserve (next, next + (entry->addr - next));
> > + next = end;
> > + }
> >
> > With the above patch, I can no longer see the issues either.
>
> I double-checked and this change looks good to me.
>
> >
> > Although, there is a difference between this and your original patch.
> > In your original patch, you are just zeroing the pages, while with this one (or with your second patch),
> > we will zero the page in reserve_bootmem_region(), but that function also init
> > some other fields of the struct page:
> >
> > mm_zero_struct_page(page);
> > set_page_links(page, zone, nid, pfn);
> > init_page_count(page);
> > page_mapcount_reset(page);
> > page_cpupid_reset_last(page);
> >
> > So I am not sure we want to bother doing that for pages that are really unreachable.
>
> I think that considering that /proc/kpageflags can check them, some data
> (even if it's trivial) might be better than just zeros.
>
> Here's the updated patch.
> Thanks for the suggestion and testing!
>
> ---
> From: Naoya Horiguchi <[email protected]>
> Date: Thu, 14 Jun 2018 14:44:36 +0900
> Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
>
> There is a kernel panic that is triggered when reading /proc/kpageflags
> on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
>
> BUG: unable to handle kernel paging request at fffffffffffffffe
> PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> Oops: 0000 [#1] SMP PTI
> CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> RIP: 0010:stable_page_flags+0x27/0x3c0
> Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> Call Trace:
> kpageflags_read+0xc7/0x120
> proc_reg_read+0x3c/0x60
> __vfs_read+0x36/0x170
> vfs_read+0x89/0x130
> ksys_pread64+0x71/0x90
> do_syscall_64+0x5b/0x160
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7efc42e75e23
> Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
>
> According to kernel bisection, this problem became visible due to commit
> f7f99100d8d9 which changes how struct pages are initialized.
>
> Memblock layout affects the pfn ranges covered by node/zone. Consider
> that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> the default (no memmap= given) memblock layout is like below:
>
> MEMBLOCK configuration:
> memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> memory.cnt = 0x4
> memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> ...
>
> If you give memmap=1G!4G (so it just covers memory[0x2]),
> the range [0x100000000-0x13fffffff] is gone:
>
> MEMBLOCK configuration:
> memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> memory.cnt = 0x3
> memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> ...
>
> This causes shrinking node 0's pfn range because it is calculated by
> the address range of memblock.memory. So some of struct pages in the
> gap range are left uninitialized.
>
> We have a function zero_resv_unavail() which does zeroing the struct
> pages within the reserved unavailable range (i.e. memblock.memory &&
> !memblock.reserved). This patch utilizes it to cover all unavailable
> ranges by putting them into memblock.reserved.
>
> Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> Signed-off-by: Naoya Horiguchi <[email protected]>
> Suggested-by: Oscar Salvador <[email protected]>
> Tested-by: Oscar Salvador <[email protected]>
> ---
> arch/x86/kernel/e820.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index d1f25c831447..d15ef47ea354 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> {
> int i;
> u64 end;
> + u64 next = 0;
>
> /*
> * The bootstrap memblock region count maximum is 128 entries
> @@ -1270,6 +1271,17 @@ void __init e820__memblock_setup(void)
> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> continue;
>
> + /*
> + * Ranges unavailable in E820_TYPE_RAM are put into
> + * memblock.reserved, to make sure that struct pages in such
> + * regions are not left uninitialized after bootup.
> + */
> + if (entry->type == E820_TYPE_RAM)
> + if (next < entry->addr) {
> + memblock_reserve (next, next + (entry->addr - next));
> + next = end;
> + }
> +
> memblock_add(entry->addr, entry->size);
> }

Thanks Naoya!

Andrew: In case you consider to take this patch instead of the first one,
could you please replace "[email protected]" with "[email protected]"?

Thanks

Best Regards
Oscar Salvador

2018-06-14 11:25:59

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

On Thu, Jun 14, 2018 at 09:21:03AM +0200, Oscar Salvador wrote:
> On Thu, Jun 14, 2018 at 06:34:55AM +0000, Naoya Horiguchi wrote:
> > On Thu, Jun 14, 2018 at 07:38:59AM +0200, Oscar Salvador wrote:
> > > On Thu, Jun 14, 2018 at 05:16:18AM +0000, Naoya Horiguchi wrote:
> > ...
> > > >
> > > > My concern is that there are a few E820 memory types rather than
> > > > E820_TYPE_RAM and E820_TYPE_RESERVED, and I'm not sure that putting them
> > > > all into memblock.reserved is really acceptable.
> > >
> > > Hi Naoya,
> > >
> > > Maybe you could just add to memblock.reserved, all unavailable ranges within
> > > E820_TYPE_RAM.
> > > Actually, in your original patch, you are walking memblock.memory, which should
> > > only contain E820_TYPE_RAM ranges (talking about x86).
> > >
> > > So I think the below would to the trick as well?
> > >
> > > @@ -1248,6 +1276,7 @@ void __init e820__memblock_setup(void)
> > > {
> > > int i;
> > > u64 end;
> > > + u64 next = 0;
> > >
> > > /*
> > > * The bootstrap memblock region count maximum is 128 entries
> > >
> > > @@ -1269,6 +1299,14 @@ void __init e820__memblock_setup(void)
> > >
> > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > continue;
> > >
> > > +
> > > + if (entry->type == E820_TYPE_RAM)
> > > + if (next < entry->addr) {
> > > + memblock_reserve (next, next + (entry->addr - next));
> > > + next = end;
> > > + }
> > >
> > > With the above patch, I can no longer see the issues either.
> >
> > I double-checked and this change looks good to me.
> >
> > >
> > > Although, there is a difference between this and your original patch.
> > > In your original patch, you are just zeroing the pages, while with this one (or with your second patch),
> > > we will zero the page in reserve_bootmem_region(), but that function also init
> > > some other fields of the struct page:
> > >
> > > mm_zero_struct_page(page);
> > > set_page_links(page, zone, nid, pfn);
> > > init_page_count(page);
> > > page_mapcount_reset(page);
> > > page_cpupid_reset_last(page);
> > >
> > > So I am not sure we want to bother doing that for pages that are really unreachable.
> >
> > I think that considering that /proc/kpageflags can check them, some data
> > (even if it's trivial) might be better than just zeros.
> >
> > Here's the updated patch.
> > Thanks for the suggestion and testing!
> >
> > ---
> > From: Naoya Horiguchi <[email protected]>
> > Date: Thu, 14 Jun 2018 14:44:36 +0900
> > Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
> >
> > There is a kernel panic that is triggered when reading /proc/kpageflags
> > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> >
> > BUG: unable to handle kernel paging request at fffffffffffffffe
> > PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> > RIP: 0010:stable_page_flags+0x27/0x3c0
> > Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> > RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> > RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> > RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> > RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> > R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> > R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> > FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> > Call Trace:
> > kpageflags_read+0xc7/0x120
> > proc_reg_read+0x3c/0x60
> > __vfs_read+0x36/0x170
> > vfs_read+0x89/0x130
> > ksys_pread64+0x71/0x90
> > do_syscall_64+0x5b/0x160
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > RIP: 0033:0x7efc42e75e23
> > Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> >
> > According to kernel bisection, this problem became visible due to commit
> > f7f99100d8d9 which changes how struct pages are initialized.
> >
> > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > the default (no memmap= given) memblock layout is like below:
> >
> > MEMBLOCK configuration:
> > memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> > memory.cnt = 0x4
> > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> > memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > ...
> >
> > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > the range [0x100000000-0x13fffffff] is gone:
> >
> > MEMBLOCK configuration:
> > memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> > memory.cnt = 0x3
> > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > ...
> >
> > This causes shrinking node 0's pfn range because it is calculated by
> > the address range of memblock.memory. So some of struct pages in the
> > gap range are left uninitialized.
> >
> > We have a function zero_resv_unavail() which does zeroing the struct
> > pages within the reserved unavailable range (i.e. memblock.memory &&
> > !memblock.reserved). This patch utilizes it to cover all unavailable
> > ranges by putting them into memblock.reserved.

I just spotted this.
It seems that the changelog has not been updated.
It still refers to zero_resv_unavail(), while this patch takes
a different approach.

> >
> > Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > Suggested-by: Oscar Salvador <[email protected]>
> > Tested-by: Oscar Salvador <[email protected]>
> > ---
> > arch/x86/kernel/e820.c | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index d1f25c831447..d15ef47ea354 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> > {
> > int i;
> > u64 end;
> > + u64 next = 0;
> >
> > /*
> > * The bootstrap memblock region count maximum is 128 entries
> > @@ -1270,6 +1271,17 @@ void __init e820__memblock_setup(void)
> > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > continue;
> >
> > + /*
> > + * Ranges unavailable in E820_TYPE_RAM are put into
> > + * memblock.reserved, to make sure that struct pages in such
> > + * regions are not left uninitialized after bootup.
> > + */
> > + if (entry->type == E820_TYPE_RAM)
> > + if (next < entry->addr) {
> > + memblock_reserve (next, next + (entry->addr - next));
> > + next = end;
> > + }
> > +
> > memblock_add(entry->addr, entry->size);
> > }
>
> Thanks Naoya!
>
> Andrew: In case you consider to take this patch instead of the first one,
> could you please replace "[email protected]" with "[email protected]"?
>
> Thanks
>
> Best Regards
> Oscar Salvador
>

Best Regards
Oscar Salvador

2018-06-14 21:32:12

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

On Thu, Jun 14, 2018 at 06:34:55AM +0000, Naoya Horiguchi wrote:
> On Thu, Jun 14, 2018 at 07:38:59AM +0200, Oscar Salvador wrote:
> > On Thu, Jun 14, 2018 at 05:16:18AM +0000, Naoya Horiguchi wrote:
> ...
> > >
> > > My concern is that there are a few E820 memory types rather than
> > > E820_TYPE_RAM and E820_TYPE_RESERVED, and I'm not sure that putting them
> > > all into memblock.reserved is really acceptable.
> >
> > Hi Naoya,
> >
> > Maybe you could just add to memblock.reserved, all unavailable ranges within
> > E820_TYPE_RAM.
> > Actually, in your original patch, you are walking memblock.memory, which should
> > only contain E820_TYPE_RAM ranges (talking about x86).
> >
> > So I think the below would to the trick as well?
> >
> > @@ -1248,6 +1276,7 @@ void __init e820__memblock_setup(void)
> > {
> > int i;
> > u64 end;
> > + u64 next = 0;
> >
> > /*
> > * The bootstrap memblock region count maximum is 128 entries
> >
> > @@ -1269,6 +1299,14 @@ void __init e820__memblock_setup(void)
> >
> > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > continue;
> >
> > +
> > + if (entry->type == E820_TYPE_RAM)
> > + if (next < entry->addr) {
> > + memblock_reserve (next, next + (entry->addr - next));
> > + next = end;
> > + }
> >
> > With the above patch, I can no longer see the issues either.
>
> I double-checked and this change looks good to me.
>
> >
> > Although, there is a difference between this and your original patch.
> > In your original patch, you are just zeroing the pages, while with this one (or with your second patch),
> > we will zero the page in reserve_bootmem_region(), but that function also init
> > some other fields of the struct page:
> >
> > mm_zero_struct_page(page);
> > set_page_links(page, zone, nid, pfn);
> > init_page_count(page);
> > page_mapcount_reset(page);
> > page_cpupid_reset_last(page);
> >
> > So I am not sure we want to bother doing that for pages that are really unreachable.
>
> I think that considering that /proc/kpageflags can check them, some data
> (even if it's trivial) might be better than just zeros.
>
> Here's the updated patch.
> Thanks for the suggestion and testing!
>
> ---
> From: Naoya Horiguchi <[email protected]>
> Date: Thu, 14 Jun 2018 14:44:36 +0900
> Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
>
> There is a kernel panic that is triggered when reading /proc/kpageflags
> on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
>
> BUG: unable to handle kernel paging request at fffffffffffffffe
> PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> Oops: 0000 [#1] SMP PTI
> CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> RIP: 0010:stable_page_flags+0x27/0x3c0
> Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> Call Trace:
> kpageflags_read+0xc7/0x120
> proc_reg_read+0x3c/0x60
> __vfs_read+0x36/0x170
> vfs_read+0x89/0x130
> ksys_pread64+0x71/0x90
> do_syscall_64+0x5b/0x160
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7efc42e75e23
> Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
>
> According to kernel bisection, this problem became visible due to commit
> f7f99100d8d9 which changes how struct pages are initialized.
>
> Memblock layout affects the pfn ranges covered by node/zone. Consider
> that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> the default (no memmap= given) memblock layout is like below:
>
> MEMBLOCK configuration:
> memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> memory.cnt = 0x4
> memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> ...
>
> If you give memmap=1G!4G (so it just covers memory[0x2]),
> the range [0x100000000-0x13fffffff] is gone:
>
> MEMBLOCK configuration:
> memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> memory.cnt = 0x3
> memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> ...
>
> This causes shrinking node 0's pfn range because it is calculated by
> the address range of memblock.memory. So some of struct pages in the
> gap range are left uninitialized.
>
> We have a function zero_resv_unavail() which does zeroing the struct
> pages within the reserved unavailable range (i.e. memblock.memory &&
> !memblock.reserved). This patch utilizes it to cover all unavailable
> ranges by putting them into memblock.reserved.
>
> Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> Signed-off-by: Naoya Horiguchi <[email protected]>
> Suggested-by: Oscar Salvador <[email protected]>
> Tested-by: Oscar Salvador <[email protected]>
> ---
> arch/x86/kernel/e820.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index d1f25c831447..d15ef47ea354 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> {
> int i;
> u64 end;
> + u64 next = 0;
>
> /*
> * The bootstrap memblock region count maximum is 128 entries
> @@ -1270,6 +1271,17 @@ void __init e820__memblock_setup(void)
> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> continue;
>
> + /*
> + * Ranges unavailable in E820_TYPE_RAM are put into
> + * memblock.reserved, to make sure that struct pages in such
> + * regions are not left uninitialized after bootup.
> + */
> + if (entry->type == E820_TYPE_RAM)
> + if (next < entry->addr) {
> + memblock_reserve (next, next + (entry->addr - next));
> + next = end;
> + }
> +
> memblock_add(entry->addr, entry->size);
> }

Sorry, but this patch is broken.
While I do not get the failure, it somehow cuts the memory down.
I did not have time to check why.

So I think that for now we should stick to your patch that touches the same code:

=======
@@ -1248,6 +1276,8 @@ void __init e820__memblock_setup(void)
{
int i;
u64 end;
+ u64 next;
+ u64 addr = 0;

/*
* The bootstrap memblock region count maximum is 128 entries
@@ -1260,17 +1290,21 @@ void __init e820__memblock_setup(void)
*/
memblock_allow_resize();

for (i = 0; i < e820_table->nr_entries; i++) {
struct e820_entry *entry = &e820_table->entries[i];

end = entry->addr + entry->size;
+ if (addr < entry->addr)
+ memblock_reserve(addr, entry->addr - addr);
+ addr = end;
if (end != (resource_size_t)end)
continue;

if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
- continue;
-
- memblock_add(entry->addr, entry->size);
+ memblock_reserve(entry->addr, entry->size);
+ else
+ memblock_add(entry->addr, entry->size);
=======

I checked it, and with that version everything looks fine.

>
> --
> 2.7.4
>

Best Regards
Oscar Salvador


2018-06-15 01:09:09

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v1] mm: zero remaining unavailable struct pages (Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM)

On Thu, Jun 14, 2018 at 09:00:50AM +0200, Michal Hocko wrote:
> On Thu 14-06-18 05:16:18, Naoya Horiguchi wrote:
> > On Wed, Jun 13, 2018 at 11:07:00AM +0200, Michal Hocko wrote:
> > > On Wed 13-06-18 05:41:08, Naoya Horiguchi wrote:
> > > [...]
> > > > From: Naoya Horiguchi <[email protected]>
> > > > Date: Wed, 13 Jun 2018 12:43:27 +0900
> > > > Subject: [PATCH] mm: zero remaining unavailable struct pages
> > > >
> > > > There is a kernel panic that is triggered when reading /proc/kpageflags
> > > > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> > > >
> > > > BUG: unable to handle kernel paging request at fffffffffffffffe
> > > > PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> > > > Oops: 0000 [#1] SMP PTI
> > > > CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> > > > RIP: 0010:stable_page_flags+0x27/0x3c0
> > > > Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> > > > RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> > > > RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> > > > RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> > > > RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> > > > R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> > > > R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> > > > FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> > > > Call Trace:
> > > > kpageflags_read+0xc7/0x120
> > > > proc_reg_read+0x3c/0x60
> > > > __vfs_read+0x36/0x170
> > > > vfs_read+0x89/0x130
> > > > ksys_pread64+0x71/0x90
> > > > do_syscall_64+0x5b/0x160
> > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > RIP: 0033:0x7efc42e75e23
> > > > Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> > > >
> > > > According to kernel bisection, this problem became visible due to commit
> > > > f7f99100d8d9 which changes how struct pages are initialized.
> > > >
> > > > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > > > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > > > the default (no memmap= given) memblock layout is like below:
> > > >
> > > > MEMBLOCK configuration:
> > > > memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> > > > memory.cnt = 0x4
> > > > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > > > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > > > memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> > > > memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > > > ...
> > > >
> > > > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > > > the range [0x100000000-0x13fffffff] is gone:
> > > >
> > > > MEMBLOCK configuration:
> > > > memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> > > > memory.cnt = 0x3
> > > > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > > > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > > > memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > > > ...
> > > >
> > > > This causes shrinking node 0's pfn range because it is calculated by
> > > > the address range of memblock.memory. So some of struct pages in the
> > > > gap range are left uninitialized.
> > > >
> > > > We have a function zero_resv_unavail() which does zeroing the struct
> > > > pages outside memblock.memory, but currently it covers only the reserved
> > > > unavailable range (i.e. memblock.memory && !memblock.reserved).
> > > > This patch extends it to cover all unavailable range, which fixes
> > > > the reported issue.
> > >
> > > Thanks for pin pointing this down Naoya! I am wondering why we cannot
> > > simply mark the excluded ranges to be reserved instead.
> >
> > I tried your idea with the change below, and it also fixes the kernel panic.
> >
> > ---
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index d1f25c831447..2cef120535d4 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> > {
> > int i;
> > u64 end;
> > + u64 addr = 0;
> >
> > /*
> > * The bootstrap memblock region count maximum is 128 entries
> > @@ -1264,13 +1265,16 @@ void __init e820__memblock_setup(void)
> > struct e820_entry *entry = &e820_table->entries[i];
> >
> > end = entry->addr + entry->size;
> > + if (addr < entry->addr)
> > + memblock_reserve(addr, entry->addr - addr);
> > + addr = end;
> > if (end != (resource_size_t)end)
> > continue;
> >
> > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > - continue;
> > -
> > - memblock_add(entry->addr, entry->size);
> > + memblock_reserve(entry->addr, entry->size);
> > + else
> > + memblock_add(entry->addr, entry->size);
> > }
> >
> > /* Throw away partial pages: */
>
> Yes, this looks so much better. Although I was more focusing on
> e820__range_remove.

Could you go into detail on e820__range_remove?
Is it helpful to fix the issue in better way?

>
> > My concern is that there are a few E820 memory types rather than
> > E820_TYPE_RAM and E820_TYPE_RESERVED, and I'm not sure that putting them
> > all into memblock.reserved is really acceptable.
>
> Why it wouldn't? I mean reserved ranges are to be never touched unless
> the owner of that range free them up.

I was just unfamiliar with how E820_TYPE_* ranges are supposed to be handled.
So this kind of confirmation is helpful for me, thank you.

Naoya Horiguchi

2018-06-15 01:09:10

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

On Thu, Jun 14, 2018 at 01:24:37PM +0200, Oscar Salvador wrote:
> On Thu, Jun 14, 2018 at 09:21:03AM +0200, Oscar Salvador wrote:
> > On Thu, Jun 14, 2018 at 06:34:55AM +0000, Naoya Horiguchi wrote:
> > > On Thu, Jun 14, 2018 at 07:38:59AM +0200, Oscar Salvador wrote:
> > > > On Thu, Jun 14, 2018 at 05:16:18AM +0000, Naoya Horiguchi wrote:
> > > ...
> > > > >
> > > > > My concern is that there are a few E820 memory types rather than
> > > > > E820_TYPE_RAM and E820_TYPE_RESERVED, and I'm not sure that putting them
> > > > > all into memblock.reserved is really acceptable.
> > > >
> > > > Hi Naoya,
> > > >
> > > > Maybe you could just add to memblock.reserved, all unavailable ranges within
> > > > E820_TYPE_RAM.
> > > > Actually, in your original patch, you are walking memblock.memory, which should
> > > > only contain E820_TYPE_RAM ranges (talking about x86).
> > > >
> > > > So I think the below would to the trick as well?
> > > >
> > > > @@ -1248,6 +1276,7 @@ void __init e820__memblock_setup(void)
> > > > {
> > > > int i;
> > > > u64 end;
> > > > + u64 next = 0;
> > > >
> > > > /*
> > > > * The bootstrap memblock region count maximum is 128 entries
> > > >
> > > > @@ -1269,6 +1299,14 @@ void __init e820__memblock_setup(void)
> > > >
> > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > > continue;
> > > >
> > > > +
> > > > + if (entry->type == E820_TYPE_RAM)
> > > > + if (next < entry->addr) {
> > > > + memblock_reserve (next, next + (entry->addr - next));
> > > > + next = end;
> > > > + }
> > > >
> > > > With the above patch, I can no longer see the issues either.
> > >
> > > I double-checked and this change looks good to me.
> > >
> > > >
> > > > Although, there is a difference between this and your original patch.
> > > > In your original patch, you are just zeroing the pages, while with this one (or with your second patch),
> > > > we will zero the page in reserve_bootmem_region(), but that function also init
> > > > some other fields of the struct page:
> > > >
> > > > mm_zero_struct_page(page);
> > > > set_page_links(page, zone, nid, pfn);
> > > > init_page_count(page);
> > > > page_mapcount_reset(page);
> > > > page_cpupid_reset_last(page);
> > > >
> > > > So I am not sure we want to bother doing that for pages that are really unreachable.
> > >
> > > I think that considering that /proc/kpageflags can check them, some data
> > > (even if it's trivial) might be better than just zeros.
> > >
> > > Here's the updated patch.
> > > Thanks for the suggestion and testing!
> > >
> > > ---
> > > From: Naoya Horiguchi <[email protected]>
> > > Date: Thu, 14 Jun 2018 14:44:36 +0900
> > > Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
> > >
> > > There is a kernel panic that is triggered when reading /proc/kpageflags
> > > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> > >
> > > BUG: unable to handle kernel paging request at fffffffffffffffe
> > > PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> > > Oops: 0000 [#1] SMP PTI
> > > CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> > > RIP: 0010:stable_page_flags+0x27/0x3c0
> > > Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> > > RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> > > RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> > > RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> > > RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> > > R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> > > R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> > > FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> > > Call Trace:
> > > kpageflags_read+0xc7/0x120
> > > proc_reg_read+0x3c/0x60
> > > __vfs_read+0x36/0x170
> > > vfs_read+0x89/0x130
> > > ksys_pread64+0x71/0x90
> > > do_syscall_64+0x5b/0x160
> > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > RIP: 0033:0x7efc42e75e23
> > > Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> > >
> > > According to kernel bisection, this problem became visible due to commit
> > > f7f99100d8d9 which changes how struct pages are initialized.
> > >
> > > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > > the default (no memmap= given) memblock layout is like below:
> > >
> > > MEMBLOCK configuration:
> > > memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> > > memory.cnt = 0x4
> > > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > > memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> > > memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > > ...
> > >
> > > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > > the range [0x100000000-0x13fffffff] is gone:
> > >
> > > MEMBLOCK configuration:
> > > memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> > > memory.cnt = 0x3
> > > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > > memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > > ...
> > >
> > > This causes shrinking node 0's pfn range because it is calculated by
> > > the address range of memblock.memory. So some of struct pages in the
> > > gap range are left uninitialized.
> > >
> > > We have a function zero_resv_unavail() which does zeroing the struct
> > > pages within the reserved unavailable range (i.e. memblock.memory &&
> > > !memblock.reserved). This patch utilizes it to cover all unavailable
> > > ranges by putting them into memblock.reserved.
>
> I just spotted this.
> It seems that the changelog has not been updated.
> It still refers to zero_resv_unavail(), while this patch takes
> a different approach.

Actually I updated this paragraph a little. v1 changes zero_resv_unavail()
itself to do zeroing every range outside memblock.memory!.
And v2 keeps zero_resv_unavail() as is, but by newly putting some ranges
into memblock.reserved, the ranges become to be handled by zero_resv_unavail(),
so I still mention this function.

It seems that with current patch we zero twice in zero_resv_unavail() and
reserve_bootmem_region(), so there might be a room of improvement to remove
the duplicate.

Thanks,
Naoya Horiguchi

>
> > >
> > > Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> > > Signed-off-by: Naoya Horiguchi <[email protected]>
> > > Suggested-by: Oscar Salvador <[email protected]>
> > > Tested-by: Oscar Salvador <[email protected]>
> > > ---
> > > arch/x86/kernel/e820.c | 12 ++++++++++++
> > > 1 file changed, 12 insertions(+)
> > >
> > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > index d1f25c831447..d15ef47ea354 100644
> > > --- a/arch/x86/kernel/e820.c
> > > +++ b/arch/x86/kernel/e820.c
> > > @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> > > {
> > > int i;
> > > u64 end;
> > > + u64 next = 0;
> > >
> > > /*
> > > * The bootstrap memblock region count maximum is 128 entries
> > > @@ -1270,6 +1271,17 @@ void __init e820__memblock_setup(void)
> > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > continue;
> > >
> > > + /*
> > > + * Ranges unavailable in E820_TYPE_RAM are put into
> > > + * memblock.reserved, to make sure that struct pages in such
> > > + * regions are not left uninitialized after bootup.
> > > + */
> > > + if (entry->type == E820_TYPE_RAM)
> > > + if (next < entry->addr) {
> > > + memblock_reserve (next, next + (entry->addr - next));
> > > + next = end;
> > > + }
> > > +
> > > memblock_add(entry->addr, entry->size);
> > > }
> >
> > Thanks Naoya!
> >
> > Andrew: In case you consider to take this patch instead of the first one,
> > could you please replace "[email protected]" with "[email protected]"?
> >
> > Thanks
> >
> > Best Regards
> > Oscar Salvador
> >
>
> Best Regards
> Oscar Salvador
>

2018-06-15 01:13:00

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

On Thu, Jun 14, 2018 at 11:30:34PM +0200, Oscar Salvador wrote:
> On Thu, Jun 14, 2018 at 06:34:55AM +0000, Naoya Horiguchi wrote:
> > On Thu, Jun 14, 2018 at 07:38:59AM +0200, Oscar Salvador wrote:
> > > On Thu, Jun 14, 2018 at 05:16:18AM +0000, Naoya Horiguchi wrote:
> > ...
> > > >
> > > > My concern is that there are a few E820 memory types rather than
> > > > E820_TYPE_RAM and E820_TYPE_RESERVED, and I'm not sure that putting them
> > > > all into memblock.reserved is really acceptable.
> > >
> > > Hi Naoya,
> > >
> > > Maybe you could just add to memblock.reserved, all unavailable ranges within
> > > E820_TYPE_RAM.
> > > Actually, in your original patch, you are walking memblock.memory, which should
> > > only contain E820_TYPE_RAM ranges (talking about x86).
> > >
> > > So I think the below would to the trick as well?
> > >
> > > @@ -1248,6 +1276,7 @@ void __init e820__memblock_setup(void)
> > > {
> > > int i;
> > > u64 end;
> > > + u64 next = 0;
> > >
> > > /*
> > > * The bootstrap memblock region count maximum is 128 entries
> > >
> > > @@ -1269,6 +1299,14 @@ void __init e820__memblock_setup(void)
> > >
> > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > continue;
> > >
> > > +
> > > + if (entry->type == E820_TYPE_RAM)
> > > + if (next < entry->addr) {
> > > + memblock_reserve (next, next + (entry->addr - next));
> > > + next = end;
> > > + }
> > >
> > > With the above patch, I can no longer see the issues either.
> >
> > I double-checked and this change looks good to me.
> >
> > >
> > > Although, there is a difference between this and your original patch.
> > > In your original patch, you are just zeroing the pages, while with this one (or with your second patch),
> > > we will zero the page in reserve_bootmem_region(), but that function also init
> > > some other fields of the struct page:
> > >
> > > mm_zero_struct_page(page);
> > > set_page_links(page, zone, nid, pfn);
> > > init_page_count(page);
> > > page_mapcount_reset(page);
> > > page_cpupid_reset_last(page);
> > >
> > > So I am not sure we want to bother doing that for pages that are really unreachable.
> >
> > I think that considering that /proc/kpageflags can check them, some data
> > (even if it's trivial) might be better than just zeros.
> >
> > Here's the updated patch.
> > Thanks for the suggestion and testing!
> >
> > ---
> > From: Naoya Horiguchi <[email protected]>
> > Date: Thu, 14 Jun 2018 14:44:36 +0900
> > Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
> >
> > There is a kernel panic that is triggered when reading /proc/kpageflags
> > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> >
> > BUG: unable to handle kernel paging request at fffffffffffffffe
> > PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> > RIP: 0010:stable_page_flags+0x27/0x3c0
> > Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> > RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> > RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> > RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> > RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> > R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> > R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> > FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> > Call Trace:
> > kpageflags_read+0xc7/0x120
> > proc_reg_read+0x3c/0x60
> > __vfs_read+0x36/0x170
> > vfs_read+0x89/0x130
> > ksys_pread64+0x71/0x90
> > do_syscall_64+0x5b/0x160
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > RIP: 0033:0x7efc42e75e23
> > Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> >
> > According to kernel bisection, this problem became visible due to commit
> > f7f99100d8d9 which changes how struct pages are initialized.
> >
> > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > the default (no memmap= given) memblock layout is like below:
> >
> > MEMBLOCK configuration:
> > memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> > memory.cnt = 0x4
> > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> > memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > ...
> >
> > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > the range [0x100000000-0x13fffffff] is gone:
> >
> > MEMBLOCK configuration:
> > memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> > memory.cnt = 0x3
> > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > ...
> >
> > This causes shrinking node 0's pfn range because it is calculated by
> > the address range of memblock.memory. So some of struct pages in the
> > gap range are left uninitialized.
> >
> > We have a function zero_resv_unavail() which does zeroing the struct
> > pages within the reserved unavailable range (i.e. memblock.memory &&
> > !memblock.reserved). This patch utilizes it to cover all unavailable
> > ranges by putting them into memblock.reserved.
> >
> > Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > Suggested-by: Oscar Salvador <[email protected]>
> > Tested-by: Oscar Salvador <[email protected]>
> > ---
> > arch/x86/kernel/e820.c | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index d1f25c831447..d15ef47ea354 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> > {
> > int i;
> > u64 end;
> > + u64 next = 0;
> >
> > /*
> > * The bootstrap memblock region count maximum is 128 entries
> > @@ -1270,6 +1271,17 @@ void __init e820__memblock_setup(void)
> > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > continue;
> >
> > + /*
> > + * Ranges unavailable in E820_TYPE_RAM are put into
> > + * memblock.reserved, to make sure that struct pages in such
> > + * regions are not left uninitialized after bootup.
> > + */
> > + if (entry->type == E820_TYPE_RAM)
> > + if (next < entry->addr) {
> > + memblock_reserve (next, next + (entry->addr - next));
> > + next = end;
> > + }
> > +
> > memblock_add(entry->addr, entry->size);
> > }
>
> Sorry, but this patch is broken.
> While I do not get the failure, it somehow cuts the memory down.
> I did not have time to check why.
>
> So I think that for now we should stick to your patch that touches the same code:
>
> =======
> @@ -1248,6 +1276,8 @@ void __init e820__memblock_setup(void)
> {
> int i;
> u64 end;
> + u64 next;
> + u64 addr = 0;
>
> /*
> * The bootstrap memblock region count maximum is 128 entries
> @@ -1260,17 +1290,21 @@ void __init e820__memblock_setup(void)
> */
> memblock_allow_resize();
>
> for (i = 0; i < e820_table->nr_entries; i++) {
> struct e820_entry *entry = &e820_table->entries[i];
>
> end = entry->addr + entry->size;
> + if (addr < entry->addr)
> + memblock_reserve(addr, entry->addr - addr);
> + addr = end;
> if (end != (resource_size_t)end)
> continue;
>
> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> - continue;
> -
> - memblock_add(entry->addr, entry->size);
> + memblock_reserve(entry->addr, entry->size);
> + else
> + memblock_add(entry->addr, entry->size);
> =======
>
> I checked it, and with that version everything looks fine.

OK, I'll go back to this version.

Thanks,
Naoya Horiguchi

2018-06-15 07:33:58

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v3] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

Hi,
I updated the patch, so let me share it.

# I'll be offline early in the next week, maybe come back on Wednesday.
# Have a nice weekend.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi <[email protected]>
Date: Thu, 14 Jun 2018 16:04:36 +0900
Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

There is a kernel panic that is triggered when reading /proc/kpageflags
on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':

BUG: unable to handle kernel paging request at fffffffffffffffe
PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
RIP: 0010:stable_page_flags+0x27/0x3c0
Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
Call Trace:
kpageflags_read+0xc7/0x120
proc_reg_read+0x3c/0x60
__vfs_read+0x36/0x170
vfs_read+0x89/0x130
ksys_pread64+0x71/0x90
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7efc42e75e23
Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24

According to kernel bisection, this problem became visible due to commit
f7f99100d8d9 which changes how struct pages are initialized.

Memblock layout affects the pfn ranges covered by node/zone. Consider
that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
the default (no memmap= given) memblock layout is like below:

MEMBLOCK configuration:
memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x4
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...

If you give memmap=1G!4G (so it just covers memory[0x2]),
the range [0x100000000-0x13fffffff] is gone:

MEMBLOCK configuration:
memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x3
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...

This causes shrinking node 0's pfn range because it is calculated by
the address range of memblock.memory. So some of struct pages in the
gap range are left uninitialized.

We have a function zero_resv_unavail() which does zeroing the struct
pages within the reserved unavailable range (i.e. memblock.memory &&
!memblock.reserved). This patch utilizes it to cover all unavailable
ranges by putting them into memblock.reserved.

Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Naoya Horiguchi <[email protected]>
Tested-by: Oscar Salvador <[email protected]>
---
arch/x86/kernel/e820.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index d1f25c831447..c88c23c658c1 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
{
int i;
u64 end;
+ u64 addr = 0;

/*
* The bootstrap memblock region count maximum is 128 entries
@@ -1264,13 +1265,21 @@ void __init e820__memblock_setup(void)
struct e820_entry *entry = &e820_table->entries[i];

end = entry->addr + entry->size;
+ if (addr < entry->addr)
+ memblock_reserve(addr, entry->addr - addr);
+ addr = end;
if (end != (resource_size_t)end)
continue;

+ /*
+ * all !E820_TYPE_RAM ranges (including gap ranges) are put
+ * into memblock.reserved to make sure that struct pages in
+ * such regions are not left uninitialized after bootup.
+ */
if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
- continue;
-
- memblock_add(entry->addr, entry->size);
+ memblock_reserve(entry->addr, entry->size);
+ else
+ memblock_add(entry->addr, entry->size);
}

/* Throw away partial pages: */
--
2.7.4


2018-06-15 08:41:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v1] mm: zero remaining unavailable struct pages (Re: kernel panic in reading /proc/kpageflags when enabling RAM-simulated PMEM)

On Fri 15-06-18 01:07:22, Naoya Horiguchi wrote:
> On Thu, Jun 14, 2018 at 09:00:50AM +0200, Michal Hocko wrote:
> > On Thu 14-06-18 05:16:18, Naoya Horiguchi wrote:
> > > On Wed, Jun 13, 2018 at 11:07:00AM +0200, Michal Hocko wrote:
> > > > On Wed 13-06-18 05:41:08, Naoya Horiguchi wrote:
> > > > [...]
> > > > > From: Naoya Horiguchi <[email protected]>
> > > > > Date: Wed, 13 Jun 2018 12:43:27 +0900
> > > > > Subject: [PATCH] mm: zero remaining unavailable struct pages
> > > > >
> > > > > There is a kernel panic that is triggered when reading /proc/kpageflags
> > > > > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> > > > >
> > > > > BUG: unable to handle kernel paging request at fffffffffffffffe
> > > > > PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> > > > > Oops: 0000 [#1] SMP PTI
> > > > > CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> > > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> > > > > RIP: 0010:stable_page_flags+0x27/0x3c0
> > > > > Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> > > > > RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> > > > > RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> > > > > RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> > > > > RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> > > > > R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> > > > > R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> > > > > FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> > > > > Call Trace:
> > > > > kpageflags_read+0xc7/0x120
> > > > > proc_reg_read+0x3c/0x60
> > > > > __vfs_read+0x36/0x170
> > > > > vfs_read+0x89/0x130
> > > > > ksys_pread64+0x71/0x90
> > > > > do_syscall_64+0x5b/0x160
> > > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > RIP: 0033:0x7efc42e75e23
> > > > > Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> > > > >
> > > > > According to kernel bisection, this problem became visible due to commit
> > > > > f7f99100d8d9 which changes how struct pages are initialized.
> > > > >
> > > > > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > > > > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > > > > the default (no memmap= given) memblock layout is like below:
> > > > >
> > > > > MEMBLOCK configuration:
> > > > > memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> > > > > memory.cnt = 0x4
> > > > > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > > > > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > > > > memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> > > > > memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > > > > ...
> > > > >
> > > > > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > > > > the range [0x100000000-0x13fffffff] is gone:
> > > > >
> > > > > MEMBLOCK configuration:
> > > > > memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> > > > > memory.cnt = 0x3
> > > > > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > > > > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > > > > memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > > > > ...
> > > > >
> > > > > This causes shrinking node 0's pfn range because it is calculated by
> > > > > the address range of memblock.memory. So some of struct pages in the
> > > > > gap range are left uninitialized.
> > > > >
> > > > > We have a function zero_resv_unavail() which does zeroing the struct
> > > > > pages outside memblock.memory, but currently it covers only the reserved
> > > > > unavailable range (i.e. memblock.memory && !memblock.reserved).
> > > > > This patch extends it to cover all unavailable range, which fixes
> > > > > the reported issue.
> > > >
> > > > Thanks for pin pointing this down Naoya! I am wondering why we cannot
> > > > simply mark the excluded ranges to be reserved instead.
> > >
> > > I tried your idea with the change below, and it also fixes the kernel panic.
> > >
> > > ---
> > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > index d1f25c831447..2cef120535d4 100644
> > > --- a/arch/x86/kernel/e820.c
> > > +++ b/arch/x86/kernel/e820.c
> > > @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> > > {
> > > int i;
> > > u64 end;
> > > + u64 addr = 0;
> > >
> > > /*
> > > * The bootstrap memblock region count maximum is 128 entries
> > > @@ -1264,13 +1265,16 @@ void __init e820__memblock_setup(void)
> > > struct e820_entry *entry = &e820_table->entries[i];
> > >
> > > end = entry->addr + entry->size;
> > > + if (addr < entry->addr)
> > > + memblock_reserve(addr, entry->addr - addr);
> > > + addr = end;
> > > if (end != (resource_size_t)end)
> > > continue;
> > >
> > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > - continue;
> > > -
> > > - memblock_add(entry->addr, entry->size);
> > > + memblock_reserve(entry->addr, entry->size);
> > > + else
> > > + memblock_add(entry->addr, entry->size);
> > > }
> > >
> > > /* Throw away partial pages: */
> >
> > Yes, this looks so much better. Although I was more focusing on
> > e820__range_remove.
>
> Could you go into detail on e820__range_remove?
> Is it helpful to fix the issue in better way?

I thought that this is a general method to remove invalid ranges of
memory so it would be better and more obvious fix. I am not entirely
sure though.

--
Michal Hocko
SUSE Labs

2018-06-15 08:42:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

On Fri 15-06-18 07:29:48, Naoya Horiguchi wrote:
[...]
> From: Naoya Horiguchi <[email protected]>
> Date: Thu, 14 Jun 2018 16:04:36 +0900
> Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
>
> There is a kernel panic that is triggered when reading /proc/kpageflags
> on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
>
> BUG: unable to handle kernel paging request at fffffffffffffffe
> PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> Oops: 0000 [#1] SMP PTI
> CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> RIP: 0010:stable_page_flags+0x27/0x3c0
> Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> Call Trace:
> kpageflags_read+0xc7/0x120
> proc_reg_read+0x3c/0x60
> __vfs_read+0x36/0x170
> vfs_read+0x89/0x130
> ksys_pread64+0x71/0x90
> do_syscall_64+0x5b/0x160
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7efc42e75e23
> Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
>
> According to kernel bisection, this problem became visible due to commit
> f7f99100d8d9 which changes how struct pages are initialized.
>
> Memblock layout affects the pfn ranges covered by node/zone. Consider
> that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> the default (no memmap= given) memblock layout is like below:
>
> MEMBLOCK configuration:
> memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> memory.cnt = 0x4
> memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> ...
>
> If you give memmap=1G!4G (so it just covers memory[0x2]),
> the range [0x100000000-0x13fffffff] is gone:
>
> MEMBLOCK configuration:
> memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> memory.cnt = 0x3
> memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> ...
>
> This causes shrinking node 0's pfn range because it is calculated by
> the address range of memblock.memory. So some of struct pages in the
> gap range are left uninitialized.
>
> We have a function zero_resv_unavail() which does zeroing the struct
> pages within the reserved unavailable range (i.e. memblock.memory &&
> !memblock.reserved). This patch utilizes it to cover all unavailable
> ranges by putting them into memblock.reserved.
>
> Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> Signed-off-by: Naoya Horiguchi <[email protected]>
> Tested-by: Oscar Salvador <[email protected]>

OK, this makes sense to me. It is definitely much better than the
original attempt.

Unless I am missing something this should be correct
Acked-by: Michal Hocko <[email protected]>

> ---
> arch/x86/kernel/e820.c | 15 ++++++++++++---
> 1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index d1f25c831447..c88c23c658c1 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> {
> int i;
> u64 end;
> + u64 addr = 0;
>
> /*
> * The bootstrap memblock region count maximum is 128 entries
> @@ -1264,13 +1265,21 @@ void __init e820__memblock_setup(void)
> struct e820_entry *entry = &e820_table->entries[i];
>
> end = entry->addr + entry->size;
> + if (addr < entry->addr)
> + memblock_reserve(addr, entry->addr - addr);
> + addr = end;
> if (end != (resource_size_t)end)
> continue;
>
> + /*
> + * all !E820_TYPE_RAM ranges (including gap ranges) are put
> + * into memblock.reserved to make sure that struct pages in
> + * such regions are not left uninitialized after bootup.
> + */
> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> - continue;
> -
> - memblock_add(entry->addr, entry->size);
> + memblock_reserve(entry->addr, entry->size);
> + else
> + memblock_add(entry->addr, entry->size);
> }
>
> /* Throw away partial pages: */
> --
> 2.7.4

--
Michal Hocko
SUSE Labs

2018-06-15 14:03:30

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v3] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

On 18-06-15 10:41:42, Michal Hocko wrote:
> On Fri 15-06-18 07:29:48, Naoya Horiguchi wrote:
> [...]
> > From: Naoya Horiguchi <[email protected]>
> > Date: Thu, 14 Jun 2018 16:04:36 +0900
> > Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
> >
> > There is a kernel panic that is triggered when reading /proc/kpageflags
> > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> >
> > BUG: unable to handle kernel paging request at fffffffffffffffe
> > PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> > RIP: 0010:stable_page_flags+0x27/0x3c0
> > Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> > RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> > RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> > RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> > RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> > R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> > R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> > FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> > Call Trace:
> > kpageflags_read+0xc7/0x120
> > proc_reg_read+0x3c/0x60
> > __vfs_read+0x36/0x170
> > vfs_read+0x89/0x130
> > ksys_pread64+0x71/0x90
> > do_syscall_64+0x5b/0x160
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > RIP: 0033:0x7efc42e75e23
> > Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> >
> > According to kernel bisection, this problem became visible due to commit
> > f7f99100d8d9 which changes how struct pages are initialized.
> >
> > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > the default (no memmap= given) memblock layout is like below:
> >
> > MEMBLOCK configuration:
> > memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> > memory.cnt = 0x4
> > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> > memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > ...
> >
> > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > the range [0x100000000-0x13fffffff] is gone:
> >
> > MEMBLOCK configuration:
> > memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> > memory.cnt = 0x3
> > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > ...
> >
> > This causes shrinking node 0's pfn range because it is calculated by
> > the address range of memblock.memory. So some of struct pages in the
> > gap range are left uninitialized.
> >
> > We have a function zero_resv_unavail() which does zeroing the struct
> > pages within the reserved unavailable range (i.e. memblock.memory &&
> > !memblock.reserved). This patch utilizes it to cover all unavailable
> > ranges by putting them into memblock.reserved.
> >
> > Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > Tested-by: Oscar Salvador <[email protected]>
>
> OK, this makes sense to me. It is definitely much better than the
> original attempt.
>
> Unless I am missing something this should be correct
> Acked-by: Michal Hocko <[email protected]>

First of all thank you Naoya for finding and root causing this issue.

So, with this fix we reserve any hole and !E820_TYPE_RAM or
!E820_TYPE_RESERVED_KERN in e820. I think, this will work because we
do pfn_valid() check in zero_resv_unavail(), so the ranges that do not have
backing struct pages will be skipped. But, I am worried on the performance
implications of when the holes of invalid memory are rather large. We would
have to loop through it in zero_resv_unavail() one pfn at a time.

Therefore, we might also need to optimize zero_resv_unavail() a little like
this:

6407 if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
6408 continue;

Add before "continue":
pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) + pageblock_nr_pageas - 1.
At least, this way, we would skip a section of invalid memory at a time.

For the patch above:
Reviewed-by: Pavel Tatashin <[email protected]>

But, I think the 2nd patch with the optimization above should go along this
this fix.

Thank you,
Pasha

>
> > ---
> > arch/x86/kernel/e820.c | 15 ++++++++++++---
> > 1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index d1f25c831447..c88c23c658c1 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> > {
> > int i;
> > u64 end;
> > + u64 addr = 0;
> >
> > /*
> > * The bootstrap memblock region count maximum is 128 entries
> > @@ -1264,13 +1265,21 @@ void __init e820__memblock_setup(void)
> > struct e820_entry *entry = &e820_table->entries[i];
> >
> > end = entry->addr + entry->size;
> > + if (addr < entry->addr)
> > + memblock_reserve(addr, entry->addr - addr);
> > + addr = end;
> > if (end != (resource_size_t)end)
> > continue;
> >
> > + /*
> > + * all !E820_TYPE_RAM ranges (including gap ranges) are put
> > + * into memblock.reserved to make sure that struct pages in
> > + * such regions are not left uninitialized after bootup.
> > + */
> > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > - continue;
> > -
> > - memblock_add(entry->addr, entry->size);
> > + memblock_reserve(entry->addr, entry->size);
> > + else
> > + memblock_add(entry->addr, entry->size);
> > }
> >
> > /* Throw away partial pages: */
> > --
> > 2.7.4
>
> --
> Michal Hocko
> SUSE Labs
>

2018-06-15 14:12:13

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

On Fri 15-06-18 10:00:00, Pavel Tatashin wrote:
[...]
> But, I think the 2nd patch with the optimization above should go along this
> this fix.

Yes, ideally with some numbers.
--
Michal Hocko
SUSE Labs

2018-06-15 14:34:06

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v3] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

On Fri, Jun 15, 2018 at 10:00:00AM -0400, Pavel Tatashin wrote:
> On 18-06-15 10:41:42, Michal Hocko wrote:
> > On Fri 15-06-18 07:29:48, Naoya Horiguchi wrote:
> > [...]
> > > From: Naoya Horiguchi <[email protected]>
> > > Date: Thu, 14 Jun 2018 16:04:36 +0900
> > > Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
> > >
> > > There is a kernel panic that is triggered when reading /proc/kpageflags
> > > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> > >
> > > BUG: unable to handle kernel paging request at fffffffffffffffe
> > > PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> > > Oops: 0000 [#1] SMP PTI
> > > CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> > > RIP: 0010:stable_page_flags+0x27/0x3c0
> > > Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> > > RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> > > RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> > > RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> > > RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> > > R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> > > R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> > > FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> > > Call Trace:
> > > kpageflags_read+0xc7/0x120
> > > proc_reg_read+0x3c/0x60
> > > __vfs_read+0x36/0x170
> > > vfs_read+0x89/0x130
> > > ksys_pread64+0x71/0x90
> > > do_syscall_64+0x5b/0x160
> > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > RIP: 0033:0x7efc42e75e23
> > > Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> > >
> > > According to kernel bisection, this problem became visible due to commit
> > > f7f99100d8d9 which changes how struct pages are initialized.
> > >
> > > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > > the default (no memmap= given) memblock layout is like below:
> > >
> > > MEMBLOCK configuration:
> > > memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> > > memory.cnt = 0x4
> > > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > > memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> > > memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > > ...
> > >
> > > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > > the range [0x100000000-0x13fffffff] is gone:
> > >
> > > MEMBLOCK configuration:
> > > memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> > > memory.cnt = 0x3
> > > memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > > memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > > memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > > ...
> > >
> > > This causes shrinking node 0's pfn range because it is calculated by
> > > the address range of memblock.memory. So some of struct pages in the
> > > gap range are left uninitialized.
> > >
> > > We have a function zero_resv_unavail() which does zeroing the struct
> > > pages within the reserved unavailable range (i.e. memblock.memory &&
> > > !memblock.reserved). This patch utilizes it to cover all unavailable
> > > ranges by putting them into memblock.reserved.
> > >
> > > Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> > > Signed-off-by: Naoya Horiguchi <[email protected]>
> > > Tested-by: Oscar Salvador <[email protected]>
> >
> > OK, this makes sense to me. It is definitely much better than the
> > original attempt.
> >
> > Unless I am missing something this should be correct
> > Acked-by: Michal Hocko <[email protected]>
>
> First of all thank you Naoya for finding and root causing this issue.
>
> So, with this fix we reserve any hole and !E820_TYPE_RAM or
> !E820_TYPE_RESERVED_KERN in e820. I think, this will work because we
> do pfn_valid() check in zero_resv_unavail(), so the ranges that do not have
> backing struct pages will be skipped. But, I am worried on the performance
> implications of when the holes of invalid memory are rather large. We would
> have to loop through it in zero_resv_unavail() one pfn at a time.
>
> Therefore, we might also need to optimize zero_resv_unavail() a little like
> this:
>
> 6407 if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
> 6408 continue;
>
> Add before "continue":
> pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) + pageblock_nr_pageas - 1.
> At least, this way, we would skip a section of invalid memory at a time.
>
> For the patch above:
> Reviewed-by: Pavel Tatashin <[email protected]>
>
> But, I think the 2nd patch with the optimization above should go along this
> this fix.

Hi Pavel,

I think this makes a lot of sense.
Since Naoya is out until Wednesday, maybe I give it a shot next week and see if I can gather some numbers.

>
> Thank you,
> Pasha
>
> >
> > > ---
> > > arch/x86/kernel/e820.c | 15 ++++++++++++---
> > > 1 file changed, 12 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > index d1f25c831447..c88c23c658c1 100644
> > > --- a/arch/x86/kernel/e820.c
> > > +++ b/arch/x86/kernel/e820.c
> > > @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> > > {
> > > int i;
> > > u64 end;
> > > + u64 addr = 0;
> > >
> > > /*
> > > * The bootstrap memblock region count maximum is 128 entries
> > > @@ -1264,13 +1265,21 @@ void __init e820__memblock_setup(void)
> > > struct e820_entry *entry = &e820_table->entries[i];
> > >
> > > end = entry->addr + entry->size;
> > > + if (addr < entry->addr)
> > > + memblock_reserve(addr, entry->addr - addr);
> > > + addr = end;
> > > if (end != (resource_size_t)end)
> > > continue;
> > >
> > > + /*
> > > + * all !E820_TYPE_RAM ranges (including gap ranges) are put
> > > + * into memblock.reserved to make sure that struct pages in
> > > + * such regions are not left uninitialized after bootup.
> > > + */
> > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > - continue;
> > > -
> > > - memblock_add(entry->addr, entry->size);
> > > + memblock_reserve(entry->addr, entry->size);
> > > + else
> > > + memblock_add(entry->addr, entry->size);
> > > }
> > >
> > > /* Throw away partial pages: */
> > > --
> > > 2.7.4
> >
> > --
> > Michal Hocko
> > SUSE Labs
> >
>

Best Regards
Oscar Salvador

2018-06-15 16:03:37

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v3] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

> Hi Pavel,
>
> I think this makes a lot of sense.
> Since Naoya is out until Wednesday, maybe I give it a shot next week and see if I can gather some numbers.

Hi Oscar,

Thank you for the offer to do this. Since, sched_clock() is not yet
initialized at the time zero_resv_unavail() is called, it is difficult
to measure it during boot. But, I had x86 early boot timestamps
patches handy, so I could measure, thus decided to submit the patch.
http://lkml.kernel.org/r/[email protected]

Thank you,
Pavel

>
> >
> > Thank you,
> > Pasha
> >
> > >
> > > > ---
> > > > arch/x86/kernel/e820.c | 15 ++++++++++++---
> > > > 1 file changed, 12 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > index d1f25c831447..c88c23c658c1 100644
> > > > --- a/arch/x86/kernel/e820.c
> > > > +++ b/arch/x86/kernel/e820.c
> > > > @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> > > > {
> > > > int i;
> > > > u64 end;
> > > > + u64 addr = 0;
> > > >
> > > > /*
> > > > * The bootstrap memblock region count maximum is 128 entries
> > > > @@ -1264,13 +1265,21 @@ void __init e820__memblock_setup(void)
> > > > struct e820_entry *entry = &e820_table->entries[i];
> > > >
> > > > end = entry->addr + entry->size;
> > > > + if (addr < entry->addr)
> > > > + memblock_reserve(addr, entry->addr - addr);
> > > > + addr = end;
> > > > if (end != (resource_size_t)end)
> > > > continue;
> > > >
> > > > + /*
> > > > + * all !E820_TYPE_RAM ranges (including gap ranges) are put
> > > > + * into memblock.reserved to make sure that struct pages in
> > > > + * such regions are not left uninitialized after bootup.
> > > > + */
> > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > > - continue;
> > > > -
> > > > - memblock_add(entry->addr, entry->size);
> > > > + memblock_reserve(entry->addr, entry->size);
> > > > + else
> > > > + memblock_add(entry->addr, entry->size);
> > > > }
> > > >
> > > > /* Throw away partial pages: */
> > > > --
> > > > 2.7.4
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
> > >
> >
>
> Best Regards
> Oscar Salvador
>

2018-06-18 23:38:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

On Fri, 15 Jun 2018 10:00:00 -0400 Pavel Tatashin <[email protected]> wrote:

> > > Signed-off-by: Naoya Horiguchi <[email protected]>
> > > Tested-by: Oscar Salvador <[email protected]>
> >
> > OK, this makes sense to me. It is definitely much better than the
> > original attempt.
> >
> > Unless I am missing something this should be correct
> > Acked-by: Michal Hocko <[email protected]>
>
> First of all thank you Naoya for finding and root causing this issue.
>
> So, with this fix we reserve any hole and !E820_TYPE_RAM or
> !E820_TYPE_RESERVED_KERN in e820. I think, this will work because we
> do pfn_valid() check in zero_resv_unavail(), so the ranges that do not have
> backing struct pages will be skipped. But, I am worried on the performance
> implications of when the holes of invalid memory are rather large. We would
> have to loop through it in zero_resv_unavail() one pfn at a time.
>
> Therefore, we might also need to optimize zero_resv_unavail() a little like
> this:
>
> 6407 if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
> 6408 continue;
>
> Add before "continue":
> pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) + pageblock_nr_pageas - 1.
> At least, this way, we would skip a section of invalid memory at a time.
>
> For the patch above:
> Reviewed-by: Pavel Tatashin <[email protected]>
>
> But, I think the 2nd patch with the optimization above should go along this
> this fix.

So I expect this patch needs a cc:stable, which I'll add.

The optimiation patch seems less important and I'd like to hold that
off for 4.19-rc1?


2018-06-19 00:52:03

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v3] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

> So I expect this patch needs a cc:stable, which I'll add.
Yes.
> The optimiation patch seems less important and I'd like to hold that
> off for 4.19-rc1?
I agree, the optimization is not as important, and can wait for 4.19.

2018-07-02 20:06:36

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v3] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

> So I expect this patch needs a cc:stable, which I'll add.
>
> The optimiation patch seems less important and I'd like to hold that
> off for 4.19-rc1?

Hi Andrew,

Should I resend the optimization patch [1] once 4.18 is released, or
will you include it, and I do not need to do anything?

[1] http://lkml.kernel.org/r/[email protected]

Thank you,
Pavel

2018-07-02 20:30:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

On Mon, 2 Jul 2018 16:05:04 -0400 Pavel Tatashin <[email protected]> wrote:

> > So I expect this patch needs a cc:stable, which I'll add.
> >
> > The optimiation patch seems less important and I'd like to hold that
> > off for 4.19-rc1?
>
> Hi Andrew,
>
> Should I resend the optimization patch [1] once 4.18 is released, or
> will you include it, and I do not need to do anything?
>
> [1] http://lkml.kernel.org/r/[email protected]
>

http://ozlabs.org/~akpm/mmots/broken-out/mm-skip-invalid-pages-block-at-a-time-in-zero_resv_unresv.patch
has been in -mm since Jun 18, so all is well.