2019-10-09 14:25:40

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 0/2] mm: Don't access uninitialized memmaps in PFN walkers

This is the follow-up of:
[PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c

We have multiple places where we might access uninitialized memmaps and
trigger kernel BUGs. Make sure to only access initialized memmaps.

Some of these places got easier to trigger with:
[PATCH v6 00/10] mm/memory_hotplug: Shrink zones before removing memory
As memmaps are now also poisoned when memory is offlined, before it is
actually removed.

v1 -> v2:
- Drop ZONE_DEVICE support from the /proc/k... files as requested by Michal
- Further simplify the code
- Split up into two patches

David Hildenbrand (2):
mm: Don't access uninitialized memmaps in fs/proc/page.c
mm/memory-failure.c: Don't access uninitialized memmaps in
memory_failure()

fs/proc/page.c | 28 ++++++++++++++++------------
mm/memory-failure.c | 14 ++++++++------
2 files changed, 24 insertions(+), 18 deletions(-)

--
2.21.0


2019-10-09 14:28:20

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 1/2] mm: Don't access uninitialized memmaps in fs/proc/page.c

There are three places where we access uninitialized memmaps, namely:
- /proc/kpagecount
- /proc/kpageflags
- /proc/kpagecgroup

We have initialized memmaps either when the section is online or when
the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
garbage and in the worst case trigger kernel BUGs, especially with
CONFIG_PAGE_POISONING.

For example, not onlining a DIMM during boot and calling /proc/kpagecount
with CONFIG_PAGE_POISONING:
:/# cat /proc/kpagecount > tmp.test
[ 95.600592] BUG: unable to handle page fault for address: fffffffffffffffe
[ 95.601238] #PF: supervisor read access in kernel mode
[ 95.601675] #PF: error_code(0x0000) - not-present page
[ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
[ 95.602596] Oops: 0000 [#1] SMP NOPTI
[ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
[ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
[ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
[ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
[ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
[ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
[ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
[ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
[ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
[ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
[ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
[ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
[ 95.611686] Call Trace:
[ 95.611906] proc_reg_read+0x3c/0x60
[ 95.612228] vfs_read+0xc5/0x180
[ 95.612505] ksys_read+0x68/0xe0
[ 95.612785] do_syscall_64+0x5c/0xa0
[ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe

For now, let's drop support for ZONE_DEVICE from the three pseudo files
in order to fix this. To distinguish offline memory (with garbage memmap)
from ZONE_DEVICE memory with properly initialized memmaps, we would have to
check get_dev_pagemap() and pfn_zone_device_reserved() right now. The usage
of both (especially, special casing devmem) is frowned upon and needs to
be reworked. The fundamental issue we have is:

if (pfn_to_online_page(pfn)) {
/* memmap initialized */
} else if (pfn_valid(pfn)) {
/*
* ???
* a) offline memory. memmap garbage.
* b) devmem: memmap initialized to ZONE_DEVICE.
* c) devmem: reserved for driver. memmap garbage.
* (d) devmem: memmap currently initializing - garbage)
*/
}

We'll leave the pfn_zone_device_reserved() check in stable_page_flags()
in place as that function is also used from memory failure. We now
no longer dump information about pages that are not in use anymore -
offline.

Reported-by: Qian Cai <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Toshiki Fukasawa <[email protected]>
Cc: Pankaj gupta <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Anthony Yznaga <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: [email protected]
Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/page.c | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index decd3fe39674..e40dbfe1168e 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -42,10 +42,12 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
return -EINVAL;

while (count > 0) {
- if (pfn_valid(pfn))
- ppage = pfn_to_page(pfn);
- else
- ppage = NULL;
+ /*
+ * TODO: ZONE_DEVICE support requires to identify
+ * memmaps that were actually initialized.
+ */
+ ppage = pfn_to_online_page(pfn);
+
if (!ppage || PageSlab(ppage) || page_has_type(ppage))
pcount = 0;
else
@@ -218,10 +220,11 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf,
return -EINVAL;

while (count > 0) {
- if (pfn_valid(pfn))
- ppage = pfn_to_page(pfn);
- else
- ppage = NULL;
+ /*
+ * TODO: ZONE_DEVICE support requires to identify
+ * memmaps that were actually initialized.
+ */
+ ppage = pfn_to_online_page(pfn);

if (put_user(stable_page_flags(ppage), out)) {
ret = -EFAULT;
@@ -263,10 +266,11 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
return -EINVAL;

while (count > 0) {
- if (pfn_valid(pfn))
- ppage = pfn_to_page(pfn);
- else
- ppage = NULL;
+ /*
+ * TODO: ZONE_DEVICE support requires to identify
+ * memmaps that were actually initialized.
+ */
+ ppage = pfn_to_online_page(pfn);

if (ppage)
ino = page_cgroup_ino(ppage);
--
2.21.0

2019-10-09 14:28:23

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

We should check for pfn_to_online_page() to not access uninitialized
memmaps. Reshuffle the code so we don't have to duplicate the error
message.

Cc: Naoya Horiguchi <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory-failure.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7ef849da8278..e866e6e5660b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);

- if (!pfn_valid(pfn)) {
+ p = pfn_to_online_page(pfn);
+ if (!p) {
+ if (pfn_valid(pfn)) {
+ pgmap = get_dev_pagemap(pfn, NULL);
+ if (pgmap)
+ return memory_failure_dev_pagemap(pfn, flags,
+ pgmap);
+ }
pr_err("Memory failure: %#lx: memory outside kernel control\n",
pfn);
return -ENXIO;
}

- pgmap = get_dev_pagemap(pfn, NULL);
- if (pgmap)
- return memory_failure_dev_pagemap(pfn, flags, pgmap);
-
- p = pfn_to_page(pfn);
if (PageHuge(p))
return memory_failure_hugetlb(pfn, flags);
if (TestSetPageHWPoison(p)) {
--
2.21.0

2019-10-09 14:44:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On Wed 09-10-19 16:24:35, David Hildenbrand wrote:
> We should check for pfn_to_online_page() to not access uninitialized
> memmaps. Reshuffle the code so we don't have to duplicate the error
> message.
>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/memory-failure.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 7ef849da8278..e866e6e5660b 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
> if (!sysctl_memory_failure_recovery)
> panic("Memory failure on page %lx", pfn);
>
> - if (!pfn_valid(pfn)) {
> + p = pfn_to_online_page(pfn);
> + if (!p) {
> + if (pfn_valid(pfn)) {
> + pgmap = get_dev_pagemap(pfn, NULL);
> + if (pgmap)
> + return memory_failure_dev_pagemap(pfn, flags,
> + pgmap);
> + }
> pr_err("Memory failure: %#lx: memory outside kernel control\n",
> pfn);
> return -ENXIO;

Don't we need that earlier at hwpoison_inject level?

--
Michal Hocko
SUSE Labs

2019-10-09 14:44:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: Don't access uninitialized memmaps in fs/proc/page.c

On Wed 09-10-19 16:24:34, David Hildenbrand wrote:
> There are three places where we access uninitialized memmaps, namely:
> - /proc/kpagecount
> - /proc/kpageflags
> - /proc/kpagecgroup
>
> We have initialized memmaps either when the section is online or when
> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
> garbage and in the worst case trigger kernel BUGs, especially with
> CONFIG_PAGE_POISONING.
>
> For example, not onlining a DIMM during boot and calling /proc/kpagecount
> with CONFIG_PAGE_POISONING:
> :/# cat /proc/kpagecount > tmp.test
> [ 95.600592] BUG: unable to handle page fault for address: fffffffffffffffe
> [ 95.601238] #PF: supervisor read access in kernel mode
> [ 95.601675] #PF: error_code(0x0000) - not-present page
> [ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
> [ 95.602596] Oops: 0000 [#1] SMP NOPTI
> [ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
> [ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> [ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
> [ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
> [ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
> [ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
> [ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
> [ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
> [ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
> [ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
> [ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
> [ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
> [ 95.611686] Call Trace:
> [ 95.611906] proc_reg_read+0x3c/0x60
> [ 95.612228] vfs_read+0xc5/0x180
> [ 95.612505] ksys_read+0x68/0xe0
> [ 95.612785] do_syscall_64+0x5c/0xa0
> [ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> For now, let's drop support for ZONE_DEVICE from the three pseudo files
> in order to fix this. To distinguish offline memory (with garbage memmap)
> from ZONE_DEVICE memory with properly initialized memmaps, we would have to
> check get_dev_pagemap() and pfn_zone_device_reserved() right now. The usage
> of both (especially, special casing devmem) is frowned upon and needs to
> be reworked. The fundamental issue we have is:
>
> if (pfn_to_online_page(pfn)) {
> /* memmap initialized */
> } else if (pfn_valid(pfn)) {
> /*
> * ???
> * a) offline memory. memmap garbage.
> * b) devmem: memmap initialized to ZONE_DEVICE.
> * c) devmem: reserved for driver. memmap garbage.
> * (d) devmem: memmap currently initializing - garbage)
> */
> }
>
> We'll leave the pfn_zone_device_reserved() check in stable_page_flags()
> in place as that function is also used from memory failure. We now
> no longer dump information about pages that are not in use anymore -
> offline.
>
> Reported-by: Qian Cai <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Alexey Dobriyan <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Stephen Rothwell <[email protected]>
> Cc: Toshiki Fukasawa <[email protected]>
> Cc: Pankaj gupta <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Anthony Yznaga <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: "Aneesh Kumar K.V" <[email protected]>
> Cc: [email protected]
> Signed-off-by: David Hildenbrand <[email protected]>
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319

usage of pfn_to_online_page is the right way to dereference the pfn
here. My understanding of the zone device internals is limited to see
whether this fixes it as well.

Acked-by: Michal Hocko <[email protected]>

> ---
> fs/proc/page.c | 28 ++++++++++++++++------------
> 1 file changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index decd3fe39674..e40dbfe1168e 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -42,10 +42,12 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
> return -EINVAL;
>
> while (count > 0) {
> - if (pfn_valid(pfn))
> - ppage = pfn_to_page(pfn);
> - else
> - ppage = NULL;
> + /*
> + * TODO: ZONE_DEVICE support requires to identify
> + * memmaps that were actually initialized.
> + */
> + ppage = pfn_to_online_page(pfn);
> +
> if (!ppage || PageSlab(ppage) || page_has_type(ppage))
> pcount = 0;
> else
> @@ -218,10 +220,11 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf,
> return -EINVAL;
>
> while (count > 0) {
> - if (pfn_valid(pfn))
> - ppage = pfn_to_page(pfn);
> - else
> - ppage = NULL;
> + /*
> + * TODO: ZONE_DEVICE support requires to identify
> + * memmaps that were actually initialized.
> + */
> + ppage = pfn_to_online_page(pfn);
>
> if (put_user(stable_page_flags(ppage), out)) {
> ret = -EFAULT;
> @@ -263,10 +266,11 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
> return -EINVAL;
>
> while (count > 0) {
> - if (pfn_valid(pfn))
> - ppage = pfn_to_page(pfn);
> - else
> - ppage = NULL;
> + /*
> + * TODO: ZONE_DEVICE support requires to identify
> + * memmaps that were actually initialized.
> + */
> + ppage = pfn_to_online_page(pfn);
>
> if (ppage)
> ino = page_cgroup_ino(ppage);
> --
> 2.21.0

--
Michal Hocko
SUSE Labs

2019-10-10 00:28:43

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On Wed, Oct 09, 2019 at 04:24:35PM +0200, David Hildenbrand wrote:
> We should check for pfn_to_online_page() to not access uninitialized
> memmaps. Reshuffle the code so we don't have to duplicate the error
> message.
>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/memory-failure.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 7ef849da8278..e866e6e5660b 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
> if (!sysctl_memory_failure_recovery)
> panic("Memory failure on page %lx", pfn);
>
> - if (!pfn_valid(pfn)) {
> + p = pfn_to_online_page(pfn);
> + if (!p) {
> + if (pfn_valid(pfn)) {
> + pgmap = get_dev_pagemap(pfn, NULL);
> + if (pgmap)
> + return memory_failure_dev_pagemap(pfn, flags,
> + pgmap);
> + }
> pr_err("Memory failure: %#lx: memory outside kernel control\n",
> pfn);
> return -ENXIO;
> }
>
> - pgmap = get_dev_pagemap(pfn, NULL);
> - if (pgmap)
> - return memory_failure_dev_pagemap(pfn, flags, pgmap);
> -
> - p = pfn_to_page(pfn);

This change seems to assume that memory_failure_dev_pagemap() is never
called for online pages. Is it an intended behavior?
Or the concept "online pages" is not applicable to zone device pages?

Thanks,
Naoya Horiguchi

2019-10-10 07:22:54

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On 10.10.19 02:26, Naoya Horiguchi wrote:
> On Wed, Oct 09, 2019 at 04:24:35PM +0200, David Hildenbrand wrote:
>> We should check for pfn_to_online_page() to not access uninitialized
>> memmaps. Reshuffle the code so we don't have to duplicate the error
>> message.
>>
>> Cc: Naoya Horiguchi <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/memory-failure.c | 14 ++++++++------
>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 7ef849da8278..e866e6e5660b 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
>> if (!sysctl_memory_failure_recovery)
>> panic("Memory failure on page %lx", pfn);
>>
>> - if (!pfn_valid(pfn)) {
>> + p = pfn_to_online_page(pfn);
>> + if (!p) {
>> + if (pfn_valid(pfn)) {
>> + pgmap = get_dev_pagemap(pfn, NULL);
>> + if (pgmap)
>> + return memory_failure_dev_pagemap(pfn, flags,
>> + pgmap);
>> + }
>> pr_err("Memory failure: %#lx: memory outside kernel control\n",
>> pfn);
>> return -ENXIO;
>> }
>>
>> - pgmap = get_dev_pagemap(pfn, NULL);
>> - if (pgmap)
>> - return memory_failure_dev_pagemap(pfn, flags, pgmap);
>> -
>> - p = pfn_to_page(pfn);
>
> This change seems to assume that memory_failure_dev_pagemap() is never
> called for online pages. Is it an intended behavior?
> Or the concept "online pages" is not applicable to zone device pages?

Yes, that's the real culprit. ZONE_DEVICE/devmem pages are never online
(SECTION_IS_ONLINE). The terminology "online" only applies to pages that
were given to the buddy. And as we support sup-section hotadd for
devmem, we cannot easily make use of the section flag it. I already
proposed somewhere to convert SECTION_IS_ONLINE to a subsection bitmap
and call it something like pfn_active().

pfn_online() would then be "pfn_active() && zone != ZONE_DEVICE". And we
could use pfn_active() everywhere to test for initialized memmaps (well,
besides some special cases like device reserved memory that does not
span full sub-sections). Until now, nobody volunteered and I have other
things to do.

--

Thanks,

David / dhildenb

2019-10-10 07:39:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On Thu 10-10-19 09:27:32, David Hildenbrand wrote:
> On 09.10.19 16:43, Michal Hocko wrote:
> > On Wed 09-10-19 16:24:35, David Hildenbrand wrote:
> >> We should check for pfn_to_online_page() to not access uninitialized
> >> memmaps. Reshuffle the code so we don't have to duplicate the error
> >> message.
> >>
> >> Cc: Naoya Horiguchi <[email protected]>
> >> Cc: Andrew Morton <[email protected]>
> >> Cc: Michal Hocko <[email protected]>
> >> Signed-off-by: David Hildenbrand <[email protected]>
> >> ---
> >> mm/memory-failure.c | 14 ++++++++------
> >> 1 file changed, 8 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >> index 7ef849da8278..e866e6e5660b 100644
> >> --- a/mm/memory-failure.c
> >> +++ b/mm/memory-failure.c
> >> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
> >> if (!sysctl_memory_failure_recovery)
> >> panic("Memory failure on page %lx", pfn);
> >>
> >> - if (!pfn_valid(pfn)) {
> >> + p = pfn_to_online_page(pfn);
> >> + if (!p) {
> >> + if (pfn_valid(pfn)) {
> >> + pgmap = get_dev_pagemap(pfn, NULL);
> >> + if (pgmap)
> >> + return memory_failure_dev_pagemap(pfn, flags,
> >> + pgmap);
> >> + }
> >> pr_err("Memory failure: %#lx: memory outside kernel control\n",
> >> pfn);
> >> return -ENXIO;
> >
> > Don't we need that earlier at hwpoison_inject level?
> >
>
> Theoretically yes, this is another instance. But pfn_to_online_page(pfn)
> alone would not be sufficient as discussed. We would, again, have to
> special-case ZONE_DEVICE via things like get_dev_pagemap() ...
>
> But mm/hwpoison-inject.c:hwpoison_inject() is a pure debug feature either way:
>
> /*
> * Note that the below poison/unpoison interfaces do not involve
> * hardware status change, hence do not require hardware support.
> * They are mainly for testing hwpoison in software level.
> */
>
> So it's not that bad compared to memory_failure() called from real HW or
> drivers/base/memory.c:soft_offline_page_store()/hard_offline_page_store()

Yes, this is just a toy. And yes we need to handle zone device pages
here because a) people likely want to test MCE behavior even on these
pages and b) HW can really trigger MCEs there as well. I was just
pointing that the patch is likely incomplete.

--
Michal Hocko
SUSE Labs

2019-10-10 07:40:42

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On 09.10.19 16:43, Michal Hocko wrote:
> On Wed 09-10-19 16:24:35, David Hildenbrand wrote:
>> We should check for pfn_to_online_page() to not access uninitialized
>> memmaps. Reshuffle the code so we don't have to duplicate the error
>> message.
>>
>> Cc: Naoya Horiguchi <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/memory-failure.c | 14 ++++++++------
>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 7ef849da8278..e866e6e5660b 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
>> if (!sysctl_memory_failure_recovery)
>> panic("Memory failure on page %lx", pfn);
>>
>> - if (!pfn_valid(pfn)) {
>> + p = pfn_to_online_page(pfn);
>> + if (!p) {
>> + if (pfn_valid(pfn)) {
>> + pgmap = get_dev_pagemap(pfn, NULL);
>> + if (pgmap)
>> + return memory_failure_dev_pagemap(pfn, flags,
>> + pgmap);
>> + }
>> pr_err("Memory failure: %#lx: memory outside kernel control\n",
>> pfn);
>> return -ENXIO;
>
> Don't we need that earlier at hwpoison_inject level?
>

Theoretically yes, this is another instance. But pfn_to_online_page(pfn)
alone would not be sufficient as discussed. We would, again, have to
special-case ZONE_DEVICE via things like get_dev_pagemap() ...

But mm/hwpoison-inject.c:hwpoison_inject() is a pure debug feature either way:

/*
* Note that the below poison/unpoison interfaces do not involve
* hardware status change, hence do not require hardware support.
* They are mainly for testing hwpoison in software level.
*/

So it's not that bad compared to memory_failure() called from real HW or
drivers/base/memory.c:soft_offline_page_store()/hard_offline_page_store()

--

Thanks,

David / dhildenb

2019-10-10 07:53:04

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On 10.10.19 09:35, Michal Hocko wrote:
> On Thu 10-10-19 09:27:32, David Hildenbrand wrote:
>> On 09.10.19 16:43, Michal Hocko wrote:
>>> On Wed 09-10-19 16:24:35, David Hildenbrand wrote:
>>>> We should check for pfn_to_online_page() to not access uninitialized
>>>> memmaps. Reshuffle the code so we don't have to duplicate the error
>>>> message.
>>>>
>>>> Cc: Naoya Horiguchi <[email protected]>
>>>> Cc: Andrew Morton <[email protected]>
>>>> Cc: Michal Hocko <[email protected]>
>>>> Signed-off-by: David Hildenbrand <[email protected]>
>>>> ---
>>>> mm/memory-failure.c | 14 ++++++++------
>>>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index 7ef849da8278..e866e6e5660b 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
>>>> if (!sysctl_memory_failure_recovery)
>>>> panic("Memory failure on page %lx", pfn);
>>>>
>>>> - if (!pfn_valid(pfn)) {
>>>> + p = pfn_to_online_page(pfn);
>>>> + if (!p) {
>>>> + if (pfn_valid(pfn)) {
>>>> + pgmap = get_dev_pagemap(pfn, NULL);
>>>> + if (pgmap)
>>>> + return memory_failure_dev_pagemap(pfn, flags,
>>>> + pgmap);
>>>> + }
>>>> pr_err("Memory failure: %#lx: memory outside kernel control\n",
>>>> pfn);
>>>> return -ENXIO;
>>>
>>> Don't we need that earlier at hwpoison_inject level?
>>>
>>
>> Theoretically yes, this is another instance. But pfn_to_online_page(pfn)
>> alone would not be sufficient as discussed. We would, again, have to
>> special-case ZONE_DEVICE via things like get_dev_pagemap() ...
>>
>> But mm/hwpoison-inject.c:hwpoison_inject() is a pure debug feature either way:
>>
>> /*
>> * Note that the below poison/unpoison interfaces do not involve
>> * hardware status change, hence do not require hardware support.
>> * They are mainly for testing hwpoison in software level.
>> */
>>
>> So it's not that bad compared to memory_failure() called from real HW or
>> drivers/base/memory.c:soft_offline_page_store()/hard_offline_page_store()
>
> Yes, this is just a toy. And yes we need to handle zone device pages
> here because a) people likely want to test MCE behavior even on these
> pages and b) HW can really trigger MCEs there as well. I was just
> pointing that the patch is likely incomplete.
>

I rather think this deserves a separate patch as it is a separate
interface :)

I do wonder why hwpoison_inject() has to perform so much extra work
compared to other memory_failure() users. This smells like legacy
leftovers to me, but I might be wrong. The interface is fairly old,
though. Does anybody know why we need this magic? I can spot quite some
duplicate checks/things getting performed.

Naiive me would just make the interface perform the same as
hard_offline_page_store(). But most probably I am not getting the real
purpose of both different interfaces.

HWPOISON_INJECT is only selected for DEBUG_KERNEL, so I would have
guessed that fixing this is not urgent.

BTW: mm/memory-failure.c:soft_offline_page() also looks wrong and needs
fixing to make sure we access initialized memmaps.

--

Thanks,

David / dhildenb

2019-10-10 08:01:54

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On 10.10.19 09:52, David Hildenbrand wrote:
> On 10.10.19 09:35, Michal Hocko wrote:
>> On Thu 10-10-19 09:27:32, David Hildenbrand wrote:
>>> On 09.10.19 16:43, Michal Hocko wrote:
>>>> On Wed 09-10-19 16:24:35, David Hildenbrand wrote:
>>>>> We should check for pfn_to_online_page() to not access uninitialized
>>>>> memmaps. Reshuffle the code so we don't have to duplicate the error
>>>>> message.
>>>>>
>>>>> Cc: Naoya Horiguchi <[email protected]>
>>>>> Cc: Andrew Morton <[email protected]>
>>>>> Cc: Michal Hocko <[email protected]>
>>>>> Signed-off-by: David Hildenbrand <[email protected]>
>>>>> ---
>>>>> mm/memory-failure.c | 14 ++++++++------
>>>>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>> index 7ef849da8278..e866e6e5660b 100644
>>>>> --- a/mm/memory-failure.c
>>>>> +++ b/mm/memory-failure.c
>>>>> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
>>>>> if (!sysctl_memory_failure_recovery)
>>>>> panic("Memory failure on page %lx", pfn);
>>>>>
>>>>> - if (!pfn_valid(pfn)) {
>>>>> + p = pfn_to_online_page(pfn);
>>>>> + if (!p) {
>>>>> + if (pfn_valid(pfn)) {
>>>>> + pgmap = get_dev_pagemap(pfn, NULL);
>>>>> + if (pgmap)
>>>>> + return memory_failure_dev_pagemap(pfn, flags,
>>>>> + pgmap);
>>>>> + }
>>>>> pr_err("Memory failure: %#lx: memory outside kernel control\n",
>>>>> pfn);
>>>>> return -ENXIO;
>>>>
>>>> Don't we need that earlier at hwpoison_inject level?
>>>>
>>>
>>> Theoretically yes, this is another instance. But pfn_to_online_page(pfn)
>>> alone would not be sufficient as discussed. We would, again, have to
>>> special-case ZONE_DEVICE via things like get_dev_pagemap() ...
>>>
>>> But mm/hwpoison-inject.c:hwpoison_inject() is a pure debug feature either way:
>>>
>>> /*
>>> * Note that the below poison/unpoison interfaces do not involve
>>> * hardware status change, hence do not require hardware support.
>>> * They are mainly for testing hwpoison in software level.
>>> */
>>>
>>> So it's not that bad compared to memory_failure() called from real HW or
>>> drivers/base/memory.c:soft_offline_page_store()/hard_offline_page_store()
>>
>> Yes, this is just a toy. And yes we need to handle zone device pages
>> here because a) people likely want to test MCE behavior even on these
>> pages and b) HW can really trigger MCEs there as well. I was just
>> pointing that the patch is likely incomplete.
>>
>
> I rather think this deserves a separate patch as it is a separate
> interface :)
>
> I do wonder why hwpoison_inject() has to perform so much extra work
> compared to other memory_failure() users. This smells like legacy
> leftovers to me, but I might be wrong. The interface is fairly old,
> though. Does anybody know why we need this magic? I can spot quite some
> duplicate checks/things getting performed.
>
> Naiive me would just make the interface perform the same as
> hard_offline_page_store(). But most probably I am not getting the real
> purpose of both different interfaces.
>
> HWPOISON_INJECT is only selected for DEBUG_KERNEL, so I would have
> guessed that fixing this is not urgent.
>
> BTW: mm/memory-failure.c:soft_offline_page() also looks wrong and needs
> fixing to make sure we access initialized memmaps.
>

To be more precise, soft_offline_page_store() needs a
pfn_to_online_page() check. Will send a patch.

--

Thanks,

David / dhildenb

2019-10-11 06:16:38

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On Thu, Oct 10, 2019 at 09:58:40AM +0200, David Hildenbrand wrote:
> On 10.10.19 09:52, David Hildenbrand wrote:
> > On 10.10.19 09:35, Michal Hocko wrote:
> >> On Thu 10-10-19 09:27:32, David Hildenbrand wrote:
> >>> On 09.10.19 16:43, Michal Hocko wrote:
> >>>> On Wed 09-10-19 16:24:35, David Hildenbrand wrote:
> >>>>> We should check for pfn_to_online_page() to not access uninitialized
> >>>>> memmaps. Reshuffle the code so we don't have to duplicate the error
> >>>>> message.
> >>>>>
> >>>>> Cc: Naoya Horiguchi <[email protected]>
> >>>>> Cc: Andrew Morton <[email protected]>
> >>>>> Cc: Michal Hocko <[email protected]>
> >>>>> Signed-off-by: David Hildenbrand <[email protected]>
> >>>>> ---
> >>>>> mm/memory-failure.c | 14 ++++++++------
> >>>>> 1 file changed, 8 insertions(+), 6 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>>>> index 7ef849da8278..e866e6e5660b 100644
> >>>>> --- a/mm/memory-failure.c
> >>>>> +++ b/mm/memory-failure.c
> >>>>> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
> >>>>> if (!sysctl_memory_failure_recovery)
> >>>>> panic("Memory failure on page %lx", pfn);
> >>>>>
> >>>>> - if (!pfn_valid(pfn)) {
> >>>>> + p = pfn_to_online_page(pfn);
> >>>>> + if (!p) {
> >>>>> + if (pfn_valid(pfn)) {
> >>>>> + pgmap = get_dev_pagemap(pfn, NULL);
> >>>>> + if (pgmap)
> >>>>> + return memory_failure_dev_pagemap(pfn, flags,
> >>>>> + pgmap);
> >>>>> + }
> >>>>> pr_err("Memory failure: %#lx: memory outside kernel control\n",
> >>>>> pfn);
> >>>>> return -ENXIO;
> >>>>
> >>>> Don't we need that earlier at hwpoison_inject level?
> >>>>
> >>>
> >>> Theoretically yes, this is another instance. But pfn_to_online_page(pfn)
> >>> alone would not be sufficient as discussed. We would, again, have to
> >>> special-case ZONE_DEVICE via things like get_dev_pagemap() ...
> >>>
> >>> But mm/hwpoison-inject.c:hwpoison_inject() is a pure debug feature either way:
> >>>
> >>> /*
> >>> * Note that the below poison/unpoison interfaces do not involve
> >>> * hardware status change, hence do not require hardware support.
> >>> * They are mainly for testing hwpoison in software level.
> >>> */
> >>>
> >>> So it's not that bad compared to memory_failure() called from real HW or
> >>> drivers/base/memory.c:soft_offline_page_store()/hard_offline_page_store()
> >>
> >> Yes, this is just a toy. And yes we need to handle zone device pages
> >> here because a) people likely want to test MCE behavior even on these
> >> pages and b) HW can really trigger MCEs there as well. I was just
> >> pointing that the patch is likely incomplete.
> >>
> >
> > I rather think this deserves a separate patch as it is a separate
> > interface :)
> >
> > I do wonder why hwpoison_inject() has to perform so much extra work
> > compared to other memory_failure() users. This smells like legacy
> > leftovers to me, but I might be wrong. The interface is fairly old,
> > though. Does anybody know why we need this magic? I can spot quite some
> > duplicate checks/things getting performed.

It concerns me too, this *is* an old legacy code. I guess it was left as-is
because no one complained about it. That's not good, so I'll do some cleanup.

> >
> > Naiive me would just make the interface perform the same as
> > hard_offline_page_store(). But most probably I am not getting the real
> > purpose of both different interfaces.

Maybe for historical reason, we have these slightly different interfaces:

- corrupt-pfn
- purely for debugging purpose
- paired with unpoison-pfn
- used by in-kernel tool tools/vm/page-types.c
- hard_offline_page
- paired with soft_offline_page
- used by other userspace tools like mcelog

But these don't explain why implemented differently, so I think that both
should be written in the same manner.

> >
> > HWPOISON_INJECT is only selected for DEBUG_KERNEL, so I would have
> > guessed that fixing this is not urgent.
> >
> > BTW: mm/memory-failure.c:soft_offline_page() also looks wrong and needs
> > fixing to make sure we access initialized memmaps.
> >
>
> To be more precise, soft_offline_page_store() needs a
> pfn_to_online_page() check. Will send a patch.

Thanks for finding this.

- Naoya Horiguchi

2019-10-11 06:54:49

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On Thu, Oct 10, 2019 at 09:17:42AM +0200, David Hildenbrand wrote:
> On 10.10.19 02:26, Naoya Horiguchi wrote:
> > On Wed, Oct 09, 2019 at 04:24:35PM +0200, David Hildenbrand wrote:
> >> We should check for pfn_to_online_page() to not access uninitialized
> >> memmaps. Reshuffle the code so we don't have to duplicate the error
> >> message.
> >>
> >> Cc: Naoya Horiguchi <[email protected]>
> >> Cc: Andrew Morton <[email protected]>
> >> Cc: Michal Hocko <[email protected]>
> >> Signed-off-by: David Hildenbrand <[email protected]>
> >> ---
> >> mm/memory-failure.c | 14 ++++++++------
> >> 1 file changed, 8 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >> index 7ef849da8278..e866e6e5660b 100644
> >> --- a/mm/memory-failure.c
> >> +++ b/mm/memory-failure.c
> >> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
> >> if (!sysctl_memory_failure_recovery)
> >> panic("Memory failure on page %lx", pfn);
> >>
> >> - if (!pfn_valid(pfn)) {
> >> + p = pfn_to_online_page(pfn);
> >> + if (!p) {
> >> + if (pfn_valid(pfn)) {
> >> + pgmap = get_dev_pagemap(pfn, NULL);
> >> + if (pgmap)
> >> + return memory_failure_dev_pagemap(pfn, flags,
> >> + pgmap);
> >> + }
> >> pr_err("Memory failure: %#lx: memory outside kernel control\n",
> >> pfn);
> >> return -ENXIO;
> >> }
> >>
> >> - pgmap = get_dev_pagemap(pfn, NULL);
> >> - if (pgmap)
> >> - return memory_failure_dev_pagemap(pfn, flags, pgmap);
> >> -
> >> - p = pfn_to_page(pfn);
> >
> > This change seems to assume that memory_failure_dev_pagemap() is never
> > called for online pages. Is it an intended behavior?
> > Or the concept "online pages" is not applicable to zone device pages?
>
> Yes, that's the real culprit. ZONE_DEVICE/devmem pages are never online
> (SECTION_IS_ONLINE). The terminology "online" only applies to pages that
> were given to the buddy. And as we support sup-section hotadd for
> devmem, we cannot easily make use of the section flag it. I already
> proposed somewhere to convert SECTION_IS_ONLINE to a subsection bitmap
> and call it something like pfn_active().
>
> pfn_online() would then be "pfn_active() && zone != ZONE_DEVICE". And we
> could use pfn_active() everywhere to test for initialized memmaps (well,
> besides some special cases like device reserved memory that does not
> span full sub-sections). Until now, nobody volunteered and I have other
> things to do.

Thank you for explanation. I'm not sure how hard now but we try to do this.
You fix the problem from online/offline viewpoint now, and leave remaining
part untouched. I agree with that approach, and the suggested change looks
good to me.

Acked-by: Naoya Horiguchi <[email protected]>

2019-10-11 10:14:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On 11.10.19 08:02, Naoya Horiguchi wrote:
> On Thu, Oct 10, 2019 at 09:58:40AM +0200, David Hildenbrand wrote:
>> On 10.10.19 09:52, David Hildenbrand wrote:
>>> On 10.10.19 09:35, Michal Hocko wrote:
>>>> On Thu 10-10-19 09:27:32, David Hildenbrand wrote:
>>>>> On 09.10.19 16:43, Michal Hocko wrote:
>>>>>> On Wed 09-10-19 16:24:35, David Hildenbrand wrote:
>>>>>>> We should check for pfn_to_online_page() to not access uninitialized
>>>>>>> memmaps. Reshuffle the code so we don't have to duplicate the error
>>>>>>> message.
>>>>>>>
>>>>>>> Cc: Naoya Horiguchi <[email protected]>
>>>>>>> Cc: Andrew Morton <[email protected]>
>>>>>>> Cc: Michal Hocko <[email protected]>
>>>>>>> Signed-off-by: David Hildenbrand <[email protected]>
>>>>>>> ---
>>>>>>> mm/memory-failure.c | 14 ++++++++------
>>>>>>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>>>> index 7ef849da8278..e866e6e5660b 100644
>>>>>>> --- a/mm/memory-failure.c
>>>>>>> +++ b/mm/memory-failure.c
>>>>>>> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>>> if (!sysctl_memory_failure_recovery)
>>>>>>> panic("Memory failure on page %lx", pfn);
>>>>>>>
>>>>>>> - if (!pfn_valid(pfn)) {
>>>>>>> + p = pfn_to_online_page(pfn);
>>>>>>> + if (!p) {
>>>>>>> + if (pfn_valid(pfn)) {
>>>>>>> + pgmap = get_dev_pagemap(pfn, NULL);
>>>>>>> + if (pgmap)
>>>>>>> + return memory_failure_dev_pagemap(pfn, flags,
>>>>>>> + pgmap);
>>>>>>> + }
>>>>>>> pr_err("Memory failure: %#lx: memory outside kernel control\n",
>>>>>>> pfn);
>>>>>>> return -ENXIO;
>>>>>>
>>>>>> Don't we need that earlier at hwpoison_inject level?
>>>>>>
>>>>>
>>>>> Theoretically yes, this is another instance. But pfn_to_online_page(pfn)
>>>>> alone would not be sufficient as discussed. We would, again, have to
>>>>> special-case ZONE_DEVICE via things like get_dev_pagemap() ...
>>>>>
>>>>> But mm/hwpoison-inject.c:hwpoison_inject() is a pure debug feature either way:
>>>>>
>>>>> /*
>>>>> * Note that the below poison/unpoison interfaces do not involve
>>>>> * hardware status change, hence do not require hardware support.
>>>>> * They are mainly for testing hwpoison in software level.
>>>>> */
>>>>>
>>>>> So it's not that bad compared to memory_failure() called from real HW or
>>>>> drivers/base/memory.c:soft_offline_page_store()/hard_offline_page_store()
>>>>
>>>> Yes, this is just a toy. And yes we need to handle zone device pages
>>>> here because a) people likely want to test MCE behavior even on these
>>>> pages and b) HW can really trigger MCEs there as well. I was just
>>>> pointing that the patch is likely incomplete.
>>>>
>>>
>>> I rather think this deserves a separate patch as it is a separate
>>> interface :)
>>>
>>> I do wonder why hwpoison_inject() has to perform so much extra work
>>> compared to other memory_failure() users. This smells like legacy
>>> leftovers to me, but I might be wrong. The interface is fairly old,
>>> though. Does anybody know why we need this magic? I can spot quite some
>>> duplicate checks/things getting performed.
>
> It concerns me too, this *is* an old legacy code. I guess it was left as-is
> because no one complained about it. That's not good, so I'll do some cleanup.

Most of that stuff was introduced in

commit 31d3d3484f9bd263925ecaa341500ac2df3a5d9b
Author: Wu Fengguang <[email protected]>
Date: Wed Dec 16 12:19:59 2009 +0100

HWPOISON: limit hwpoison injector to known page types

__memory_failure()'s workflow is

set PG_hwpoison
//...
unset PG_hwpoison if didn't pass hwpoison filter

That could kill unrelated process if it happens to page fault on the
page with the (temporary) PG_hwpoison. The race should be big enough to
appear in stress tests.

Fix it by grabbing the page and checking filter at inject time. This
also avoids the very noisy "Injecting memory failure..." messages.


Now, we still have the same "issue" in memory_failure() today:


if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
return 0;
}
[...]
if (hwpoison_filter(p)) {
if (TestClearPageHWPoison(p))
num_poisoned_pages_dec();
unlock_page(p);
put_hwpoison_page(p);
return 0;
}

However, I don't understand why we need that special handling only for this
debug interface and not the other users.

I'd vote for ripping out that legacy crap (so the interface works correctly
with ZONE_DEVICE) and instead (if really required) rework memory_failure()
to not produce such side effects.

>
>>>
>>> Naiive me would just make the interface perform the same as
>>> hard_offline_page_store(). But most probably I am not getting the real
>>> purpose of both different interfaces.
>
> Maybe for historical reason, we have these slightly different interfaces:
>
> - corrupt-pfn
> - purely for debugging purpose
> - paired with unpoison-pfn
> - used by in-kernel tool tools/vm/page-types.c
> - hard_offline_page
> - paired with soft_offline_page
> - used by other userspace tools like mcelog
>
> But these don't explain why implemented differently, so I think that both
> should be written in the same manner.
>
>>>
>>> HWPOISON_INJECT is only selected for DEBUG_KERNEL, so I would have
>>> guessed that fixing this is not urgent.
>>>
>>> BTW: mm/memory-failure.c:soft_offline_page() also looks wrong and needs
>>> fixing to make sure we access initialized memmaps.
>>>
>>
>> To be more precise, soft_offline_page_store() needs a
>> pfn_to_online_page() check. Will send a patch.
>
> Thanks for finding this.
>
> - Naoya Horiguchi
>


--

Thanks,

David / dhildenb

2019-10-14 08:45:13

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On 09.10.19 16:24, David Hildenbrand wrote:
> We should check for pfn_to_online_page() to not access uninitialized
> memmaps. Reshuffle the code so we don't have to duplicate the error
> message.
>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/memory-failure.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 7ef849da8278..e866e6e5660b 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
> if (!sysctl_memory_failure_recovery)
> panic("Memory failure on page %lx", pfn);
>
> - if (!pfn_valid(pfn)) {
> + p = pfn_to_online_page(pfn);
> + if (!p) {
> + if (pfn_valid(pfn)) {
> + pgmap = get_dev_pagemap(pfn, NULL);
> + if (pgmap)
> + return memory_failure_dev_pagemap(pfn, flags,
> + pgmap);
> + }
> pr_err("Memory failure: %#lx: memory outside kernel control\n",
> pfn);
> return -ENXIO;
> }
>
> - pgmap = get_dev_pagemap(pfn, NULL);
> - if (pgmap)
> - return memory_failure_dev_pagemap(pfn, flags, pgmap);
> -
> - p = pfn_to_page(pfn);
> if (PageHuge(p))
> return memory_failure_hugetlb(pfn, flags);
> if (TestSetPageHWPoison(p)) {
>

@Andrew, can you add

Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319

And

Cc: [email protected] # v4.13+

The stable backports won't be clean cherry-picks AFAIKS, but do-able.

--

Thanks,

David / dhildenb

2019-10-14 08:47:04

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: Don't access uninitialized memmaps in fs/proc/page.c

On 09.10.19 16:24, David Hildenbrand wrote:
> There are three places where we access uninitialized memmaps, namely:
> - /proc/kpagecount
> - /proc/kpageflags
> - /proc/kpagecgroup
>
> We have initialized memmaps either when the section is online or when
> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
> garbage and in the worst case trigger kernel BUGs, especially with
> CONFIG_PAGE_POISONING.
>
> For example, not onlining a DIMM during boot and calling /proc/kpagecount
> with CONFIG_PAGE_POISONING:
> :/# cat /proc/kpagecount > tmp.test
> [ 95.600592] BUG: unable to handle page fault for address: fffffffffffffffe
> [ 95.601238] #PF: supervisor read access in kernel mode
> [ 95.601675] #PF: error_code(0x0000) - not-present page
> [ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
> [ 95.602596] Oops: 0000 [#1] SMP NOPTI
> [ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
> [ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> [ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0
> [ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
> [ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
> [ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
> [ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
> [ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
> [ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
> [ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
> [ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
> [ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
> [ 95.611686] Call Trace:
> [ 95.611906] proc_reg_read+0x3c/0x60
> [ 95.612228] vfs_read+0xc5/0x180
> [ 95.612505] ksys_read+0x68/0xe0
> [ 95.612785] do_syscall_64+0x5c/0xa0
> [ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> For now, let's drop support for ZONE_DEVICE from the three pseudo files
> in order to fix this. To distinguish offline memory (with garbage memmap)
> from ZONE_DEVICE memory with properly initialized memmaps, we would have to
> check get_dev_pagemap() and pfn_zone_device_reserved() right now. The usage
> of both (especially, special casing devmem) is frowned upon and needs to
> be reworked. The fundamental issue we have is:
>
> if (pfn_to_online_page(pfn)) {
> /* memmap initialized */
> } else if (pfn_valid(pfn)) {
> /*
> * ???
> * a) offline memory. memmap garbage.
> * b) devmem: memmap initialized to ZONE_DEVICE.
> * c) devmem: reserved for driver. memmap garbage.
> * (d) devmem: memmap currently initializing - garbage)
> */
> }
>
> We'll leave the pfn_zone_device_reserved() check in stable_page_flags()
> in place as that function is also used from memory failure. We now
> no longer dump information about pages that are not in use anymore -
> offline.
>
> Reported-by: Qian Cai <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Alexey Dobriyan <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Stephen Rothwell <[email protected]>
> Cc: Toshiki Fukasawa <[email protected]>
> Cc: Pankaj gupta <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Anthony Yznaga <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: "Aneesh Kumar K.V" <[email protected]>
> Cc: [email protected]
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> fs/proc/page.c | 28 ++++++++++++++++------------
> 1 file changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index decd3fe39674..e40dbfe1168e 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -42,10 +42,12 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
> return -EINVAL;
>
> while (count > 0) {
> - if (pfn_valid(pfn))
> - ppage = pfn_to_page(pfn);
> - else
> - ppage = NULL;
> + /*
> + * TODO: ZONE_DEVICE support requires to identify
> + * memmaps that were actually initialized.
> + */
> + ppage = pfn_to_online_page(pfn);
> +
> if (!ppage || PageSlab(ppage) || page_has_type(ppage))
> pcount = 0;
> else
> @@ -218,10 +220,11 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf,
> return -EINVAL;
>
> while (count > 0) {
> - if (pfn_valid(pfn))
> - ppage = pfn_to_page(pfn);
> - else
> - ppage = NULL;
> + /*
> + * TODO: ZONE_DEVICE support requires to identify
> + * memmaps that were actually initialized.
> + */
> + ppage = pfn_to_online_page(pfn);
>
> if (put_user(stable_page_flags(ppage), out)) {
> ret = -EFAULT;
> @@ -263,10 +266,11 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
> return -EINVAL;
>
> while (count > 0) {
> - if (pfn_valid(pfn))
> - ppage = pfn_to_page(pfn);
> - else
> - ppage = NULL;
> + /*
> + * TODO: ZONE_DEVICE support requires to identify
> + * memmaps that were actually initialized.
> + */
> + ppage = pfn_to_online_page(pfn);
>
> if (ppage)
> ino = page_cgroup_ino(ppage);
>

@Andrew, can you add

Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319

And

Cc: [email protected] # v4.13+

Thanks!

--

Thanks,

David / dhildenb

2019-10-14 13:37:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

[Cc Oscar]

On Fri 11-10-19 12:13:17, David Hildenbrand wrote:
> On 11.10.19 08:02, Naoya Horiguchi wrote:
> > On Thu, Oct 10, 2019 at 09:58:40AM +0200, David Hildenbrand wrote:
> >> On 10.10.19 09:52, David Hildenbrand wrote:
> >>> On 10.10.19 09:35, Michal Hocko wrote:
> >>>> On Thu 10-10-19 09:27:32, David Hildenbrand wrote:
> >>>>> On 09.10.19 16:43, Michal Hocko wrote:
> >>>>>> On Wed 09-10-19 16:24:35, David Hildenbrand wrote:
> >>>>>>> We should check for pfn_to_online_page() to not access uninitialized
> >>>>>>> memmaps. Reshuffle the code so we don't have to duplicate the error
> >>>>>>> message.
> >>>>>>>
> >>>>>>> Cc: Naoya Horiguchi <[email protected]>
> >>>>>>> Cc: Andrew Morton <[email protected]>
> >>>>>>> Cc: Michal Hocko <[email protected]>
> >>>>>>> Signed-off-by: David Hildenbrand <[email protected]>
> >>>>>>> ---
> >>>>>>> mm/memory-failure.c | 14 ++++++++------
> >>>>>>> 1 file changed, 8 insertions(+), 6 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>>>>>> index 7ef849da8278..e866e6e5660b 100644
> >>>>>>> --- a/mm/memory-failure.c
> >>>>>>> +++ b/mm/memory-failure.c
> >>>>>>> @@ -1253,17 +1253,19 @@ int memory_failure(unsigned long pfn, int flags)
> >>>>>>> if (!sysctl_memory_failure_recovery)
> >>>>>>> panic("Memory failure on page %lx", pfn);
> >>>>>>>
> >>>>>>> - if (!pfn_valid(pfn)) {
> >>>>>>> + p = pfn_to_online_page(pfn);
> >>>>>>> + if (!p) {
> >>>>>>> + if (pfn_valid(pfn)) {
> >>>>>>> + pgmap = get_dev_pagemap(pfn, NULL);
> >>>>>>> + if (pgmap)
> >>>>>>> + return memory_failure_dev_pagemap(pfn, flags,
> >>>>>>> + pgmap);
> >>>>>>> + }
> >>>>>>> pr_err("Memory failure: %#lx: memory outside kernel control\n",
> >>>>>>> pfn);
> >>>>>>> return -ENXIO;
> >>>>>>
> >>>>>> Don't we need that earlier at hwpoison_inject level?
> >>>>>>
> >>>>>
> >>>>> Theoretically yes, this is another instance. But pfn_to_online_page(pfn)
> >>>>> alone would not be sufficient as discussed. We would, again, have to
> >>>>> special-case ZONE_DEVICE via things like get_dev_pagemap() ...
> >>>>>
> >>>>> But mm/hwpoison-inject.c:hwpoison_inject() is a pure debug feature either way:
> >>>>>
> >>>>> /*
> >>>>> * Note that the below poison/unpoison interfaces do not involve
> >>>>> * hardware status change, hence do not require hardware support.
> >>>>> * They are mainly for testing hwpoison in software level.
> >>>>> */
> >>>>>
> >>>>> So it's not that bad compared to memory_failure() called from real HW or
> >>>>> drivers/base/memory.c:soft_offline_page_store()/hard_offline_page_store()
> >>>>
> >>>> Yes, this is just a toy. And yes we need to handle zone device pages
> >>>> here because a) people likely want to test MCE behavior even on these
> >>>> pages and b) HW can really trigger MCEs there as well. I was just
> >>>> pointing that the patch is likely incomplete.
> >>>>
> >>>
> >>> I rather think this deserves a separate patch as it is a separate
> >>> interface :)
> >>>
> >>> I do wonder why hwpoison_inject() has to perform so much extra work
> >>> compared to other memory_failure() users. This smells like legacy
> >>> leftovers to me, but I might be wrong. The interface is fairly old,
> >>> though. Does anybody know why we need this magic? I can spot quite some
> >>> duplicate checks/things getting performed.
> >
> > It concerns me too, this *is* an old legacy code. I guess it was left as-is
> > because no one complained about it. That's not good, so I'll do some cleanup.
>
> Most of that stuff was introduced in
>
> commit 31d3d3484f9bd263925ecaa341500ac2df3a5d9b
> Author: Wu Fengguang <[email protected]>
> Date: Wed Dec 16 12:19:59 2009 +0100
>
> HWPOISON: limit hwpoison injector to known page types
>
> __memory_failure()'s workflow is
>
> set PG_hwpoison
> //...
> unset PG_hwpoison if didn't pass hwpoison filter
>
> That could kill unrelated process if it happens to page fault on the
> page with the (temporary) PG_hwpoison. The race should be big enough to
> appear in stress tests.
>
> Fix it by grabbing the page and checking filter at inject time. This
> also avoids the very noisy "Injecting memory failure..." messages.
>
>
> Now, we still have the same "issue" in memory_failure() today:
>
>
> if (TestSetPageHWPoison(p)) {
> pr_err("Memory failure: %#lx: already hardware poisoned\n",
> pfn);
> return 0;
> }
> [...]
> if (hwpoison_filter(p)) {
> if (TestClearPageHWPoison(p))
> num_poisoned_pages_dec();
> unlock_page(p);
> put_hwpoison_page(p);
> return 0;
> }
>
> However, I don't understand why we need that special handling only for this
> debug interface and not the other users.
>
> I'd vote for ripping out that legacy crap (so the interface works correctly
> with ZONE_DEVICE) and instead (if really required) rework memory_failure()
> to not produce such side effects.

I do agree. The two should be really using the same code. My
understanding was that MADV_HWPOISON was there to test the actual MCE
behavior (and the man page seems to agree with that).

Oscar is working on a rewrite. Not sure he has considered this as well.
--
Michal Hocko
SUSE Labs

2019-10-15 14:24:28

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On Mon, Oct 14, 2019 at 03:36:17PM +0200, Michal Hocko wrote:
> I do agree. The two should be really using the same code. My
> understanding was that MADV_HWPOISON was there to test the actual MCE
> behavior (and the man page seems to agree with that).
>
> Oscar is working on a rewrite. Not sure he has considered this as well.

Yeah, I came across hwpoison-inject module when doing my re-write,
and I felt like this is begging for a clean up.

Since unpoison_memory needs some adjustments after my re-write, I will go ahead
and clean that up, otherwise it will be inconsistent.

I expect to be ready fo send the v2 by the end of this week.

Thanks
--
Oscar Salvador
SUSE L3

2019-10-19 09:54:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On Thu, 10 Oct 2019 09:17:42 +0200 David Hildenbrand <[email protected]> wrote:

> >> - pgmap = get_dev_pagemap(pfn, NULL);
> >> - if (pgmap)
> >> - return memory_failure_dev_pagemap(pfn, flags, pgmap);
> >> -
> >> - p = pfn_to_page(pfn);
> >
> > This change seems to assume that memory_failure_dev_pagemap() is never
> > called for online pages. Is it an intended behavior?
> > Or the concept "online pages" is not applicable to zone device pages?
>
> Yes, that's the real culprit. ZONE_DEVICE/devmem pages are never online
> (SECTION_IS_ONLINE). The terminology "online" only applies to pages that
> were given to the buddy. And as we support sup-section hotadd for
> devmem, we cannot easily make use of the section flag it. I already
> proposed somewhere to convert SECTION_IS_ONLINE to a subsection bitmap
> and call it something like pfn_active().
>
> pfn_online() would then be "pfn_active() && zone != ZONE_DEVICE". And we
> could use pfn_active() everywhere to test for initialized memmaps (well,
> besides some special cases like device reserved memory that does not
> span full sub-sections). Until now, nobody volunteered and I have other
> things to do.

Is it worth a code comment or two to make this clearer?

2019-10-21 09:45:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

On 19.10.19 04:05, Andrew Morton wrote:
> On Thu, 10 Oct 2019 09:17:42 +0200 David Hildenbrand <[email protected]> wrote:
>
>>>> - pgmap = get_dev_pagemap(pfn, NULL);
>>>> - if (pgmap)
>>>> - return memory_failure_dev_pagemap(pfn, flags, pgmap);
>>>> -
>>>> - p = pfn_to_page(pfn);
>>>
>>> This change seems to assume that memory_failure_dev_pagemap() is never
>>> called for online pages. Is it an intended behavior?
>>> Or the concept "online pages" is not applicable to zone device pages?
>>
>> Yes, that's the real culprit. ZONE_DEVICE/devmem pages are never online
>> (SECTION_IS_ONLINE). The terminology "online" only applies to pages that
>> were given to the buddy. And as we support sup-section hotadd for
>> devmem, we cannot easily make use of the section flag it. I already
>> proposed somewhere to convert SECTION_IS_ONLINE to a subsection bitmap
>> and call it something like pfn_active().
>>
>> pfn_online() would then be "pfn_active() && zone != ZONE_DEVICE". And we
>> could use pfn_active() everywhere to test for initialized memmaps (well,
>> besides some special cases like device reserved memory that does not
>> span full sub-sections). Until now, nobody volunteered and I have other
>> things to do.
>
> Is it worth a code comment or two to make this clearer?

You mean something like

/* Only pages managed by the buddy are online (not ZONE_DEVICE). */

?

--

Thanks,

David / dhildenb