2019-10-10 14:14:21

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1] drivers/base/memory.c: Don't access uninitialized memmaps in soft_offline_page_store()

Uninitialized memmaps contain garbage and in the worst case trigger kernel
BUGs, especially with CONFIG_PAGE_POISONING. They should not get
touched.

Right now, when trying to soft-offline a PFN that resides on a memory
block that was never onlined, one gets a misleading error with
CONFIG_PAGE_POISONING:
:/# echo 5637144576 > /sys/devices/system/memory/soft_offline_page
[ 23.097167] soft offline: 0x150000 page already poisoned

But the actual result depends on the garbage in the memmap.

soft_offline_page() can only work with online pages, it returns -EIO in
case of ZONE_DEVICE. Make sure to only forward pages that are online
(iow, managed by the buddy) and, therefore, have an initialized memmap.

Add a check against pfn_to_online_page() and similarly return -EIO.

Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/base/memory.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 6bea4f3f8040..55907c27075b 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -540,6 +540,9 @@ static ssize_t soft_offline_page_store(struct device *dev,
pfn >>= PAGE_SHIFT;
if (!pfn_valid(pfn))
return -ENXIO;
+ /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
+ if (!pfn_to_online_page(pfn))
+ return -EIO;
ret = soft_offline_page(pfn_to_page(pfn), 0);
return ret == 0 ? count : ret;
}
--
2.21.0


2019-10-11 06:17:59

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v1] drivers/base/memory.c: Don't access uninitialized memmaps in soft_offline_page_store()

On Thu, Oct 10, 2019 at 04:12:00PM +0200, David Hildenbrand wrote:
> Uninitialized memmaps contain garbage and in the worst case trigger kernel
> BUGs, especially with CONFIG_PAGE_POISONING. They should not get
> touched.
>
> Right now, when trying to soft-offline a PFN that resides on a memory
> block that was never onlined, one gets a misleading error with
> CONFIG_PAGE_POISONING:
> :/# echo 5637144576 > /sys/devices/system/memory/soft_offline_page
> [ 23.097167] soft offline: 0x150000 page already poisoned
>
> But the actual result depends on the garbage in the memmap.
>
> soft_offline_page() can only work with online pages, it returns -EIO in
> case of ZONE_DEVICE. Make sure to only forward pages that are online
> (iow, managed by the buddy) and, therefore, have an initialized memmap.
>
> Add a check against pfn_to_online_page() and similarly return -EIO.
>
> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> drivers/base/memory.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 6bea4f3f8040..55907c27075b 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -540,6 +540,9 @@ static ssize_t soft_offline_page_store(struct device *dev,
> pfn >>= PAGE_SHIFT;
> if (!pfn_valid(pfn))
> return -ENXIO;
> + /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
> + if (!pfn_to_online_page(pfn))
> + return -EIO;

Acked-by: Naoya Horiguchi <[email protected]>

I think this check could be placed in soft_offline_page(), but that requires
a few more unrelated lines of changes due to the mismatch on type of parameter
between memory_failure() and soft_offline_page(), This is not your problem,
and I plan to do some cleanup on related interfaces, so this patch is fine.

- Naoya Horiguchi

2019-10-11 09:52:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1] drivers/base/memory.c: Don't access uninitialized memmaps in soft_offline_page_store()

On 11.10.19 08:13, Naoya Horiguchi wrote:
> On Thu, Oct 10, 2019 at 04:12:00PM +0200, David Hildenbrand wrote:
>> Uninitialized memmaps contain garbage and in the worst case trigger kernel
>> BUGs, especially with CONFIG_PAGE_POISONING. They should not get
>> touched.
>>
>> Right now, when trying to soft-offline a PFN that resides on a memory
>> block that was never onlined, one gets a misleading error with
>> CONFIG_PAGE_POISONING:
>> :/# echo 5637144576 > /sys/devices/system/memory/soft_offline_page
>> [ 23.097167] soft offline: 0x150000 page already poisoned
>>
>> But the actual result depends on the garbage in the memmap.
>>
>> soft_offline_page() can only work with online pages, it returns -EIO in
>> case of ZONE_DEVICE. Make sure to only forward pages that are online
>> (iow, managed by the buddy) and, therefore, have an initialized memmap.
>>
>> Add a check against pfn_to_online_page() and similarly return -EIO.
>>
>> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319
>> Cc: Greg Kroah-Hartman <[email protected]>
>> Cc: "Rafael J. Wysocki" <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> drivers/base/memory.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index 6bea4f3f8040..55907c27075b 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -540,6 +540,9 @@ static ssize_t soft_offline_page_store(struct device *dev,
>> pfn >>= PAGE_SHIFT;
>> if (!pfn_valid(pfn))
>> return -ENXIO;
>> + /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
>> + if (!pfn_to_online_page(pfn))
>> + return -EIO;
>
> Acked-by: Naoya Horiguchi <[email protected]>
>
> I think this check could be placed in soft_offline_page(), but that requires
> a few more unrelated lines of changes due to the mismatch on type of parameter
> between memory_failure() and soft_offline_page(), This is not your problem,
> and I plan to do some cleanup on related interfaces, so this patch is fine.
>

Thanks,

well I think when you come via madvise(), you are always guaranteed to
hold a reasonable page in your hands. Only when converting from
arbitrary pfns, we have to watch out. But yeah, feel free to cc me on
cleanups :)

> - Naoya Horiguchi
>


--

Thanks,

David / dhildenb

2019-10-11 22:40:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v1] drivers/base/memory.c: Don't access uninitialized memmaps in soft_offline_page_store()

On Thu, 10 Oct 2019 16:12:00 +0200 David Hildenbrand <[email protected]> wrote:

> Uninitialized memmaps contain garbage and in the worst case trigger kernel
> BUGs, especially with CONFIG_PAGE_POISONING. They should not get
> touched.
>
> Right now, when trying to soft-offline a PFN that resides on a memory
> block that was never onlined, one gets a misleading error with
> CONFIG_PAGE_POISONING:
> :/# echo 5637144576 > /sys/devices/system/memory/soft_offline_page
> [ 23.097167] soft offline: 0x150000 page already poisoned
>
> But the actual result depends on the garbage in the memmap.
>
> soft_offline_page() can only work with online pages, it returns -EIO in
> case of ZONE_DEVICE. Make sure to only forward pages that are online
> (iow, managed by the buddy) and, therefore, have an initialized memmap.
>
> Add a check against pfn_to_online_page() and similarly return -EIO.
>
> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319

Should this be cc:stable?

What is the relationship between this and some similar fixes in the
series "mm/memory_hotplug: Shrink zones before removing memory", v6?

Should any of the patches in "mm/memory_hotplug: Shrink zones before
removing memory", v6 be cc:stable?

2019-10-14 08:32:19

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1] drivers/base/memory.c: Don't access uninitialized memmaps in soft_offline_page_store()

On 12.10.19 00:16, Andrew Morton wrote:
> On Thu, 10 Oct 2019 16:12:00 +0200 David Hildenbrand <[email protected]> wrote:
>
>> Uninitialized memmaps contain garbage and in the worst case trigger kernel
>> BUGs, especially with CONFIG_PAGE_POISONING. They should not get
>> touched.
>>
>> Right now, when trying to soft-offline a PFN that resides on a memory
>> block that was never onlined, one gets a misleading error with
>> CONFIG_PAGE_POISONING:
>> :/# echo 5637144576 > /sys/devices/system/memory/soft_offline_page
>> [ 23.097167] soft offline: 0x150000 page already poisoned
>>
>> But the actual result depends on the garbage in the memmap.
>>
>> soft_offline_page() can only work with online pages, it returns -EIO in
>> case of ZONE_DEVICE. Make sure to only forward pages that are online
>> (iow, managed by the buddy) and, therefore, have an initialized memmap.
>>
>> Add a check against pfn_to_online_page() and similarly return -EIO.
>>
>> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319
>
> Should this be cc:stable?

I think yes, more on that below.

>
> What is the relationship between this and some similar fixes in the
> series "mm/memory_hotplug: Shrink zones before removing memory", v6?

In general, they all have the same root cause. With f1dd2cd13c4b, we
started to initialize the memmap when onlining memory, however, we at
least zeroed it out when adding the memory. With d0dc12e86b319 we
removed the zeroing, and added conditional poisoning instead.

All these BUGs can be reproduced by adding memory and keeping some
memory blocks offline. Most distributions either online memory directly
in the kernel when added or userspace onlines it via udev rules. s390x
is special, because there we don't online memory blocks as default in
user space. So on !s390x systems, these BUGs are quite hard to reproduce.

With "mm/memory_hotplug: Shrink zones before removing memory" these BUGs
get easier to reproduce, because it is now sufficient to offline a
memory block that was already onlined.

Also, devmem with "driver reserved memory" (for which part we don't
initialize the memmap) is able to trigger these BUGs, but that feature
is more recent AFAIK.

So, cc:stable, I am not sure if it applies to all patches. Some really
only trigger when page poisoning is active, but don't result in any
damage (as so far observed). We really produce damage in case we
de-reference the NID/zone via the garbage memmap (and probably when
doing a page_to_pfn(pfn_to_page(gargabe_page))).

But here, it is quite hard to tell what could happen, so I guess, if in
doubt, better add cc:stable?

>
> Should any of the patches in "mm/memory_hotplug: Shrink zones before
> removing memory", v6 be cc:stable?
>

I'll go over all patches and reply to the relevant ones.



So for this patch, please add:

Cc: [email protected] # v4.13+

--

Thanks,

David / dhildenb

2019-10-14 13:30:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v1] drivers/base/memory.c: Don't access uninitialized memmaps in soft_offline_page_store()

On Thu 10-10-19 16:12:00, David Hildenbrand wrote:
> Uninitialized memmaps contain garbage and in the worst case trigger kernel
> BUGs, especially with CONFIG_PAGE_POISONING. They should not get
> touched.
>
> Right now, when trying to soft-offline a PFN that resides on a memory
> block that was never onlined, one gets a misleading error with
> CONFIG_PAGE_POISONING:
> :/# echo 5637144576 > /sys/devices/system/memory/soft_offline_page
> [ 23.097167] soft offline: 0x150000 page already poisoned
>
> But the actual result depends on the garbage in the memmap.
>
> soft_offline_page() can only work with online pages, it returns -EIO in
> case of ZONE_DEVICE. Make sure to only forward pages that are online
> (iow, managed by the buddy) and, therefore, have an initialized memmap.
>
> Add a check against pfn_to_online_page() and similarly return -EIO.
>
> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> drivers/base/memory.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 6bea4f3f8040..55907c27075b 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -540,6 +540,9 @@ static ssize_t soft_offline_page_store(struct device *dev,
> pfn >>= PAGE_SHIFT;
> if (!pfn_valid(pfn))
> return -ENXIO;
> + /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
> + if (!pfn_to_online_page(pfn))
> + return -EIO;
> ret = soft_offline_page(pfn_to_page(pfn), 0);
> return ret == 0 ? count : ret;
> }
> --
> 2.21.0

--
Michal Hocko
SUSE Labs