2021-04-29 12:26:38

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 0/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages

This is (obviously) for v5.13++; no need to rush with review, but I decided
to send it around right away.

Looking for places where the kernel might unconditionally read
PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore
needs some more love to not touch some other pages we really don't want to
read -- i.e., hwpoisoned.

Examples for PageOffline() pages are pages inflated in a balloon,
memory unplugged via virtio-mem, and partially-present sections in
memory added by the Hyper-V balloon.

When reading pages inflated in a balloon, we essentially produce
unnecessary load in the hypervisor; holes in partially present sections in
case of Hyper-V are not accessible and already were a problem for
/proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages. In
the future, virtio-mem might disallow reading unplugged memory -- marked
as PageOffline() -- in some environments, resulting in undefined behavior
when accessed; therefore, I'm trying to identify and rework all these
(corner) cases.

With this series, there is really only access via /dev/mem, /proc/vmcore
and kdb left after I ripped out /dev/kmem. kdb is an advanced corner-case
use case -- we won't care for now if someone explicitly tries to do nasty
things by reading from/writing to physical addresses we better not touch.
/dev/mem is a use case we won't support for virtio-mem, at least for now,
so we'll simply disallow mapping any virtio-mem memory via /dev/mem next.
/proc/vmcore is really only a problem when dumping the old kernel via
something that's not makedumpfile (read: basically never), however, we'll
try sanitizing that as well in the second kernel in the future.

Tested via kcore_dump:
https://github.com/schlafwandler/kcore_dump

Cc: Andrew Morton <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Aili Yao <[email protected]>
Cc: Jiri Bohac <[email protected]>
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

David Hildenbrand (7):
fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER
fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM
mm: rename and move page_is_poisoned()
fs/proc/kcore: don't read offline sections, logically offline pages
and hwpoisoned pages
mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize
setting PageOffline()
virtio-mem: use page_offline_(start|end) when setting PageOffline()
fs/proc/kcore: use page_offline_(freeze|unfreeze)

drivers/virtio/virtio_mem.c | 2 ++
fs/proc/kcore.c | 69 ++++++++++++++++++++++++++++++-------
include/linux/kcore.h | 3 --
include/linux/page-flags.h | 12 +++++++
mm/gup.c | 6 +++-
mm/internal.h | 20 -----------
mm/util.c | 40 +++++++++++++++++++++
7 files changed, 115 insertions(+), 37 deletions(-)


base-commit: 9f4ad9e425a1d3b6a34617b8ea226d56a119a717
--
2.30.2


2021-04-29 12:26:55

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 1/7] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER

Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()")
removed the last user of KCORE_REMAP.

Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
dumping vsyscall user page") removed the last user of KCORE_OTHER.

Let's drop both types. While at it, also drop vaddr in "struct
kcore_list", used by KCORE_REMAP only.

Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/kcore.c | 7 ++-----
include/linux/kcore.h | 3 ---
2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 4d2e64e9016c..09f77d3c6e15 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -380,11 +380,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
phdr->p_type = PT_LOAD;
phdr->p_flags = PF_R | PF_W | PF_X;
phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset;
- if (m->type == KCORE_REMAP)
- phdr->p_vaddr = (size_t)m->vaddr;
- else
- phdr->p_vaddr = (size_t)m->addr;
- if (m->type == KCORE_RAM || m->type == KCORE_REMAP)
+ phdr->p_vaddr = (size_t)m->addr;
+ if (m->type == KCORE_RAM)
phdr->p_paddr = __pa(m->addr);
else if (m->type == KCORE_TEXT)
phdr->p_paddr = __pa_symbol(m->addr);
diff --git a/include/linux/kcore.h b/include/linux/kcore.h
index da676cdbd727..86c0f1d18998 100644
--- a/include/linux/kcore.h
+++ b/include/linux/kcore.h
@@ -11,14 +11,11 @@ enum kcore_type {
KCORE_RAM,
KCORE_VMEMMAP,
KCORE_USER,
- KCORE_OTHER,
- KCORE_REMAP,
};

struct kcore_list {
struct list_head list;
unsigned long addr;
- unsigned long vaddr;
size_t size;
int type;
};
--
2.30.2

2021-04-29 12:27:09

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 2/7] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM

Let's resturcture the code, using switch-case, and checking pfn_is_ram()
only when we are dealing with KCORE_RAM.

Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/kcore.c | 35 +++++++++++++++++++++++++++--------
1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 09f77d3c6e15..ed6fbb3bd50c 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -483,25 +483,36 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
goto out;
}
m = NULL; /* skip the list anchor */
- } else if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
- if (clear_user(buffer, tsz)) {
- ret = -EFAULT;
- goto out;
- }
- } else if (m->type == KCORE_VMALLOC) {
+ goto skip;
+ }
+
+ switch (m->type) {
+ case KCORE_VMALLOC:
vread(buf, (char *)start, tsz);
/* we have to zero-fill user buffer even if no read */
if (copy_to_user(buffer, buf, tsz)) {
ret = -EFAULT;
goto out;
}
- } else if (m->type == KCORE_USER) {
+ break;
+ case KCORE_USER:
/* User page is handled prior to normal kernel page: */
if (copy_to_user(buffer, (char *)start, tsz)) {
ret = -EFAULT;
goto out;
}
- } else {
+ break;
+ case KCORE_RAM:
+ if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
+ if (clear_user(buffer, tsz)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ break;
+ }
+ fallthrough;
+ case KCORE_VMEMMAP:
+ case KCORE_TEXT:
if (kern_addr_valid(start)) {
/*
* Using bounce buffer to bypass the
@@ -525,7 +536,15 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
goto out;
}
}
+ break;
+ default:
+ pr_warn_once("Unhandled KCORE type: %d\n", m->type);
+ if (clear_user(buffer, tsz)) {
+ ret = -EFAULT;
+ goto out;
+ }
}
+skip:
buflen -= tsz;
*fpos += tsz;
buffer += tsz;
--
2.30.2

2021-04-29 12:27:35

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
introduced page_is_poisoned(), however, v5 [1] of the patch used
"page_is_hwpoison()" and something went wrong while upstreaming. Rename the
function and move it to page-flags.h, from where it can be used in other
-- kcore -- context.

Move the comment to the place where it belongs and simplify.

[1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/page-flags.h | 7 +++++++
mm/gup.c | 6 +++++-
mm/internal.h | 20 --------------------
3 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 04a34c08e0a6..b8c56672a588 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -694,6 +694,13 @@ PAGEFLAG_FALSE(DoubleMap)
TESTSCFLAG_FALSE(DoubleMap)
#endif

+static inline bool is_page_hwpoison(struct page *page)
+{
+ if (PageHWPoison(page))
+ return true;
+ return PageHuge(page) && PageHWPoison(compound_head(page));
+}
+
/*
* For pages that are never mapped to userspace (and aren't PageSlab),
* page_type may be used. Because it is initialised to -1, we invert the
diff --git a/mm/gup.c b/mm/gup.c
index ef7d2da9f03f..000f3303e7f2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1536,7 +1536,11 @@ struct page *get_dump_page(unsigned long addr)
if (locked)
mmap_read_unlock(mm);

- if (ret == 1 && is_page_poisoned(page))
+ /*
+ * We might have hwpoisoned pages still mapped into user space. Don't
+ * read these pages when creating a coredump, access could be fatal.
+ */
+ if (ret == 1 && is_page_hwpoison(page))
return NULL;

return (ret == 1) ? page : NULL;
diff --git a/mm/internal.h b/mm/internal.h
index cb3c5e0a7799..1432feec62df 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -97,26 +97,6 @@ static inline void set_page_refcounted(struct page *page)
set_page_count(page, 1);
}

-/*
- * When kernel touch the user page, the user page may be have been marked
- * poison but still mapped in user space, if without this page, the kernel
- * can guarantee the data integrity and operation success, the kernel is
- * better to check the posion status and avoid touching it, be good not to
- * panic, coredump for process fatal signal is a sample case matching this
- * scenario. Or if kernel can't guarantee the data integrity, it's better
- * not to call this function, let kernel touch the poison page and get to
- * panic.
- */
-static inline bool is_page_poisoned(struct page *page)
-{
- if (PageHWPoison(page))
- return true;
- else if (PageHuge(page) && PageHWPoison(compound_head(page)))
- return true;
-
- return false;
-}
-
extern unsigned long highest_memmap_pfn;

/*
--
2.30.2

2021-04-29 12:28:05

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()

A driver might set a page logically offline -- PageOffline() -- and
turn the page inaccessible in the hypervisor; after that, access to page
content can be fatal. One example is virtio-mem; while unplugged memory
-- marked as PageOffline() can currently be read in the hypervisor, this
will no longer be the case in the future; for example, when having
a virtio-mem device backed by huge pages in the hypervisor.

Some special PFN walkers -- i.e., /proc/kcore -- read content of random
pages after checking PageOffline(); however, these PFN walkers can race
with drivers that set PageOffline().

Let's introduce page_offline_(begin|end|freeze|unfreeze) for
synchronizing.

page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
synchronize with such drivers, achieving that a page cannot be set
PageOffline() while frozen.

page_offline_begin()/page_offline_end() is used by drivers that care about
such races when setting a page PageOffline().

For simplicity, use a rwsem for now; neither drivers nor users are
performance sensitive.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/page-flags.h | 5 +++++
mm/util.c | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b8c56672a588..e3d00c72f459 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -767,6 +767,11 @@ PAGE_TYPE_OPS(Buddy, buddy)
*/
PAGE_TYPE_OPS(Offline, offline)

+extern void page_offline_freeze(void);
+extern void page_offline_unfreeze(void);
+extern void page_offline_begin(void);
+extern void page_offline_end(void);
+
/*
* Marks pages in use as page tables.
*/
diff --git a/mm/util.c b/mm/util.c
index 54870226cea6..95395d4e4209 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1013,3 +1013,41 @@ void mem_dump_obj(void *object)
}
pr_cont(" non-slab/vmalloc memory.\n");
}
+
+/*
+ * A driver might set a page logically offline -- PageOffline() -- and
+ * turn the page inaccessible in the hypervisor; after that, access to page
+ * content can be fatal.
+ *
+ * Some special PFN walkers -- i.e., /proc/kcore -- read content of random
+ * pages after checking PageOffline(); however, these PFN walkers can race
+ * with drivers that set PageOffline().
+ *
+ * page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
+ * synchronize with such drivers, achieving that a page cannot be set
+ * PageOffline() while frozen.
+ *
+ * page_offline_begin()/page_offline_end() is used by drivers that care about
+ * such races when setting a page PageOffline().
+ */
+static DECLARE_RWSEM(page_offline_rwsem);
+
+void page_offline_freeze(void)
+{
+ down_read(&page_offline_rwsem);
+}
+
+void page_offline_unfreeze(void)
+{
+ up_read(&page_offline_rwsem);
+}
+
+void page_offline_begin(void)
+{
+ down_write(&page_offline_rwsem);
+}
+
+void page_offline_end(void)
+{
+ up_write(&page_offline_rwsem);
+}
--
2.30.2

2021-04-29 12:28:49

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 4/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages

Let's avoid reading:

1) Offline memory sections: the content of offline memory sections is stale
as the memory is effectively unused by the kernel. On s390x with standby
memory, offline memory sections (belonging to offline storage
increments) are not accessible. With virtio-mem and the hyper-v balloon,
we can have unavailable memory chunks that should not be accessed inside
offline memory sections. Last but not least, offline memory sections
might contain hwpoisoned pages which we can no longer identify
because the memmap is stale.

2) PG_offline pages: logically offline pages that are documented as
"The content of these pages is effectively stale. Such pages should not
be touched (read/write/dump/save) except by their owner.".
Examples include pages inflated in a balloon or unavailble memory
ranges inside hotplugged memory sections with virtio-mem or the hyper-v
balloon.

3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
As documented: "Accessing is not safe since it may cause another machine
check. Don't touch!"

Reading /proc/kcore now performs similar checks as when reading
/proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
It's also similar to hibernation code, however, we don't skip hwpoisoned
pages when processing pages in kernel/power/snapshot.c:saveable_page() yet.

Note 1: we can race against memory offlining code, especially
memory going offline and getting unplugged: however, we will properly tear
down the identity mapping and handle faults gracefully when accessing
this memory from kcore code.

Note 2: we can race against drivers setting PageOffline() and turning
memory inaccessible in the hypervisor. We'll handle this in a follow-up
patch.

Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/kcore.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index ed6fbb3bd50c..92ff1e4436cb 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -465,6 +465,9 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)

m = NULL;
while (buflen) {
+ struct page *page;
+ unsigned long pfn;
+
/*
* If this is the first iteration or the address is not within
* the previous entry, search for a matching entry.
@@ -503,7 +506,16 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
}
break;
case KCORE_RAM:
- if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
+ pfn = __pa(start) >> PAGE_SHIFT;
+ page = pfn_to_online_page(pfn);
+
+ /*
+ * Don't read offline sections, logically offline pages
+ * (e.g., inflated in a balloon), hwpoisoned pages,
+ * and explicitly excluded physical ranges.
+ */
+ if (!page || PageOffline(page) ||
+ is_page_hwpoison(page) || !pfn_is_ram(pfn)) {
if (clear_user(buffer, tsz)) {
ret = -EFAULT;
goto out;
--
2.30.2

2021-04-29 12:29:24

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when setting PageOffline()

Let's properly use page_offline_(start|end) to synchronize setting
PageOffline(), so we won't have valid page access to unplugged memory
regions from /proc/kcore.

Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/virtio/virtio_mem.c | 2 ++
mm/util.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 10ec60d81e84..dc2a2e2b2ff8 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
static void virtio_mem_set_fake_offline(unsigned long pfn,
unsigned long nr_pages, bool onlined)
{
+ page_offline_begin();
for (; nr_pages--; pfn++) {
struct page *page = pfn_to_page(pfn);

@@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
ClearPageReserved(page);
}
}
+ page_offline_end();
}

/*
diff --git a/mm/util.c b/mm/util.c
index 95395d4e4209..d0e357bd65e6 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1046,8 +1046,10 @@ void page_offline_begin(void)
{
down_write(&page_offline_rwsem);
}
+EXPORT_SYMBOL(page_offline_begin);

void page_offline_end(void)
{
up_write(&page_offline_rwsem);
}
+EXPORT_SYMBOL(page_offline_end);
--
2.30.2

2021-04-29 12:29:37

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)

Let's properly synchronize with drivers that set PageOffline(). Unfreeze
every now and then, so drivers that want to set PageOffline() can make
progress.

Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/kcore.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 92ff1e4436cb..3d7531f47389 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
static ssize_t
read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
{
+ size_t page_offline_frozen = 0;
char *buf = file->private_data;
size_t phdrs_offset, notes_offset, data_offset;
size_t phdrs_len, notes_len;
@@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
pfn = __pa(start) >> PAGE_SHIFT;
page = pfn_to_online_page(pfn);

+ /*
+ * Don't race against drivers that set PageOffline()
+ * and expect no further page access.
+ */
+ if (page_offline_frozen == MAX_ORDER_NR_PAGES) {
+ page_offline_unfreeze();
+ page_offline_frozen = 0;
+ cond_resched();
+ }
+ if (!page_offline_frozen++)
+ page_offline_freeze();
+
/*
* Don't read offline sections, logically offline pages
* (e.g., inflated in a balloon), hwpoisoned pages,
@@ -565,6 +578,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
}

out:
+ if (page_offline_frozen)
+ page_offline_unfreeze();
up_read(&kclist_lock);
if (ret)
return ret;
--
2.30.2

2021-05-02 06:34:36

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v1 2/7] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM

On Thu, Apr 29, 2021 at 02:25:14PM +0200, David Hildenbrand wrote:
> Let's resturcture the code, using switch-case, and checking pfn_is_ram()
> only when we are dealing with KCORE_RAM.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Mike Rapoport <[email protected]>

> ---
> fs/proc/kcore.c | 35 +++++++++++++++++++++++++++--------
> 1 file changed, 27 insertions(+), 8 deletions(-)
>
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 09f77d3c6e15..ed6fbb3bd50c 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -483,25 +483,36 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> goto out;
> }
> m = NULL; /* skip the list anchor */
> - } else if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
> - if (clear_user(buffer, tsz)) {
> - ret = -EFAULT;
> - goto out;
> - }
> - } else if (m->type == KCORE_VMALLOC) {
> + goto skip;
> + }
> +
> + switch (m->type) {
> + case KCORE_VMALLOC:
> vread(buf, (char *)start, tsz);
> /* we have to zero-fill user buffer even if no read */
> if (copy_to_user(buffer, buf, tsz)) {
> ret = -EFAULT;
> goto out;
> }
> - } else if (m->type == KCORE_USER) {
> + break;
> + case KCORE_USER:
> /* User page is handled prior to normal kernel page: */
> if (copy_to_user(buffer, (char *)start, tsz)) {
> ret = -EFAULT;
> goto out;
> }
> - } else {
> + break;
> + case KCORE_RAM:
> + if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
> + if (clear_user(buffer, tsz)) {
> + ret = -EFAULT;
> + goto out;
> + }
> + break;
> + }
> + fallthrough;
> + case KCORE_VMEMMAP:
> + case KCORE_TEXT:
> if (kern_addr_valid(start)) {
> /*
> * Using bounce buffer to bypass the
> @@ -525,7 +536,15 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> goto out;
> }
> }
> + break;
> + default:
> + pr_warn_once("Unhandled KCORE type: %d\n", m->type);
> + if (clear_user(buffer, tsz)) {
> + ret = -EFAULT;
> + goto out;
> + }
> }
> +skip:
> buflen -= tsz;
> *fpos += tsz;
> buffer += tsz;
> --
> 2.30.2
>

--
Sincerely yours,
Mike.

2021-05-02 06:34:42

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

On Thu, Apr 29, 2021 at 02:25:15PM +0200, David Hildenbrand wrote:
> Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
> introduced page_is_poisoned(), however, v5 [1] of the patch used
> "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
> function and move it to page-flags.h, from where it can be used in other
> -- kcore -- context.
>
> Move the comment to the place where it belongs and simplify.
>
> [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Mike Rapoport <[email protected]>

> ---
> include/linux/page-flags.h | 7 +++++++
> mm/gup.c | 6 +++++-
> mm/internal.h | 20 --------------------
> 3 files changed, 12 insertions(+), 21 deletions(-)

Nice :)

>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 04a34c08e0a6..b8c56672a588 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -694,6 +694,13 @@ PAGEFLAG_FALSE(DoubleMap)
> TESTSCFLAG_FALSE(DoubleMap)
> #endif
>
> +static inline bool is_page_hwpoison(struct page *page)
> +{
> + if (PageHWPoison(page))
> + return true;
> + return PageHuge(page) && PageHWPoison(compound_head(page));
> +}
> +
> /*
> * For pages that are never mapped to userspace (and aren't PageSlab),
> * page_type may be used. Because it is initialised to -1, we invert the
> diff --git a/mm/gup.c b/mm/gup.c
> index ef7d2da9f03f..000f3303e7f2 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1536,7 +1536,11 @@ struct page *get_dump_page(unsigned long addr)
> if (locked)
> mmap_read_unlock(mm);
>
> - if (ret == 1 && is_page_poisoned(page))
> + /*
> + * We might have hwpoisoned pages still mapped into user space. Don't
> + * read these pages when creating a coredump, access could be fatal.
> + */
> + if (ret == 1 && is_page_hwpoison(page))
> return NULL;
>
> return (ret == 1) ? page : NULL;
> diff --git a/mm/internal.h b/mm/internal.h
> index cb3c5e0a7799..1432feec62df 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -97,26 +97,6 @@ static inline void set_page_refcounted(struct page *page)
> set_page_count(page, 1);
> }
>
> -/*
> - * When kernel touch the user page, the user page may be have been marked
> - * poison but still mapped in user space, if without this page, the kernel
> - * can guarantee the data integrity and operation success, the kernel is
> - * better to check the posion status and avoid touching it, be good not to
> - * panic, coredump for process fatal signal is a sample case matching this
> - * scenario. Or if kernel can't guarantee the data integrity, it's better
> - * not to call this function, let kernel touch the poison page and get to
> - * panic.
> - */
> -static inline bool is_page_poisoned(struct page *page)
> -{
> - if (PageHWPoison(page))
> - return true;
> - else if (PageHuge(page) && PageHWPoison(compound_head(page)))
> - return true;
> -
> - return false;
> -}
> -
> extern unsigned long highest_memmap_pfn;
>
> /*
> --
> 2.30.2
>

--
Sincerely yours,
Mike.

2021-05-02 06:34:48

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v1 4/7] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages

On Thu, Apr 29, 2021 at 02:25:16PM +0200, David Hildenbrand wrote:
> Let's avoid reading:
>
> 1) Offline memory sections: the content of offline memory sections is stale
> as the memory is effectively unused by the kernel. On s390x with standby
> memory, offline memory sections (belonging to offline storage
> increments) are not accessible. With virtio-mem and the hyper-v balloon,
> we can have unavailable memory chunks that should not be accessed inside
> offline memory sections. Last but not least, offline memory sections
> might contain hwpoisoned pages which we can no longer identify
> because the memmap is stale.
>
> 2) PG_offline pages: logically offline pages that are documented as
> "The content of these pages is effectively stale. Such pages should not
> be touched (read/write/dump/save) except by their owner.".
> Examples include pages inflated in a balloon or unavailble memory
> ranges inside hotplugged memory sections with virtio-mem or the hyper-v
> balloon.
>
> 3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
> As documented: "Accessing is not safe since it may cause another machine
> check. Don't touch!"
>
> Reading /proc/kcore now performs similar checks as when reading
> /proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
> It's also similar to hibernation code, however, we don't skip hwpoisoned
> pages when processing pages in kernel/power/snapshot.c:saveable_page() yet.
>
> Note 1: we can race against memory offlining code, especially
> memory going offline and getting unplugged: however, we will properly tear
> down the identity mapping and handle faults gracefully when accessing
> this memory from kcore code.
>
> Note 2: we can race against drivers setting PageOffline() and turning
> memory inaccessible in the hypervisor. We'll handle this in a follow-up
> patch.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Mike Rapoport <[email protected]>

> ---
> fs/proc/kcore.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index ed6fbb3bd50c..92ff1e4436cb 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -465,6 +465,9 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>
> m = NULL;
> while (buflen) {
> + struct page *page;
> + unsigned long pfn;
> +
> /*
> * If this is the first iteration or the address is not within
> * the previous entry, search for a matching entry.
> @@ -503,7 +506,16 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> }
> break;
> case KCORE_RAM:
> - if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
> + pfn = __pa(start) >> PAGE_SHIFT;
> + page = pfn_to_online_page(pfn);
> +
> + /*
> + * Don't read offline sections, logically offline pages
> + * (e.g., inflated in a balloon), hwpoisoned pages,
> + * and explicitly excluded physical ranges.
> + */
> + if (!page || PageOffline(page) ||
> + is_page_hwpoison(page) || !pfn_is_ram(pfn)) {
> if (clear_user(buffer, tsz)) {
> ret = -EFAULT;
> goto out;
> --
> 2.30.2
>

--
Sincerely yours,
Mike.

2021-05-02 06:36:39

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)

On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
> Let's properly synchronize with drivers that set PageOffline(). Unfreeze
> every now and then, so drivers that want to set PageOffline() can make
> progress.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> fs/proc/kcore.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 92ff1e4436cb..3d7531f47389 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
> static ssize_t
> read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> {
> + size_t page_offline_frozen = 0;
> char *buf = file->private_data;
> size_t phdrs_offset, notes_offset, data_offset;
> size_t phdrs_len, notes_len;
> @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> pfn = __pa(start) >> PAGE_SHIFT;
> page = pfn_to_online_page(pfn);

Can't this race with page offlining for the first time we get here?

> + /*
> + * Don't race against drivers that set PageOffline()
> + * and expect no further page access.
> + */
> + if (page_offline_frozen == MAX_ORDER_NR_PAGES) {
> + page_offline_unfreeze();
> + page_offline_frozen = 0;
> + cond_resched();
> + }
> + if (!page_offline_frozen++)
> + page_offline_freeze();
> +

Don't we need to freeze before doing pfn_to_online_page()?

> /*
> * Don't read offline sections, logically offline pages
> * (e.g., inflated in a balloon), hwpoisoned pages,
> @@ -565,6 +578,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> }
>
> out:
> + if (page_offline_frozen)
> + page_offline_unfreeze();
> up_read(&kclist_lock);
> if (ret)
> return ret;
> --
> 2.30.2
>

--
Sincerely yours,
Mike.

2021-05-02 06:37:18

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v1 1/7] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER

On Thu, Apr 29, 2021 at 02:25:13PM +0200, David Hildenbrand wrote:
> Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()")
> removed the last user of KCORE_REMAP.
>
> Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
> dumping vsyscall user page") removed the last user of KCORE_OTHER.
>
> Let's drop both types. While at it, also drop vaddr in "struct
> kcore_list", used by KCORE_REMAP only.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Mike Rapoport <[email protected]>

> ---
> fs/proc/kcore.c | 7 ++-----
> include/linux/kcore.h | 3 ---
> 2 files changed, 2 insertions(+), 8 deletions(-)
>
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 4d2e64e9016c..09f77d3c6e15 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -380,11 +380,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> phdr->p_type = PT_LOAD;
> phdr->p_flags = PF_R | PF_W | PF_X;
> phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset;
> - if (m->type == KCORE_REMAP)
> - phdr->p_vaddr = (size_t)m->vaddr;
> - else
> - phdr->p_vaddr = (size_t)m->addr;
> - if (m->type == KCORE_RAM || m->type == KCORE_REMAP)
> + phdr->p_vaddr = (size_t)m->addr;
> + if (m->type == KCORE_RAM)
> phdr->p_paddr = __pa(m->addr);
> else if (m->type == KCORE_TEXT)
> phdr->p_paddr = __pa_symbol(m->addr);
> diff --git a/include/linux/kcore.h b/include/linux/kcore.h
> index da676cdbd727..86c0f1d18998 100644
> --- a/include/linux/kcore.h
> +++ b/include/linux/kcore.h
> @@ -11,14 +11,11 @@ enum kcore_type {
> KCORE_RAM,
> KCORE_VMEMMAP,
> KCORE_USER,
> - KCORE_OTHER,
> - KCORE_REMAP,
> };
>
> struct kcore_list {
> struct list_head list;
> unsigned long addr;
> - unsigned long vaddr;
> size_t size;
> int type;
> };
> --
> 2.30.2
>

--
Sincerely yours,
Mike.

2021-05-02 06:37:18

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()

On Thu, Apr 29, 2021 at 02:25:17PM +0200, David Hildenbrand wrote:
> A driver might set a page logically offline -- PageOffline() -- and
> turn the page inaccessible in the hypervisor; after that, access to page
> content can be fatal. One example is virtio-mem; while unplugged memory
> -- marked as PageOffline() can currently be read in the hypervisor, this
> will no longer be the case in the future; for example, when having
> a virtio-mem device backed by huge pages in the hypervisor.
>
> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> pages after checking PageOffline(); however, these PFN walkers can race
> with drivers that set PageOffline().
>
> Let's introduce page_offline_(begin|end|freeze|unfreeze) for

Bikeshed: freeze|thaw?

> synchronizing.
>
> page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
> synchronize with such drivers, achieving that a page cannot be set
> PageOffline() while frozen.
>
> page_offline_begin()/page_offline_end() is used by drivers that care about
> such races when setting a page PageOffline().
>
> For simplicity, use a rwsem for now; neither drivers nor users are
> performance sensitive.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> include/linux/page-flags.h | 5 +++++
> mm/util.c | 38 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 43 insertions(+)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index b8c56672a588..e3d00c72f459 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -767,6 +767,11 @@ PAGE_TYPE_OPS(Buddy, buddy)
> */
> PAGE_TYPE_OPS(Offline, offline)
>
> +extern void page_offline_freeze(void);
> +extern void page_offline_unfreeze(void);
> +extern void page_offline_begin(void);
> +extern void page_offline_end(void);
> +
> /*
> * Marks pages in use as page tables.
> */
> diff --git a/mm/util.c b/mm/util.c
> index 54870226cea6..95395d4e4209 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1013,3 +1013,41 @@ void mem_dump_obj(void *object)
> }
> pr_cont(" non-slab/vmalloc memory.\n");
> }
> +
> +/*
> + * A driver might set a page logically offline -- PageOffline() -- and
> + * turn the page inaccessible in the hypervisor; after that, access to page
> + * content can be fatal.
> + *
> + * Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> + * pages after checking PageOffline(); however, these PFN walkers can race
> + * with drivers that set PageOffline().
> + *
> + * page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
> + * synchronize with such drivers, achieving that a page cannot be set
> + * PageOffline() while frozen.
> + *
> + * page_offline_begin()/page_offline_end() is used by drivers that care about
> + * such races when setting a page PageOffline().
> + */
> +static DECLARE_RWSEM(page_offline_rwsem);
> +
> +void page_offline_freeze(void)
> +{
> + down_read(&page_offline_rwsem);
> +}
> +
> +void page_offline_unfreeze(void)
> +{
> + up_read(&page_offline_rwsem);
> +}
> +
> +void page_offline_begin(void)
> +{
> + down_write(&page_offline_rwsem);
> +}
> +
> +void page_offline_end(void)
> +{
> + up_write(&page_offline_rwsem);
> +}
> --
> 2.30.2
>

--
Sincerely yours,
Mike.

2021-05-02 06:37:24

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when setting PageOffline()

On Thu, Apr 29, 2021 at 02:25:18PM +0200, David Hildenbrand wrote:
> Let's properly use page_offline_(start|end) to synchronize setting
> PageOffline(), so we won't have valid page access to unplugged memory
> regions from /proc/kcore.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> drivers/virtio/virtio_mem.c | 2 ++
> mm/util.c | 2 ++
> 2 files changed, 4 insertions(+)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 10ec60d81e84..dc2a2e2b2ff8 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
> static void virtio_mem_set_fake_offline(unsigned long pfn,
> unsigned long nr_pages, bool onlined)
> {
> + page_offline_begin();
> for (; nr_pages--; pfn++) {
> struct page *page = pfn_to_page(pfn);
>
> @@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
> ClearPageReserved(page);
> }
> }
> + page_offline_end();

I'm not really familiar with ballooning and memory hotplug, but is it the
only place that needs page_offline_{begin,end} ?

> }
>
> /*
> diff --git a/mm/util.c b/mm/util.c
> index 95395d4e4209..d0e357bd65e6 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1046,8 +1046,10 @@ void page_offline_begin(void)
> {
> down_write(&page_offline_rwsem);
> }
> +EXPORT_SYMBOL(page_offline_begin);

Should have been a part of the previous patch.

> void page_offline_end(void)
> {
> up_write(&page_offline_rwsem);
> }
> +EXPORT_SYMBOL(page_offline_end);

Ditto

> --
> 2.30.2
>

--
Sincerely yours,
Mike.

2021-05-03 08:14:54

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()

On 02.05.21 08:33, Mike Rapoport wrote:
> On Thu, Apr 29, 2021 at 02:25:17PM +0200, David Hildenbrand wrote:
>> A driver might set a page logically offline -- PageOffline() -- and
>> turn the page inaccessible in the hypervisor; after that, access to page
>> content can be fatal. One example is virtio-mem; while unplugged memory
>> -- marked as PageOffline() can currently be read in the hypervisor, this
>> will no longer be the case in the future; for example, when having
>> a virtio-mem device backed by huge pages in the hypervisor.
>>
>> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
>> pages after checking PageOffline(); however, these PFN walkers can race
>> with drivers that set PageOffline().
>>
>> Let's introduce page_offline_(begin|end|freeze|unfreeze) for
>
> Bikeshed: freeze|thaw?
>

Sure :)


--
Thanks,

David / dhildenb

2021-05-03 08:17:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when setting PageOffline()

On 02.05.21 08:33, Mike Rapoport wrote:
> On Thu, Apr 29, 2021 at 02:25:18PM +0200, David Hildenbrand wrote:
>> Let's properly use page_offline_(start|end) to synchronize setting
>> PageOffline(), so we won't have valid page access to unplugged memory
>> regions from /proc/kcore.
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> drivers/virtio/virtio_mem.c | 2 ++
>> mm/util.c | 2 ++
>> 2 files changed, 4 insertions(+)
>>
>> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
>> index 10ec60d81e84..dc2a2e2b2ff8 100644
>> --- a/drivers/virtio/virtio_mem.c
>> +++ b/drivers/virtio/virtio_mem.c
>> @@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
>> static void virtio_mem_set_fake_offline(unsigned long pfn,
>> unsigned long nr_pages, bool onlined)
>> {
>> + page_offline_begin();
>> for (; nr_pages--; pfn++) {
>> struct page *page = pfn_to_page(pfn);
>>
>> @@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
>> ClearPageReserved(page);
>> }
>> }
>> + page_offline_end();
>
> I'm not really familiar with ballooning and memory hotplug, but is it the
> only place that needs page_offline_{begin,end} ?

Existing balloon implementations that I am aware of (Hyper-V, XEN,
virtio-balloon, vmware-balloon) usually allow reading inflated memory;
doing so might result in unnecessary overhead in the hypervisor, so we
really want to avoid it -- but it's strictly not forbidden and has been
working forever. So we barely care about races: if there would be a rare
race, we'd still be able to read that memory.

For virtio-mem, it'll be different in the future when using shmem, huge
pages, !anonymous private mappings, ... as backing storage for a VM;
there will be a virtio spec extension to document that virtio-mem
changes that indicate the new behavior won't allow reading unplugged
memory and doing so will result in undefined behavior.

--
Thanks,

David / dhildenb

2021-05-03 08:24:37

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/7] virtio-mem: use page_offline_(start|end) when setting PageOffline()

On Thu, Apr 29, 2021 at 02:25:18PM +0200, David Hildenbrand wrote:
> Let's properly use page_offline_(start|end) to synchronize setting
> PageOffline(), so we won't have valid page access to unplugged memory
> regions from /proc/kcore.
>
> Signed-off-by: David Hildenbrand <[email protected]>


the patch looks good to me as such

Acked-by: Michael S. Tsirkin <[email protected]>

Feel free to merge with rest of patcgset - it seems to mostly
live in the fs/mm space.

IF you respin, maybe add the explanation you sent in response to Mike's comments
in the commit log.


> ---
> drivers/virtio/virtio_mem.c | 2 ++
> mm/util.c | 2 ++
> 2 files changed, 4 insertions(+)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 10ec60d81e84..dc2a2e2b2ff8 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
> static void virtio_mem_set_fake_offline(unsigned long pfn,
> unsigned long nr_pages, bool onlined)
> {
> + page_offline_begin();
> for (; nr_pages--; pfn++) {
> struct page *page = pfn_to_page(pfn);
>
> @@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
> ClearPageReserved(page);
> }
> }
> + page_offline_end();
> }
>
> /*
> diff --git a/mm/util.c b/mm/util.c
> index 95395d4e4209..d0e357bd65e6 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1046,8 +1046,10 @@ void page_offline_begin(void)
> {
> down_write(&page_offline_rwsem);
> }
> +EXPORT_SYMBOL(page_offline_begin);
>
> void page_offline_end(void)
> {
> up_write(&page_offline_rwsem);
> }
> +EXPORT_SYMBOL(page_offline_end);
> --
> 2.30.2

2021-05-03 08:29:24

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)

On 02.05.21 08:34, Mike Rapoport wrote:
> On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
>> Let's properly synchronize with drivers that set PageOffline(). Unfreeze
>> every now and then, so drivers that want to set PageOffline() can make
>> progress.
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> fs/proc/kcore.c | 15 +++++++++++++++
>> 1 file changed, 15 insertions(+)
>>
>> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
>> index 92ff1e4436cb..3d7531f47389 100644
>> --- a/fs/proc/kcore.c
>> +++ b/fs/proc/kcore.c
>> @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
>> static ssize_t
>> read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>> {
>> + size_t page_offline_frozen = 0;
>> char *buf = file->private_data;
>> size_t phdrs_offset, notes_offset, data_offset;
>> size_t phdrs_len, notes_len;
>> @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>> pfn = __pa(start) >> PAGE_SHIFT;
>> page = pfn_to_online_page(pfn);
>
> Can't this race with page offlining for the first time we get here?


To clarify, we have three types of offline pages in the kernel ...

a) Pages part of an offline memory section; the memap is stale and not
trustworthy. pfn_to_online_page() checks that. We *can* protect against
memory offlining using get_online_mems()/put_online_mems(), but usually
avoid doing so as the race window is very small (and a problem all over
the kernel we basically never hit) and locking is rather expensive. In
the future, we might switch to rcu to handle that more efficiently and
avoiding these possible races.

b) PageOffline(): logically offline pages contained in an online memory
section with a sane memmap. virtio-mem calls these pages "fake offline";
something like a "temporary" memory hole. The new mechanism I propose
will be used to handle synchronization as races can be more severe,
e.g., when reading actual page content here.

c) Soft offline pages: hwpoisoned pages that are not actually harmful
yet, but could become harmful in the future. So we better try to remove
the page from the page allcoator and try to migrate away existing users.


So page_offline_* handle "b) PageOffline()" only. There is a tiny race
between pfn_to_online_page(pfn) and looking at the memmap as we have in
many cases already throughout the kernel, to be tackled in the future.


(A better name for PageOffline() might make sense; PageSoftOffline()
would be catchy but interferes with c). PageLogicallyOffline() is ugly;
PageFakeOffline() might do)

>
>> + /*
>> + * Don't race against drivers that set PageOffline()
>> + * and expect no further page access.
>> + */
>> + if (page_offline_frozen == MAX_ORDER_NR_PAGES) {
>> + page_offline_unfreeze();
>> + page_offline_frozen = 0;
>> + cond_resched();
>> + }
>> + if (!page_offline_frozen++)
>> + page_offline_freeze();
>> +
>
> Don't we need to freeze before doing pfn_to_online_page()?

See my explanation above. Thanks!

--
Thanks,

David / dhildenb

2021-05-03 09:53:11

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)

On Mon, May 03, 2021 at 10:28:36AM +0200, David Hildenbrand wrote:
> On 02.05.21 08:34, Mike Rapoport wrote:
> > On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
> > > Let's properly synchronize with drivers that set PageOffline(). Unfreeze
> > > every now and then, so drivers that want to set PageOffline() can make
> > > progress.
> > >
> > > Signed-off-by: David Hildenbrand <[email protected]>
> > > ---
> > > fs/proc/kcore.c | 15 +++++++++++++++
> > > 1 file changed, 15 insertions(+)
> > >
> > > diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> > > index 92ff1e4436cb..3d7531f47389 100644
> > > --- a/fs/proc/kcore.c
> > > +++ b/fs/proc/kcore.c
> > > @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
> > > static ssize_t
> > > read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> > > {
> > > + size_t page_offline_frozen = 0;
> > > char *buf = file->private_data;
> > > size_t phdrs_offset, notes_offset, data_offset;
> > > size_t phdrs_len, notes_len;
> > > @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> > > pfn = __pa(start) >> PAGE_SHIFT;
> > > page = pfn_to_online_page(pfn);
> >
> > Can't this race with page offlining for the first time we get here?
>
>
> To clarify, we have three types of offline pages in the kernel ...
>
> a) Pages part of an offline memory section; the memap is stale and not
> trustworthy. pfn_to_online_page() checks that. We *can* protect against
> memory offlining using get_online_mems()/put_online_mems(), but usually
> avoid doing so as the race window is very small (and a problem all over the
> kernel we basically never hit) and locking is rather expensive. In the
> future, we might switch to rcu to handle that more efficiently and avoiding
> these possible races.
>
> b) PageOffline(): logically offline pages contained in an online memory
> section with a sane memmap. virtio-mem calls these pages "fake offline";
> something like a "temporary" memory hole. The new mechanism I propose will
> be used to handle synchronization as races can be more severe, e.g., when
> reading actual page content here.
>
> c) Soft offline pages: hwpoisoned pages that are not actually harmful yet,
> but could become harmful in the future. So we better try to remove the page
> from the page allcoator and try to migrate away existing users.
>
>
> So page_offline_* handle "b) PageOffline()" only. There is a tiny race
> between pfn_to_online_page(pfn) and looking at the memmap as we have in many
> cases already throughout the kernel, to be tackled in the future.

Right, but here you anyway add locking, so why exclude the first iteration?

> (A better name for PageOffline() might make sense; PageSoftOffline() would
> be catchy but interferes with c). PageLogicallyOffline() is ugly;
> PageFakeOffline() might do)
>
> > > + /*
> > > + * Don't race against drivers that set PageOffline()
> > > + * and expect no further page access.
> > > + */
> > > + if (page_offline_frozen == MAX_ORDER_NR_PAGES) {
> > > + page_offline_unfreeze();
> > > + page_offline_frozen = 0;
> > > + cond_resched();
> > > + }
> > > + if (!page_offline_frozen++)
> > > + page_offline_freeze();
> > > +

BTW, did you consider something like

if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
page_offline_unfreeze();
cond_resched();
page_offline_freeze();
}

We don't seem to care about page_offline_frozen overflows here, do we?

--
Sincerely yours,
Mike.

2021-05-03 10:50:55

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)

On 03.05.21 11:28, Mike Rapoport wrote:
> On Mon, May 03, 2021 at 10:28:36AM +0200, David Hildenbrand wrote:
>> On 02.05.21 08:34, Mike Rapoport wrote:
>>> On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
>>>> Let's properly synchronize with drivers that set PageOffline(). Unfreeze
>>>> every now and then, so drivers that want to set PageOffline() can make
>>>> progress.
>>>>
>>>> Signed-off-by: David Hildenbrand <[email protected]>
>>>> ---
>>>> fs/proc/kcore.c | 15 +++++++++++++++
>>>> 1 file changed, 15 insertions(+)
>>>>
>>>> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
>>>> index 92ff1e4436cb..3d7531f47389 100644
>>>> --- a/fs/proc/kcore.c
>>>> +++ b/fs/proc/kcore.c
>>>> @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
>>>> static ssize_t
>>>> read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>>>> {
>>>> + size_t page_offline_frozen = 0;
>>>> char *buf = file->private_data;
>>>> size_t phdrs_offset, notes_offset, data_offset;
>>>> size_t phdrs_len, notes_len;
>>>> @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>>>> pfn = __pa(start) >> PAGE_SHIFT;
>>>> page = pfn_to_online_page(pfn);
>>>
>>> Can't this race with page offlining for the first time we get here?
>>
>>
>> To clarify, we have three types of offline pages in the kernel ...
>>
>> a) Pages part of an offline memory section; the memap is stale and not
>> trustworthy. pfn_to_online_page() checks that. We *can* protect against
>> memory offlining using get_online_mems()/put_online_mems(), but usually
>> avoid doing so as the race window is very small (and a problem all over the
>> kernel we basically never hit) and locking is rather expensive. In the
>> future, we might switch to rcu to handle that more efficiently and avoiding
>> these possible races.
>>
>> b) PageOffline(): logically offline pages contained in an online memory
>> section with a sane memmap. virtio-mem calls these pages "fake offline";
>> something like a "temporary" memory hole. The new mechanism I propose will
>> be used to handle synchronization as races can be more severe, e.g., when
>> reading actual page content here.
>>
>> c) Soft offline pages: hwpoisoned pages that are not actually harmful yet,
>> but could become harmful in the future. So we better try to remove the page
>> from the page allcoator and try to migrate away existing users.
>>
>>
>> So page_offline_* handle "b) PageOffline()" only. There is a tiny race
>> between pfn_to_online_page(pfn) and looking at the memmap as we have in many
>> cases already throughout the kernel, to be tackled in the future.
>
> Right, but here you anyway add locking, so why exclude the first iteration?

What we're protecting is PageOffline() below. If I didn't mess up, we
should always be calling page_offline_freeze() before calling
PageOffline(). Or am I missing something?

>
> BTW, did you consider something like

Yes, I played with something like that. We'd have to handle the first
page_offline_freeze() freeze differently, though, and that's where
things got a bit ugly in my attempts.

>
> if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
> page_offline_unfreeze();
> cond_resched();
> page_offline_freeze();
> }
>
> We don't seem to care about page_offline_frozen overflows here, do we?

No, the buffer size is also size_t and gets incremented on a per-byte
basis. The variant I have right now looked the cleanest to me. Happy to
hear simpler alternatives.


--
Thanks,

David / dhildenb

2021-05-03 12:53:18

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)

On Mon, May 03, 2021 at 12:13:45PM +0200, David Hildenbrand wrote:
> On 03.05.21 11:28, Mike Rapoport wrote:
> > On Mon, May 03, 2021 at 10:28:36AM +0200, David Hildenbrand wrote:
> > > On 02.05.21 08:34, Mike Rapoport wrote:
> > > > On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
> > > > > Let's properly synchronize with drivers that set PageOffline(). Unfreeze
> > > > > every now and then, so drivers that want to set PageOffline() can make
> > > > > progress.
> > > > >
> > > > > Signed-off-by: David Hildenbrand <[email protected]>
> > > > > ---
> > > > > fs/proc/kcore.c | 15 +++++++++++++++
> > > > > 1 file changed, 15 insertions(+)
> > > > >
> > > > > diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> > > > > index 92ff1e4436cb..3d7531f47389 100644
> > > > > --- a/fs/proc/kcore.c
> > > > > +++ b/fs/proc/kcore.c
> > > > > @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
> > > > > static ssize_t
> > > > > read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> > > > > {
> > > > > + size_t page_offline_frozen = 0;
> > > > > char *buf = file->private_data;
> > > > > size_t phdrs_offset, notes_offset, data_offset;
> > > > > size_t phdrs_len, notes_len;
> > > > > @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> > > > > pfn = __pa(start) >> PAGE_SHIFT;
> > > > > page = pfn_to_online_page(pfn);
> > > >
> > > > Can't this race with page offlining for the first time we get here?
> > >
> > >
> > > To clarify, we have three types of offline pages in the kernel ...
> > >
> > > a) Pages part of an offline memory section; the memap is stale and not
> > > trustworthy. pfn_to_online_page() checks that. We *can* protect against
> > > memory offlining using get_online_mems()/put_online_mems(), but usually
> > > avoid doing so as the race window is very small (and a problem all over the
> > > kernel we basically never hit) and locking is rather expensive. In the
> > > future, we might switch to rcu to handle that more efficiently and avoiding
> > > these possible races.
> > >
> > > b) PageOffline(): logically offline pages contained in an online memory
> > > section with a sane memmap. virtio-mem calls these pages "fake offline";
> > > something like a "temporary" memory hole. The new mechanism I propose will
> > > be used to handle synchronization as races can be more severe, e.g., when
> > > reading actual page content here.
> > >
> > > c) Soft offline pages: hwpoisoned pages that are not actually harmful yet,
> > > but could become harmful in the future. So we better try to remove the page
> > > from the page allcoator and try to migrate away existing users.
> > >
> > >
> > > So page_offline_* handle "b) PageOffline()" only. There is a tiny race
> > > between pfn_to_online_page(pfn) and looking at the memmap as we have in many
> > > cases already throughout the kernel, to be tackled in the future.
> >
> > Right, but here you anyway add locking, so why exclude the first iteration?
>
> What we're protecting is PageOffline() below. If I didn't mess up, we should
> always be calling page_offline_freeze() before calling PageOffline(). Or am
> I missing something?

Somehow I was under impression we are protecting both pfn_to_online_page()
and PageOffline().

> > BTW, did you consider something like
>
> Yes, I played with something like that. We'd have to handle the first
> page_offline_freeze() freeze differently, though, and that's where things
> got a bit ugly in my attempts.
>
> >
> > if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
> > page_offline_unfreeze();
> > cond_resched();
> > page_offline_freeze();
> > }
> >
> > We don't seem to care about page_offline_frozen overflows here, do we?
>
> No, the buffer size is also size_t and gets incremented on a per-byte basis.
> The variant I have right now looked the cleanest to me. Happy to hear
> simpler alternatives.

Well, locking for the first time before the while() loop and doing
resched-relock outside switch() would be definitely nicer, and it makes the
last unlock unconditional.

The cost of prevention of memory offline during reads of !KCORE_RAM parts
does not seem that significant to me, but I may be missing something.

--
Sincerely yours,
Mike.

2021-05-03 12:53:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze)

On 03.05.21 13:33, Mike Rapoport wrote:
> On Mon, May 03, 2021 at 12:13:45PM +0200, David Hildenbrand wrote:
>> On 03.05.21 11:28, Mike Rapoport wrote:
>>> On Mon, May 03, 2021 at 10:28:36AM +0200, David Hildenbrand wrote:
>>>> On 02.05.21 08:34, Mike Rapoport wrote:
>>>>> On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote:
>>>>>> Let's properly synchronize with drivers that set PageOffline(). Unfreeze
>>>>>> every now and then, so drivers that want to set PageOffline() can make
>>>>>> progress.
>>>>>>
>>>>>> Signed-off-by: David Hildenbrand <[email protected]>
>>>>>> ---
>>>>>> fs/proc/kcore.c | 15 +++++++++++++++
>>>>>> 1 file changed, 15 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
>>>>>> index 92ff1e4436cb..3d7531f47389 100644
>>>>>> --- a/fs/proc/kcore.c
>>>>>> +++ b/fs/proc/kcore.c
>>>>>> @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
>>>>>> static ssize_t
>>>>>> read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>>>>>> {
>>>>>> + size_t page_offline_frozen = 0;
>>>>>> char *buf = file->private_data;
>>>>>> size_t phdrs_offset, notes_offset, data_offset;
>>>>>> size_t phdrs_len, notes_len;
>>>>>> @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>>>>>> pfn = __pa(start) >> PAGE_SHIFT;
>>>>>> page = pfn_to_online_page(pfn);
>>>>>
>>>>> Can't this race with page offlining for the first time we get here?
>>>>
>>>>
>>>> To clarify, we have three types of offline pages in the kernel ...
>>>>
>>>> a) Pages part of an offline memory section; the memap is stale and not
>>>> trustworthy. pfn_to_online_page() checks that. We *can* protect against
>>>> memory offlining using get_online_mems()/put_online_mems(), but usually
>>>> avoid doing so as the race window is very small (and a problem all over the
>>>> kernel we basically never hit) and locking is rather expensive. In the
>>>> future, we might switch to rcu to handle that more efficiently and avoiding
>>>> these possible races.
>>>>
>>>> b) PageOffline(): logically offline pages contained in an online memory
>>>> section with a sane memmap. virtio-mem calls these pages "fake offline";
>>>> something like a "temporary" memory hole. The new mechanism I propose will
>>>> be used to handle synchronization as races can be more severe, e.g., when
>>>> reading actual page content here.
>>>>
>>>> c) Soft offline pages: hwpoisoned pages that are not actually harmful yet,
>>>> but could become harmful in the future. So we better try to remove the page
>>>> from the page allcoator and try to migrate away existing users.
>>>>
>>>>
>>>> So page_offline_* handle "b) PageOffline()" only. There is a tiny race
>>>> between pfn_to_online_page(pfn) and looking at the memmap as we have in many
>>>> cases already throughout the kernel, to be tackled in the future.
>>>
>>> Right, but here you anyway add locking, so why exclude the first iteration?
>>
>> What we're protecting is PageOffline() below. If I didn't mess up, we should
>> always be calling page_offline_freeze() before calling PageOffline(). Or am
>> I missing something?
>
> Somehow I was under impression we are protecting both pfn_to_online_page()
> and PageOffline().
>
>>> BTW, did you consider something like
>>
>> Yes, I played with something like that. We'd have to handle the first
>> page_offline_freeze() freeze differently, though, and that's where things
>> got a bit ugly in my attempts.
>>
>>>
>>> if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
>>> page_offline_unfreeze();
>>> cond_resched();
>>> page_offline_freeze();
>>> }
>>>
>>> We don't seem to care about page_offline_frozen overflows here, do we?
>>
>> No, the buffer size is also size_t and gets incremented on a per-byte basis.
>> The variant I have right now looked the cleanest to me. Happy to hear
>> simpler alternatives.
>
> Well, locking for the first time before the while() loop and doing
> resched-relock outside switch() would be definitely nicer, and it makes the
> last unlock unconditional.
>
> The cost of prevention of memory offline during reads of !KCORE_RAM parts
> does not seem that significant to me, but I may be missing something.

Also true, I'll have a look if I can just simplify that.

--
Thanks,

David / dhildenb

2021-05-05 13:41:30

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

On 05.05.21 15:13, Michal Hocko wrote:
> On Thu 29-04-21 14:25:15, David Hildenbrand wrote:
>> Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
>> introduced page_is_poisoned(), however, v5 [1] of the patch used
>> "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
>> function and move it to page-flags.h, from where it can be used in other
>> -- kcore -- context.
>>
>> Move the comment to the place where it belongs and simplify.
>>
>> [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>
> I do agree that being explicit about hwpoison is much better. Poisoned
> page can be also an unitialized one and I believe this is the reason why
> you are bringing that up.

I'm bringing it up because I want to reuse that function as state above :)

>
> But you've made me look at d3378e86d182 and I am wondering whether this
> is really a valid patch. First of all it can leak a reference count
> AFAICS. Moreover it doesn't really fix anything because the page can be
> marked hwpoison right after the check is done. I do not think the race
> is feasible to be closed. So shouldn't we rather revert it?

I am not sure if we really care about races here that much here? I mean,
essentially we are racing with HW breaking asynchronously. Just because
we would be synchronizing with SetPageHWPoison() wouldn't mean we can
stop HW from breaking.

Long story short, this should be good enough for the cases we actually
can handle? What am I missing?

--
Thanks,

David / dhildenb

2021-05-05 13:42:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()

On Thu 29-04-21 14:25:17, David Hildenbrand wrote:
> A driver might set a page logically offline -- PageOffline() -- and
> turn the page inaccessible in the hypervisor; after that, access to page
> content can be fatal. One example is virtio-mem; while unplugged memory
> -- marked as PageOffline() can currently be read in the hypervisor, this
> will no longer be the case in the future; for example, when having
> a virtio-mem device backed by huge pages in the hypervisor.
>
> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> pages after checking PageOffline(); however, these PFN walkers can race
> with drivers that set PageOffline().
>
> Let's introduce page_offline_(begin|end|freeze|unfreeze) for
> synchronizing.
>
> page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
> synchronize with such drivers, achieving that a page cannot be set
> PageOffline() while frozen.
>
> page_offline_begin()/page_offline_end() is used by drivers that care about
> such races when setting a page PageOffline().
>
> For simplicity, use a rwsem for now; neither drivers nor users are
> performance sensitive.

Please add a note to the PageOffline documentation as well. While are
adding the api close enough an explicit note there wouldn't hurt.

> Signed-off-by: David Hildenbrand <[email protected]>

As to the patch itself, I am slightly worried that other pfn walkers
might be less tolerant to the locking than the proc ones. On the other
hand most users shouldn't really care as they do not tend to touch the
memory content and PageOffline check without any synchronization should
be sufficient for those. Let's try this out and see where we get...

Acked-by: Michal Hocko <[email protected]>

> ---
> include/linux/page-flags.h | 5 +++++
> mm/util.c | 38 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 43 insertions(+)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index b8c56672a588..e3d00c72f459 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -767,6 +767,11 @@ PAGE_TYPE_OPS(Buddy, buddy)
> */
> PAGE_TYPE_OPS(Offline, offline)
>
> +extern void page_offline_freeze(void);
> +extern void page_offline_unfreeze(void);
> +extern void page_offline_begin(void);
> +extern void page_offline_end(void);
> +
> /*
> * Marks pages in use as page tables.
> */
> diff --git a/mm/util.c b/mm/util.c
> index 54870226cea6..95395d4e4209 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1013,3 +1013,41 @@ void mem_dump_obj(void *object)
> }
> pr_cont(" non-slab/vmalloc memory.\n");
> }
> +
> +/*
> + * A driver might set a page logically offline -- PageOffline() -- and
> + * turn the page inaccessible in the hypervisor; after that, access to page
> + * content can be fatal.
> + *
> + * Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> + * pages after checking PageOffline(); however, these PFN walkers can race
> + * with drivers that set PageOffline().
> + *
> + * page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
> + * synchronize with such drivers, achieving that a page cannot be set
> + * PageOffline() while frozen.
> + *
> + * page_offline_begin()/page_offline_end() is used by drivers that care about
> + * such races when setting a page PageOffline().
> + */
> +static DECLARE_RWSEM(page_offline_rwsem);
> +
> +void page_offline_freeze(void)
> +{
> + down_read(&page_offline_rwsem);
> +}
> +
> +void page_offline_unfreeze(void)
> +{
> + up_read(&page_offline_rwsem);
> +}
> +
> +void page_offline_begin(void)
> +{
> + down_write(&page_offline_rwsem);
> +}
> +
> +void page_offline_end(void)
> +{
> + up_write(&page_offline_rwsem);
> +}
> --
> 2.30.2
>

--
Michal Hocko
SUSE Labs

2021-05-05 13:42:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

On Wed 05-05-21 15:17:53, David Hildenbrand wrote:
> On 05.05.21 15:13, Michal Hocko wrote:
> > On Thu 29-04-21 14:25:15, David Hildenbrand wrote:
> > > Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
> > > introduced page_is_poisoned(), however, v5 [1] of the patch used
> > > "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
> > > function and move it to page-flags.h, from where it can be used in other
> > > -- kcore -- context.
> > >
> > > Move the comment to the place where it belongs and simplify.
> > >
> > > [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
> > >
> > > Signed-off-by: David Hildenbrand <[email protected]>
> >
> > I do agree that being explicit about hwpoison is much better. Poisoned
> > page can be also an unitialized one and I believe this is the reason why
> > you are bringing that up.
>
> I'm bringing it up because I want to reuse that function as state above :)
>
> >
> > But you've made me look at d3378e86d182 and I am wondering whether this
> > is really a valid patch. First of all it can leak a reference count
> > AFAICS. Moreover it doesn't really fix anything because the page can be
> > marked hwpoison right after the check is done. I do not think the race
> > is feasible to be closed. So shouldn't we rather revert it?
>
> I am not sure if we really care about races here that much here? I mean,
> essentially we are racing with HW breaking asynchronously. Just because we
> would be synchronizing with SetPageHWPoison() wouldn't mean we can stop HW
> from breaking.

Right

> Long story short, this should be good enough for the cases we actually can
> handle? What am I missing?

I am not sure I follow. My point is that I fail to see any added value
of the check as it doesn't prevent the race (it fundamentally cannot as
the page can be poisoned at any time) but the failure path doesn't
put_page which is incorrect even for hwpoison pages.
--
Michal Hocko
SUSE Labs

2021-05-05 13:44:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

>> Long story short, this should be good enough for the cases we actually can
>> handle? What am I missing?
>
> I am not sure I follow. My point is that I fail to see any added value
> of the check as it doesn't prevent the race (it fundamentally cannot as
> the page can be poisoned at any time) but the failure path doesn't
> put_page which is incorrect even for hwpoison pages.

Oh, I think you are right. If we have a page and return NULL we would
leak a reference.

Actually, we discussed in that thread handling this entirely
differently, which resulted in a v7 [1]; however Andrew moved forward
with this (outdated?) patch, maybe that was just a mistake?

Yes, I agree we should revert that patch for now.

Regarding the race comment: AFAIU e.g., [2], it's not really a problem
with a race, but rather some corner case issue that can happen if we
fail in memory_failure().


[1] https://lkml.kernel.org/r/20210406104123.451ee3c3@alex-virtual-machine
[2]
https://lkml.kernel.org/r/[email protected]

--
Thanks,

David / dhildenb

2021-05-05 13:47:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

On Wed 05-05-21 15:39:08, David Hildenbrand wrote:
> > > Long story short, this should be good enough for the cases we actually can
> > > handle? What am I missing?
> >
> > I am not sure I follow. My point is that I fail to see any added value
> > of the check as it doesn't prevent the race (it fundamentally cannot as
> > the page can be poisoned at any time) but the failure path doesn't
> > put_page which is incorrect even for hwpoison pages.
>
> Oh, I think you are right. If we have a page and return NULL we would leak a
> reference.
>
> Actually, we discussed in that thread handling this entirely differently,
> which resulted in a v7 [1]; however Andrew moved forward with this
> (outdated?) patch, maybe that was just a mistake?
>
> Yes, I agree we should revert that patch for now.

OK, Let me send the revert to Andrew.

--
Michal Hocko
SUSE Labs

2021-05-05 14:15:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

On Thu 29-04-21 14:25:15, David Hildenbrand wrote:
> Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
> introduced page_is_poisoned(), however, v5 [1] of the patch used
> "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
> function and move it to page-flags.h, from where it can be used in other
> -- kcore -- context.
>
> Move the comment to the place where it belongs and simplify.
>
> [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
>
> Signed-off-by: David Hildenbrand <[email protected]>

I do agree that being explicit about hwpoison is much better. Poisoned
page can be also an unitialized one and I believe this is the reason why
you are bringing that up.

But you've made me look at d3378e86d182 and I am wondering whether this
is really a valid patch. First of all it can leak a reference count
AFAICS. Moreover it doesn't really fix anything because the page can be
marked hwpoison right after the check is done. I do not think the race
is feasible to be closed. So shouldn't we rather revert it?

--
Michal Hocko
SUSE Labs

2021-05-05 17:45:39

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()

On Wed, May 05, 2021 at 05:10:33PM +0200, David Hildenbrand wrote:
> On 05.05.21 15:24, Michal Hocko wrote:
> > On Thu 29-04-21 14:25:17, David Hildenbrand wrote:
> > > A driver might set a page logically offline -- PageOffline() -- and
> > > turn the page inaccessible in the hypervisor; after that, access to page
> > > content can be fatal. One example is virtio-mem; while unplugged memory
> > > -- marked as PageOffline() can currently be read in the hypervisor, this
> > > will no longer be the case in the future; for example, when having
> > > a virtio-mem device backed by huge pages in the hypervisor.
> > >
> > > Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> > > pages after checking PageOffline(); however, these PFN walkers can race
> > > with drivers that set PageOffline().
> > >
> > > Let's introduce page_offline_(begin|end|freeze|unfreeze) for
> > > synchronizing.
> > >
> > > page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
> > > synchronize with such drivers, achieving that a page cannot be set
> > > PageOffline() while frozen.
> > >
> > > page_offline_begin()/page_offline_end() is used by drivers that care about
> > > such races when setting a page PageOffline().
> > >
> > > For simplicity, use a rwsem for now; neither drivers nor users are
> > > performance sensitive.
> >
> > Please add a note to the PageOffline documentation as well. While are
> > adding the api close enough an explicit note there wouldn't hurt.
>
> Will do.
>
> >
> > > Signed-off-by: David Hildenbrand <[email protected]>
> >
> > As to the patch itself, I am slightly worried that other pfn walkers
> > might be less tolerant to the locking than the proc ones. On the other
> > hand most users shouldn't really care as they do not tend to touch the
> > memory content and PageOffline check without any synchronization should
> > be sufficient for those. Let's try this out and see where we get...
>
> My thinking. Users that actually read random page content (as discussed in
> the cover letter) are
>
> 1. Hibernation
> 2. Dumping (/proc/kcore, /proc/vmcore)
> 3. Physical memory access bypassing the kernel via /dev/mem
> 4. Live debug tools (kgdb)

I think you can add

5. Very old drivers

> Other PFN walkers really shouldn't (and don't) access random page content.
>
> Thanks!
>
> --
> Thanks,
>
> David / dhildenb
>
>

--
Sincerely yours,
Mike.

2021-05-05 20:49:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 5/7] mm: introduce page_offline_(begin|end|freeze|unfreeze) to synchronize setting PageOffline()

On 05.05.21 15:24, Michal Hocko wrote:
> On Thu 29-04-21 14:25:17, David Hildenbrand wrote:
>> A driver might set a page logically offline -- PageOffline() -- and
>> turn the page inaccessible in the hypervisor; after that, access to page
>> content can be fatal. One example is virtio-mem; while unplugged memory
>> -- marked as PageOffline() can currently be read in the hypervisor, this
>> will no longer be the case in the future; for example, when having
>> a virtio-mem device backed by huge pages in the hypervisor.
>>
>> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
>> pages after checking PageOffline(); however, these PFN walkers can race
>> with drivers that set PageOffline().
>>
>> Let's introduce page_offline_(begin|end|freeze|unfreeze) for
>> synchronizing.
>>
>> page_offline_freeze()/page_offline_unfreeze() allows for a subsystem to
>> synchronize with such drivers, achieving that a page cannot be set
>> PageOffline() while frozen.
>>
>> page_offline_begin()/page_offline_end() is used by drivers that care about
>> such races when setting a page PageOffline().
>>
>> For simplicity, use a rwsem for now; neither drivers nor users are
>> performance sensitive.
>
> Please add a note to the PageOffline documentation as well. While are
> adding the api close enough an explicit note there wouldn't hurt.

Will do.

>
>> Signed-off-by: David Hildenbrand <[email protected]>
>
> As to the patch itself, I am slightly worried that other pfn walkers
> might be less tolerant to the locking than the proc ones. On the other
> hand most users shouldn't really care as they do not tend to touch the
> memory content and PageOffline check without any synchronization should
> be sufficient for those. Let's try this out and see where we get...

My thinking. Users that actually read random page content (as discussed
in the cover letter) are

1. Hibernation
2. Dumping (/proc/kcore, /proc/vmcore)
3. Physical memory access bypassing the kernel via /dev/mem
4. Live debug tools (kgdb)

Other PFN walkers really shouldn't (and don't) access random page content.

Thanks!

--
Thanks,

David / dhildenb

2021-05-06 00:58:40

by yaoaili [么爱利]

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

On Wed, 5 May 2021 15:27:39 +0200
Michal Hocko <[email protected]> wrote:

> On Wed 05-05-21 15:17:53, David Hildenbrand wrote:
> > On 05.05.21 15:13, Michal Hocko wrote:
> > > On Thu 29-04-21 14:25:15, David Hildenbrand wrote:
> > > > Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
> > > > introduced page_is_poisoned(), however, v5 [1] of the patch used
> > > > "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
> > > > function and move it to page-flags.h, from where it can be used in other
> > > > -- kcore -- context.
> > > >
> > > > Move the comment to the place where it belongs and simplify.
> > > >
> > > > [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
> > > >
> > > > Signed-off-by: David Hildenbrand <[email protected]>
> > >
> > > I do agree that being explicit about hwpoison is much better. Poisoned
> > > page can be also an unitialized one and I believe this is the reason why
> > > you are bringing that up.
> >
> > I'm bringing it up because I want to reuse that function as state above :)
> >
> > >
> > > But you've made me look at d3378e86d182 and I am wondering whether this
> > > is really a valid patch. First of all it can leak a reference count
> > > AFAICS. Moreover it doesn't really fix anything because the page can be
> > > marked hwpoison right after the check is done. I do not think the race
> > > is feasible to be closed. So shouldn't we rather revert it?
> >
> > I am not sure if we really care about races here that much here? I mean,
> > essentially we are racing with HW breaking asynchronously. Just because we
> > would be synchronizing with SetPageHWPoison() wouldn't mean we can stop HW
> > from breaking.
>
> Right
>
> > Long story short, this should be good enough for the cases we actually can
> > handle? What am I missing?
>
> I am not sure I follow. My point is that I fail to see any added value
> of the check as it doesn't prevent the race (it fundamentally cannot as
> the page can be poisoned at any time) but the failure path doesn't
> put_page which is incorrect even for hwpoison pages.

Sorry, I have something to say:

I have noticed the ref count leak in the previous topic ,but I don't think
it's a really matter. For memory recovery case for user pages, we will keep one
reference to the poison page so the error page will not be freed to buddy allocator.
which can be checked in memory_faulure() function.

For the case here, the reference count for error page may be greater than one as it's not
unmmapped successfully and may shared. we take a reference for that page will not result some
really mattering issue.

The only break I can think for this leak is that we will fail to unpoison the error page for test purpose.

Thanks!
Aili Yao

2021-05-06 01:09:35

by yaoaili [么爱利]

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

On Wed, 5 May 2021 15:45:47 +0200
Michal Hocko <[email protected]> wrote:

> On Wed 05-05-21 15:39:08, David Hildenbrand wrote:
> > > > Long story short, this should be good enough for the cases we actually can
> > > > handle? What am I missing?
> > >
> > > I am not sure I follow. My point is that I fail to see any added value
> > > of the check as it doesn't prevent the race (it fundamentally cannot as
> > > the page can be poisoned at any time) but the failure path doesn't
> > > put_page which is incorrect even for hwpoison pages.
> >
> > Oh, I think you are right. If we have a page and return NULL we would leak a
> > reference.
> >
> > Actually, we discussed in that thread handling this entirely differently,
> > which resulted in a v7 [1]; however Andrew moved forward with this
> > (outdated?) patch, maybe that was just a mistake?
> >
> > Yes, I agree we should revert that patch for now.
>
> OK, Let me send the revert to Andrew.
>

Got this!
Anyway, I will try to post a new patch for this issue based on the previous patch v7.

Thanks!
Aili Yao

2021-05-06 07:08:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

On Thu 06-05-21 08:56:11, Aili Yao wrote:
> On Wed, 5 May 2021 15:27:39 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Wed 05-05-21 15:17:53, David Hildenbrand wrote:
> > > On 05.05.21 15:13, Michal Hocko wrote:
> > > > On Thu 29-04-21 14:25:15, David Hildenbrand wrote:
> > > > > Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
> > > > > introduced page_is_poisoned(), however, v5 [1] of the patch used
> > > > > "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
> > > > > function and move it to page-flags.h, from where it can be used in other
> > > > > -- kcore -- context.
> > > > >
> > > > > Move the comment to the place where it belongs and simplify.
> > > > >
> > > > > [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
> > > > >
> > > > > Signed-off-by: David Hildenbrand <[email protected]>
> > > >
> > > > I do agree that being explicit about hwpoison is much better. Poisoned
> > > > page can be also an unitialized one and I believe this is the reason why
> > > > you are bringing that up.
> > >
> > > I'm bringing it up because I want to reuse that function as state above :)
> > >
> > > >
> > > > But you've made me look at d3378e86d182 and I am wondering whether this
> > > > is really a valid patch. First of all it can leak a reference count
> > > > AFAICS. Moreover it doesn't really fix anything because the page can be
> > > > marked hwpoison right after the check is done. I do not think the race
> > > > is feasible to be closed. So shouldn't we rather revert it?
> > >
> > > I am not sure if we really care about races here that much here? I mean,
> > > essentially we are racing with HW breaking asynchronously. Just because we
> > > would be synchronizing with SetPageHWPoison() wouldn't mean we can stop HW
> > > from breaking.
> >
> > Right
> >
> > > Long story short, this should be good enough for the cases we actually can
> > > handle? What am I missing?
> >
> > I am not sure I follow. My point is that I fail to see any added value
> > of the check as it doesn't prevent the race (it fundamentally cannot as
> > the page can be poisoned at any time) but the failure path doesn't
> > put_page which is incorrect even for hwpoison pages.
>
> Sorry, I have something to say:
>
> I have noticed the ref count leak in the previous topic ,but I don't think
> it's a really matter. For memory recovery case for user pages, we will keep one
> reference to the poison page so the error page will not be freed to buddy allocator.
> which can be checked in memory_faulure() function.

So what would happen if those pages are hwpoisoned from userspace rather
than by HW. And repeatedly so?
--
Michal Hocko
SUSE Labs

2021-05-06 07:32:05

by yaoaili [么爱利]

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

On Thu, 6 May 2021 09:06:14 +0200
Michal Hocko <[email protected]> wrote:

> On Thu 06-05-21 08:56:11, Aili Yao wrote:
> > On Wed, 5 May 2021 15:27:39 +0200
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Wed 05-05-21 15:17:53, David Hildenbrand wrote:
> > > > On 05.05.21 15:13, Michal Hocko wrote:
> > > > > On Thu 29-04-21 14:25:15, David Hildenbrand wrote:
> > > > > > Commit d3378e86d182 ("mm/gup: check page posion status for coredump.")
> > > > > > introduced page_is_poisoned(), however, v5 [1] of the patch used
> > > > > > "page_is_hwpoison()" and something went wrong while upstreaming. Rename the
> > > > > > function and move it to page-flags.h, from where it can be used in other
> > > > > > -- kcore -- context.
> > > > > >
> > > > > > Move the comment to the place where it belongs and simplify.
> > > > > >
> > > > > > [1] https://lkml.kernel.org/r/20210322193318.377c9ce9@alex-virtual-machine
> > > > > >
> > > > > > Signed-off-by: David Hildenbrand <[email protected]>
> > > > >
> > > > > I do agree that being explicit about hwpoison is much better. Poisoned
> > > > > page can be also an unitialized one and I believe this is the reason why
> > > > > you are bringing that up.
> > > >
> > > > I'm bringing it up because I want to reuse that function as state above :)
> > > >
> > > > >
> > > > > But you've made me look at d3378e86d182 and I am wondering whether this
> > > > > is really a valid patch. First of all it can leak a reference count
> > > > > AFAICS. Moreover it doesn't really fix anything because the page can be
> > > > > marked hwpoison right after the check is done. I do not think the race
> > > > > is feasible to be closed. So shouldn't we rather revert it?
> > > >
> > > > I am not sure if we really care about races here that much here? I mean,
> > > > essentially we are racing with HW breaking asynchronously. Just because we
> > > > would be synchronizing with SetPageHWPoison() wouldn't mean we can stop HW
> > > > from breaking.
> > >
> > > Right
> > >
> > > > Long story short, this should be good enough for the cases we actually can
> > > > handle? What am I missing?
> > >
> > > I am not sure I follow. My point is that I fail to see any added value
> > > of the check as it doesn't prevent the race (it fundamentally cannot as
> > > the page can be poisoned at any time) but the failure path doesn't
> > > put_page which is incorrect even for hwpoison pages.
> >
> > Sorry, I have something to say:
> >
> > I have noticed the ref count leak in the previous topic ,but I don't think
> > it's a really matter. For memory recovery case for user pages, we will keep one
> > reference to the poison page so the error page will not be freed to buddy allocator.
> > which can be checked in memory_faulure() function.
>
> So what would happen if those pages are hwpoisoned from userspace rather
> than by HW. And repeatedly so?

Sorry, I may be not totally understand what you mean.

Do you mean hard page offline from mcelog?
If yes, I think it's not for one real UC error but for CE storms.
when we access this page in kernel, the access may success even it was marked hwpoison.

Thanks!
Aili Yao

2021-05-06 08:28:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

On Thu 06-05-21 15:28:05, Aili Yao wrote:
> On Thu, 6 May 2021 09:06:14 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Thu 06-05-21 08:56:11, Aili Yao wrote:
> > > On Wed, 5 May 2021 15:27:39 +0200
> > > Michal Hocko <[email protected]> wrote:
[...]
> > > > I am not sure I follow. My point is that I fail to see any added value
> > > > of the check as it doesn't prevent the race (it fundamentally cannot as
> > > > the page can be poisoned at any time) but the failure path doesn't
> > > > put_page which is incorrect even for hwpoison pages.
> > >
> > > Sorry, I have something to say:
> > >
> > > I have noticed the ref count leak in the previous topic ,but I don't think
> > > it's a really matter. For memory recovery case for user pages, we will keep one
> > > reference to the poison page so the error page will not be freed to buddy allocator.
> > > which can be checked in memory_faulure() function.
> >
> > So what would happen if those pages are hwpoisoned from userspace rather
> > than by HW. And repeatedly so?
>
> Sorry, I may be not totally understand what you mean.
>
> Do you mean hard page offline from mcelog?

No I mean soft hwpoison from userspace (e.g. by MADV_HWPOISON but there
are other interfaces AFAIK).

And just to be explicit. All those interfaces are root only
(CAP_SYS_ADMIN) so I am not really worried about any malitious abuse of
the reference leak. I am mostly concerned that this is obviously broken
without a good reason. The most trivial fix would have been to put_page
in the return path but as I've mentioned in other email thread the fix
really needs a deeper thought and consider other things.

Hope that clarifies this some more.
--
Michal Hocko
SUSE Labs

2021-05-06 09:27:51

by yaoaili [么爱利]

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] mm: rename and move page_is_poisoned()

On Thu, 6 May 2021 09:55:51 +0200
Michal Hocko <[email protected]> wrote:

> On Thu 06-05-21 15:28:05, Aili Yao wrote:
> > On Thu, 6 May 2021 09:06:14 +0200
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Thu 06-05-21 08:56:11, Aili Yao wrote:
> > > > On Wed, 5 May 2021 15:27:39 +0200
> > > > Michal Hocko <[email protected]> wrote:
> [...]
> > > > > I am not sure I follow. My point is that I fail to see any added value
> > > > > of the check as it doesn't prevent the race (it fundamentally cannot as
> > > > > the page can be poisoned at any time) but the failure path doesn't
> > > > > put_page which is incorrect even for hwpoison pages.
> > > >
> > > > Sorry, I have something to say:
> > > >
> > > > I have noticed the ref count leak in the previous topic ,but I don't think
> > > > it's a really matter. For memory recovery case for user pages, we will keep one
> > > > reference to the poison page so the error page will not be freed to buddy allocator.
> > > > which can be checked in memory_faulure() function.
> > >
> > > So what would happen if those pages are hwpoisoned from userspace rather
> > > than by HW. And repeatedly so?
> >
> > Sorry, I may be not totally understand what you mean.
> >
> > Do you mean hard page offline from mcelog?
>
> No I mean soft hwpoison from userspace (e.g. by MADV_HWPOISON but there
> are other interfaces AFAIK).
>
> And just to be explicit. All those interfaces are root only
> (CAP_SYS_ADMIN) so I am not really worried about any malitious abuse of
> the reference leak. I am mostly concerned that this is obviously broken
> without a good reason. The most trivial fix would have been to put_page
> in the return path but as I've mentioned in other email thread the fix
> really needs a deeper thought and consider other things.
>
> Hope that clarifies this some more.

Thanks, got it!
Yes, there are some test scenarios that should be covered.

But for test, the default SIGBUS handlers is usually replaced, and the test process
may not hit the coredump code.

Anyway, there is a ref leak in the normal enviorments and better to be fixed.

Thanks!
Aili Yao