Looking for places where the kernel might unconditionally read
PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore
needs some more love to not touch some other pages we really don't want to
read -- i.e., hwpoisoned ones.
Examples for PageOffline() pages are pages inflated in a balloon,
memory unplugged via virtio-mem, and partially-present sections in
memory added by the Hyper-V balloon.
When reading pages inflated in a balloon, we essentially produce
unnecessary load in the hypervisor; holes in partially present sections in
case of Hyper-V are not accessible and already were a problem for
/proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages. In
the future, virtio-mem might disallow reading unplugged memory -- marked
as PageOffline() -- in some environments, resulting in undefined behavior
when accessed; therefore, I'm trying to identify and rework all these
(corner) cases.
With this series, there is really only access via /dev/mem, /proc/vmcore
and kdb left after I ripped out /dev/kmem. kdb is an advanced corner-case
use case -- we won't care for now if someone explicitly tries to do nasty
things by reading from/writing to physical addresses we better not touch.
/dev/mem is a use case we won't support for virtio-mem, at least for now,
so we'll simply disallow mapping any virtio-mem memory via /dev/mem next.
/proc/vmcore is really only a problem when dumping the old kernel via
something that's not makedumpfile (read: basically never), however, we'll
try sanitizing that as well in the second kernel in the future.
Tested via kcore_dump:
https://github.com/schlafwandler/kcore_dump
v1 -> v2:
- Dropped "mm: rename and move page_is_poisoned()"
- "fs/proc/kcore: don't read offline sections, logically offline pages ..."
-- Add is_page_hwpoison() in page-flags.h along with a comment
- "mm: introduce page_offline_(begin|end|freeze|thaw) to ..."
-- s/unfreeze/thaw/
-- Add a comment to PageOffline documentation in page-flags.h
- "virtio-mem: use page_offline_(start|end) when setting PageOffline()"
-- Extend patch description
- "fs/proc/kcore: use page_offline_(freeze|thaw)"
-- Simplify freeze/thaw logic
- Collected acks/rbs
Cc: Andrew Morton <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Aili Yao <[email protected]>
Cc: Jiri Bohac <[email protected]>
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
David Hildenbrand (6):
fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER
fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM
fs/proc/kcore: don't read offline sections, logically offline pages
and hwpoisoned pages
mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize
setting PageOffline()
virtio-mem: use page_offline_(start|end) when setting PageOffline()
fs/proc/kcore: use page_offline_(freeze|thaw)
drivers/virtio/virtio_mem.c | 2 ++
fs/proc/kcore.c | 67 ++++++++++++++++++++++++++++++-------
include/linux/kcore.h | 3 --
include/linux/page-flags.h | 22 ++++++++++++
mm/util.c | 40 ++++++++++++++++++++++
5 files changed, 118 insertions(+), 16 deletions(-)
base-commit: 6efb943b8616ec53a5e444193dccf1af9ad627b5
--
2.31.1
Let's avoid reading:
1) Offline memory sections: the content of offline memory sections is stale
as the memory is effectively unused by the kernel. On s390x with standby
memory, offline memory sections (belonging to offline storage
increments) are not accessible. With virtio-mem and the hyper-v balloon,
we can have unavailable memory chunks that should not be accessed inside
offline memory sections. Last but not least, offline memory sections
might contain hwpoisoned pages which we can no longer identify
because the memmap is stale.
2) PG_offline pages: logically offline pages that are documented as
"The content of these pages is effectively stale. Such pages should not
be touched (read/write/dump/save) except by their owner.".
Examples include pages inflated in a balloon or unavailble memory
ranges inside hotplugged memory sections with virtio-mem or the hyper-v
balloon.
3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
As documented: "Accessing is not safe since it may cause another machine
check. Don't touch!"
Introduce is_page_hwpoison(), adding a comment that it is inherently
racy but best we can really do.
Reading /proc/kcore now performs similar checks as when reading
/proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
It's also similar to hibernation code, however, we don't skip hwpoisoned
pages when processing pages in kernel/power/snapshot.c:saveable_page() yet.
Note 1: we can race against memory offlining code, especially
memory going offline and getting unplugged: however, we will properly tear
down the identity mapping and handle faults gracefully when accessing
this memory from kcore code.
Note 2: we can race against drivers setting PageOffline() and turning
memory inaccessible in the hypervisor. We'll handle this in a follow-up
patch.
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/kcore.c | 14 +++++++++++++-
include/linux/page-flags.h | 12 ++++++++++++
2 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index ed6fbb3bd50c..92ff1e4436cb 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -465,6 +465,9 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
m = NULL;
while (buflen) {
+ struct page *page;
+ unsigned long pfn;
+
/*
* If this is the first iteration or the address is not within
* the previous entry, search for a matching entry.
@@ -503,7 +506,16 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
}
break;
case KCORE_RAM:
- if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
+ pfn = __pa(start) >> PAGE_SHIFT;
+ page = pfn_to_online_page(pfn);
+
+ /*
+ * Don't read offline sections, logically offline pages
+ * (e.g., inflated in a balloon), hwpoisoned pages,
+ * and explicitly excluded physical ranges.
+ */
+ if (!page || PageOffline(page) ||
+ is_page_hwpoison(page) || !pfn_is_ram(pfn)) {
if (clear_user(buffer, tsz)) {
ret = -EFAULT;
goto out;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 04a34c08e0a6..daed82744f4b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -694,6 +694,18 @@ PAGEFLAG_FALSE(DoubleMap)
TESTSCFLAG_FALSE(DoubleMap)
#endif
+/*
+ * Check if a page is currently marked HWPoisoned. Note that this check is
+ * best effort only and inherently racy: there is no way to synchronize with
+ * failing hardware.
+ */
+static inline bool is_page_hwpoison(struct page *page)
+{
+ if (PageHWPoison(page))
+ return true;
+ return PageHuge(page) && PageHWPoison(compound_head(page));
+}
+
/*
* For pages that are never mapped to userspace (and aren't PageSlab),
* page_type may be used. Because it is initialised to -1, we invert the
--
2.31.1
Let's properly use page_offline_(start|end) to synchronize setting
PageOffline(), so we won't have valid page access to unplugged memory
regions from /proc/kcore.
Existing balloon implementations usually allow reading inflated memory;
doing so might result in unnecessary overhead in the hypervisor, which
is currently the case with virtio-mem.
For future virtio-mem use cases, it will be different when using shmem,
huge pages, !anonymous private mappings, ... as backing storage for a VM.
virtio-mem unplugged memory must no longer be accessed and access might
result in undefined behavior. There will be a virtio spec extension to
document this change, including a new feature flag indicating the
changed behavior. We really don't want to race against PFN walkers
reading random page content.
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/virtio/virtio_mem.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 10ec60d81e84..dc2a2e2b2ff8 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
static void virtio_mem_set_fake_offline(unsigned long pfn,
unsigned long nr_pages, bool onlined)
{
+ page_offline_begin();
for (; nr_pages--; pfn++) {
struct page *page = pfn_to_page(pfn);
@@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
ClearPageReserved(page);
}
}
+ page_offline_end();
}
/*
--
2.31.1
Let's properly synchronize with drivers that set PageOffline().
Unfreeze/thaw every now and then, so drivers that want to set PageOffline()
can make progress.
Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/kcore.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 92ff1e4436cb..982e694aae77 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -313,6 +313,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
{
char *buf = file->private_data;
size_t phdrs_offset, notes_offset, data_offset;
+ size_t page_offline_frozen = 1;
size_t phdrs_len, notes_len;
struct kcore_list *m;
size_t tsz;
@@ -322,6 +323,11 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
int ret = 0;
down_read(&kclist_lock);
+ /*
+ * Don't race against drivers that set PageOffline() and expect no
+ * further page access.
+ */
+ page_offline_freeze();
get_kcore_size(&nphdr, &phdrs_len, ¬es_len, &data_offset);
phdrs_offset = sizeof(struct elfhdr);
@@ -480,6 +486,12 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
}
}
+ if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
+ page_offline_thaw();
+ cond_resched();
+ page_offline_freeze();
+ }
+
if (&m->list == &kclist_head) {
if (clear_user(buffer, tsz)) {
ret = -EFAULT;
@@ -565,6 +577,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
}
out:
+ page_offline_thaw();
up_read(&kclist_lock);
if (ret)
return ret;
--
2.31.1
Let's resturcture the code, using switch-case, and checking pfn_is_ram()
only when we are dealing with KCORE_RAM.
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/kcore.c | 35 +++++++++++++++++++++++++++--------
1 file changed, 27 insertions(+), 8 deletions(-)
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 09f77d3c6e15..ed6fbb3bd50c 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -483,25 +483,36 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
goto out;
}
m = NULL; /* skip the list anchor */
- } else if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
- if (clear_user(buffer, tsz)) {
- ret = -EFAULT;
- goto out;
- }
- } else if (m->type == KCORE_VMALLOC) {
+ goto skip;
+ }
+
+ switch (m->type) {
+ case KCORE_VMALLOC:
vread(buf, (char *)start, tsz);
/* we have to zero-fill user buffer even if no read */
if (copy_to_user(buffer, buf, tsz)) {
ret = -EFAULT;
goto out;
}
- } else if (m->type == KCORE_USER) {
+ break;
+ case KCORE_USER:
/* User page is handled prior to normal kernel page: */
if (copy_to_user(buffer, (char *)start, tsz)) {
ret = -EFAULT;
goto out;
}
- } else {
+ break;
+ case KCORE_RAM:
+ if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
+ if (clear_user(buffer, tsz)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ break;
+ }
+ fallthrough;
+ case KCORE_VMEMMAP:
+ case KCORE_TEXT:
if (kern_addr_valid(start)) {
/*
* Using bounce buffer to bypass the
@@ -525,7 +536,15 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
goto out;
}
}
+ break;
+ default:
+ pr_warn_once("Unhandled KCORE type: %d\n", m->type);
+ if (clear_user(buffer, tsz)) {
+ ret = -EFAULT;
+ goto out;
+ }
}
+skip:
buflen -= tsz;
*fpos += tsz;
buffer += tsz;
--
2.31.1
Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()")
removed the last user of KCORE_REMAP.
Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
dumping vsyscall user page") removed the last user of KCORE_OTHER.
Let's drop both types. While at it, also drop vaddr in "struct
kcore_list", used by KCORE_REMAP only.
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/kcore.c | 7 ++-----
include/linux/kcore.h | 3 ---
2 files changed, 2 insertions(+), 8 deletions(-)
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 4d2e64e9016c..09f77d3c6e15 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -380,11 +380,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
phdr->p_type = PT_LOAD;
phdr->p_flags = PF_R | PF_W | PF_X;
phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset;
- if (m->type == KCORE_REMAP)
- phdr->p_vaddr = (size_t)m->vaddr;
- else
- phdr->p_vaddr = (size_t)m->addr;
- if (m->type == KCORE_RAM || m->type == KCORE_REMAP)
+ phdr->p_vaddr = (size_t)m->addr;
+ if (m->type == KCORE_RAM)
phdr->p_paddr = __pa(m->addr);
else if (m->type == KCORE_TEXT)
phdr->p_paddr = __pa_symbol(m->addr);
diff --git a/include/linux/kcore.h b/include/linux/kcore.h
index da676cdbd727..86c0f1d18998 100644
--- a/include/linux/kcore.h
+++ b/include/linux/kcore.h
@@ -11,14 +11,11 @@ enum kcore_type {
KCORE_RAM,
KCORE_VMEMMAP,
KCORE_USER,
- KCORE_OTHER,
- KCORE_REMAP,
};
struct kcore_list {
struct list_head list;
unsigned long addr;
- unsigned long vaddr;
size_t size;
int type;
};
--
2.31.1
A driver might set a page logically offline -- PageOffline() -- and
turn the page inaccessible in the hypervisor; after that, access to page
content can be fatal. One example is virtio-mem; while unplugged memory
-- marked as PageOffline() can currently be read in the hypervisor, this
will no longer be the case in the future; for example, when having
a virtio-mem device backed by huge pages in the hypervisor.
Some special PFN walkers -- i.e., /proc/kcore -- read content of random
pages after checking PageOffline(); however, these PFN walkers can race
with drivers that set PageOffline().
Let's introduce page_offline_(begin|end|freeze|thaw) for
synchronizing.
page_offline_freeze()/page_offline_thaw() allows for a subsystem to
synchronize with such drivers, achieving that a page cannot be set
PageOffline() while frozen.
page_offline_begin()/page_offline_end() is used by drivers that care about
such races when setting a page PageOffline().
For simplicity, use a rwsem for now; neither drivers nor users are
performance sensitive.
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/page-flags.h | 10 ++++++++++
mm/util.c | 40 ++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index daed82744f4b..ea2df9a247b3 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -769,9 +769,19 @@ PAGE_TYPE_OPS(Buddy, buddy)
* relies on this feature is aware that re-onlining the memory block will
* require to re-set the pages PageOffline() and not giving them to the
* buddy via online_page_callback_t.
+ *
+ * There are drivers that mark a page PageOffline() and do not expect any
+ * further access to page content. PFN walkers that read content of random
+ * pages should check PageOffline() and synchronize with such drivers using
+ * page_offline_freeze()/page_offline_thaw().
*/
PAGE_TYPE_OPS(Offline, offline)
+extern void page_offline_freeze(void);
+extern void page_offline_thaw(void);
+extern void page_offline_begin(void);
+extern void page_offline_end(void);
+
/*
* Marks pages in use as page tables.
*/
diff --git a/mm/util.c b/mm/util.c
index a8bf17f18a81..a034525e7ba2 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1010,3 +1010,43 @@ void mem_dump_obj(void *object)
}
EXPORT_SYMBOL_GPL(mem_dump_obj);
#endif
+
+/*
+ * A driver might set a page logically offline -- PageOffline() -- and
+ * turn the page inaccessible in the hypervisor; after that, access to page
+ * content can be fatal.
+ *
+ * Some special PFN walkers -- i.e., /proc/kcore -- read content of random
+ * pages after checking PageOffline(); however, these PFN walkers can race
+ * with drivers that set PageOffline().
+ *
+ * page_offline_freeze()/page_offline_thaw() allows for a subsystem to
+ * synchronize with such drivers, achieving that a page cannot be set
+ * PageOffline() while frozen.
+ *
+ * page_offline_begin()/page_offline_end() is used by drivers that care about
+ * such races when setting a page PageOffline().
+ */
+static DECLARE_RWSEM(page_offline_rwsem);
+
+void page_offline_freeze(void)
+{
+ down_read(&page_offline_rwsem);
+}
+
+void page_offline_thaw(void)
+{
+ up_read(&page_offline_rwsem);
+}
+
+void page_offline_begin(void)
+{
+ down_write(&page_offline_rwsem);
+}
+EXPORT_SYMBOL(page_offline_begin);
+
+void page_offline_end(void)
+{
+ up_write(&page_offline_rwsem);
+}
+EXPORT_SYMBOL(page_offline_end);
--
2.31.1
On Fri, May 14, 2021 at 07:22:47PM +0200, David Hildenbrand wrote:
> Let's properly synchronize with drivers that set PageOffline().
> Unfreeze/thaw every now and then, so drivers that want to set PageOffline()
> can make progress.
>
> Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
> ---
> fs/proc/kcore.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 92ff1e4436cb..982e694aae77 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -313,6 +313,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> {
> char *buf = file->private_data;
> size_t phdrs_offset, notes_offset, data_offset;
> + size_t page_offline_frozen = 1;
> size_t phdrs_len, notes_len;
> struct kcore_list *m;
> size_t tsz;
> @@ -322,6 +323,11 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> int ret = 0;
>
> down_read(&kclist_lock);
> + /*
> + * Don't race against drivers that set PageOffline() and expect no
> + * further page access.
> + */
> + page_offline_freeze();
>
> get_kcore_size(&nphdr, &phdrs_len, ¬es_len, &data_offset);
> phdrs_offset = sizeof(struct elfhdr);
> @@ -480,6 +486,12 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> }
> }
>
> + if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
> + page_offline_thaw();
> + cond_resched();
> + page_offline_freeze();
> + }
> +
> if (&m->list == &kclist_head) {
> if (clear_user(buffer, tsz)) {
> ret = -EFAULT;
> @@ -565,6 +577,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> }
>
> out:
> + page_offline_thaw();
> up_read(&kclist_lock);
> if (ret)
> return ret;
> --
> 2.31.1
>
--
Sincerely yours,
Mike.
On Fri, May 14, 2021 at 07:22:45PM +0200, David Hildenbrand wrote:
> A driver might set a page logically offline -- PageOffline() -- and
> turn the page inaccessible in the hypervisor; after that, access to page
> content can be fatal. One example is virtio-mem; while unplugged memory
> -- marked as PageOffline() can currently be read in the hypervisor, this
> will no longer be the case in the future; for example, when having
> a virtio-mem device backed by huge pages in the hypervisor.
>
> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> pages after checking PageOffline(); however, these PFN walkers can race
> with drivers that set PageOffline().
>
> Let's introduce page_offline_(begin|end|freeze|thaw) for
> synchronizing.
>
> page_offline_freeze()/page_offline_thaw() allows for a subsystem to
> synchronize with such drivers, achieving that a page cannot be set
> PageOffline() while frozen.
>
> page_offline_begin()/page_offline_end() is used by drivers that care about
> such races when setting a page PageOffline().
>
> For simplicity, use a rwsem for now; neither drivers nor users are
> performance sensitive.
>
> Acked-by: Michal Hocko <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
One nit below, otherwise
Reviewed-by: Mike Rapoport <[email protected]>
> ---
> include/linux/page-flags.h | 10 ++++++++++
> mm/util.c | 40 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 50 insertions(+)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index daed82744f4b..ea2df9a247b3 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -769,9 +769,19 @@ PAGE_TYPE_OPS(Buddy, buddy)
> * relies on this feature is aware that re-onlining the memory block will
> * require to re-set the pages PageOffline() and not giving them to the
> * buddy via online_page_callback_t.
> + *
> + * There are drivers that mark a page PageOffline() and do not expect any
Maybe "and expect there won't be any further access"...
> + * further access to page content. PFN walkers that read content of random
> + * pages should check PageOffline() and synchronize with such drivers using
> + * page_offline_freeze()/page_offline_thaw().
> */
> PAGE_TYPE_OPS(Offline, offline)
>
> +extern void page_offline_freeze(void);
> +extern void page_offline_thaw(void);
> +extern void page_offline_begin(void);
> +extern void page_offline_end(void);
> +
> /*
> * Marks pages in use as page tables.
> */
> diff --git a/mm/util.c b/mm/util.c
> index a8bf17f18a81..a034525e7ba2 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1010,3 +1010,43 @@ void mem_dump_obj(void *object)
> }
> EXPORT_SYMBOL_GPL(mem_dump_obj);
> #endif
> +
> +/*
> + * A driver might set a page logically offline -- PageOffline() -- and
> + * turn the page inaccessible in the hypervisor; after that, access to page
> + * content can be fatal.
> + *
> + * Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> + * pages after checking PageOffline(); however, these PFN walkers can race
> + * with drivers that set PageOffline().
> + *
> + * page_offline_freeze()/page_offline_thaw() allows for a subsystem to
> + * synchronize with such drivers, achieving that a page cannot be set
> + * PageOffline() while frozen.
> + *
> + * page_offline_begin()/page_offline_end() is used by drivers that care about
> + * such races when setting a page PageOffline().
> + */
> +static DECLARE_RWSEM(page_offline_rwsem);
> +
> +void page_offline_freeze(void)
> +{
> + down_read(&page_offline_rwsem);
> +}
> +
> +void page_offline_thaw(void)
> +{
> + up_read(&page_offline_rwsem);
> +}
> +
> +void page_offline_begin(void)
> +{
> + down_write(&page_offline_rwsem);
> +}
> +EXPORT_SYMBOL(page_offline_begin);
> +
> +void page_offline_end(void)
> +{
> + up_write(&page_offline_rwsem);
> +}
> +EXPORT_SYMBOL(page_offline_end);
> --
> 2.31.1
>
--
Sincerely yours,
Mike.
On Fri, May 14, 2021 at 07:22:46PM +0200, David Hildenbrand wrote:
> Let's properly use page_offline_(start|end) to synchronize setting
> PageOffline(), so we won't have valid page access to unplugged memory
> regions from /proc/kcore.
>
> Existing balloon implementations usually allow reading inflated memory;
> doing so might result in unnecessary overhead in the hypervisor, which
> is currently the case with virtio-mem.
>
> For future virtio-mem use cases, it will be different when using shmem,
> huge pages, !anonymous private mappings, ... as backing storage for a VM.
> virtio-mem unplugged memory must no longer be accessed and access might
> result in undefined behavior. There will be a virtio spec extension to
> document this change, including a new feature flag indicating the
> changed behavior. We really don't want to race against PFN walkers
> reading random page content.
>
> Acked-by: Michael S. Tsirkin <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> drivers/virtio/virtio_mem.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 10ec60d81e84..dc2a2e2b2ff8 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
> static void virtio_mem_set_fake_offline(unsigned long pfn,
> unsigned long nr_pages, bool onlined)
> {
> + page_offline_begin();
> for (; nr_pages--; pfn++) {
> struct page *page = pfn_to_page(pfn);
>
> @@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
> ClearPageReserved(page);
> }
> }
> + page_offline_end();
> }
>
> /*
> --
> 2.31.1
>
>
--
Sincerely yours,
Mike.
On 17.05.21 08:43, Mike Rapoport wrote:
> On Fri, May 14, 2021 at 07:22:45PM +0200, David Hildenbrand wrote:
>> A driver might set a page logically offline -- PageOffline() -- and
>> turn the page inaccessible in the hypervisor; after that, access to page
>> content can be fatal. One example is virtio-mem; while unplugged memory
>> -- marked as PageOffline() can currently be read in the hypervisor, this
>> will no longer be the case in the future; for example, when having
>> a virtio-mem device backed by huge pages in the hypervisor.
>>
>> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
>> pages after checking PageOffline(); however, these PFN walkers can race
>> with drivers that set PageOffline().
>>
>> Let's introduce page_offline_(begin|end|freeze|thaw) for
>> synchronizing.
>>
>> page_offline_freeze()/page_offline_thaw() allows for a subsystem to
>> synchronize with such drivers, achieving that a page cannot be set
>> PageOffline() while frozen.
>>
>> page_offline_begin()/page_offline_end() is used by drivers that care about
>> such races when setting a page PageOffline().
>>
>> For simplicity, use a rwsem for now; neither drivers nor users are
>> performance sensitive.
>>
>> Acked-by: Michal Hocko <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>
> One nit below, otherwise
>
> Reviewed-by: Mike Rapoport <[email protected]>
>
>> ---
>> include/linux/page-flags.h | 10 ++++++++++
>> mm/util.c | 40 ++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 50 insertions(+)
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index daed82744f4b..ea2df9a247b3 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -769,9 +769,19 @@ PAGE_TYPE_OPS(Buddy, buddy)
>> * relies on this feature is aware that re-onlining the memory block will
>> * require to re-set the pages PageOffline() and not giving them to the
>> * buddy via online_page_callback_t.
>> + *
>> + * There are drivers that mark a page PageOffline() and do not expect any
>
> Maybe "and expect there won't be any further access"...
>
Thanks, makes sense.
I'll wait a bit before I resend.
--
Thanks,
David / dhildenb
On Fri, May 14, 2021 at 07:22:44PM +0200, David Hildenbrand wrote:
> Let's avoid reading:
>
> 1) Offline memory sections: the content of offline memory sections is stale
> as the memory is effectively unused by the kernel. On s390x with standby
> memory, offline memory sections (belonging to offline storage
> increments) are not accessible. With virtio-mem and the hyper-v balloon,
> we can have unavailable memory chunks that should not be accessed inside
> offline memory sections. Last but not least, offline memory sections
> might contain hwpoisoned pages which we can no longer identify
> because the memmap is stale.
>
> 2) PG_offline pages: logically offline pages that are documented as
> "The content of these pages is effectively stale. Such pages should not
> be touched (read/write/dump/save) except by their owner.".
> Examples include pages inflated in a balloon or unavailble memory
> ranges inside hotplugged memory sections with virtio-mem or the hyper-v
> balloon.
>
> 3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
> As documented: "Accessing is not safe since it may cause another machine
> check. Don't touch!"
>
> Introduce is_page_hwpoison(), adding a comment that it is inherently
> racy but best we can really do.
>
> Reading /proc/kcore now performs similar checks as when reading
> /proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
> It's also similar to hibernation code, however, we don't skip hwpoisoned
> pages when processing pages in kernel/power/snapshot.c:saveable_page() yet.
>
> Note 1: we can race against memory offlining code, especially
> memory going offline and getting unplugged: however, we will properly tear
> down the identity mapping and handle faults gracefully when accessing
> this memory from kcore code.
>
> Note 2: we can race against drivers setting PageOffline() and turning
> memory inaccessible in the hypervisor. We'll handle this in a follow-up
> patch.
>
> Reviewed-by: Mike Rapoport <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE L3
On Fri, May 14, 2021 at 07:22:46PM +0200, David Hildenbrand wrote:
> Let's properly use page_offline_(start|end) to synchronize setting
> PageOffline(), so we won't have valid page access to unplugged memory
> regions from /proc/kcore.
>
> Existing balloon implementations usually allow reading inflated memory;
> doing so might result in unnecessary overhead in the hypervisor, which
> is currently the case with virtio-mem.
>
> For future virtio-mem use cases, it will be different when using shmem,
> huge pages, !anonymous private mappings, ... as backing storage for a VM.
> virtio-mem unplugged memory must no longer be accessed and access might
> result in undefined behavior. There will be a virtio spec extension to
> document this change, including a new feature flag indicating the
> changed behavior. We really don't want to race against PFN walkers
> reading random page content.
>
> Acked-by: Michael S. Tsirkin <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE L3
On Fri, May 14, 2021 at 07:22:47PM +0200, David Hildenbrand wrote:
> Let's properly synchronize with drivers that set PageOffline().
> Unfreeze/thaw every now and then, so drivers that want to set PageOffline()
> can make progress.
>
> Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE L3
On Fri, May 14, 2021 at 07:22:45PM +0200, David Hildenbrand wrote:
> A driver might set a page logically offline -- PageOffline() -- and
> turn the page inaccessible in the hypervisor; after that, access to page
> content can be fatal. One example is virtio-mem; while unplugged memory
> -- marked as PageOffline() can currently be read in the hypervisor, this
> will no longer be the case in the future; for example, when having
> a virtio-mem device backed by huge pages in the hypervisor.
>
> Some special PFN walkers -- i.e., /proc/kcore -- read content of random
> pages after checking PageOffline(); however, these PFN walkers can race
> with drivers that set PageOffline().
>
> Let's introduce page_offline_(begin|end|freeze|thaw) for
> synchronizing.
>
> page_offline_freeze()/page_offline_thaw() allows for a subsystem to
> synchronize with such drivers, achieving that a page cannot be set
> PageOffline() while frozen.
>
> page_offline_begin()/page_offline_end() is used by drivers that care about
> such races when setting a page PageOffline().
>
> For simplicity, use a rwsem for now; neither drivers nor users are
> performance sensitive.
>
> Acked-by: Michal Hocko <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE L3