Right now, pages inflated as part of a balloon driver will be dumped
by dump tools like makedumpfile. While XEN is able to check in the
crash kernel whether a certain pfn is actuall backed by memory in the
hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
virtio-balloon and hv-balloon inflated memory will essentially result in
zero pages getting allocated by the hypervisor and the dump getting
filled with this data.
The allocation and reading of zero pages can directly be avoided if a
dumping tool could know which pages only contain stale information not to
be dumped.
Also for XEN, calling into the kernel and asking the hypervisor if a
pfn is backed can be avoided if the duming tool would skip such pages
right from the beginning.
Dumping tools have no idea whether a given page is part of a balloon driver
and shall not be dumped. Esp. PG_reserved cannot be used for that purpose
as all memory allocated during early boot is also PG_reserved, see
discussion at [1]. So some other way of indication is required and a new
page flag is frowned upon.
We have PG_balloon (MAPCOUNT value), which is essentially unused now. I
suggest renaming it to something more generic (PG_offline) to mark pages as
logically offline. This flag can than e.g. also be used by virtio-mem in
the future to mark subsections as offline. Or by other code that wants to
put pages logically offline (e.g. later maybe poisoned pages that shall
no longer be used).
This series converts PG_balloon to PG_offline, allows dumping tools to
query the value to detect such pages and marks pages in the hv-balloon
and XEN balloon properly as PG_offline. Note that virtio-balloon already
set pages to PG_balloon (and now PG_offline).
Please note that this is also helpful for a problem we were seeing under
Hyper-V: Dumping logically offline memory (pages kept fake offline while
onlining a section via online_page_callback) would under some condicions
result in a kernel panic when dumping them.
As I don't have access to neither XEN nor Hyper-V installation, this was
not tested yet (and a makedumpfile change will be required to skip
dumping these pages).
[1] https://lkml.org/lkml/2018/7/20/566
David Hildenbrand (6):
mm: balloon: update comment about isolation/migration/compaction
mm: convert PG_balloon to PG_offline
kexec: export PG_offline to VMCOREINFO
xen/balloon: mark inflated pages PG_offline
hv_balloon: mark inflated pages PG_offline
PM / Hibernate: exclude all PageOffline() pages
Documentation/admin-guide/mm/pagemap.rst | 6 +++++
drivers/hv/hv_balloon.c | 14 ++++++++--
drivers/xen/balloon.c | 3 +++
fs/proc/page.c | 4 +--
include/linux/balloon_compaction.h | 34 +++++++++---------------
include/linux/page-flags.h | 11 +++++---
include/uapi/linux/kernel-page-flags.h | 1 +
kernel/crash_core.c | 2 ++
kernel/power/snapshot.c | 5 +++-
tools/vm/page-types.c | 1 +
10 files changed, 51 insertions(+), 30 deletions(-)
--
2.17.2
Commit b1123ea6d3b3 ("mm: balloon: use general non-lru movable page
feature") reworked balloon handling to make use of the general
non-lru movable page feature. The big comment block in
balloon_compaction.h contains quite some outdated information. Let's fix
this.
Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/balloon_compaction.h | 26 +++++++++-----------------
1 file changed, 9 insertions(+), 17 deletions(-)
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 53051f3d8f25..cbe50da5a59d 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -4,15 +4,18 @@
*
* Common interface definitions for making balloon pages movable by compaction.
*
- * Despite being perfectly possible to perform ballooned pages migration, they
- * make a special corner case to compaction scans because balloon pages are not
- * enlisted at any LRU list like the other pages we do compact / migrate.
+ * Balloon page migration makes use of the general non-lru movable page
+ * feature.
+ *
+ * page->private is used to reference the responsible balloon device.
+ * page->mapping is used in context of non-lru page migration to reference
+ * the address space operations for page isolation/migration/compaction.
*
* As the page isolation scanning step a compaction thread does is a lockless
* procedure (from a page standpoint), it might bring some racy situations while
* performing balloon page compaction. In order to sort out these racy scenarios
* and safely perform balloon's page compaction and migration we must, always,
- * ensure following these three simple rules:
+ * ensure following these simple rules:
*
* i. when updating a balloon's page ->mapping element, strictly do it under
* the following lock order, independently of the far superior
@@ -21,19 +24,8 @@
* +--spin_lock_irq(&b_dev_info->pages_lock);
* ... page->mapping updates here ...
*
- * ii. before isolating or dequeueing a balloon page from the balloon device
- * pages list, the page reference counter must be raised by one and the
- * extra refcount must be dropped when the page is enqueued back into
- * the balloon device page list, thus a balloon page keeps its reference
- * counter raised only while it is under our special handling;
- *
- * iii. after the lockless scan step have selected a potential balloon page for
- * isolation, re-test the PageBalloon mark and the PagePrivate flag
- * under the proper page lock, to ensure isolating a valid balloon page
- * (not yet isolated, nor under release procedure)
- *
- * iv. isolation or dequeueing procedure must clear PagePrivate flag under
- * page lock together with removing page from balloon device page list.
+ * ii. isolation or dequeueing procedure must remove the page from balloon
+ * device page list under b_dev_info->pages_lock.
*
* The functions provided by this interface are placed to help on coping with
* the aforementioned balloon page corner case, as well as to ensure the simple
--
2.17.2
PG_balloon was introduced to implement page migration/compaction for pages
inflated in virtio-balloon. Nowadays, it is only a marker that a page is
part of virtio-balloon and therefore logically offline.
We also want to make use of this flag in other balloon drivers - for
inflated pages or when onlining a section but keeping some pages offline
(e.g. used right now by XEN and Hyper-V via set_online_page_callback()).
We are going to expose this flag to dump tools like makedumpfile. But
instead of exposing PG_balloon, let's generalize the concept of marking
pages as logically offline, so it can be reused for other purposes
later on.
Rename PG_balloon to PG_offline. This is an indicator that the page is
logically offline, the content stale and that it should not be touched
(e.g. a hypervisor would have to allocate backing storage in order for the
guest to dump an unused page). We can then e.g. exclude such pages from
dumps.
In following patches, we will make use of this bit also in other balloon
drivers. While at it, document PGTABLE.
Cc: Jonathan Corbet <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Christian Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Miles Chen <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
Documentation/admin-guide/mm/pagemap.rst | 6 ++++++
fs/proc/page.c | 4 ++--
include/linux/balloon_compaction.h | 8 ++++----
include/linux/page-flags.h | 11 +++++++----
include/uapi/linux/kernel-page-flags.h | 1 +
tools/vm/page-types.c | 1 +
6 files changed, 21 insertions(+), 10 deletions(-)
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index 3f7bade2c231..9afd6bdc424b 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -78,6 +78,8 @@ number of times a page is mapped.
23. BALLOON
24. ZERO_PAGE
25. IDLE
+ 26. PGTABLE
+ 27. OFFLINE
* ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -128,6 +130,10 @@ Short descriptions to the page flags
Note that this flag may be stale in case the page was accessed via
a PTE. To make sure the flag is up-to-date one has to read
``/sys/kernel/mm/page_idle/bitmap`` first.
+26 - PGTABLE
+ page is in use as a page table
+27 - OFFLINE
+ page is logically offline
IO related page flags
---------------------
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 6c517b11acf8..378401af4d9d 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -152,8 +152,8 @@ u64 stable_page_flags(struct page *page)
else if (page_count(page) == 0 && is_free_buddy_page(page))
u |= 1 << KPF_BUDDY;
- if (PageBalloon(page))
- u |= 1 << KPF_BALLOON;
+ if (PageOffline(page))
+ u |= 1 << KPF_OFFLINE;
if (PageTable(page))
u |= 1 << KPF_PGTABLE;
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index cbe50da5a59d..f111c780ef1d 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -95,7 +95,7 @@ extern int balloon_page_migrate(struct address_space *mapping,
static inline void balloon_page_insert(struct balloon_dev_info *balloon,
struct page *page)
{
- __SetPageBalloon(page);
+ __SetPageOffline(page);
__SetPageMovable(page, balloon->inode->i_mapping);
set_page_private(page, (unsigned long)balloon);
list_add(&page->lru, &balloon->pages);
@@ -111,7 +111,7 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon,
*/
static inline void balloon_page_delete(struct page *page)
{
- __ClearPageBalloon(page);
+ __ClearPageOffline(page);
__ClearPageMovable(page);
set_page_private(page, 0);
/*
@@ -141,13 +141,13 @@ static inline gfp_t balloon_mapping_gfp_mask(void)
static inline void balloon_page_insert(struct balloon_dev_info *balloon,
struct page *page)
{
- __SetPageBalloon(page);
+ __SetPageOffline(page);
list_add(&page->lru, &balloon->pages);
}
static inline void balloon_page_delete(struct page *page)
{
- __ClearPageBalloon(page);
+ __ClearPageOffline(page);
list_del(&page->lru);
}
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50ce1bddaf56..f91da3d0a67e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -670,7 +670,7 @@ PAGEFLAG_FALSE(DoubleMap)
#define PAGE_TYPE_BASE 0xf0000000
/* Reserve 0x0000007f to catch underflows of page_mapcount */
#define PG_buddy 0x00000080
-#define PG_balloon 0x00000100
+#define PG_offline 0x00000100
#define PG_kmemcg 0x00000200
#define PG_table 0x00000400
@@ -700,10 +700,13 @@ static __always_inline void __ClearPage##uname(struct page *page) \
PAGE_TYPE_OPS(Buddy, buddy)
/*
- * PageBalloon() is true for pages that are on the balloon page list
- * (see mm/balloon_compaction.c).
+ * PageOffline() indicates that the pages is logically offline although the
+ * containing section is online. (e.g. inflated in a balloon driver or
+ * not onlined when onlining the section).
+ * The content of these pages is effectively stale. Such pages should not
+ * be touched (read/write/dump/save) except by their owner.
*/
-PAGE_TYPE_OPS(Balloon, balloon)
+PAGE_TYPE_OPS(Offline, offline)
/*
* If kmemcg is enabled, the buddy allocator will set PageKmemcg() on
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index 21b9113c69da..6c662eb0dab8 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -36,5 +36,6 @@
#define KPF_ZERO_PAGE 24
#define KPF_IDLE 25
#define KPF_PGTABLE 26
+#define KPF_OFFLINE 27
#endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 37908a83ddc2..b219c2eafd6a 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -137,6 +137,7 @@ static const char * const page_flag_names[] = {
[KPF_PGTABLE] = "g:pgtable",
[KPF_ZERO_PAGE] = "z:zero_page",
[KPF_IDLE] = "i:idle_page",
+ [KPF_OFFLINE] = "o:offline",
[KPF_RESERVED] = "r:reserved",
[KPF_MLOCKED] = "m:mlocked",
--
2.17.2
Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
makedumpfile can directly skip pages that are logically offline and the
content therefore stale.
Cc: Andrew Morton <[email protected]>
Cc: Dave Young <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Omar Sandoval <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
kernel/crash_core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 933cb3e45b98..093c9f917ed0 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
#ifdef CONFIG_HUGETLB_PAGE
VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
+#define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline)
+ VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
#endif
arch_crash_save_vmcoreinfo();
--
2.17.2
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.
Cc: Boris Ostrovsky <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/xen/balloon.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 12148289debd..14dd6b814db3 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -425,6 +425,7 @@ static int xen_bring_pgs_online(struct page *pg, unsigned int order)
for (i = 0; i < size; i++) {
p = pfn_to_page(start_pfn + i);
__online_page_set_limits(p);
+ __SetPageOffline(p);
__balloon_append(p);
}
mutex_unlock(&balloon_mutex);
@@ -493,6 +494,7 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
xenmem_reservation_va_mapping_update(1, &page, &frame_list[i]);
/* Relinquish the page back to the allocator. */
+ __ClearPageOffline(page);
free_reserved_page(page);
}
@@ -519,6 +521,7 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
state = BP_EAGAIN;
break;
}
+ __SetPageOffline(page);
adjust_managed_page_count(page, -1);
xenmem_reservation_scrub_page(page);
list_add(&page->lru, &pages);
--
2.17.2
Mark inflated and never onlined pages PG_offline, to tell the world that
the content is stale and should not be dumped.
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/hv/hv_balloon.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 5728dc470eeb..778b6f879d1c 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -681,8 +681,13 @@ static struct notifier_block hv_memory_nb = {
/* Check if the particular page is backed and can be onlined and online it. */
static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
{
- if (!has_pfn_is_backed(has, page_to_pfn(pg)))
+ if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
+ if (!PageOffline(pg))
+ __SetPageOffline(pg);
return;
+ }
+ if (PageOffline(pg))
+ __ClearPageOffline(pg);
/* This frame is currently backed; online the page. */
__online_page_set_limits(pg);
@@ -1200,6 +1205,7 @@ static void free_balloon_pages(struct hv_dynmem_device *dm,
for (i = 0; i < num_pages; i++) {
pg = pfn_to_page(i + start_frame);
+ __ClearPageOffline(pg);
__free_page(pg);
dm->num_pages_ballooned--;
}
@@ -1212,7 +1218,7 @@ static unsigned int alloc_balloon_pages(struct hv_dynmem_device *dm,
struct dm_balloon_response *bl_resp,
int alloc_unit)
{
- unsigned int i = 0;
+ unsigned int i, j;
struct page *pg;
if (num_pages < alloc_unit)
@@ -1244,6 +1250,10 @@ static unsigned int alloc_balloon_pages(struct hv_dynmem_device *dm,
if (alloc_unit != 1)
split_page(pg, get_order(alloc_unit << PAGE_SHIFT));
+ /* mark all pages offline */
+ for (j = 0; j < (1 << get_order(alloc_unit << PAGE_SHIFT)); j++)
+ __SetPageOffline(pg + j);
+
bl_resp->range_count++;
bl_resp->range_array[i].finfo.start_page =
page_to_pfn(pg);
--
2.17.2
The content of pages that are marked PG_offline is not of interest
(e.g. inflated by a balloon driver), let's skip these pages.
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
kernel/power/snapshot.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index b0308a2c6000..01db1d13481a 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn)
BUG_ON(!PageHighMem(page));
if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) ||
- PageReserved(page))
+ PageReserved(page) || PageOffline(page))
return NULL;
if (page_is_guard(page))
@@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, unsigned long pfn)
if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
return NULL;
+ if (PageOffline(page))
+ return NULL;
+
if (PageReserved(page)
&& (!kernel_page_present(page) || pfn_is_nosave(pfn)))
return NULL;
--
2.17.2
On Wed, Nov 14, 2018 at 10:17:00PM +0100, David Hildenbrand wrote:
> Rename PG_balloon to PG_offline. This is an indicator that the page is
> logically offline, the content stale and that it should not be touched
> (e.g. a hypervisor would have to allocate backing storage in order for the
> guest to dump an unused page). We can then e.g. exclude such pages from
> dumps.
>
> In following patches, we will make use of this bit also in other balloon
> drivers. While at it, document PGTABLE.
Thank you for documenting PGTABLE. I didn't realise I also had this
document to update when I added PGTABLE.
> +++ b/Documentation/admin-guide/mm/pagemap.rst
> @@ -78,6 +78,8 @@ number of times a page is mapped.
> 23. BALLOON
> 24. ZERO_PAGE
> 25. IDLE
> + 26. PGTABLE
> + 27. OFFLINE
So the offline *user* bit is new ... even though the *kernel* bit
just renames the balloon bit. I'm not sure how I feel about this.
I'm going to think about it some more. Could you share your decision
process with us?
I have no objection to renaming the balloon bit inside the kernel; I
think that's a wise idea. I'm just not sure whether we should rename
the user balloon bit rather than adding a new bit.
On 14.11.18 23:23, Matthew Wilcox wrote:
> On Wed, Nov 14, 2018 at 10:17:00PM +0100, David Hildenbrand wrote:
>> Rename PG_balloon to PG_offline. This is an indicator that the page is
>> logically offline, the content stale and that it should not be touched
>> (e.g. a hypervisor would have to allocate backing storage in order for the
>> guest to dump an unused page). We can then e.g. exclude such pages from
>> dumps.
>>
>> In following patches, we will make use of this bit also in other balloon
>> drivers. While at it, document PGTABLE.
>
> Thank you for documenting PGTABLE. I didn't realise I also had this
> document to update when I added PGTABLE.
Thank you for looking into this :)
>
>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>> @@ -78,6 +78,8 @@ number of times a page is mapped.
>> 23. BALLOON
>> 24. ZERO_PAGE
>> 25. IDLE
>> + 26. PGTABLE
>> + 27. OFFLINE
>
> So the offline *user* bit is new ... even though the *kernel* bit
> just renames the balloon bit. I'm not sure how I feel about this.
> I'm going to think about it some more. Could you share your decision
> process with us?
BALLOON was/is documented as
"23 - BALLOON
balloon compaction page
"
and only includes all virtio-ballon pages after the non-lru migration
feature has been implemented for ballooned pages. Since then, this flag
does basically no longer stands for what it actually was supposed to do.
To not break uapi I decided to not rename it but instead to add a new flag.
>
> I have no objection to renaming the balloon bit inside the kernel; I
> think that's a wise idea. I'm just not sure whether we should rename
> the user balloon bit rather than adding a new bit.
>
Can we rename without breaking uapi?
--
Thanks,
David / dhildenb
From: David Hildenbrand
Sent: November 14, 2018 at 9:16:58 PM GMT
> Subject: [PATCH RFC 0/6] mm/kdump: allow to exclude pages that are logically offline
>
>
> Right now, pages inflated as part of a balloon driver will be dumped
> by dump tools like makedumpfile. While XEN is able to check in the
> crash kernel whether a certain pfn is actuall backed by memory in the
> hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
> virtio-balloon and hv-balloon inflated memory will essentially result in
> zero pages getting allocated by the hypervisor and the dump getting
> filled with this data.
Is there any reason that VMware balloon driver is not mentioned?
On 14.11.18 23:57, Nadav Amit wrote:
> From: David Hildenbrand
> Sent: November 14, 2018 at 9:16:58 PM GMT
>> Subject: [PATCH RFC 0/6] mm/kdump: allow to exclude pages that are logically offline
>>
>>
>> Right now, pages inflated as part of a balloon driver will be dumped
>> by dump tools like makedumpfile. While XEN is able to check in the
>> crash kernel whether a certain pfn is actuall backed by memory in the
>> hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
>> virtio-balloon and hv-balloon inflated memory will essentially result in
>> zero pages getting allocated by the hypervisor and the dump getting
>> filled with this data.
>
> Is there any reason that VMware balloon driver is not mentioned?
Definitely ...
... not ;) . I haven't looked at vmware's balloon driver yet (I only saw
that there was quite some activity recently). I guess it should have
similar problems. (I mean reading and dumping data nobody cares about is
certainly not desired)
Can you share if something like this is also desired for vmware's
implementation? (I tagged this as RFC to get some more feedback)
It should in theory be as simple as adding a handful of
_SetPageOffline()/_ClearPageOffline() at the right spots.
--
Thanks,
David / dhildenb
From: David Hildenbrand
Sent: November 14, 2018 at 11:05:38 PM GMT
> Subject: Re: [PATCH RFC 0/6] mm/kdump: allow to exclude pages that are logically offline
>
>
> On 14.11.18 23:57, Nadav Amit wrote:
>> From: David Hildenbrand
>> Sent: November 14, 2018 at 9:16:58 PM GMT
>>> Subject: [PATCH RFC 0/6] mm/kdump: allow to exclude pages that are logically offline
>>>
>>>
>>> Right now, pages inflated as part of a balloon driver will be dumped
>>> by dump tools like makedumpfile. While XEN is able to check in the
>>> crash kernel whether a certain pfn is actuall backed by memory in the
>>> hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
>>> virtio-balloon and hv-balloon inflated memory will essentially result in
>>> zero pages getting allocated by the hypervisor and the dump getting
>>> filled with this data.
>>
>> Is there any reason that VMware balloon driver is not mentioned?
>
> Definitely ...
>
> ... not ;) . I haven't looked at vmware's balloon driver yet (I only saw
> that there was quite some activity recently). I guess it should have
> similar problems. (I mean reading and dumping data nobody cares about is
> certainly not desired)
>
> Can you share if something like this is also desired for vmware's
> implementation? (I tagged this as RFC to get some more feedback)
>
> It should in theory be as simple as adding a handful of
> _SetPageOffline()/_ClearPageOffline() at the right spots.
Thanks, I was just suspecting it is personal ;-)
Actually, some patches that I sent for 4.20 to use the balloon-compaction
infrastructure by the VMware balloon fell between the cracks, and I need
to resend them.
I would obviously prefer that your changes would be done on top of those
that were skipped. This patch-set sounds very reasonable to me, but I prefer
that Julien (cc’d) would also give his opinion.
Regards,
Nadav
>On 11/14/18, 3:41 PM, "Nadav Amit" <[email protected]> wrote:
>>From: David Hildenbrand
>>Sent: November 14, 2018 at 11:05:38 PM GMT
>> Subject: Re: [PATCH RFC 0/6] mm/kdump: allow to exclude pages that are logically offline
>>
>>
>> Can you share if something like this is also desired for vmware's
>> implementation? (I tagged this as RFC to get some more feedback)
>>
>> It should in theory be as simple as adding a handful of
>> _SetPageOffline()/_ClearPageOffline() at the right spots.
>
> Thanks, I was just suspecting it is personal ;-)
>
> I would obviously prefer that your changes would be done on top of those
> that were skipped. This patch-set sounds very reasonable to me, but I prefer
> that Julien (cc’d) would also give his opinion.
I think this is desirable for VMware's implementation also. You are right,
dumping data that is not relevant is a waste :-)
I haven't heard of any panic/issue due to this but that's still a good optimization.
Nadav or I could help to test that on ESX if required.
Regards,
--
Julien Freche
On Wed, Nov 14, 2018 at 11:49:15PM +0100, David Hildenbrand wrote:
> On 14.11.18 23:23, Matthew Wilcox wrote:
> > On Wed, Nov 14, 2018 at 10:17:00PM +0100, David Hildenbrand wrote:
> >> Rename PG_balloon to PG_offline. This is an indicator that the page is
> >> logically offline, the content stale and that it should not be touched
> >> (e.g. a hypervisor would have to allocate backing storage in order for the
> >> guest to dump an unused page). We can then e.g. exclude such pages from
> >> dumps.
> >>
> >> In following patches, we will make use of this bit also in other balloon
> >> drivers. While at it, document PGTABLE.
> >
> > Thank you for documenting PGTABLE. I didn't realise I also had this
> > document to update when I added PGTABLE.
>
> Thank you for looking into this :)
>
> >
> >> +++ b/Documentation/admin-guide/mm/pagemap.rst
> >> @@ -78,6 +78,8 @@ number of times a page is mapped.
> >> 23. BALLOON
> >> 24. ZERO_PAGE
> >> 25. IDLE
> >> + 26. PGTABLE
> >> + 27. OFFLINE
> >
> > So the offline *user* bit is new ... even though the *kernel* bit
> > just renames the balloon bit. I'm not sure how I feel about this.
> > I'm going to think about it some more. Could you share your decision
> > process with us?
>
> BALLOON was/is documented as
>
> "23 - BALLOON
> balloon compaction page
> "
>
> and only includes all virtio-ballon pages after the non-lru migration
> feature has been implemented for ballooned pages. Since then, this flag
> does basically no longer stands for what it actually was supposed to do.
Perhaps I missing something, but how the user should interpret "23" when he
reads /proc/kpageflags?
> To not break uapi I decided to not rename it but instead to add a new flag.
I've got a feeling that uapi was anyway changed for the BALLON flag
meaning.
> >
> > I have no objection to renaming the balloon bit inside the kernel; I
> > think that's a wise idea. I'm just not sure whether we should rename
> > the user balloon bit rather than adding a new bit.
> >
>
> Can we rename without breaking uapi?
>
> --
>
> Thanks,
>
> David / dhildenb
>
--
Sincerely yours,
Mike.
Hi David,
On 11/14/18 at 10:17pm, David Hildenbrand wrote:
> Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
> makedumpfile can directly skip pages that are logically offline and the
> content therefore stale.
It would be good to copy some background info from cover letter to the
patch description so that we can get better understanding why this is
needed now.
BTW, Lianbo is working on a documentation of the vmcoreinfo exported
fields. Ccing him so that he is aware of this.
Also cc Boris, although I do not want the doc changes blocks this
he might have different opinion :)
>
> Cc: Andrew Morton <[email protected]>
> Cc: Dave Young <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Omar Sandoval <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: "Michael S. Tsirkin" <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> kernel/crash_core.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 933cb3e45b98..093c9f917ed0 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void)
> VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
> #ifdef CONFIG_HUGETLB_PAGE
> VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
> +#define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline)
> + VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
> #endif
>
> arch_crash_save_vmcoreinfo();
> --
> 2.17.2
>
Thanks
Dave
On Wed 2018-11-14 22:17:04, David Hildenbrand wrote:
> The content of pages that are marked PG_offline is not of interest
> (e.g. inflated by a balloon driver), let's skip these pages.
>
> Cc: "Rafael J. Wysocki" <[email protected]>
Acked-by: Pavel Machek <[email protected]>
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index b0308a2c6000..01db1d13481a 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn)
> BUG_ON(!PageHighMem(page));
>
> if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) ||
> - PageReserved(page))
> + PageReserved(page) || PageOffline(page))
> return NULL;
>
> if (page_is_guard(page))
> @@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, unsigned long pfn)
> if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
> return NULL;
>
> + if (PageOffline(page))
> + return NULL;
> +
> if (PageReserved(page)
> && (!kernel_page_present(page) || pfn_is_nosave(pfn)))
> return NULL;
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On 15.11.18 03:07, Mike Rapoport wrote:
> On Wed, Nov 14, 2018 at 11:49:15PM +0100, David Hildenbrand wrote:
>> On 14.11.18 23:23, Matthew Wilcox wrote:
>>> On Wed, Nov 14, 2018 at 10:17:00PM +0100, David Hildenbrand wrote:
>>>> Rename PG_balloon to PG_offline. This is an indicator that the page is
>>>> logically offline, the content stale and that it should not be touched
>>>> (e.g. a hypervisor would have to allocate backing storage in order for the
>>>> guest to dump an unused page). We can then e.g. exclude such pages from
>>>> dumps.
>>>>
>>>> In following patches, we will make use of this bit also in other balloon
>>>> drivers. While at it, document PGTABLE.
>>>
>>> Thank you for documenting PGTABLE. I didn't realise I also had this
>>> document to update when I added PGTABLE.
>>
>> Thank you for looking into this :)
>>
>>>
>>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>>>> @@ -78,6 +78,8 @@ number of times a page is mapped.
>>>> 23. BALLOON
>>>> 24. ZERO_PAGE
>>>> 25. IDLE
>>>> + 26. PGTABLE
>>>> + 27. OFFLINE
>>>
>>> So the offline *user* bit is new ... even though the *kernel* bit
>>> just renames the balloon bit. I'm not sure how I feel about this.
>>> I'm going to think about it some more. Could you share your decision
>>> process with us?
>>
>> BALLOON was/is documented as
>>
>> "23 - BALLOON
>> balloon compaction page
>> "
>>
>> and only includes all virtio-ballon pages after the non-lru migration
>> feature has been implemented for ballooned pages. Since then, this flag
>> does basically no longer stands for what it actually was supposed to do.
>
> Perhaps I missing something, but how the user should interpret "23" when he
> reads /proc/kpageflags?
Looking at the history in more detail:
commit 09316c09dde33aae14f34489d9e3d243ec0d5938
Author: Konstantin Khlebnikov <[email protected]>
Date: Thu Oct 9 15:29:32 2014 -0700
mm/balloon_compaction: add vmstat counters and kpageflags bit
Always mark pages with PageBalloon even if balloon compaction is
disabled
and expose this mark in /proc/kpageflags as KPF_BALLOON.
So KPF_BALLOON was exposed when virtio-balloon pages were always marked
with PG_balloon. So the documentation is actually wrong ("balloon page"
vs. "balloon compaction page").
I have no idea who actually used that information. I suspect this was
just some debugging aid.
>
>> To not break uapi I decided to not rename it but instead to add a new flag.
>
> I've got a feeling that uapi was anyway changed for the BALLON flag
> meaning.
Yes. If we *replace* KPF_BALLOON by KPF_OFFLINE
a) Some applications might no longer compile (I guess that's ok)
b) Some old applications will treat KPF_OFFLINE like KPF_BALLOON (which
should at least for virtio-balloon usage until now be fine - it is just
more generic)
So I guess it's up to Maintainers/Matthew to decide :)
--
Thanks,
David / dhildenb
On 15.11.18 07:19, Dave Young wrote:
> Hi David,
>
> On 11/14/18 at 10:17pm, David Hildenbrand wrote:
>> Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so
>> makedumpfile can directly skip pages that are logically offline and the
>> content therefore stale.
>
> It would be good to copy some background info from cover letter to the
> patch description so that we can get better understanding why this is
> needed now.
Yes, will add more detail!
>
> BTW, Lianbo is working on a documentation of the vmcoreinfo exported
> fields. Ccing him so that he is aware of this.
>
> Also cc Boris, although I do not want the doc changes blocks this
> he might have different opinion :)
I'll be happy to help updating documentation (or updating it myself if
the doc updates go in first).
--
Thanks,
David / dhildenb
On Thu, Nov 15, 2018 at 02:19:23PM +0800, Dave Young wrote:
> It would be good to copy some background info from cover letter to the
> patch description so that we can get better understanding why this is
> needed now.
>
> BTW, Lianbo is working on a documentation of the vmcoreinfo exported
> fields. Ccing him so that he is aware of this.
>
> Also cc Boris, although I do not want the doc changes blocks this
> he might have different opinion :)
Yeah, my initial reaction is that exporting an mm-internal flag to
userspace is a no-no.
What would be better, IMHO, is having a general method of telling the
kdump kernel - maybe ranges of physical addresses - which to skip.
Because the moment there's a set of pages which do not have PG_offline
set but kdump would still like to skip, this breaks.
But that's mm guys' call.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On 15.11.18 12:10, Borislav Petkov wrote:
> On Thu, Nov 15, 2018 at 02:19:23PM +0800, Dave Young wrote:
>> It would be good to copy some background info from cover letter to the
>> patch description so that we can get better understanding why this is
>> needed now.
>>
>> BTW, Lianbo is working on a documentation of the vmcoreinfo exported
>> fields. Ccing him so that he is aware of this.
>>
>> Also cc Boris, although I do not want the doc changes blocks this
>> he might have different opinion :)
>
> Yeah, my initial reaction is that exporting an mm-internal flag to
> userspace is a no-no.
Sorry to say, but that is the current practice without which
makedumpfile would not be able to work at all. (exclude user pages,
exclude page cache, exclude buddy pages). Let's not reinvent the wheel
here. This is how dumping works forever.
Also see how hwpoisoned pages are handled.
(please note that most distributions only support dumping via
makedumpfile, for said reasons)
>
> What would be better, IMHO, is having a general method of telling the
> kdump kernel - maybe ranges of physical addresses - which to skip.
And that has to be updated whenever we change a page flag. But then the
dump kernel would have to be aware about "struct page" location and
format of some old kernel to be dump. Let's just not even discuss going
down that path.
>
> Because the moment there's a set of pages which do not have PG_offline
> set but kdump would still like to skip, this breaks.
I don't understand your concern. PG_offline is only an optimization for
sections that are online. Offline sections can already be completely
ignored when dumping.
I don't see how there should be "set of pages which do not have
PG_offline". I mean if they don't have PG_offline they will simply be
dumped like before. Which worked forever. Sorry, I don't get your point.
--
Thanks,
David / dhildenb
On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote:
> Sorry to say, but that is the current practice without which
> makedumpfile would not be able to work at all. (exclude user pages,
> exclude page cache, exclude buddy pages). Let's not reinvent the wheel
> here. This is how dumping works forever.
Sorry, but "we've always done this in the past" doesn't make it better.
> I don't see how there should be "set of pages which do not have
> PG_offline".
It doesn't have to be a set of pages. Think a (mmconfig perhaps) region
which the kdump kernel should completely skip because poking in it in
the kdump kernel, causes all kinds of havoc like machine checks. etc.
We've had and still have one issue like that.
But let me clarify my note: I don't want to be discussing with you the
design of makedumpfile and how it should or should not work - that ship
has already sailed. Apparently there are valid reasons to do it this
way.
I was *simply* stating that it feels wrong to export mm flags like that.
But as I said already, that is mm guys' call and looking at how we're
already exporting a bunch of stuff in the vmcoreinfo - including other
mm flags - I guess one more flag doesn't matter anymore.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On 15.11.18 12:52, Borislav Petkov wrote:
> On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote:
>> Sorry to say, but that is the current practice without which
>> makedumpfile would not be able to work at all. (exclude user pages,
>> exclude page cache, exclude buddy pages). Let's not reinvent the wheel
>> here. This is how dumping works forever.
>
> Sorry, but "we've always done this in the past" doesn't make it better.
Just saying that "I'm not the first to do it, don't hit me with a stick" :)
>
>> I don't see how there should be "set of pages which do not have
>> PG_offline".
>
> It doesn't have to be a set of pages. Think a (mmconfig perhaps) region
> which the kdump kernel should completely skip because poking in it in
> the kdump kernel, causes all kinds of havoc like machine checks. etc.
> We've had and still have one issue like that.
Indeed. And we still have without makedumpfile. I think you are aware of
this, but I'll explain it just for consistency: PG_hwpoison
At some point we detect a HW error and mask a page as PG_hwpoison.
makedumpfile knows how to treat that flag and can exclude it from the
dump (== not access it). No crash.
kdump itself has no clue about old "struct pages". Especially:
a) Where they are located in memory (e.g. SPARSE)
b) What their format is ("where are the flags")
c) What the meaning of flags is ("what does bit X mean")
In order to know such information, we would have to do parsing of quite
some information inside the kernel in kdump. Basically what makedumpfile
does just now. Is this feasible? I don't think so.
So we would need another approach to communicate such information as you
said. I can't think of any, but if anybody reading this has an idea,
please speak up. I am interested.
The *only* way right now we would have to handle such scenarios:
1. While dumping memory and we get a machine check, fake reading a zero
page instead of crashing.
2. While dumping memory and we get a fault, fake reading a zero page
instead of crashing.
>
> But let me clarify my note: I don't want to be discussing with you the
> design of makedumpfile and how it should or should not work - that ship
> has already sailed. Apparently there are valid reasons to do it this
> way.
Indeed, and the basic design is to export these flags. (let's say
"unfortunately", being able to handle such stuff in kdump directly would
be the dream).
> I was *simply* stating that it feels wrong to export mm flags like that.
>
> But as I said already, that is mm guys' call and looking at how we're
> already exporting a bunch of stuff in the vmcoreinfo - including other
> mm flags - I guess one more flag doesn't matter anymore.
Fair enough, noted. If you have an idea how to handle this in kdump,
please let me know.
--
Thanks,
David / dhildenb
On Thu 15-11-18 12:52:13, Borislav Petkov wrote:
> On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote:
> > Sorry to say, but that is the current practice without which
> > makedumpfile would not be able to work at all. (exclude user pages,
> > exclude page cache, exclude buddy pages). Let's not reinvent the wheel
> > here. This is how dumping works forever.
>
> Sorry, but "we've always done this in the past" doesn't make it better.
>
> > I don't see how there should be "set of pages which do not have
> > PG_offline".
>
> It doesn't have to be a set of pages. Think a (mmconfig perhaps) region
> which the kdump kernel should completely skip because poking in it in
> the kdump kernel, causes all kinds of havoc like machine checks. etc.
> We've had and still have one issue like that.
>
> But let me clarify my note: I don't want to be discussing with you the
> design of makedumpfile and how it should or should not work - that ship
> has already sailed. Apparently there are valid reasons to do it this
> way.
>
> I was *simply* stating that it feels wrong to export mm flags like that.
>
> But as I said already, that is mm guys' call and looking at how we're
> already exporting a bunch of stuff in the vmcoreinfo - including other
> mm flags - I guess one more flag doesn't matter anymore.
I am not familiar with kexec to judge this particular patch but we
cannot simply define any range for these pages (same as for hwpoison
ones) because they can be almost anywhere in the available memory range.
Then there can be countless of them. There is no other way to rule them
out but to check the page state.
I am not really sure what is the concern here exactly. Kdump is so
closly tight to the specific kernel version that the api exported
specifically for its purpose cannot be seriously considered an ABI.
Kdump has to adopt all the time.
--
Michal Hocko
SUSE Labs
[Cc Konstantin - the patch is http://lkml.kernel.org/r/[email protected]]
On Thu 15-11-18 10:21:13, David Hildenbrand wrote:
> On 15.11.18 03:07, Mike Rapoport wrote:
> > On Wed, Nov 14, 2018 at 11:49:15PM +0100, David Hildenbrand wrote:
> >> On 14.11.18 23:23, Matthew Wilcox wrote:
> >>> On Wed, Nov 14, 2018 at 10:17:00PM +0100, David Hildenbrand wrote:
> >>>> Rename PG_balloon to PG_offline. This is an indicator that the page is
> >>>> logically offline, the content stale and that it should not be touched
> >>>> (e.g. a hypervisor would have to allocate backing storage in order for the
> >>>> guest to dump an unused page). We can then e.g. exclude such pages from
> >>>> dumps.
> >>>>
> >>>> In following patches, we will make use of this bit also in other balloon
> >>>> drivers. While at it, document PGTABLE.
> >>>
> >>> Thank you for documenting PGTABLE. I didn't realise I also had this
> >>> document to update when I added PGTABLE.
> >>
> >> Thank you for looking into this :)
> >>
> >>>
> >>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
> >>>> @@ -78,6 +78,8 @@ number of times a page is mapped.
> >>>> 23. BALLOON
> >>>> 24. ZERO_PAGE
> >>>> 25. IDLE
> >>>> + 26. PGTABLE
> >>>> + 27. OFFLINE
> >>>
> >>> So the offline *user* bit is new ... even though the *kernel* bit
> >>> just renames the balloon bit. I'm not sure how I feel about this.
> >>> I'm going to think about it some more. Could you share your decision
> >>> process with us?
> >>
> >> BALLOON was/is documented as
> >>
> >> "23 - BALLOON
> >> balloon compaction page
> >> "
> >>
> >> and only includes all virtio-ballon pages after the non-lru migration
> >> feature has been implemented for ballooned pages. Since then, this flag
> >> does basically no longer stands for what it actually was supposed to do.
> >
> > Perhaps I missing something, but how the user should interpret "23" when he
> > reads /proc/kpageflags?
>
> Looking at the history in more detail:
>
> commit 09316c09dde33aae14f34489d9e3d243ec0d5938
> Author: Konstantin Khlebnikov <[email protected]>
> Date: Thu Oct 9 15:29:32 2014 -0700
>
> mm/balloon_compaction: add vmstat counters and kpageflags bit
>
> Always mark pages with PageBalloon even if balloon compaction is
> disabled
> and expose this mark in /proc/kpageflags as KPF_BALLOON.
>
>
> So KPF_BALLOON was exposed when virtio-balloon pages were always marked
> with PG_balloon. So the documentation is actually wrong ("balloon page"
> vs. "balloon compaction page").
>
> I have no idea who actually used that information. I suspect this was
> just some debugging aid.
>
> >
> >> To not break uapi I decided to not rename it but instead to add a new flag.
> >
> > I've got a feeling that uapi was anyway changed for the BALLON flag
> > meaning.
>
> Yes. If we *replace* KPF_BALLOON by KPF_OFFLINE
>
> a) Some applications might no longer compile (I guess that's ok)
> b) Some old applications will treat KPF_OFFLINE like KPF_BALLOON (which
> should at least for virtio-balloon usage until now be fine - it is just
> more generic)
I do not think any compilation could break. If the semantic of the flag
is preserved then everything should work as expected.
--
Michal Hocko
SUSE Labs
On Wed 14-11-18 22:17:04, David Hildenbrand wrote:
[...]
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index b0308a2c6000..01db1d13481a 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn)
> BUG_ON(!PageHighMem(page));
>
> if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) ||
> - PageReserved(page))
> + PageReserved(page) || PageOffline(page))
> return NULL;
>
> if (page_is_guard(page))
> @@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, unsigned long pfn)
> if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
> return NULL;
>
> + if (PageOffline(page))
> + return NULL;
> +
> if (PageReserved(page)
> && (!kernel_page_present(page) || pfn_is_nosave(pfn)))
> return NULL;
Btw. now that you are touching this file could you also make
s@pfn_to_page@pfn_to_online_page@ please? We really do not want to touch
offline pfn ranges in general. A separate patch for that of course.
Thanks!
--
Michal Hocko
SUSE Labs
On 15.11.18 13:23, Michal Hocko wrote:
> On Wed 14-11-18 22:17:04, David Hildenbrand wrote:
> [...]
>> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
>> index b0308a2c6000..01db1d13481a 100644
>> --- a/kernel/power/snapshot.c
>> +++ b/kernel/power/snapshot.c
>> @@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone *zone, unsigned long pfn)
>> BUG_ON(!PageHighMem(page));
>>
>> if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) ||
>> - PageReserved(page))
>> + PageReserved(page) || PageOffline(page))
>> return NULL;
>>
>> if (page_is_guard(page))
>> @@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, unsigned long pfn)
>> if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
>> return NULL;
>>
>> + if (PageOffline(page))
>> + return NULL;
>> +
>> if (PageReserved(page)
>> && (!kernel_page_present(page) || pfn_is_nosave(pfn)))
>> return NULL;
>
> Btw. now that you are touching this file could you also make
> s@pfn_to_page@pfn_to_online_page@ please? We really do not want to touch
> offline pfn ranges in general. A separate patch for that of course.
>
> Thanks!
>
Sure thing, will look into that!
Thanks!
--
Thanks,
David / dhildenb
On Thu, Nov 15, 2018 at 01:11:02PM +0100, Michal Hocko wrote:
> I am not familiar with kexec to judge this particular patch but we
> cannot simply define any range for these pages (same as for hwpoison
> ones) because they can be almost anywhere in the available memory range.
> Then there can be countless of them. There is no other way to rule them
> out but to check the page state.
I guess, especially if it is a monster box with a lot of memory in it.
> I am not really sure what is the concern here exactly. Kdump is so
> closly tight to the specific kernel version that the api exported
> specifically for its purpose cannot be seriously considered an ABI.
> Kdump has to adopt all the time.
Right...
Except, when people start ogling vmcoreinfo for other things and start
exporting all kinds of kernel internals in there, my alarm bells start
ringing.
But ok, kdump *is* special and I guess that's fine.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Thu, Nov 15, 2018 at 01:01:17PM +0100, David Hildenbrand wrote:
> Just saying that "I'm not the first to do it, don't hit me with a stick" :)
:-)
> Indeed. And we still have without makedumpfile. I think you are aware of
> this, but I'll explain it just for consistency: PG_hwpoison
No, I appreciate an explanation very much! So thanks for that. :)
> At some point we detect a HW error and mask a page as PG_hwpoison.
>
> makedumpfile knows how to treat that flag and can exclude it from the
> dump (== not access it). No crash.
>
> kdump itself has no clue about old "struct pages". Especially:
> a) Where they are located in memory (e.g. SPARSE)
> b) What their format is ("where are the flags")
> c) What the meaning of flags is ("what does bit X mean")
>
> In order to know such information, we would have to do parsing of quite
> some information inside the kernel in kdump. Basically what makedumpfile
> does just now. Is this feasible? I don't think so.
>
> So we would need another approach to communicate such information as you
> said. I can't think of any, but if anybody reading this has an idea,
> please speak up. I am interested.
Yeah but that ship has sailed. And even if we had a great idea, we'd
have to support kdump before and after the idea. And that would be a
serious mess.
And if you have a huge box with gazillion piles of memory and an alpha
particle passes through a bunch of them on its way down to the earth's
core, and while doing so, flips a bunch of bits, you need to go and
collect all those regions and update some list which you then need to
shove into the second kernel.
And you probably need to do all that through perhaps a piece of memory
which is used for communication between first and second kernel and that
list better fit in there, or you need to realloc. And that piece of
memory's layout needs to be properly defined so that the second kernel
can parse it correctly.
And so on...
> The *only* way right now we would have to handle such scenarios:
>
> 1. While dumping memory and we get a machine check, fake reading a zero
> page instead of crashing.
> 2. While dumping memory and we get a fault, fake reading a zero page
> instead of crashing.
Yap.
> Indeed, and the basic design is to export these flags. (let's say
> "unfortunately", being able to handle such stuff in kdump directly would
> be the dream).
Well, AFAICT, the minimum work you need to always do before starting the
dumping is somehow generate that list of pages or ranges to not dump.
And that work needs to be done by the first or the second kernel, I'd
say.
If the first kernel would do it, then you'd have to probably have
callbacks to certain operations which go and add ranges or pages to
exclude, to a list which is then readily accessible to the second
kernel. Which means, when you reserve memory for the second kernel,
you'd have to reserve memory also for such a list.
But then what do you do when that memory gets filled up...?
So I guess exporting those things in vmcoreinfo is probably the only
thing we *can* do in the end.
Oh well, enough rambling... :)
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On 14.11.18 22:16, David Hildenbrand wrote:
> Right now, pages inflated as part of a balloon driver will be dumped
> by dump tools like makedumpfile. While XEN is able to check in the
> crash kernel whether a certain pfn is actuall backed by memory in the
> hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
> virtio-balloon and hv-balloon inflated memory will essentially result in
> zero pages getting allocated by the hypervisor and the dump getting
> filled with this data.
>
> The allocation and reading of zero pages can directly be avoided if a
> dumping tool could know which pages only contain stale information not to
> be dumped.
>
> Also for XEN, calling into the kernel and asking the hypervisor if a
> pfn is backed can be avoided if the duming tool would skip such pages
> right from the beginning.
>
> Dumping tools have no idea whether a given page is part of a balloon driver
> and shall not be dumped. Esp. PG_reserved cannot be used for that purpose
> as all memory allocated during early boot is also PG_reserved, see
> discussion at [1]. So some other way of indication is required and a new
> page flag is frowned upon.
>
> We have PG_balloon (MAPCOUNT value), which is essentially unused now. I
> suggest renaming it to something more generic (PG_offline) to mark pages as
> logically offline. This flag can than e.g. also be used by virtio-mem in
> the future to mark subsections as offline. Or by other code that wants to
> put pages logically offline (e.g. later maybe poisoned pages that shall
> no longer be used).
>
> This series converts PG_balloon to PG_offline, allows dumping tools to
> query the value to detect such pages and marks pages in the hv-balloon
> and XEN balloon properly as PG_offline. Note that virtio-balloon already
> set pages to PG_balloon (and now PG_offline).
>
> Please note that this is also helpful for a problem we were seeing under
> Hyper-V: Dumping logically offline memory (pages kept fake offline while
> onlining a section via online_page_callback) would under some condicions
> result in a kernel panic when dumping them.
>
> As I don't have access to neither XEN nor Hyper-V installation, this was
> not tested yet (and a makedumpfile change will be required to skip
> dumping these pages).
>
> [1] https://lkml.org/lkml/2018/7/20/566
>
> David Hildenbrand (6):
> mm: balloon: update comment about isolation/migration/compaction
> mm: convert PG_balloon to PG_offline
> kexec: export PG_offline to VMCOREINFO
> xen/balloon: mark inflated pages PG_offline
> hv_balloon: mark inflated pages PG_offline
> PM / Hibernate: exclude all PageOffline() pages
>
> Documentation/admin-guide/mm/pagemap.rst | 6 +++++
> drivers/hv/hv_balloon.c | 14 ++++++++--
> drivers/xen/balloon.c | 3 +++
> fs/proc/page.c | 4 +--
> include/linux/balloon_compaction.h | 34 +++++++++---------------
> include/linux/page-flags.h | 11 +++++---
> include/uapi/linux/kernel-page-flags.h | 1 +
> kernel/crash_core.c | 2 ++
> kernel/power/snapshot.c | 5 +++-
> tools/vm/page-types.c | 1 +
> 10 files changed, 51 insertions(+), 30 deletions(-)
>
I just did a test with virtio-balloon (and a very simple makedumpfile
patch which I can supply on demand).
1. Guest with 8GB. Inflate balloon to 4GB via
sudo virsh setmem f29 --size 4096M --live
2. Trigger a kernel panic in the guest
echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger
Original pages : 0x00000000001e1da8
Excluded pages : 0x00000000001c9221
Pages filled with zero : 0x00000000000050b0
Non-private cache pages : 0x0000000000046547
Private cache pages : 0x0000000000002165
User process data pages : 0x00000000000048cf
Free pages : 0x00000000000771f6
Hwpoison pages : 0x0000000000000000
Offline pages : 0x0000000000100000
Remaining pages : 0x0000000000018b87
(The number of pages is reduced to 5%.)
Memory Hole : 0x000000000009e258
--------------------------------------------------
Total pages : 0x0000000000280000
(Offline patches matches the 4GB)
--
Thanks,
David / dhildenb