LinuxLists.cc - [PATCH v1 0/3] mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE

2024-06-07 09:10:03

by David Hildenbrand

[permalink] [raw]

Subject: [PATCH v1 0/3] mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE

This can be a considered a long-overdue follow-up to some parts of [1].
The patches are based on [2], but they are not strictly required -- just
makes it clearer why we can use adjust_managed_page_count() for memory
hotplug without going into details about highmem.

We stop initializing pages with PageReserved() in memory hotplug code --
except when dealing with ZONE_DEVICE for now. Instead, we use
PageOffline(): all pages are initialized to PageOffline() when onlining a
memory section, and only the ones actually getting exposed to the
system/page allocator will get PageOffline cleared.

This way, we enlighten memory hotplug more about PageOffline() pages and
can cleanup some hacks we have in virtio-mem code.

What about ZONE_DEVICE? PageOffline() is wrong, but we might just stop
using PageReserved() for them later by simply checking for
is_zone_device_page() at suitable places. That will be a separate patch
set / proposal.

This primarily affects virtio-mem, HV-balloon and XEN balloon. I only
briefly tested with virtio-mem, which benefits most from these cleanups.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lkml.kernel.org/r/[email protected]

Cc: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Xuan Zhuo <[email protected]>
Cc: "Eugenio Pérez" <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: Oleksandr Tyshchenko <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Dmitry Vyukov <[email protected]>

David Hildenbrand (3):
mm: pass meminit_context to __free_pages_core()
mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with
PageOffline() instead of PageReserved()
mm/memory_hotplug: skip adjust_managed_page_count() for PageOffline()
pages when offlining

drivers/hv/hv_balloon.c | 5 ++--
drivers/virtio/virtio_mem.c | 29 +++++++++---------
drivers/xen/balloon.c | 9 ++++--
include/linux/memory_hotplug.h | 4 +--
include/linux/page-flags.h | 20 +++++++------
mm/internal.h | 3 +-
mm/kmsan/init.c | 2 +-
mm/memory_hotplug.c | 31 +++++++++----------
mm/mm_init.c | 14 ++++++---
mm/page_alloc.c | 55 +++++++++++++++++++++++++++-------
10 files changed, 108 insertions(+), 64 deletions(-)

base-commit: 19b8422c5bd56fb5e7085995801c6543a98bda1f
prerequisite-patch-id: ca280eafd2732d7912e0c5249dc0df9ecbef19ca
prerequisite-patch-id: 8f43ebc81fdf7b9b665b57614e9e569535094758
--
2.45.1

2024-06-07 09:15:55

by David Hildenbrand

[permalink] [raw]

Subject: [PATCH v1 3/3] mm/memory_hotplug: skip adjust_managed_page_count() for PageOffline() pages when offlining

We currently have a hack for virtio-mem in place to handle memory
offlining with PageOffline pages for which we already adjusted the
managed page count.

Let's enlighten memory offlining code so we can get rid of that hack,
and document the situation.

Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/virtio/virtio_mem.c | 11 ++---------
include/linux/memory_hotplug.h | 4 ++--
include/linux/page-flags.h | 8 ++++++--
mm/memory_hotplug.c | 6 +++---
mm/page_alloc.c | 12 ++++++++++--
5 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index b90df29621c81..b0b8714415783 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1269,12 +1269,6 @@ static void virtio_mem_fake_offline_going_offline(unsigned long pfn,
struct page *page;
unsigned long i;

- /*
- * Drop our reference to the pages so the memory can get offlined
- * and add the unplugged pages to the managed page counters (so
- * offlining code can correctly subtract them again).
- */
- adjust_managed_page_count(pfn_to_page(pfn), nr_pages);
/* Drop our reference to the pages so the memory can get offlined. */
for (i = 0; i < nr_pages; i++) {
page = pfn_to_page(pfn + i);
@@ -1293,10 +1287,9 @@ static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
unsigned long i;

/*
- * Get the reference we dropped when going offline and subtract the
- * unplugged pages from the managed page counters.
+ * Get the reference again that we dropped via page_ref_dec_and_test()
+ * when going offline.
*/
- adjust_managed_page_count(pfn_to_page(pfn), -nr_pages);
for (i = 0; i < nr_pages; i++)
page_ref_inc(pfn_to_page(pfn + i));
}
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7a9ff464608d7..ebe876930e782 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -175,8 +175,8 @@ extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages);
extern int online_pages(unsigned long pfn, unsigned long nr_pages,
struct zone *zone, struct memory_group *group);
-extern void __offline_isolated_pages(unsigned long start_pfn,
- unsigned long end_pfn);
+extern unsigned long __offline_isolated_pages(unsigned long start_pfn,
+ unsigned long end_pfn);

typedef void (*online_page_callback_t)(struct page *page, unsigned int order);

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e0362ce7fc109..0876aca0833e7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1024,11 +1024,15 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
* putting them back to the buddy, it can do so via the memory notifier by
* decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
* reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline()
- * pages (now with a reference count of zero) are treated like free pages,
- * allowing the containing memory block to get offlined. A driver that
+ * pages (now with a reference count of zero) are treated like free (unmanaged)
+ * pages, allowing the containing memory block to get offlined. A driver that
* relies on this feature is aware that re-onlining the memory block will
* require not giving them to the buddy via generic_online_page().
*
+ * Memory offlining code will not adjust the managed page count for any
+ * PageOffline() pages, treating them like they were never exposed to the
+ * buddy using generic_online_page().
+ *
* There are drivers that mark a page PageOffline() and expect there won't be
* any further access to page content. PFN walkers that read content of random
* pages should check PageOffline() and synchronize with such drivers using
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0254059efcbe1..965707a02556f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1941,7 +1941,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
struct zone *zone, struct memory_group *group)
{
const unsigned long end_pfn = start_pfn + nr_pages;
- unsigned long pfn, system_ram_pages = 0;
+ unsigned long pfn, managed_pages, system_ram_pages = 0;
const int node = zone_to_nid(zone);
unsigned long flags;
struct memory_notify arg;
@@ -2062,7 +2062,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
} while (ret);

/* Mark all sections offline and remove free pages from the buddy. */
- __offline_isolated_pages(start_pfn, end_pfn);
+ managed_pages = __offline_isolated_pages(start_pfn, end_pfn);
pr_debug("Offlined Pages %ld\n", nr_pages);

/*
@@ -2078,7 +2078,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
zone_pcp_enable(zone);

/* removal success */
- adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
+ adjust_managed_page_count(pfn_to_page(start_pfn), -managed_pages);
adjust_present_page_count(pfn_to_page(start_pfn), group, -nr_pages);

/* reinitialise watermarks and update pcp limits */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 039bc52cc9091..809bc4a816e85 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6745,14 +6745,19 @@ void zone_pcp_reset(struct zone *zone)
/*
* All pages in the range must be in a single zone, must not contain holes,
* must span full sections, and must be isolated before calling this function.
+ *
+ * Returns the number of managed (non-PageOffline()) pages in the range: the
+ * number of pages for which memory offlining code must adjust managed page
+ * counters using adjust_managed_page_count().
*/
-void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+unsigned long __offline_isolated_pages(unsigned long start_pfn,
+ unsigned long end_pfn)
{
+ unsigned long already_offline = 0, flags;
unsigned long pfn = start_pfn;
struct page *page;
struct zone *zone;
unsigned int order;
- unsigned long flags;

offline_mem_sections(pfn, end_pfn);
zone = page_zone(pfn_to_page(pfn));
@@ -6774,6 +6779,7 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
if (PageOffline(page)) {
BUG_ON(page_count(page));
BUG_ON(PageBuddy(page));
+ already_offline++;
pfn++;
continue;
}
@@ -6786,6 +6792,8 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
pfn += (1 << order);
}
spin_unlock_irqrestore(&zone->lock, flags);
+
+ return end_pfn - start_pfn - already_offline;
}
#endif

--
2.45.1

2024-06-07 09:27:53

by David Hildenbrand

[permalink] [raw]

Subject: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

In preparation for further changes, let's teach __free_pages_core()
about the differences of memory hotplug handling.

Move the memory hotplug specific handling from generic_online_page() to
__free_pages_core(), use adjust_managed_page_count() on the memory
hotplug path, and spell out why memory freed via memblock
cannot currently use adjust_managed_page_count().

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/internal.h | 3 ++-
mm/kmsan/init.c | 2 +-
mm/memory_hotplug.c | 9 +--------
mm/mm_init.c | 4 ++--
mm/page_alloc.c | 17 +++++++++++++++--
5 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 12e95fdf61e90..3fdee779205ab 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -604,7 +604,8 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
int mt);
extern void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order);
-extern void __free_pages_core(struct page *page, unsigned int order);
+extern void __free_pages_core(struct page *page, unsigned int order,
+ enum meminit_context);

/*
* This will have no effect, other than possibly generating a warning, if the
diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
index 3ac3b8921d36f..ca79636f858e5 100644
--- a/mm/kmsan/init.c
+++ b/mm/kmsan/init.c
@@ -172,7 +172,7 @@ static void do_collection(void)
shadow = smallstack_pop(&collect);
origin = smallstack_pop(&collect);
kmsan_setup_meta(page, shadow, origin, collect.order);
- __free_pages_core(page, collect.order);
+ __free_pages_core(page, collect.order, MEMINIT_EARLY);
}
}

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 171ad975c7cfd..27e3be75edcf7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -630,14 +630,7 @@ EXPORT_SYMBOL_GPL(restore_online_page_callback);

void generic_online_page(struct page *page, unsigned int order)
{
- /*
- * Freeing the page with debug_pagealloc enabled will try to unmap it,
- * so we should map it first. This is better than introducing a special
- * case in page freeing fast path.
- */
- debug_pagealloc_map_pages(page, 1 << order);
- __free_pages_core(page, order);
- totalram_pages_add(1UL << order);
+ __free_pages_core(page, order, MEMINIT_HOTPLUG);
}
EXPORT_SYMBOL_GPL(generic_online_page);

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 019193b0d8703..feb5b6e8c8875 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1938,7 +1938,7 @@ static void __init deferred_free_range(unsigned long pfn,
for (i = 0; i < nr_pages; i++, page++, pfn++) {
if (pageblock_aligned(pfn))
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
- __free_pages_core(page, 0);
+ __free_pages_core(page, 0, MEMINIT_EARLY);
}
}

@@ -2513,7 +2513,7 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
}
}

- __free_pages_core(page, order);
+ __free_pages_core(page, order, MEMINIT_EARLY);
}

DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2224965ada468..e0c8a8354be36 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1214,7 +1214,8 @@ static void __free_pages_ok(struct page *page, unsigned int order,
__count_vm_events(PGFREE, 1 << order);
}

-void __free_pages_core(struct page *page, unsigned int order)
+void __free_pages_core(struct page *page, unsigned int order,
+ enum meminit_context context)
{
unsigned int nr_pages = 1 << order;
struct page *p = page;
@@ -1234,7 +1235,19 @@ void __free_pages_core(struct page *page, unsigned int order)
__ClearPageReserved(p);
set_page_count(p, 0);

- atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
+ if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG) &&
+ unlikely(context == MEMINIT_HOTPLUG)) {
+ /*
+ * Freeing the page with debug_pagealloc enabled will try to
+ * unmap it; some archs don't like double-unmappings, so
+ * map it first.
+ */
+ debug_pagealloc_map_pages(page, nr_pages);
+ adjust_managed_page_count(page, nr_pages);
+ } else {
+ /* memblock adjusts totalram_pages() ahead of time. */
+ atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
+ }

if (page_contains_unaccepted(page, order)) {
if (order == MAX_PAGE_ORDER && __free_unaccepted(page))
--
2.45.1

2024-06-07 09:28:29

by David Hildenbrand

[permalink] [raw]

Subject: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

We currently initialize the memmap such that PG_reserved is set and the
refcount of the page is 1. In virtio-mem code, we have to manually clear
that PG_reserved flag to make memory offlining with partially hotplugged
memory blocks possible: has_unmovable_pages() would otherwise bail out on
such pages.

We want to avoid PG_reserved where possible and move to typed pages
instead. Further, we want to further enlighten memory offlining code about
PG_offline: offline pages in an online memory section. One example is
handling managed page count adjustments in a cleaner way during memory
offlining.

So let's initialize the pages with PG_offline instead of PG_reserved.
generic_online_page()->__free_pages_core() will now clear that flag before
handing that memory to the buddy.

Note that the page refcount is still 1 and would forbid offlining of such
memory except when special care is take during GOING_OFFLINE as
currently only implemented by virtio-mem.

With this change, we can now get non-PageReserved() pages in the XEN
balloon list. From what I can tell, that can already happen via
decrease_reservation(), so that should be fine.

HV-balloon should not really observe a change: partial online memory
blocks still cannot get surprise-offlined, because the refcount of these
PageOffline() pages is 1.

Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
hotplugged pages are now PageOffline() instead of PageReserved() before
they are handed over to the buddy.

We'll leave the ZONE_DEVICE case alone for now.

Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/hv/hv_balloon.c | 5 ++---
drivers/virtio/virtio_mem.c | 18 ++++++++++++------
drivers/xen/balloon.c | 9 +++++++--
include/linux/page-flags.h | 12 +++++-------
mm/memory_hotplug.c | 16 ++++++++++------
mm/mm_init.c | 10 ++++++++--
mm/page_alloc.c | 32 +++++++++++++++++++++++---------
7 files changed, 67 insertions(+), 35 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index e000fa3b9f978..c1be38edd8361 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -693,9 +693,8 @@ static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
if (!PageOffline(pg))
__SetPageOffline(pg);
return;
- }
- if (PageOffline(pg))
- __ClearPageOffline(pg);
+ } else if (!PageOffline(pg))
+ return;

/* This frame is currently backed; online the page. */
generic_online_page(pg, 0);
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index a3857bacc8446..b90df29621c81 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1146,12 +1146,16 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
for (; nr_pages--; pfn++) {
struct page *page = pfn_to_page(pfn);

- __SetPageOffline(page);
- if (!onlined) {
+ if (!onlined)
+ /*
+ * Pages that have not been onlined yet were initialized
+ * to PageOffline(). Remember that we have to route them
+ * through generic_online_page().
+ */
SetPageDirty(page);
- /* FIXME: remove after cleanups */
- ClearPageReserved(page);
- }
+ else
+ __SetPageOffline(page);
+ VM_WARN_ON_ONCE(!PageOffline(page));
}
page_offline_end();
}
@@ -1166,9 +1170,11 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
for (; nr_pages--; pfn++) {
struct page *page = pfn_to_page(pfn);

- __ClearPageOffline(page);
if (!onlined)
+ /* generic_online_page() will clear PageOffline(). */
ClearPageDirty(page);
+ else
+ __ClearPageOffline(page);
}
}

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index aaf2514fcfa46..528395133b4f8 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -146,7 +146,8 @@ static DECLARE_WAIT_QUEUE_HEAD(balloon_wq);
/* balloon_append: add the given page to the balloon. */
static void balloon_append(struct page *page)
{
- __SetPageOffline(page);
+ if (!PageOffline(page))
+ __SetPageOffline(page);

/* Lowmem is re-populated first, so highmem pages go at list tail. */
if (PageHighMem(page)) {
@@ -412,7 +413,11 @@ static enum bp_state increase_reservation(unsigned long nr_pages)

xenmem_reservation_va_mapping_update(1, &page, &frame_list[i]);

- /* Relinquish the page back to the allocator. */
+ /*
+ * Relinquish the page back to the allocator. Note that
+ * some pages, including ones added via xen_online_page(), might
+ * not be marked reserved; free_reserved_page() will handle that.
+ */
free_reserved_page(page);
}

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f04fea86324d9..e0362ce7fc109 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -30,16 +30,11 @@
* - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
* to read/write these pages might end badly. Don't touch!
* - The zero page(s)
- * - Pages not added to the page allocator when onlining a section because
- * they were excluded via the online_page_callback() or because they are
- * PG_hwpoison.
* - Pages allocated in the context of kexec/kdump (loaded kernel image,
* control pages, vmcoreinfo)
* - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are
* not marked PG_reserved (as they might be in use by somebody else who does
* not respect the caching strategy).
- * - Pages part of an offline section (struct pages of offline sections should
- * not be trusted as they will be initialized when first onlined).
* - MCA pages on ia64
* - Pages holding CPU notes for POWER Firmware Assisted Dump
* - Device memory (e.g. PMEM, DAX, HMM)
@@ -1021,6 +1016,10 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
* The content of these pages is effectively stale. Such pages should not
* be touched (read/write/dump/save) except by their owner.
*
+ * When a memory block gets onlined, all pages are initialized with a
+ * refcount of 1 and PageOffline(). generic_online_page() will
+ * take care of clearing PageOffline().
+ *
* If a driver wants to allow to offline unmovable PageOffline() pages without
* putting them back to the buddy, it can do so via the memory notifier by
* decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
@@ -1028,8 +1027,7 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
* pages (now with a reference count of zero) are treated like free pages,
* allowing the containing memory block to get offlined. A driver that
* relies on this feature is aware that re-onlining the memory block will
- * require to re-set the pages PageOffline() and not giving them to the
- * buddy via online_page_callback_t.
+ * require not giving them to the buddy via generic_online_page().
*
* There are drivers that mark a page PageOffline() and expect there won't be
* any further access to page content. PFN walkers that read content of random
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 27e3be75edcf7..0254059efcbe1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -734,7 +734,7 @@ static inline void section_taint_zone_device(unsigned long pfn)
/*
* Associate the pfn range with the given zone, initializing the memmaps
* and resizing the pgdat/zone data to span the added pages. After this
- * call, all affected pages are PG_reserved.
+ * call, all affected pages are PageOffline().
*
* All aligned pageblocks are initialized to the specified migratetype
* (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
@@ -1100,8 +1100,12 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,

move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);

- for (i = 0; i < nr_pages; i++)
- SetPageVmemmapSelfHosted(pfn_to_page(pfn + i));
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page = pfn_to_page(pfn + i);
+
+ __ClearPageOffline(page);
+ SetPageVmemmapSelfHosted(page);
+ }

/*
* It might be that the vmemmap_pages fully span sections. If that is
@@ -1959,9 +1963,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
* Don't allow to offline memory blocks that contain holes.
* Consequently, memory blocks with holes can never get onlined
* via the hotplug path - online_pages() - as hotplugged memory has
- * no holes. This way, we e.g., don't have to worry about marking
- * memory holes PG_reserved, don't need pfn_valid() checks, and can
- * avoid using walk_system_ram_range() later.
+ * no holes. This way, we don't have to worry about memory holes,
+ * don't need pfn_valid() checks, and can avoid using
+ * walk_system_ram_range() later.
*/
walk_system_ram_range(start_pfn, nr_pages, &system_ram_pages,
count_system_ram_pages_cb);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index feb5b6e8c8875..c066c1c474837 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -892,8 +892,14 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone

page = pfn_to_page(pfn);
__init_single_page(page, pfn, zone, nid);
- if (context == MEMINIT_HOTPLUG)
- __SetPageReserved(page);
+ if (context == MEMINIT_HOTPLUG) {
+#ifdef CONFIG_ZONE_DEVICE
+ if (zone == ZONE_DEVICE)
+ __SetPageReserved(page);
+ else
+#endif
+ __SetPageOffline(page);
+ }

/*
* Usually, we want to mark the pageblock MIGRATE_MOVABLE,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e0c8a8354be36..039bc52cc9091 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1225,18 +1225,23 @@ void __free_pages_core(struct page *page, unsigned int order,
* When initializing the memmap, __init_single_page() sets the refcount
* of all pages to 1 ("allocated"/"not free"). We have to set the
* refcount of all involved pages to 0.
+ *
+ * Note that hotplugged memory pages are initialized to PageOffline().
+ * Pages freed from memblock might be marked as reserved.
*/
- prefetchw(p);
- for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
- prefetchw(p + 1);
- __ClearPageReserved(p);
- set_page_count(p, 0);
- }
- __ClearPageReserved(p);
- set_page_count(p, 0);
-
if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG) &&
unlikely(context == MEMINIT_HOTPLUG)) {
+ prefetchw(p);
+ for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
+ prefetchw(p + 1);
+ VM_WARN_ON_ONCE(PageReserved(p));
+ __ClearPageOffline(p);
+ set_page_count(p, 0);
+ }
+ VM_WARN_ON_ONCE(PageReserved(p));
+ __ClearPageOffline(p);
+ set_page_count(p, 0);
+
/*
* Freeing the page with debug_pagealloc enabled will try to
* unmap it; some archs don't like double-unmappings, so
@@ -1245,6 +1250,15 @@ void __free_pages_core(struct page *page, unsigned int order,
debug_pagealloc_map_pages(page, nr_pages);
adjust_managed_page_count(page, nr_pages);
} else {
+ prefetchw(p);
+ for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
+ prefetchw(p + 1);
+ __ClearPageReserved(p);
+ set_page_count(p, 0);
+ }
+ __ClearPageReserved(p);
+ set_page_count(p, 0);
+
/* memblock adjusts totalram_pages() ahead of time. */
atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
}
--
2.45.1

2024-06-07 18:41:13

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

On 07.06.24 11:09, David Hildenbrand wrote:
> In preparation for further changes, let's teach __free_pages_core()
> about the differences of memory hotplug handling.
>
> Move the memory hotplug specific handling from generic_online_page() to
> __free_pages_core(), use adjust_managed_page_count() on the memory
> hotplug path, and spell out why memory freed via memblock
> cannot currently use adjust_managed_page_count().
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/internal.h | 3 ++-
> mm/kmsan/init.c | 2 +-
> mm/memory_hotplug.c | 9 +--------
> mm/mm_init.c | 4 ++--
> mm/page_alloc.c | 17 +++++++++++++++--
> 5 files changed, 21 insertions(+), 14 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 12e95fdf61e90..3fdee779205ab 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -604,7 +604,8 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
> int mt);
> extern void memblock_free_pages(struct page *page, unsigned long pfn,
> unsigned int order);
> -extern void __free_pages_core(struct page *page, unsigned int order);
> +extern void __free_pages_core(struct page *page, unsigned int order,
> + enum meminit_context);
>
> /*
> * This will have no effect, other than possibly generating a warning, if the
> diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
> index 3ac3b8921d36f..ca79636f858e5 100644
> --- a/mm/kmsan/init.c
> +++ b/mm/kmsan/init.c
> @@ -172,7 +172,7 @@ static void do_collection(void)
> shadow = smallstack_pop(&collect);
> origin = smallstack_pop(&collect);
> kmsan_setup_meta(page, shadow, origin, collect.order);
> - __free_pages_core(page, collect.order);
> + __free_pages_core(page, collect.order, MEMINIT_EARLY);
> }
> }
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 171ad975c7cfd..27e3be75edcf7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -630,14 +630,7 @@ EXPORT_SYMBOL_GPL(restore_online_page_callback);
>
> void generic_online_page(struct page *page, unsigned int order)
> {
> - /*
> - * Freeing the page with debug_pagealloc enabled will try to unmap it,
> - * so we should map it first. This is better than introducing a special
> - * case in page freeing fast path.
> - */
> - debug_pagealloc_map_pages(page, 1 << order);
> - __free_pages_core(page, order);
> - totalram_pages_add(1UL << order);
> + __free_pages_core(page, order, MEMINIT_HOTPLUG);
> }
> EXPORT_SYMBOL_GPL(generic_online_page);
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 019193b0d8703..feb5b6e8c8875 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1938,7 +1938,7 @@ static void __init deferred_free_range(unsigned long pfn,
> for (i = 0; i < nr_pages; i++, page++, pfn++) {
> if (pageblock_aligned(pfn))
> set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> - __free_pages_core(page, 0);
> + __free_pages_core(page, 0, MEMINIT_EARLY);
> }
> }

The build bot just reminded me that I missed another case in this function:
(CONFIG_DEFERRED_STRUCT_PAGE_INIT)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index feb5b6e8c8875..5a0752261a795 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1928,7 +1928,7 @@ static void __init deferred_free_range(unsigned long pfn,
if (nr_pages == MAX_ORDER_NR_PAGES && IS_MAX_ORDER_ALIGNED(pfn)) {
for (i = 0; i < nr_pages; i += pageblock_nr_pages)
set_pageblock_migratetype(page + i, MIGRATE_MOVABLE);
- __free_pages_core(page, MAX_PAGE_ORDER);
+ __free_pages_core(page, MAX_PAGE_ORDER, MEMINIT_EARLY);
return;
}

--
Cheers,

David / dhildenb

2024-06-10 04:03:55

by Oscar Salvador

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

On Fri, Jun 07, 2024 at 11:09:36AM +0200, David Hildenbrand wrote:
> In preparation for further changes, let's teach __free_pages_core()
> about the differences of memory hotplug handling.
>
> Move the memory hotplug specific handling from generic_online_page() to
> __free_pages_core(), use adjust_managed_page_count() on the memory
> hotplug path, and spell out why memory freed via memblock
> cannot currently use adjust_managed_page_count().
>
> Signed-off-by: David Hildenbrand <[email protected]>

All looks good but I am puzzled with something.

> + } else {
> + /* memblock adjusts totalram_pages() ahead of time. */
> + atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
> + }

You say that memblock adjusts totalram_pages ahead of time, and I guess
you mean in memblock_free_all()

pages = free_low_memory_core_early()
totalram_pages_add(pages);

but that is not ahead, it looks like it is upading __after__ sending
them to buddy?

--
Oscar Salvador
SUSE Labs

2024-06-10 04:24:16

by Oscar Salvador

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

On Fri, Jun 07, 2024 at 11:09:37AM +0200, David Hildenbrand wrote:
> We currently initialize the memmap such that PG_reserved is set and the
> refcount of the page is 1. In virtio-mem code, we have to manually clear
> that PG_reserved flag to make memory offlining with partially hotplugged
> memory blocks possible: has_unmovable_pages() would otherwise bail out on
> such pages.
>
> We want to avoid PG_reserved where possible and move to typed pages
> instead. Further, we want to further enlighten memory offlining code about
> PG_offline: offline pages in an online memory section. One example is
> handling managed page count adjustments in a cleaner way during memory
> offlining.
>
> So let's initialize the pages with PG_offline instead of PG_reserved.
> generic_online_page()->__free_pages_core() will now clear that flag before
> handing that memory to the buddy.
>
> Note that the page refcount is still 1 and would forbid offlining of such
> memory except when special care is take during GOING_OFFLINE as
> currently only implemented by virtio-mem.
>
> With this change, we can now get non-PageReserved() pages in the XEN
> balloon list. From what I can tell, that can already happen via
> decrease_reservation(), so that should be fine.
>
> HV-balloon should not really observe a change: partial online memory
> blocks still cannot get surprise-offlined, because the refcount of these
> PageOffline() pages is 1.
>
> Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
> hotplugged pages are now PageOffline() instead of PageReserved() before
> they are handed over to the buddy.
>
> We'll leave the ZONE_DEVICE case alone for now.
>
> Signed-off-by: David Hildenbrand <[email protected]>

> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 27e3be75edcf7..0254059efcbe1 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -734,7 +734,7 @@ static inline void section_taint_zone_device(unsigned long pfn)
> /*
> * Associate the pfn range with the given zone, initializing the memmaps
> * and resizing the pgdat/zone data to span the added pages. After this
> - * call, all affected pages are PG_reserved.
> + * call, all affected pages are PageOffline().
> *
> * All aligned pageblocks are initialized to the specified migratetype
> * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
> @@ -1100,8 +1100,12 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
>
> move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
>
> - for (i = 0; i < nr_pages; i++)
> - SetPageVmemmapSelfHosted(pfn_to_page(pfn + i));
> + for (i = 0; i < nr_pages; i++) {
> + struct page *page = pfn_to_page(pfn + i);
> +
> + __ClearPageOffline(page);
> + SetPageVmemmapSelfHosted(page);

So, refresh my memory here please.
AFAIR, those VmemmapSelfHosted pages were marked Reserved before, but now,
memmap_init_range() will not mark them reserved anymore.
I do not think that is ok? I am worried about walkers getting this wrong.

We usually skip PageReserved pages in walkers because are pages we cannot deal
with for those purposes, but with this change, we will leak
PageVmemmapSelfHosted, and I am not sure whether are ready for that.

Moreover, boot memmap pages are marked as PageReserved, which would be
now inconsistent with those added during hotplug operations.

All in all, I feel uneasy about this change.

--
Oscar Salvador
SUSE Labs

2024-06-10 04:29:27

by Oscar Salvador

[permalink] [raw]

Subject: Re: [PATCH v1 3/3] mm/memory_hotplug: skip adjust_managed_page_count() for PageOffline() pages when offlining

On Fri, Jun 07, 2024 at 11:09:38AM +0200, David Hildenbrand wrote:
> We currently have a hack for virtio-mem in place to handle memory
> offlining with PageOffline pages for which we already adjusted the
> managed page count.
>
> Let's enlighten memory offlining code so we can get rid of that hack,
> and document the situation.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Acked-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE Labs

2024-06-10 08:44:06

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

On 10.06.24 06:03, Oscar Salvador wrote:
> On Fri, Jun 07, 2024 at 11:09:36AM +0200, David Hildenbrand wrote:
>> In preparation for further changes, let's teach __free_pages_core()
>> about the differences of memory hotplug handling.
>>
>> Move the memory hotplug specific handling from generic_online_page() to
>> __free_pages_core(), use adjust_managed_page_count() on the memory
>> hotplug path, and spell out why memory freed via memblock
>> cannot currently use adjust_managed_page_count().
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>
> All looks good but I am puzzled with something.
>
>> + } else {
>> + /* memblock adjusts totalram_pages() ahead of time. */
>> + atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
>> + }
>
> You say that memblock adjusts totalram_pages ahead of time, and I guess
> you mean in memblock_free_all()

And memblock_free_late(), which uses atomic_long_inc().

>
> pages = free_low_memory_core_early()
> totalram_pages_add(pages);
>
> but that is not ahead, it looks like it is upading __after__ sending
> them to buddy?

Right (it's suboptimal, but not really problematic so far. Hopefully Wei
can clean it up and move it in here as well)

For the time being

"/* memblock adjusts totalram_pages() manually. */"

?

Thanks!

--
Cheers,

David / dhildenb

2024-06-10 08:56:20

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

On 10.06.24 06:23, Oscar Salvador wrote:
> On Fri, Jun 07, 2024 at 11:09:37AM +0200, David Hildenbrand wrote:
>> We currently initialize the memmap such that PG_reserved is set and the
>> refcount of the page is 1. In virtio-mem code, we have to manually clear
>> that PG_reserved flag to make memory offlining with partially hotplugged
>> memory blocks possible: has_unmovable_pages() would otherwise bail out on
>> such pages.
>>
>> We want to avoid PG_reserved where possible and move to typed pages
>> instead. Further, we want to further enlighten memory offlining code about
>> PG_offline: offline pages in an online memory section. One example is
>> handling managed page count adjustments in a cleaner way during memory
>> offlining.
>>
>> So let's initialize the pages with PG_offline instead of PG_reserved.
>> generic_online_page()->__free_pages_core() will now clear that flag before
>> handing that memory to the buddy.
>>
>> Note that the page refcount is still 1 and would forbid offlining of such
>> memory except when special care is take during GOING_OFFLINE as
>> currently only implemented by virtio-mem.
>>
>> With this change, we can now get non-PageReserved() pages in the XEN
>> balloon list. From what I can tell, that can already happen via
>> decrease_reservation(), so that should be fine.
>>
>> HV-balloon should not really observe a change: partial online memory
>> blocks still cannot get surprise-offlined, because the refcount of these
>> PageOffline() pages is 1.
>>
>> Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
>> hotplugged pages are now PageOffline() instead of PageReserved() before
>> they are handed over to the buddy.
>>
>> We'll leave the ZONE_DEVICE case alone for now.
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 27e3be75edcf7..0254059efcbe1 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -734,7 +734,7 @@ static inline void section_taint_zone_device(unsigned long pfn)
>> /*
>> * Associate the pfn range with the given zone, initializing the memmaps
>> * and resizing the pgdat/zone data to span the added pages. After this
>> - * call, all affected pages are PG_reserved.
>> + * call, all affected pages are PageOffline().
>> *
>> * All aligned pageblocks are initialized to the specified migratetype
>> * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
>> @@ -1100,8 +1100,12 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
>>
>> move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
>>
>> - for (i = 0; i < nr_pages; i++)
>> - SetPageVmemmapSelfHosted(pfn_to_page(pfn + i));
>> + for (i = 0; i < nr_pages; i++) {
>> + struct page *page = pfn_to_page(pfn + i);
>> +
>> + __ClearPageOffline(page);
>> + SetPageVmemmapSelfHosted(page);
>
> So, refresh my memory here please.
> AFAIR, those VmemmapSelfHosted pages were marked Reserved before, but now,
> memmap_init_range() will not mark them reserved anymore.

Correct.

> I do not think that is ok? I am worried about walkers getting this wrong.
>
> We usually skip PageReserved pages in walkers because are pages we cannot deal
> with for those purposes, but with this change, we will leak
> PageVmemmapSelfHosted, and I am not sure whether are ready for that.

There are fortunately not that many left.

I'd even say marking them (vmemmap) reserved is more wrong than right:
note that ordinary vmemmap pages after memory hotplug are not reserved!
Only bootmem should be reserved.

Let's take at the relevant core-mm ones (arch stuff is mostly just for
MMIO remapping)

fs/proc/task_mmu.c: if (PageReserved(page))
fs/proc/task_mmu.c: if (PageReserved(page))

-> If we find vmemmap pages mapped into user space we already messed up
seriously

kernel/power/snapshot.c: if (PageReserved(page) ||
kernel/power/snapshot.c: if (PageReserved(page)

-> There should be no change (saveable_page() would still allow saving
them, highmem does not apply)

mm/hugetlb_vmemmap.c: if (!PageReserved(head))
mm/hugetlb_vmemmap.c: if (PageReserved(page))

-> Wants to identify bootmem, but we exclude these
PageVmemmapSelfHosted() on the splitting part already properly

mm/page_alloc.c: VM_WARN_ON_ONCE(PageReserved(p));
mm/page_alloc.c: if (PageReserved(page))

-> pfn_range_valid_contig() would scan them, just like for ordinary
vmemmap pages during hotplug. We'll simply fail isolating/migrating
them similarly (like any unmovable allocations) later

mm/page_ext.c: BUG_ON(PageReserved(page));

-> free_page_ext handling, does not apply

mm/page_isolation.c: if (PageReserved(page))

-> has_unmovable_pages() should still detect them as unmovable (e.g.,
neither movable nor LRU).

mm/page_owner.c: if (PageReserved(page))
mm/page_owner.c: if (PageReserved(page))

-> Simply page_ext_get() will return NULL instead and we'll similarly
skip them

mm/sparse.c: if (!PageReserved(virt_to_page(ms->usage))) {

-> Detecting boot memory for ms->usage allocation, does not apply to
vmemmap.

virt/kvm/kvm_main.c: if (!PageReserved(page))
virt/kvm/kvm_main.c: return !PageReserved(page);

-> For MMIO remapping purposes, does not apply to vmemmap

> Moreover, boot memmap pages are marked as PageReserved, which would be
> now inconsistent with those added during hotplug operations.

Just like vmemmap pages allocated dynamically during memory hotplug.
Now, really only bootmem-ones are PageReserved.

> All in all, I feel uneasy about this change.

I really don't want to mark these pages here PageReserved for the sake
of it.

Any PageReserved user that I am missing, or why we should handle these
vmemmap pages differently than the ones allocated during ordinary memory
hotplug?

In the future, we might want to consider using a dedicated page type for
them, so we can stop using a bit that doesn't allow to reliably identify
them. (we should mark all vmemmap with that type then)

Thanks!

--
Cheers,

David / dhildenb

2024-06-10 08:56:35

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 3/3] mm/memory_hotplug: skip adjust_managed_page_count() for PageOffline() pages when offlining

On 10.06.24 06:29, Oscar Salvador wrote:
> On Fri, Jun 07, 2024 at 11:09:38AM +0200, David Hildenbrand wrote:
>> We currently have a hack for virtio-mem in place to handle memory
>> offlining with PageOffline pages for which we already adjusted the
>> managed page count.
>>
>> Let's enlighten memory offlining code so we can get rid of that hack,
>> and document the situation.
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>
> Acked-by: Oscar Salvador <[email protected]>
>

Thanks for the review!

--
Cheers,

David / dhildenb

2024-06-10 12:07:58

by Oscar Salvador

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

On Mon, Jun 10, 2024 at 10:38:05AM +0200, David Hildenbrand wrote:
> On 10.06.24 06:03, Oscar Salvador wrote:
> > On Fri, Jun 07, 2024 at 11:09:36AM +0200, David Hildenbrand wrote:
> > > In preparation for further changes, let's teach __free_pages_core()
> > > about the differences of memory hotplug handling.
> > >
> > > Move the memory hotplug specific handling from generic_online_page() to
> > > __free_pages_core(), use adjust_managed_page_count() on the memory
> > > hotplug path, and spell out why memory freed via memblock
> > > cannot currently use adjust_managed_page_count().
> > >
> > > Signed-off-by: David Hildenbrand <[email protected]>
> >
> > All looks good but I am puzzled with something.
> >
> > > + } else {
> > > + /* memblock adjusts totalram_pages() ahead of time. */
> > > + atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
> > > + }
> >
> > You say that memblock adjusts totalram_pages ahead of time, and I guess
> > you mean in memblock_free_all()
>
> And memblock_free_late(), which uses atomic_long_inc().

Ah yes.

> Right (it's suboptimal, but not really problematic so far. Hopefully Wei can
> clean it up and move it in here as well)

That would be great.

> For the time being
>
> "/* memblock adjusts totalram_pages() manually. */"

Yes, I think that is better ;-)

Thanks!

--
Oscar Salvador
SUSE Labs

2024-06-11 08:01:18

by Oscar Salvador

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

On Fri, Jun 07, 2024 at 11:09:37AM +0200, David Hildenbrand wrote:
> We currently initialize the memmap such that PG_reserved is set and the
> refcount of the page is 1. In virtio-mem code, we have to manually clear
> that PG_reserved flag to make memory offlining with partially hotplugged
> memory blocks possible: has_unmovable_pages() would otherwise bail out on
> such pages.
>
> We want to avoid PG_reserved where possible and move to typed pages
> instead. Further, we want to further enlighten memory offlining code about
> PG_offline: offline pages in an online memory section. One example is
> handling managed page count adjustments in a cleaner way during memory
> offlining.
>
> So let's initialize the pages with PG_offline instead of PG_reserved.
> generic_online_page()->__free_pages_core() will now clear that flag before
> handing that memory to the buddy.
>
> Note that the page refcount is still 1 and would forbid offlining of such
> memory except when special care is take during GOING_OFFLINE as
> currently only implemented by virtio-mem.
>
> With this change, we can now get non-PageReserved() pages in the XEN
> balloon list. From what I can tell, that can already happen via
> decrease_reservation(), so that should be fine.
>
> HV-balloon should not really observe a change: partial online memory
> blocks still cannot get surprise-offlined, because the refcount of these
> PageOffline() pages is 1.
>
> Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
> hotplugged pages are now PageOffline() instead of PageReserved() before
> they are handed over to the buddy.
>
> We'll leave the ZONE_DEVICE case alone for now.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Acked-by: Oscar Salvador <[email protected]> # for the generic
memory-hotplug bits

--
Oscar Salvador
SUSE Labs

2024-06-11 08:07:48

by Oscar Salvador

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

On Mon, Jun 10, 2024 at 10:56:02AM +0200, David Hildenbrand wrote:
> There are fortunately not that many left.
>
> I'd even say marking them (vmemmap) reserved is more wrong than right: note
> that ordinary vmemmap pages after memory hotplug are not reserved! Only
> bootmem should be reserved.

Ok, that is a very good point that I missed.
I thought that hotplugged-vmemmap pages (not selfhosted) were marked as
Reserved, that is why I thought this would be inconsistent.
But then, if that is the case, I think we are safe as kernel can already
encounter vmemmap pages that are not reserved and it deals with them
somehow.

> Let's take at the relevant core-mm ones (arch stuff is mostly just for MMIO
> remapping)
>
...
> Any PageReserved user that I am missing, or why we should handle these
> vmemmap pages differently than the ones allocated during ordinary memory
> hotplug?

No, I cannot think of a reason why normal vmemmap pages should behave
different than self-hosted.

I was also confused because I thought that after this change
pfn_to_online_page() would be different for self-hosted vmemmap pages,
because I thought that somehow we relied on PageOffline(), but it is not
the case.

> In the future, we might want to consider using a dedicated page type for
> them, so we can stop using a bit that doesn't allow to reliably identify
> them. (we should mark all vmemmap with that type then)

Yes, a all-vmemmap pages type would be a good thing, so we do not have
to special case.

Just one last thing.
Now self-hosted vmemmap pages will have the PageOffline cleared, and that
will still remain after the memory-block they belong to has gone
offline, which is ok because those vmemmap pages lay around until the
chunk of memory gets removed.

Ok, just wanted to convince myself that there will no be surprises.

Thanks David for claryfing.

--
Oscar Salvador
SUSE Labs

2024-06-11 08:27:56

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

On 11.06.24 09:45, Oscar Salvador wrote:
> On Mon, Jun 10, 2024 at 10:56:02AM +0200, David Hildenbrand wrote:
>> There are fortunately not that many left.
>>
>> I'd even say marking them (vmemmap) reserved is more wrong than right: note
>> that ordinary vmemmap pages after memory hotplug are not reserved! Only
>> bootmem should be reserved.
>
> Ok, that is a very good point that I missed.
> I thought that hotplugged-vmemmap pages (not selfhosted) were marked as
> Reserved, that is why I thought this would be inconsistent.
> But then, if that is the case, I think we are safe as kernel can already
> encounter vmemmap pages that are not reserved and it deals with them
> somehow.
>
>> Let's take at the relevant core-mm ones (arch stuff is mostly just for MMIO
>> remapping)
>>
> ...
>> Any PageReserved user that I am missing, or why we should handle these
>> vmemmap pages differently than the ones allocated during ordinary memory
>> hotplug?
>
> No, I cannot think of a reason why normal vmemmap pages should behave
> different than self-hosted.
>
> I was also confused because I thought that after this change
> pfn_to_online_page() would be different for self-hosted vmemmap pages,
> because I thought that somehow we relied on PageOffline(), but it is not
> the case.

Fortunately not :) PageFakeOffline() or PageLogicallyOffline() might be
clearer, but I don't quite like these names. If you have a good idea,
please let me know.

>
>> In the future, we might want to consider using a dedicated page type for
>> them, so we can stop using a bit that doesn't allow to reliably identify
>> them. (we should mark all vmemmap with that type then)
>
> Yes, a all-vmemmap pages type would be a good thing, so we do not have
> to special case.
>
> Just one last thing.
> Now self-hosted vmemmap pages will have the PageOffline cleared, and that
> will still remain after the memory-block they belong to has gone
> offline, which is ok because those vmemmap pages lay around until the
> chunk of memory gets removed.

Yes, and that memmap might even get poisoned in debug kernels to catch
any wrong access.

>
> Ok, just wanted to convince myself that there will no be surprises.
>
> Thanks David for claryfing.

Thanks for the review and raising that. I'll add more details to the
patch description!

--
Cheers,

David / dhildenb

2024-06-11 09:43:21

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

On 07.06.24 11:09, David Hildenbrand wrote:
> We currently initialize the memmap such that PG_reserved is set and the
> refcount of the page is 1. In virtio-mem code, we have to manually clear
> that PG_reserved flag to make memory offlining with partially hotplugged
> memory blocks possible: has_unmovable_pages() would otherwise bail out on
> such pages.
>
> We want to avoid PG_reserved where possible and move to typed pages
> instead. Further, we want to further enlighten memory offlining code about
> PG_offline: offline pages in an online memory section. One example is
> handling managed page count adjustments in a cleaner way during memory
> offlining.
>
> So let's initialize the pages with PG_offline instead of PG_reserved.
> generic_online_page()->__free_pages_core() will now clear that flag before
> handing that memory to the buddy.
>
> Note that the page refcount is still 1 and would forbid offlining of such
> memory except when special care is take during GOING_OFFLINE as
> currently only implemented by virtio-mem.
>
> With this change, we can now get non-PageReserved() pages in the XEN
> balloon list. From what I can tell, that can already happen via
> decrease_reservation(), so that should be fine.
>
> HV-balloon should not really observe a change: partial online memory
> blocks still cannot get surprise-offlined, because the refcount of these
> PageOffline() pages is 1.
>
> Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
> hotplugged pages are now PageOffline() instead of PageReserved() before
> they are handed over to the buddy.
>
> We'll leave the ZONE_DEVICE case alone for now.
>

@Andrew, can we add here:

"Note that self-hosted vmemmap pages will no longer be marked as
reserved. This matches ordinary vmemmap pages allocated from the buddy
during memory hotplug. Now, really only vmemmap pages allocated from
memblock during early boot will be marked reserved. Existing
PageReserved() checks seem to be handling all relevant cases correctly
even after this change."

--
Cheers,

David / dhildenb

2024-06-11 10:08:48

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

On 07.06.24 11:09, David Hildenbrand wrote:
> In preparation for further changes, let's teach __free_pages_core()
> about the differences of memory hotplug handling.
>
> Move the memory hotplug specific handling from generic_online_page() to
> __free_pages_core(), use adjust_managed_page_count() on the memory
> hotplug path, and spell out why memory freed via memblock
> cannot currently use adjust_managed_page_count().
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---

@Andrew, can you squash the following?

From 0a7921cf21cacf178ca7485da0138fc38a97a28e Mon Sep 17 00:00:00 2001
From: David Hildenbrand <[email protected]>
Date: Tue, 11 Jun 2024 12:05:09 +0200
Subject: [PATCH] fixup: mm/highmem: make nr_free_highpages() return "unsigned
long"

Fixup the memblock comment.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e0c8a8354be36..fc53f96db58a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1245,7 +1245,7 @@ void __free_pages_core(struct page *page, unsigned int order,
debug_pagealloc_map_pages(page, nr_pages);
adjust_managed_page_count(page, nr_pages);
} else {
- /* memblock adjusts totalram_pages() ahead of time. */
+ /* memblock adjusts totalram_pages() manually. */
atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
}

--
2.45.2

--
Cheers,

David / dhildenb

2024-06-11 19:20:00

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

On Tue, 11 Jun 2024 12:06:56 +0200 David Hildenbrand <[email protected]> wrote:

> On 07.06.24 11:09, David Hildenbrand wrote:
> > In preparation for further changes, let's teach __free_pages_core()
> > about the differences of memory hotplug handling.
> >
> > Move the memory hotplug specific handling from generic_online_page() to
> > __free_pages_core(), use adjust_managed_page_count() on the memory
> > hotplug path, and spell out why memory freed via memblock
> > cannot currently use adjust_managed_page_count().
> >
> > Signed-off-by: David Hildenbrand <[email protected]>
> > ---
>
> @Andrew, can you squash the following?

Sure.

I queued it against "mm: pass meminit_context to __free_pages_core()",
not against

> Subject: [PATCH] fixup: mm/highmem: make nr_free_highpages() return "unsigned
> long"

2024-06-11 19:21:16

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

On 11.06.24 21:19, Andrew Morton wrote:
> On Tue, 11 Jun 2024 12:06:56 +0200 David Hildenbrand <[email protected]> wrote:
>
>> On 07.06.24 11:09, David Hildenbrand wrote:
>>> In preparation for further changes, let's teach __free_pages_core()
>>> about the differences of memory hotplug handling.
>>>
>>> Move the memory hotplug specific handling from generic_online_page() to
>>> __free_pages_core(), use adjust_managed_page_count() on the memory
>>> hotplug path, and spell out why memory freed via memblock
>>> cannot currently use adjust_managed_page_count().
>>>
>>> Signed-off-by: David Hildenbrand <[email protected]>
>>> ---
>>
>> @Andrew, can you squash the following?
>
> Sure.
>
> I queued it against "mm: pass meminit_context to __free_pages_core()",
> not against

Ah yes, sorry. Thanks!

--
Cheers,

David / dhildenb

2024-06-11 19:38:53

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

On Tue, 11 Jun 2024 11:42:56 +0200 David Hildenbrand <[email protected]> wrote:

> > We'll leave the ZONE_DEVICE case alone for now.
> >
>
> @Andrew, can we add here:
>
> "Note that self-hosted vmemmap pages will no longer be marked as
> reserved. This matches ordinary vmemmap pages allocated from the buddy
> during memory hotplug. Now, really only vmemmap pages allocated from
> memblock during early boot will be marked reserved. Existing
> PageReserved() checks seem to be handling all relevant cases correctly
> even after this change."

Done, thanks.

2024-06-11 19:52:54

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

On 11.06.24 21:41, Tim Chen wrote:
> On Fri, 2024-06-07 at 11:09 +0200, David Hildenbrand wrote:
>> In preparation for further changes, let's teach __free_pages_core()
>> about the differences of memory hotplug handling.
>>
>> Move the memory hotplug specific handling from generic_online_page() to
>> __free_pages_core(), use adjust_managed_page_count() on the memory
>> hotplug path, and spell out why memory freed via memblock
>> cannot currently use adjust_managed_page_count().
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/internal.h | 3 ++-
>> mm/kmsan/init.c | 2 +-
>> mm/memory_hotplug.c | 9 +--------
>> mm/mm_init.c | 4 ++--
>> mm/page_alloc.c | 17 +++++++++++++++--
>> 5 files changed, 21 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 12e95fdf61e90..3fdee779205ab 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -604,7 +604,8 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
>> int mt);
>> extern void memblock_free_pages(struct page *page, unsigned long pfn,
>> unsigned int order);
>> -extern void __free_pages_core(struct page *page, unsigned int order);
>> +extern void __free_pages_core(struct page *page, unsigned int order,
>> + enum meminit_context);
>
> Shouldn't the above be
> enum meminit_context context);

Although C allows parameters without names in declarations, this was
unintended.

Thanks!

--
Cheers,

David / dhildenb

2024-06-11 22:42:11

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

On Fri, 2024-06-07 at 11:09 +0200, David Hildenbrand wrote:
> In preparation for further changes, let's teach __free_pages_core()
> about the differences of memory hotplug handling.
>
> Move the memory hotplug specific handling from generic_online_page() to
> __free_pages_core(), use adjust_managed_page_count() on the memory
> hotplug path, and spell out why memory freed via memblock
> cannot currently use adjust_managed_page_count().
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/internal.h | 3 ++-
> mm/kmsan/init.c | 2 +-
> mm/memory_hotplug.c | 9 +--------
> mm/mm_init.c | 4 ++--
> mm/page_alloc.c | 17 +++++++++++++++--
> 5 files changed, 21 insertions(+), 14 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 12e95fdf61e90..3fdee779205ab 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -604,7 +604,8 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
> int mt);
> extern void memblock_free_pages(struct page *page, unsigned long pfn,
> unsigned int order);
> -extern void __free_pages_core(struct page *page, unsigned int order);
> +extern void __free_pages_core(struct page *page, unsigned int order,
> + enum meminit_context);

Shouldn't the above be
enum meminit_context context);
>
> /*
> * This will have no effect, other than possibly generating a warning, if the

Thanks.

Tim

2024-06-12 18:38:44

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

On 11.06.24 21:19, Andrew Morton wrote:
> On Tue, 11 Jun 2024 12:06:56 +0200 David Hildenbrand <[email protected]> wrote:
>
>> On 07.06.24 11:09, David Hildenbrand wrote:
>>> In preparation for further changes, let's teach __free_pages_core()
>>> about the differences of memory hotplug handling.
>>>
>>> Move the memory hotplug specific handling from generic_online_page() to
>>> __free_pages_core(), use adjust_managed_page_count() on the memory
>>> hotplug path, and spell out why memory freed via memblock
>>> cannot currently use adjust_managed_page_count().
>>>
>>> Signed-off-by: David Hildenbrand <[email protected]>
>>> ---
>>
>> @Andrew, can you squash the following?
>
> Sure.
>
> I queued it against "mm: pass meminit_context to __free_pages_core()",
> not against
>
>> Subject: [PATCH] fixup: mm/highmem: make nr_free_highpages() return "unsigned
>> long"
>

Can you squash the following as well? (hopefully the last fixup, otherwise I
might just resend a v2)

From 53c8c5834e638b2ae5e2a34fa7d49ce0dcf25192 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <[email protected]>
Date: Wed, 12 Jun 2024 20:31:07 +0200
Subject: [PATCH] fixup: mm: pass meminit_context to __free_pages_core()

Let's add the parameter name also in the declaration.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/internal.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index 14bab8a41baf6..254dd907bf9a2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -605,7 +605,7 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
extern void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order);
extern void __free_pages_core(struct page *page, unsigned int order,
- enum meminit_context);
+ enum meminit_context context);

/*
* This will have no effect, other than possibly generating a warning, if the
--
2.45.2

--
Cheers,

David / dhildenb