2020-09-16 18:37:46

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation

When adding separate memory blocks via add_memory*() and onlining them
immediately, the metadata (especially the memmap) of the next block will be
placed onto one of the just added+onlined block. This creates a chain
of unmovable allocations: If the last memory block cannot get
offlined+removed() so will all dependant ones. We directly have unmovable
allocations all over the place.

This can be observed quite easily using virtio-mem, however, it can also
be observed when using DIMMs. The freshly onlined pages will usually be
placed to the head of the freelists, meaning they will be allocated next,
turning the just-added memory usually immediately un-removable. The
fresh pages are cold, prefering to allocate others (that might be hot)
also feels to be the natural thing to do.

It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
adding separate, successive memory blocks, each memory block will have
unmovable allocations on them - for example gigantic pages will fail to
allocate.

While the ZONE_NORMAL doesn't provide any guarantees that memory can get
offlined+removed again (any kind of fragmentation with unmovable
allocations is possible), there are many scenarios (hotplugging a lot of
memory, running workload, hotunplug some memory/as much as possible) where
we can offline+remove quite a lot with this patchset.

a) To visualize the problem, a very simple example:

Start a VM with 4GB and 8GB of virtio-mem memory:

[root@localhost ~]# lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
0x0000000100000000-0x000000033fffffff 9G online yes 32-103

Memory block size: 128M
Total online memory: 12G
Total offline memory: 0B

Then try to unplug as much as possible using virtio-mem. Observe which
memory blocks are still around. Without this patch set:

[root@localhost ~]# lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
0x0000000100000000-0x000000013fffffff 1G online yes 32-39
0x0000000148000000-0x000000014fffffff 128M online yes 41
0x0000000158000000-0x000000015fffffff 128M online yes 43
0x0000000168000000-0x000000016fffffff 128M online yes 45
0x0000000178000000-0x000000017fffffff 128M online yes 47
0x0000000188000000-0x0000000197ffffff 256M online yes 49-50
0x00000001a0000000-0x00000001a7ffffff 128M online yes 52
0x00000001b0000000-0x00000001b7ffffff 128M online yes 54
0x00000001c0000000-0x00000001c7ffffff 128M online yes 56
0x00000001d0000000-0x00000001d7ffffff 128M online yes 58
0x00000001e0000000-0x00000001e7ffffff 128M online yes 60
0x00000001f0000000-0x00000001f7ffffff 128M online yes 62
0x0000000200000000-0x0000000207ffffff 128M online yes 64
0x0000000210000000-0x0000000217ffffff 128M online yes 66
0x0000000220000000-0x0000000227ffffff 128M online yes 68
0x0000000230000000-0x0000000237ffffff 128M online yes 70
0x0000000240000000-0x0000000247ffffff 128M online yes 72
0x0000000250000000-0x0000000257ffffff 128M online yes 74
0x0000000260000000-0x0000000267ffffff 128M online yes 76
0x0000000270000000-0x0000000277ffffff 128M online yes 78
0x0000000280000000-0x0000000287ffffff 128M online yes 80
0x0000000290000000-0x0000000297ffffff 128M online yes 82
0x00000002a0000000-0x00000002a7ffffff 128M online yes 84
0x00000002b0000000-0x00000002b7ffffff 128M online yes 86
0x00000002c0000000-0x00000002c7ffffff 128M online yes 88
0x00000002d0000000-0x00000002d7ffffff 128M online yes 90
0x00000002e0000000-0x00000002e7ffffff 128M online yes 92
0x00000002f0000000-0x00000002f7ffffff 128M online yes 94
0x0000000300000000-0x0000000307ffffff 128M online yes 96
0x0000000310000000-0x0000000317ffffff 128M online yes 98
0x0000000320000000-0x0000000327ffffff 128M online yes 100
0x0000000330000000-0x000000033fffffff 256M online yes 102-103

Memory block size: 128M
Total online memory: 8.1G
Total offline memory: 0B

With this patch set:

[root@localhost ~]# lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
0x0000000100000000-0x000000013fffffff 1G online yes 32-39

Memory block size: 128M
Total online memory: 4G
Total offline memory: 0B

All memory can get unplugged, all memory block can get removed. Of course,
no workload ran and the system was basically idle, but it highlights the
issue - the fairly deterministic chain of unmovable allocations. When a
huge page for the 2MB memmap is needed, a just-onlined 4MB page will
be split. The remaining 2MB page will be used for the memmap of the next
memory block. So one memory block will hold the memmap of the two following
memory blocks. Finally the pages of the last-onlined memory block will get
used for the next bigger allocations - if any allocation is unmovable,
all dependent memory blocks cannot get unplugged and removed until that
allocation is gone.

Note that with bigger memory blocks (e.g., 256MB), *all* memory
blocks are dependent and none can get unplugged again!

b) Experiment with memory intensive workload

I performed an experiment with an older version of this patch set
(before we used undo_isolate_page_range() in online_pages():
Hotplug 56GB to a VM with an initial 4GB, onlining all memory to
ZONE_NORMAL right from the kernel when adding it. I then run various
memory intensive workloads that consume most system memory for a total of
45 minutes. Once finished, I try to unplug as much memory as possible.

With this change, I am able to remove via virtio-mem (adding individual
128MB memory blocks) 413 out of 448 added memory blocks. Via individual
(256MB) DIMMs 380 out of 448 added memory blocks. (I don't have any numbers
without this patchset, but looking at the above example, it's at most half
of the 448 memory blocks for virtio-mem, and most probably none for DIMMs).

Again, there are workloads that might behave very differently due to the
nature of ZONE_NORMAL.

c) Future work:
- I'll be looking into avoiding reporting freshly onlined pages via the
free page reporting framework. They are unbacked in the hypervisor, so
reporting them isn't necessary (and might actually be bad for performance
in some future use cases in the hypervisor).
- I'll be looking into being able to tell the OS that some pages are fresh
(e.g., via alloc_contig_range() in virito-mem, freeing balloon inflated
memory in a ballooning driver), such that we will skip reporting them
via free page reporting (marking them reported), and placing them to the
tail of the freelist.
- virtio-mem will soon also support ZONE_MOVABLE, however, especially
when hotplugging a lot of memory (as in the experiment), a considerable
amount of memory will have to remain in ZONE_NORMAL - so this change
is relevant in any case.

I'm sending this as RFC as it also in its current form for simplicity
affects not only memory onlining but also
- Other users of undo_isolate_page_range(): Pages are always placed to the
tail.
-- When memory offlining fails
-- When memory isolation fails after having isolated some pageblocks
-- When alloc_contig_range() either succeeds or fails
- Other users of __putback_isolated_page(): Pages are always placed to the
tail.
-- Free page reporting
- Other users of __free_pages_core()
-- AFAIKs, any memory that is getting exposed to the buddy during boot.
IIUC we will now usually allocate memory from lower addresses within
a zone first (especially during boot).
- Other users of generic_online_page()
-- Hyper-V balloon

Let's see if there are concerns for these users with this approach.

David Hildenbrand (4):
mm/page_alloc: convert "report" flag of __free_one_page() to a proper
flag
mm/page_alloc: place pages to tail in __putback_isolated_page()
mm/page_alloc: always move pages to the tail of the freelist in
unset_migratetype_isolate()
mm/page_alloc: place pages to tail in __free_pages_core()

include/linux/page-isolation.h | 2 +
mm/page_alloc.c | 102 +++++++++++++++++++++++++--------
mm/page_isolation.c | 8 ++-
3 files changed, 86 insertions(+), 26 deletions(-)

--
2.26.2


2020-09-16 18:37:56

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 1/4] mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag

Let's prepare for additional flags and avoid long parameter lists of bools.
Follow-up patches will also make use of the flags in __free_pages_ok(),
however, I wasn't able to come up with a better name for the type - should
be good enough for internal purposes.

Cc: Andrew Morton <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/page_alloc.c | 28 ++++++++++++++++++++--------
1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6b699d273d6e..91cefb8157dd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -77,6 +77,18 @@
#include "shuffle.h"
#include "page_reporting.h"

+/* Free One Page flags: for internal, non-pcp variants of free_pages(). */
+typedef int __bitwise fop_t;
+
+/* No special request */
+#define FOP_NONE ((__force fop_t)0)
+
+/*
+ * Skip free page reporting notification after buddy merging (will *not* mark
+ * the page reported, only skip the notification).
+ */
+#define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))
+
/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_FRACTION (8)
@@ -948,10 +960,9 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
* -- nyc
*/

-static inline void __free_one_page(struct page *page,
- unsigned long pfn,
- struct zone *zone, unsigned int order,
- int migratetype, bool report)
+static inline void __free_one_page(struct page *page, unsigned long pfn,
+ struct zone *zone, unsigned int order,
+ int migratetype, fop_t fop_flags)
{
struct capture_control *capc = task_capc(zone);
unsigned long buddy_pfn;
@@ -1038,7 +1049,7 @@ static inline void __free_one_page(struct page *page,
add_to_free_list(page, zone, order, migratetype);

/* Notify page reporting subsystem of freed page */
- if (report)
+ if (!(fop_flags & FOP_SKIP_REPORT_NOTIFY))
page_reporting_notify_free(order);
}

@@ -1368,7 +1379,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
if (unlikely(isolated_pageblocks))
mt = get_pageblock_migratetype(page);

- __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
+ __free_one_page(page, page_to_pfn(page), zone, 0, mt, FOP_NONE);
trace_mm_page_pcpu_drain(page, 0, mt);
}
spin_unlock(&zone->lock);
@@ -1384,7 +1395,7 @@ static void free_one_page(struct zone *zone,
is_migrate_isolate(migratetype))) {
migratetype = get_pfnblock_migratetype(page, pfn);
}
- __free_one_page(page, pfn, zone, order, migratetype, true);
+ __free_one_page(page, pfn, zone, order, migratetype, FOP_NONE);
spin_unlock(&zone->lock);
}

@@ -3277,7 +3288,8 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
lockdep_assert_held(&zone->lock);

/* Return isolated page to tail of freelist. */
- __free_one_page(page, page_to_pfn(page), zone, order, mt, false);
+ __free_one_page(page, page_to_pfn(page), zone, order, mt,
+ FOP_SKIP_REPORT_NOTIFY);
}

/*
--
2.26.2

2020-09-16 18:38:17

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 2/4] mm/page_alloc: place pages to tail in __putback_isolated_page()

__putback_isolated_page() already documents that pages will be placed to
the tail of the freelist - this is, however, not the case for
"order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
the case for all existing users.

This change affects two users:
- free page reporting
- page isolation, when undoing the isolation.

This behavior is desireable for pages that haven't really been touched
lately, so exactly the two users that don't actually read/write page
content, but rather move untouched pages.

The new behavior is especially desirable for memory onlining, where we
allow allocation of newly onlined pages via undo_isolate_page_range()
in online_pages(). Right now, we always place them to the head of the
free list, resulting in undesireable behavior: Assume we add
individual memory chunks via add_memory() and online them right away to
the NORMAL zone. We create a dependency chain of unmovable allocations
e.g., via the memmap. The memmap of the next chunk will be placed onto
previous chunks - if the last block cannot get offlined+removed, all
dependent ones cannot get offlined+removed. While this can already be
observed with individual DIMMs, it's more of an issue for virtio-mem
(and I suspect also ppc DLPAR).

Note: If we observe a degradation due to the changed page isolation
behavior (which I doubt), we can always make this configurable by the
instance triggering undo of isolation (e.g., alloc_contig_range(),
memory onlining, memory offlining).

Cc: Andrew Morton <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Scott Cheloha <[email protected]>
Cc: Michael Ellerman <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/page_alloc.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91cefb8157dd..bba9a0f60c70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -89,6 +89,12 @@ typedef int __bitwise fop_t;
*/
#define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))

+/*
+ * Place the freed page to the tail of the freelist after buddy merging. Will
+ * get ignored with page shuffling enabled.
+ */
+#define FOP_TO_TAIL ((__force fop_t)BIT(1))
+
/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_FRACTION (8)
@@ -1040,6 +1046,8 @@ static inline void __free_one_page(struct page *page, unsigned long pfn,

if (is_shuffle_order(order))
to_tail = shuffle_pick_tail();
+ else if (fop_flags & FOP_TO_TAIL)
+ to_tail = true;
else
to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);

@@ -3289,7 +3297,7 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)

/* Return isolated page to tail of freelist. */
__free_one_page(page, page_to_pfn(page), zone, order, mt,
- FOP_SKIP_REPORT_NOTIFY);
+ FOP_SKIP_REPORT_NOTIFY | FOP_TO_TAIL);
}

/*
--
2.26.2

2020-09-16 18:38:26

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 3/4] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

Page isolation doesn't actually touch the pages, it simply isolates
pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.

We already place pages to the tail of the freelists when undoing
isolation via __putback_isolated_page(), let's do it in any case
(e.g., if order == pageblock_order) and document the behavior.

This change results in all pages getting onlined via online_pages() to
be placed to the tail of the freelist.

Cc: Andrew Morton <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Scott Cheloha <[email protected]>
Cc: Michael Ellerman <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/page-isolation.h | 2 ++
mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-----
mm/page_isolation.c | 8 ++++++--
3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 572458016331..a36be2cf4dbb 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -38,6 +38,8 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
void set_pageblock_migratetype(struct page *page, int migratetype);
int move_freepages_block(struct zone *zone, struct page *page,
int migratetype, int *num_movable);
+int move_freepages_block_tail(struct zone *zone, struct page *page,
+ int migratetype);

/*
* Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bba9a0f60c70..75b0f49b4022 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -899,6 +899,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
list_move(&page->lru, &area->free_list[migratetype]);
}

+/* Used for pages which are on another list */
+static inline void move_to_free_list_tail(struct page *page, struct zone *zone,
+ unsigned int order, int migratetype)
+{
+ struct free_area *area = &zone->free_area[order];
+
+ list_move_tail(&page->lru, &area->free_list[migratetype]);
+}
+
static inline void del_page_from_free_list(struct page *page, struct zone *zone,
unsigned int order)
{
@@ -2323,7 +2332,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
*/
static int move_freepages(struct zone *zone,
struct page *start_page, struct page *end_page,
- int migratetype, int *num_movable)
+ int migratetype, int *num_movable, bool to_tail)
{
struct page *page;
unsigned int order;
@@ -2354,7 +2363,10 @@ static int move_freepages(struct zone *zone,
VM_BUG_ON_PAGE(page_zone(page) != zone, page);

order = page_order(page);
- move_to_free_list(page, zone, order, migratetype);
+ if (to_tail)
+ move_to_free_list_tail(page, zone, order, migratetype);
+ else
+ move_to_free_list(page, zone, order, migratetype);
page += 1 << order;
pages_moved += 1 << order;
}
@@ -2362,8 +2374,9 @@ static int move_freepages(struct zone *zone,
return pages_moved;
}

-int move_freepages_block(struct zone *zone, struct page *page,
- int migratetype, int *num_movable)
+static int __move_freepages_block(struct zone *zone, struct page *page,
+ int migratetype, int *num_movable,
+ bool to_tail)
{
unsigned long start_pfn, end_pfn;
struct page *start_page, *end_page;
@@ -2384,7 +2397,20 @@ int move_freepages_block(struct zone *zone, struct page *page,
return 0;

return move_freepages(zone, start_page, end_page, migratetype,
- num_movable);
+ num_movable, to_tail);
+}
+
+int move_freepages_block(struct zone *zone, struct page *page,
+ int migratetype, int *num_movable)
+{
+ return __move_freepages_block(zone, page, migratetype, num_movable,
+ false);
+}
+
+int move_freepages_block_tail(struct zone *zone, struct page *page,
+ int migratetype)
+{
+ return __move_freepages_block(zone, page, migratetype, NULL, true);
}

static void change_pageblock_range(struct page *pageblock_page,
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index abfe26ad59fd..84aa1d14751d 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -83,7 +83,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
* Because freepage with more than pageblock_order on isolated
* pageblock is restricted to merge due to freepage counting problem,
* it is possible that there is free buddy page.
- * move_freepages_block() doesn't care of merge so we need other
+ * move_freepages_block*() don't care about merging, so we need another
* approach in order to merge them. Isolation and free will make
* these pages to be merged.
*/
@@ -106,9 +106,13 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
* If we isolate freepage with more than pageblock_order, there
* should be no freepage in the range, so we could avoid costly
* pageblock scanning for freepage moving.
+ *
+ * We didn't actually touch any of the isolated pages, so place them
+ * to the tail of the freelists. This is especially relevant during
+ * memory onlining.
*/
if (!isolated_page) {
- nr_pages = move_freepages_block(zone, page, migratetype, NULL);
+ nr_pages = move_freepages_block_tail(zone, page, migratetype);
__mod_zone_freepage_state(zone, nr_pages, migratetype);
}
set_pageblock_migratetype(page, migratetype);
--
2.26.2

2020-09-16 18:38:39

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 4/4] mm/page_alloc: place pages to tail in __free_pages_core()

__free_pages_core() is used when exposing fresh memory to the buddy
during system boot and when onlining memory in generic_online_page().

generic_online_page() is used in two cases:

1. Direct memory onlining in online_pages().
2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
balloon and virtio-mem), when parts of a section are kept
fake-offline to be fake-onlined later on.

In 1, we already place pages to the tail of the freelist. Pages will be
freed to MIGRATE_ISOLATE lists first and moved to the tail of the freelists
via undo_isolate_page_range().

In 2, we currently don't implement a proper rule. In case of virtio-mem,
where we currently always online MAX_ORDER - 1 pages, the pages will be
placed to the HEAD of the freelist - undesireable. While the hyper-v
balloon calls generic_online_page() with single pages, usually it will
call it on successive single pages in a larger block.

The pages are fresh, so place them to the tail of the freelists and avoid
the PCP.

Note: If we detect that the new behavior is undesireable for
__free_pages_core() during boot, we can let the caller specify the
behavior.

Cc: Andrew Morton <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Wei Liu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/page_alloc.c | 32 ++++++++++++++++++++------------
1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 75b0f49b4022..50746e6dc21b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -264,7 +264,8 @@ bool pm_suspended_storage(void)
unsigned int pageblock_order __read_mostly;
#endif

-static void __free_pages_ok(struct page *page, unsigned int order);
+static void __free_pages_ok(struct page *page, unsigned int order,
+ fop_t fop_flags);

/*
* results with 256, 32 in the lowmem_reserve sysctl:
@@ -676,7 +677,7 @@ static void bad_page(struct page *page, const char *reason)
void free_compound_page(struct page *page)
{
mem_cgroup_uncharge(page);
- __free_pages_ok(page, compound_order(page));
+ __free_pages_ok(page, compound_order(page), FOP_NONE);
}

void prep_compound_page(struct page *page, unsigned int order)
@@ -1402,17 +1403,15 @@ static void free_pcppages_bulk(struct zone *zone, int count,
spin_unlock(&zone->lock);
}

-static void free_one_page(struct zone *zone,
- struct page *page, unsigned long pfn,
- unsigned int order,
- int migratetype)
+static void free_one_page(struct zone *zone, struct page *page, unsigned long pfn,
+ unsigned int order, int migratetype, fop_t fop_flags)
{
spin_lock(&zone->lock);
if (unlikely(has_isolate_pageblock(zone) ||
is_migrate_isolate(migratetype))) {
migratetype = get_pfnblock_migratetype(page, pfn);
}
- __free_one_page(page, pfn, zone, order, migratetype, FOP_NONE);
+ __free_one_page(page, pfn, zone, order, migratetype, fop_flags);
spin_unlock(&zone->lock);
}

@@ -1490,7 +1489,8 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
}
}

-static void __free_pages_ok(struct page *page, unsigned int order)
+static void __free_pages_ok(struct page *page, unsigned int order,
+ fop_t fop_flags)
{
unsigned long flags;
int migratetype;
@@ -1502,7 +1502,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
migratetype = get_pfnblock_migratetype(page, pfn);
local_irq_save(flags);
__count_vm_events(PGFREE, 1 << order);
- free_one_page(page_zone(page), page, pfn, order, migratetype);
+ free_one_page(page_zone(page), page, pfn, order, migratetype,
+ fop_flags);
local_irq_restore(flags);
}

@@ -1523,7 +1524,13 @@ void __free_pages_core(struct page *page, unsigned int order)

atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
set_page_refcounted(page);
- __free_pages(page, order);
+
+ /*
+ * Bypass PCP and place fresh pages right to the tail, primarily
+ * relevant for memory onlining.
+ */
+ page_ref_dec(page);
+ __free_pages_ok(page, order, FOP_TO_TAIL);
}

#ifdef CONFIG_NEED_MULTIPLE_NODES
@@ -3167,7 +3174,8 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
*/
if (migratetype >= MIGRATE_PCPTYPES) {
if (unlikely(is_migrate_isolate(migratetype))) {
- free_one_page(zone, page, pfn, 0, migratetype);
+ free_one_page(zone, page, pfn, 0, migratetype,
+ FOP_NONE);
return;
}
migratetype = MIGRATE_MOVABLE;
@@ -4984,7 +4992,7 @@ static inline void free_the_page(struct page *page, unsigned int order)
if (order == 0) /* Via pcp? */
free_unref_page(page);
else
- __free_pages_ok(page, order);
+ __free_pages_ok(page, order, FOP_NONE);
}

void __free_pages(struct page *page, unsigned int order)
--
2.26.2

2020-09-16 18:53:57

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation

On 2020-09-16 20:34, David Hildenbrand wrote:
> When adding separate memory blocks via add_memory*() and onlining them
> immediately, the metadata (especially the memmap) of the next block
> will be
> placed onto one of the just added+onlined block. This creates a chain
> of unmovable allocations: If the last memory block cannot get
> offlined+removed() so will all dependant ones. We directly have
> unmovable
> allocations all over the place.
>
> This can be observed quite easily using virtio-mem, however, it can
> also
> be observed when using DIMMs. The freshly onlined pages will usually be
> placed to the head of the freelists, meaning they will be allocated
> next,
> turning the just-added memory usually immediately un-removable. The
> fresh pages are cold, prefering to allocate others (that might be hot)
> also feels to be the natural thing to do.
>
> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar:
> when
> adding separate, successive memory blocks, each memory block will have
> unmovable allocations on them - for example gigantic pages will fail to
> allocate.
>
> While the ZONE_NORMAL doesn't provide any guarantees that memory can
> get
> offlined+removed again (any kind of fragmentation with unmovable
> allocations is possible), there are many scenarios (hotplugging a lot
> of
> memory, running workload, hotunplug some memory/as much as possible)
> where
> we can offline+remove quite a lot with this patchset.

Hi David,

I did not read through the patchset yet, so sorry if the question is
nonsense, but is this not trying to fix the same issue the vmemmap
patches did? [1]

I was about to give it a new respin now that thw hwpoison stuff has been
settled.

[1] https://patchwork.kernel.org/cover/11059175/
>

2020-09-16 19:40:13

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation



> Am 16.09.2020 um 20:50 schrieb [email protected]:
>
> On 2020-09-16 20:34, David Hildenbrand wrote:
>> When adding separate memory blocks via add_memory*() and onlining them
>> immediately, the metadata (especially the memmap) of the next block will be
>> placed onto one of the just added+onlined block. This creates a chain
>> of unmovable allocations: If the last memory block cannot get
>> offlined+removed() so will all dependant ones. We directly have unmovable
>> allocations all over the place.
>> This can be observed quite easily using virtio-mem, however, it can also
>> be observed when using DIMMs. The freshly onlined pages will usually be
>> placed to the head of the freelists, meaning they will be allocated next,
>> turning the just-added memory usually immediately un-removable. The
>> fresh pages are cold, prefering to allocate others (that might be hot)
>> also feels to be the natural thing to do.
>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>> adding separate, successive memory blocks, each memory block will have
>> unmovable allocations on them - for example gigantic pages will fail to
>> allocate.
>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>> offlined+removed again (any kind of fragmentation with unmovable
>> allocations is possible), there are many scenarios (hotplugging a lot of
>> memory, running workload, hotunplug some memory/as much as possible) where
>> we can offline+remove quite a lot with this patchset.
>
> Hi David,
>

Hi Oscar.

> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]

Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it‘s not completely ideal, especially for single memory blocks.

With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.

Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.

Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.

So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.

Thanks!

David

>
> I was about to give it a new respin now that thw hwpoison stuff has been settled.
>
> [1] https://patchwork.kernel.org/cover/11059175/
>

2020-09-16 22:09:24

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] mm/page_alloc: place pages to tail in __putback_isolated_page()

On Wed, Sep 16, 2020 at 11:34 AM David Hildenbrand <[email protected]> wrote:
>
> __putback_isolated_page() already documents that pages will be placed to
> the tail of the freelist - this is, however, not the case for
> "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
> the case for all existing users.
>
> This change affects two users:
> - free page reporting
> - page isolation, when undoing the isolation.
>
> This behavior is desireable for pages that haven't really been touched

I think "desirable" is misspelled here.

> lately, so exactly the two users that don't actually read/write page
> content, but rather move untouched pages.

So in reality we were already dealing with this for page reporting,
but not in the most direct way. If I recall we were adding the pages
to the head of the list and then when we would go back to pull more
pages we were doing list rotation in the report function so they were
technically being added to the head, but usually would end up back on
the tail anyway. If anything the benefit for page reporting is that it
should be more direct this way as we will only have to rescan the
pages now when we have consumed all of the reported ones on the list.

> The new behavior is especially desirable for memory onlining, where we
> allow allocation of newly onlined pages via undo_isolate_page_range()
> in online_pages(). Right now, we always place them to the head of the
> free list, resulting in undesireable behavior: Assume we add
> individual memory chunks via add_memory() and online them right away to
> the NORMAL zone. We create a dependency chain of unmovable allocations
> e.g., via the memmap. The memmap of the next chunk will be placed onto
> previous chunks - if the last block cannot get offlined+removed, all
> dependent ones cannot get offlined+removed. While this can already be
> observed with individual DIMMs, it's more of an issue for virtio-mem
> (and I suspect also ppc DLPAR).
>
> Note: If we observe a degradation due to the changed page isolation
> behavior (which I doubt), we can always make this configurable by the
> instance triggering undo of isolation (e.g., alloc_contig_range(),
> memory onlining, memory offlining).
>
> Cc: Andrew Morton <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Wei Yang <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Scott Cheloha <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/page_alloc.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91cefb8157dd..bba9a0f60c70 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -89,6 +89,12 @@ typedef int __bitwise fop_t;
> */
> #define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))
>
> +/*
> + * Place the freed page to the tail of the freelist after buddy merging. Will
> + * get ignored with page shuffling enabled.
> + */
> +#define FOP_TO_TAIL ((__force fop_t)BIT(1))
> +
> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
> static DEFINE_MUTEX(pcp_batch_high_lock);
> #define MIN_PERCPU_PAGELIST_FRACTION (8)
> @@ -1040,6 +1046,8 @@ static inline void __free_one_page(struct page *page, unsigned long pfn,
>
> if (is_shuffle_order(order))
> to_tail = shuffle_pick_tail();
> + else if (fop_flags & FOP_TO_TAIL)
> + to_tail = true;
> else
> to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
>
> @@ -3289,7 +3297,7 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
>
> /* Return isolated page to tail of freelist. */
> __free_one_page(page, page_to_pfn(page), zone, order, mt,
> - FOP_SKIP_REPORT_NOTIFY);
> + FOP_SKIP_REPORT_NOTIFY | FOP_TO_TAIL);
> }
>
> /*

The code looks good to me.

Reviewed-by: Alexander Duyck <[email protected]>

2020-09-16 22:24:37

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH RFC 1/4] mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag

On Wed, Sep 16, 2020 at 11:34 AM David Hildenbrand <[email protected]> wrote:
>
> Let's prepare for additional flags and avoid long parameter lists of bools.
> Follow-up patches will also make use of the flags in __free_pages_ok(),
> however, I wasn't able to come up with a better name for the type - should
> be good enough for internal purposes.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Wei Yang <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/page_alloc.c | 28 ++++++++++++++++++++--------
> 1 file changed, 20 insertions(+), 8 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6b699d273d6e..91cefb8157dd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -77,6 +77,18 @@
> #include "shuffle.h"
> #include "page_reporting.h"
>
> +/* Free One Page flags: for internal, non-pcp variants of free_pages(). */
> +typedef int __bitwise fop_t;
> +
> +/* No special request */
> +#define FOP_NONE ((__force fop_t)0)
> +
> +/*
> + * Skip free page reporting notification after buddy merging (will *not* mark
> + * the page reported, only skip the notification).
> + */
> +#define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))
> +
> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
> static DEFINE_MUTEX(pcp_batch_high_lock);
> #define MIN_PERCPU_PAGELIST_FRACTION (8)
> @@ -948,10 +960,9 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
> * -- nyc
> */
>
> -static inline void __free_one_page(struct page *page,
> - unsigned long pfn,
> - struct zone *zone, unsigned int order,
> - int migratetype, bool report)
> +static inline void __free_one_page(struct page *page, unsigned long pfn,
> + struct zone *zone, unsigned int order,
> + int migratetype, fop_t fop_flags)
> {
> struct capture_control *capc = task_capc(zone);
> unsigned long buddy_pfn;
> @@ -1038,7 +1049,7 @@ static inline void __free_one_page(struct page *page,
> add_to_free_list(page, zone, order, migratetype);
>
> /* Notify page reporting subsystem of freed page */
> - if (report)
> + if (!(fop_flags & FOP_SKIP_REPORT_NOTIFY))
> page_reporting_notify_free(order);
> }
>
> @@ -1368,7 +1379,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> if (unlikely(isolated_pageblocks))
> mt = get_pageblock_migratetype(page);
>
> - __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
> + __free_one_page(page, page_to_pfn(page), zone, 0, mt, FOP_NONE);
> trace_mm_page_pcpu_drain(page, 0, mt);
> }
> spin_unlock(&zone->lock);
> @@ -1384,7 +1395,7 @@ static void free_one_page(struct zone *zone,
> is_migrate_isolate(migratetype))) {
> migratetype = get_pfnblock_migratetype(page, pfn);
> }
> - __free_one_page(page, pfn, zone, order, migratetype, true);
> + __free_one_page(page, pfn, zone, order, migratetype, FOP_NONE);
> spin_unlock(&zone->lock);
> }
>
> @@ -3277,7 +3288,8 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
> lockdep_assert_held(&zone->lock);
>
> /* Return isolated page to tail of freelist. */
> - __free_one_page(page, page_to_pfn(page), zone, order, mt, false);
> + __free_one_page(page, page_to_pfn(page), zone, order, mt,
> + FOP_SKIP_REPORT_NOTIFY);
> }
>
> /*

Seems pretty straight forward. So we are basically flipping the logic
and replacing !report with FOP_SKIP_REPORT_NOTIFY.

Reviewed-by: Alexander Duyck <[email protected]>

2020-09-18 01:54:51

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH RFC 1/4] mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag

On Wed, Sep 16, 2020 at 08:34:08PM +0200, David Hildenbrand wrote:
>Let's prepare for additional flags and avoid long parameter lists of bools.
>Follow-up patches will also make use of the flags in __free_pages_ok(),
>however, I wasn't able to come up with a better name for the type - should
>be good enough for internal purposes.
>
>Cc: Andrew Morton <[email protected]>
>Cc: Alexander Duyck <[email protected]>
>Cc: Mel Gorman <[email protected]>
>Cc: Michal Hocko <[email protected]>
>Cc: Dave Hansen <[email protected]>
>Cc: Vlastimil Babka <[email protected]>
>Cc: Wei Yang <[email protected]>
>Cc: Oscar Salvador <[email protected]>
>Cc: Mike Rapoport <[email protected]>
>Signed-off-by: David Hildenbrand <[email protected]>
>---
> mm/page_alloc.c | 28 ++++++++++++++++++++--------
> 1 file changed, 20 insertions(+), 8 deletions(-)
>
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index 6b699d273d6e..91cefb8157dd 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -77,6 +77,18 @@
> #include "shuffle.h"
> #include "page_reporting.h"
>
>+/* Free One Page flags: for internal, non-pcp variants of free_pages(). */
>+typedef int __bitwise fop_t;
>+
>+/* No special request */
>+#define FOP_NONE ((__force fop_t)0)
>+
>+/*
>+ * Skip free page reporting notification after buddy merging (will *not* mark

__free_one_page() may not merge buddy when its buddy is not available.

Would this comment be a little confusing?

>+ * the page reported, only skip the notification).
>+ */
>+#define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))
>+
> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
> static DEFINE_MUTEX(pcp_batch_high_lock);
> #define MIN_PERCPU_PAGELIST_FRACTION (8)
>@@ -948,10 +960,9 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
> * -- nyc
> */
>
>-static inline void __free_one_page(struct page *page,
>- unsigned long pfn,
>- struct zone *zone, unsigned int order,
>- int migratetype, bool report)
>+static inline void __free_one_page(struct page *page, unsigned long pfn,
>+ struct zone *zone, unsigned int order,
>+ int migratetype, fop_t fop_flags)
> {
> struct capture_control *capc = task_capc(zone);
> unsigned long buddy_pfn;
>@@ -1038,7 +1049,7 @@ static inline void __free_one_page(struct page *page,
> add_to_free_list(page, zone, order, migratetype);
>
> /* Notify page reporting subsystem of freed page */
>- if (report)
>+ if (!(fop_flags & FOP_SKIP_REPORT_NOTIFY))
> page_reporting_notify_free(order);
> }
>
>@@ -1368,7 +1379,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> if (unlikely(isolated_pageblocks))
> mt = get_pageblock_migratetype(page);
>
>- __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>+ __free_one_page(page, page_to_pfn(page), zone, 0, mt, FOP_NONE);
> trace_mm_page_pcpu_drain(page, 0, mt);
> }
> spin_unlock(&zone->lock);
>@@ -1384,7 +1395,7 @@ static void free_one_page(struct zone *zone,
> is_migrate_isolate(migratetype))) {
> migratetype = get_pfnblock_migratetype(page, pfn);
> }
>- __free_one_page(page, pfn, zone, order, migratetype, true);
>+ __free_one_page(page, pfn, zone, order, migratetype, FOP_NONE);
> spin_unlock(&zone->lock);
> }
>
>@@ -3277,7 +3288,8 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
> lockdep_assert_held(&zone->lock);
>
> /* Return isolated page to tail of freelist. */
>- __free_one_page(page, page_to_pfn(page), zone, order, mt, false);
>+ __free_one_page(page, page_to_pfn(page), zone, order, mt,
>+ FOP_SKIP_REPORT_NOTIFY);
> }
>
> /*
>--
>2.26.2

--
Wei Yang
Help you, Help me

2020-09-18 02:18:30

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] mm/page_alloc: place pages to tail in __putback_isolated_page()

On Wed, Sep 16, 2020 at 08:34:09PM +0200, David Hildenbrand wrote:
>__putback_isolated_page() already documents that pages will be placed to
>the tail of the freelist - this is, however, not the case for
>"order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
>the case for all existing users.
>
>This change affects two users:
>- free page reporting
>- page isolation, when undoing the isolation.
>
>This behavior is desireable for pages that haven't really been touched
>lately, so exactly the two users that don't actually read/write page
>content, but rather move untouched pages.
>
>The new behavior is especially desirable for memory onlining, where we
>allow allocation of newly onlined pages via undo_isolate_page_range()
>in online_pages(). Right now, we always place them to the head of the
>free list, resulting in undesireable behavior: Assume we add
>individual memory chunks via add_memory() and online them right away to
>the NORMAL zone. We create a dependency chain of unmovable allocations
>e.g., via the memmap. The memmap of the next chunk will be placed onto
>previous chunks - if the last block cannot get offlined+removed, all
>dependent ones cannot get offlined+removed. While this can already be
>observed with individual DIMMs, it's more of an issue for virtio-mem
>(and I suspect also ppc DLPAR).
>
>Note: If we observe a degradation due to the changed page isolation
>behavior (which I doubt), we can always make this configurable by the
>instance triggering undo of isolation (e.g., alloc_contig_range(),
>memory onlining, memory offlining).
>
>Cc: Andrew Morton <[email protected]>
>Cc: Alexander Duyck <[email protected]>
>Cc: Mel Gorman <[email protected]>
>Cc: Michal Hocko <[email protected]>
>Cc: Dave Hansen <[email protected]>
>Cc: Vlastimil Babka <[email protected]>
>Cc: Wei Yang <[email protected]>
>Cc: Oscar Salvador <[email protected]>
>Cc: Mike Rapoport <[email protected]>
>Cc: Scott Cheloha <[email protected]>
>Cc: Michael Ellerman <[email protected]>
>Signed-off-by: David Hildenbrand <[email protected]>
>---
> mm/page_alloc.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index 91cefb8157dd..bba9a0f60c70 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -89,6 +89,12 @@ typedef int __bitwise fop_t;
> */
> #define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))
>
>+/*
>+ * Place the freed page to the tail of the freelist after buddy merging. Will
>+ * get ignored with page shuffling enabled.
>+ */
>+#define FOP_TO_TAIL ((__force fop_t)BIT(1))
>+
> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
> static DEFINE_MUTEX(pcp_batch_high_lock);
> #define MIN_PERCPU_PAGELIST_FRACTION (8)
>@@ -1040,6 +1046,8 @@ static inline void __free_one_page(struct page *page, unsigned long pfn,
>
> if (is_shuffle_order(order))
> to_tail = shuffle_pick_tail();
>+ else if (fop_flags & FOP_TO_TAIL)
>+ to_tail = true;

Take another look into this part. Maybe we can move this check at top?

For online_page case, currently we have following call flow:

online_page
online_pages_range
shuffle_zone

This means we would always shuffle the newly added pages. Maybe we don't need
to do the shuffle when adding them to the free_list?

> else
> to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
>
>@@ -3289,7 +3297,7 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
>
> /* Return isolated page to tail of freelist. */
> __free_one_page(page, page_to_pfn(page), zone, order, mt,
>- FOP_SKIP_REPORT_NOTIFY);
>+ FOP_SKIP_REPORT_NOTIFY | FOP_TO_TAIL);
> }
>
> /*
>--
>2.26.2

--
Wei Yang
Help you, Help me

2020-09-18 02:33:16

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

On Wed, Sep 16, 2020 at 08:34:10PM +0200, David Hildenbrand wrote:
>Page isolation doesn't actually touch the pages, it simply isolates
>pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
>
>We already place pages to the tail of the freelists when undoing
>isolation via __putback_isolated_page(), let's do it in any case
>(e.g., if order == pageblock_order) and document the behavior.
>
>This change results in all pages getting onlined via online_pages() to
>be placed to the tail of the freelist.

I am sorry to not follow again. unset_migratetype_isolate() is used in
__offline_pages if my understanding is correct. How does it contribute on
online_pages?

>
>Cc: Andrew Morton <[email protected]>
>Cc: Alexander Duyck <[email protected]>
>Cc: Mel Gorman <[email protected]>
>Cc: Michal Hocko <[email protected]>
>Cc: Dave Hansen <[email protected]>
>Cc: Vlastimil Babka <[email protected]>
>Cc: Wei Yang <[email protected]>
>Cc: Oscar Salvador <[email protected]>
>Cc: Mike Rapoport <[email protected]>
>Cc: Scott Cheloha <[email protected]>
>Cc: Michael Ellerman <[email protected]>
>Signed-off-by: David Hildenbrand <[email protected]>
>---
> include/linux/page-isolation.h | 2 ++
> mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-----
> mm/page_isolation.c | 8 ++++++--
> 3 files changed, 39 insertions(+), 7 deletions(-)
>
>diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>index 572458016331..a36be2cf4dbb 100644
>--- a/include/linux/page-isolation.h
>+++ b/include/linux/page-isolation.h
>@@ -38,6 +38,8 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
> void set_pageblock_migratetype(struct page *page, int migratetype);
> int move_freepages_block(struct zone *zone, struct page *page,
> int migratetype, int *num_movable);
>+int move_freepages_block_tail(struct zone *zone, struct page *page,
>+ int migratetype);
>
> /*
> * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index bba9a0f60c70..75b0f49b4022 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -899,6 +899,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
> list_move(&page->lru, &area->free_list[migratetype]);
> }
>
>+/* Used for pages which are on another list */
>+static inline void move_to_free_list_tail(struct page *page, struct zone *zone,
>+ unsigned int order, int migratetype)
>+{
>+ struct free_area *area = &zone->free_area[order];
>+
>+ list_move_tail(&page->lru, &area->free_list[migratetype]);
>+}
>+
> static inline void del_page_from_free_list(struct page *page, struct zone *zone,
> unsigned int order)
> {
>@@ -2323,7 +2332,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
> */
> static int move_freepages(struct zone *zone,
> struct page *start_page, struct page *end_page,
>- int migratetype, int *num_movable)
>+ int migratetype, int *num_movable, bool to_tail)
> {
> struct page *page;
> unsigned int order;
>@@ -2354,7 +2363,10 @@ static int move_freepages(struct zone *zone,
> VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>
> order = page_order(page);
>- move_to_free_list(page, zone, order, migratetype);
>+ if (to_tail)
>+ move_to_free_list_tail(page, zone, order, migratetype);
>+ else
>+ move_to_free_list(page, zone, order, migratetype);
> page += 1 << order;
> pages_moved += 1 << order;
> }
>@@ -2362,8 +2374,9 @@ static int move_freepages(struct zone *zone,
> return pages_moved;
> }
>
>-int move_freepages_block(struct zone *zone, struct page *page,
>- int migratetype, int *num_movable)
>+static int __move_freepages_block(struct zone *zone, struct page *page,
>+ int migratetype, int *num_movable,
>+ bool to_tail)
> {
> unsigned long start_pfn, end_pfn;
> struct page *start_page, *end_page;
>@@ -2384,7 +2397,20 @@ int move_freepages_block(struct zone *zone, struct page *page,
> return 0;
>
> return move_freepages(zone, start_page, end_page, migratetype,
>- num_movable);
>+ num_movable, to_tail);
>+}
>+
>+int move_freepages_block(struct zone *zone, struct page *page,
>+ int migratetype, int *num_movable)
>+{
>+ return __move_freepages_block(zone, page, migratetype, num_movable,
>+ false);
>+}
>+
>+int move_freepages_block_tail(struct zone *zone, struct page *page,
>+ int migratetype)
>+{
>+ return __move_freepages_block(zone, page, migratetype, NULL, true);
> }
>
> static void change_pageblock_range(struct page *pageblock_page,
>diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>index abfe26ad59fd..84aa1d14751d 100644
>--- a/mm/page_isolation.c
>+++ b/mm/page_isolation.c
>@@ -83,7 +83,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> * Because freepage with more than pageblock_order on isolated
> * pageblock is restricted to merge due to freepage counting problem,
> * it is possible that there is free buddy page.
>- * move_freepages_block() doesn't care of merge so we need other
>+ * move_freepages_block*() don't care about merging, so we need another
> * approach in order to merge them. Isolation and free will make
> * these pages to be merged.
> */
>@@ -106,9 +106,13 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> * If we isolate freepage with more than pageblock_order, there
> * should be no freepage in the range, so we could avoid costly
> * pageblock scanning for freepage moving.
>+ *
>+ * We didn't actually touch any of the isolated pages, so place them
>+ * to the tail of the freelists. This is especially relevant during
>+ * memory onlining.
> */
> if (!isolated_page) {
>- nr_pages = move_freepages_block(zone, page, migratetype, NULL);
>+ nr_pages = move_freepages_block_tail(zone, page, migratetype);
> __mod_zone_freepage_state(zone, nr_pages, migratetype);
> }
> set_pageblock_migratetype(page, migratetype);
>--
>2.26.2

--
Wei Yang
Help you, Help me

2020-09-18 02:35:49

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation

On Wed, Sep 16, 2020 at 09:31:21PM +0200, David Hildenbrand wrote:
>
>
>> Am 16.09.2020 um 20:50 schrieb [email protected]:
>>
>> On 2020-09-16 20:34, David Hildenbrand wrote:
>>> When adding separate memory blocks via add_memory*() and onlining them
>>> immediately, the metadata (especially the memmap) of the next block will be
>>> placed onto one of the just added+onlined block. This creates a chain
>>> of unmovable allocations: If the last memory block cannot get
>>> offlined+removed() so will all dependant ones. We directly have unmovable
>>> allocations all over the place.
>>> This can be observed quite easily using virtio-mem, however, it can also
>>> be observed when using DIMMs. The freshly onlined pages will usually be
>>> placed to the head of the freelists, meaning they will be allocated next,
>>> turning the just-added memory usually immediately un-removable. The
>>> fresh pages are cold, prefering to allocate others (that might be hot)
>>> also feels to be the natural thing to do.
>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>>> adding separate, successive memory blocks, each memory block will have
>>> unmovable allocations on them - for example gigantic pages will fail to
>>> allocate.
>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>>> offlined+removed again (any kind of fragmentation with unmovable
>>> allocations is possible), there are many scenarios (hotplugging a lot of
>>> memory, running workload, hotunplug some memory/as much as possible) where
>>> we can offline+remove quite a lot with this patchset.
>>
>> Hi David,
>>
>
>Hi Oscar.
>
>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
>
>Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it‘s not completely ideal, especially for single memory blocks.
>
>With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
>
>Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
>
>Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
>
>So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.

While everything changes with shuffle.

>
>Thanks!
>
>David
>
>>
>> I was about to give it a new respin now that thw hwpoison stuff has been settled.
>>
>> [1] https://patchwork.kernel.org/cover/11059175/
>>

--
Wei Yang
Help you, Help me

2020-09-18 02:56:10

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] mm/page_alloc: place pages to tail in __putback_isolated_page()

On Wed, Sep 16, 2020 at 08:34:09PM +0200, David Hildenbrand wrote:
>__putback_isolated_page() already documents that pages will be placed to
>the tail of the freelist - this is, however, not the case for
>"order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
>the case for all existing users.
>
>This change affects two users:
>- free page reporting
>- page isolation, when undoing the isolation.
>
>This behavior is desireable for pages that haven't really been touched
>lately, so exactly the two users that don't actually read/write page
>content, but rather move untouched pages.
>
>The new behavior is especially desirable for memory onlining, where we
>allow allocation of newly onlined pages via undo_isolate_page_range()
>in online_pages(). Right now, we always place them to the head of the

The code looks good, while I don't fully understand the log here.

undo_isolate_page_range() is used in __offline_pages and alloc_contig_range. I
don't connect them with online_pages(). Do I miss something?

>free list, resulting in undesireable behavior: Assume we add
>individual memory chunks via add_memory() and online them right away to
>the NORMAL zone. We create a dependency chain of unmovable allocations
>e.g., via the memmap. The memmap of the next chunk will be placed onto
>previous chunks - if the last block cannot get offlined+removed, all
>dependent ones cannot get offlined+removed. While this can already be
>observed with individual DIMMs, it's more of an issue for virtio-mem
>(and I suspect also ppc DLPAR).
>
>Note: If we observe a degradation due to the changed page isolation
>behavior (which I doubt), we can always make this configurable by the
>instance triggering undo of isolation (e.g., alloc_contig_range(),
>memory onlining, memory offlining).
>
>Cc: Andrew Morton <[email protected]>
>Cc: Alexander Duyck <[email protected]>
>Cc: Mel Gorman <[email protected]>
>Cc: Michal Hocko <[email protected]>
>Cc: Dave Hansen <[email protected]>
>Cc: Vlastimil Babka <[email protected]>
>Cc: Wei Yang <[email protected]>
>Cc: Oscar Salvador <[email protected]>
>Cc: Mike Rapoport <[email protected]>
>Cc: Scott Cheloha <[email protected]>
>Cc: Michael Ellerman <[email protected]>
>Signed-off-by: David Hildenbrand <[email protected]>
>---
> mm/page_alloc.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index 91cefb8157dd..bba9a0f60c70 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -89,6 +89,12 @@ typedef int __bitwise fop_t;
> */
> #define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))
>
>+/*
>+ * Place the freed page to the tail of the freelist after buddy merging. Will
>+ * get ignored with page shuffling enabled.
>+ */
>+#define FOP_TO_TAIL ((__force fop_t)BIT(1))
>+
> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
> static DEFINE_MUTEX(pcp_batch_high_lock);
> #define MIN_PERCPU_PAGELIST_FRACTION (8)
>@@ -1040,6 +1046,8 @@ static inline void __free_one_page(struct page *page, unsigned long pfn,
>
> if (is_shuffle_order(order))
> to_tail = shuffle_pick_tail();
>+ else if (fop_flags & FOP_TO_TAIL)
>+ to_tail = true;
> else
> to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
>
>@@ -3289,7 +3297,7 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
>
> /* Return isolated page to tail of freelist. */
> __free_one_page(page, page_to_pfn(page), zone, order, mt,
>- FOP_SKIP_REPORT_NOTIFY);
>+ FOP_SKIP_REPORT_NOTIFY | FOP_TO_TAIL);
> }
>
> /*
>--
>2.26.2

--
Wei Yang
Help you, Help me

2020-09-18 07:27:09

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 1/4] mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag

On 18.09.20 03:53, Wei Yang wrote:
> On Wed, Sep 16, 2020 at 08:34:08PM +0200, David Hildenbrand wrote:
>> Let's prepare for additional flags and avoid long parameter lists of bools.
>> Follow-up patches will also make use of the flags in __free_pages_ok(),
>> however, I wasn't able to come up with a better name for the type - should
>> be good enough for internal purposes.
>>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Alexander Duyck <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Vlastimil Babka <[email protected]>
>> Cc: Wei Yang <[email protected]>
>> Cc: Oscar Salvador <[email protected]>
>> Cc: Mike Rapoport <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/page_alloc.c | 28 ++++++++++++++++++++--------
>> 1 file changed, 20 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6b699d273d6e..91cefb8157dd 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -77,6 +77,18 @@
>> #include "shuffle.h"
>> #include "page_reporting.h"
>>
>> +/* Free One Page flags: for internal, non-pcp variants of free_pages(). */
>> +typedef int __bitwise fop_t;
>> +
>> +/* No special request */
>> +#define FOP_NONE ((__force fop_t)0)
>> +
>> +/*
>> + * Skip free page reporting notification after buddy merging (will *not* mark
>
> __free_one_page() may not merge buddy when its buddy is not available.
>
> Would this comment be a little confusing?
>

I rather meant the process than if it's actually happening.

"Skip free page reporting notification for the (possibly merged) page."

Thanks!

--
Thanks,

David / dhildenb

2020-09-18 07:29:06

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] mm/page_alloc: place pages to tail in __putback_isolated_page()

On 18.09.20 04:07, Wei Yang wrote:
> On Wed, Sep 16, 2020 at 08:34:09PM +0200, David Hildenbrand wrote:
>> __putback_isolated_page() already documents that pages will be placed to
>> the tail of the freelist - this is, however, not the case for
>> "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
>> the case for all existing users.
>>
>> This change affects two users:
>> - free page reporting
>> - page isolation, when undoing the isolation.
>>
>> This behavior is desireable for pages that haven't really been touched
>> lately, so exactly the two users that don't actually read/write page
>> content, but rather move untouched pages.
>>
>> The new behavior is especially desirable for memory onlining, where we
>> allow allocation of newly onlined pages via undo_isolate_page_range()
>> in online_pages(). Right now, we always place them to the head of the
>
> The code looks good, while I don't fully understand the log here.
>
> undo_isolate_page_range() is used in __offline_pages and alloc_contig_range. I
> don't connect them with online_pages(). Do I miss something?

Yeah, please look at -mm / -next instead. See

https://lkml.kernel.org/r/[email protected]


--
Thanks,

David / dhildenb

2020-09-18 07:30:44

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] mm/page_alloc: place pages to tail in __putback_isolated_page()

On 18.09.20 04:16, Wei Yang wrote:
> On Wed, Sep 16, 2020 at 08:34:09PM +0200, David Hildenbrand wrote:
>> __putback_isolated_page() already documents that pages will be placed to
>> the tail of the freelist - this is, however, not the case for
>> "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
>> the case for all existing users.
>>
>> This change affects two users:
>> - free page reporting
>> - page isolation, when undoing the isolation.
>>
>> This behavior is desireable for pages that haven't really been touched
>> lately, so exactly the two users that don't actually read/write page
>> content, but rather move untouched pages.
>>
>> The new behavior is especially desirable for memory onlining, where we
>> allow allocation of newly onlined pages via undo_isolate_page_range()
>> in online_pages(). Right now, we always place them to the head of the
>> free list, resulting in undesireable behavior: Assume we add
>> individual memory chunks via add_memory() and online them right away to
>> the NORMAL zone. We create a dependency chain of unmovable allocations
>> e.g., via the memmap. The memmap of the next chunk will be placed onto
>> previous chunks - if the last block cannot get offlined+removed, all
>> dependent ones cannot get offlined+removed. While this can already be
>> observed with individual DIMMs, it's more of an issue for virtio-mem
>> (and I suspect also ppc DLPAR).
>>
>> Note: If we observe a degradation due to the changed page isolation
>> behavior (which I doubt), we can always make this configurable by the
>> instance triggering undo of isolation (e.g., alloc_contig_range(),
>> memory onlining, memory offlining).
>>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Alexander Duyck <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Vlastimil Babka <[email protected]>
>> Cc: Wei Yang <[email protected]>
>> Cc: Oscar Salvador <[email protected]>
>> Cc: Mike Rapoport <[email protected]>
>> Cc: Scott Cheloha <[email protected]>
>> Cc: Michael Ellerman <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/page_alloc.c | 10 +++++++++-
>> 1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 91cefb8157dd..bba9a0f60c70 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -89,6 +89,12 @@ typedef int __bitwise fop_t;
>> */
>> #define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))
>>
>> +/*
>> + * Place the freed page to the tail of the freelist after buddy merging. Will
>> + * get ignored with page shuffling enabled.
>> + */
>> +#define FOP_TO_TAIL ((__force fop_t)BIT(1))
>> +
>> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>> static DEFINE_MUTEX(pcp_batch_high_lock);
>> #define MIN_PERCPU_PAGELIST_FRACTION (8)
>> @@ -1040,6 +1046,8 @@ static inline void __free_one_page(struct page *page, unsigned long pfn,
>>
>> if (is_shuffle_order(order))
>> to_tail = shuffle_pick_tail();
>> + else if (fop_flags & FOP_TO_TAIL)
>> + to_tail = true;
>
> Take another look into this part. Maybe we can move this check at top?
>
> For online_page case, currently we have following call flow:
>
> online_page
> online_pages_range
> shuffle_zone
>
> This means we would always shuffle the newly added pages. Maybe we don't need
> to do the shuffle when adding them to the free_list?

Yeah we don't, but it doesn't really buy us too much as the call paths I
am touching are used by other mechanisms as well that need it
(especially undoing page isolation).

--
Thanks,

David / dhildenb

2020-09-18 07:33:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

On 18.09.20 04:29, Wei Yang wrote:
> On Wed, Sep 16, 2020 at 08:34:10PM +0200, David Hildenbrand wrote:
>> Page isolation doesn't actually touch the pages, it simply isolates
>> pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
>>
>> We already place pages to the tail of the freelists when undoing
>> isolation via __putback_isolated_page(), let's do it in any case
>> (e.g., if order == pageblock_order) and document the behavior.
>>
>> This change results in all pages getting onlined via online_pages() to
>> be placed to the tail of the freelist.
>
> I am sorry to not follow again. unset_migratetype_isolate() is used in
> __offline_pages if my understanding is correct. How does it contribute on
> online_pages?

See -next / -mm, that should make it clearer.

--
Thanks,

David / dhildenb

2020-09-18 07:35:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation

On 18.09.20 04:30, Wei Yang wrote:
> On Wed, Sep 16, 2020 at 09:31:21PM +0200, David Hildenbrand wrote:
>>
>>
>>> Am 16.09.2020 um 20:50 schrieb [email protected]:
>>>
>>> On 2020-09-16 20:34, David Hildenbrand wrote:
>>>> When adding separate memory blocks via add_memory*() and onlining them
>>>> immediately, the metadata (especially the memmap) of the next block will be
>>>> placed onto one of the just added+onlined block. This creates a chain
>>>> of unmovable allocations: If the last memory block cannot get
>>>> offlined+removed() so will all dependant ones. We directly have unmovable
>>>> allocations all over the place.
>>>> This can be observed quite easily using virtio-mem, however, it can also
>>>> be observed when using DIMMs. The freshly onlined pages will usually be
>>>> placed to the head of the freelists, meaning they will be allocated next,
>>>> turning the just-added memory usually immediately un-removable. The
>>>> fresh pages are cold, prefering to allocate others (that might be hot)
>>>> also feels to be the natural thing to do.
>>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>>>> adding separate, successive memory blocks, each memory block will have
>>>> unmovable allocations on them - for example gigantic pages will fail to
>>>> allocate.
>>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>>>> offlined+removed again (any kind of fragmentation with unmovable
>>>> allocations is possible), there are many scenarios (hotplugging a lot of
>>>> memory, running workload, hotunplug some memory/as much as possible) where
>>>> we can offline+remove quite a lot with this patchset.
>>>
>>> Hi David,
>>>
>>
>> Hi Oscar.
>>
>>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
>>
>> Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it‘s not completely ideal, especially for single memory blocks.
>>
>> With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
>>
>> Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
>>
>> Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
>>
>> So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.
>
> While everything changes with shuffle.
>

Right. Shuffling would naturally try to break the dependencies.

Shuffling is quite rare though, it has to be enabled explicitly on the
cmdline and might not be of too much help in virtualized environments.

--
Thanks,

David / dhildenb

2020-09-21 01:59:08

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] mm/page_alloc: place pages to tail in __putback_isolated_page()

On Fri, Sep 18, 2020 at 09:27:23AM +0200, David Hildenbrand wrote:
>On 18.09.20 04:07, Wei Yang wrote:
>> On Wed, Sep 16, 2020 at 08:34:09PM +0200, David Hildenbrand wrote:
>>> __putback_isolated_page() already documents that pages will be placed to
>>> the tail of the freelist - this is, however, not the case for
>>> "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
>>> the case for all existing users.
>>>
>>> This change affects two users:
>>> - free page reporting
>>> - page isolation, when undoing the isolation.
>>>
>>> This behavior is desireable for pages that haven't really been touched
>>> lately, so exactly the two users that don't actually read/write page
>>> content, but rather move untouched pages.
>>>
>>> The new behavior is especially desirable for memory onlining, where we
>>> allow allocation of newly onlined pages via undo_isolate_page_range()
>>> in online_pages(). Right now, we always place them to the head of the
>>
>> The code looks good, while I don't fully understand the log here.
>>
>> undo_isolate_page_range() is used in __offline_pages and alloc_contig_range. I
>> don't connect them with online_pages(). Do I miss something?
>
>Yeah, please look at -mm / -next instead. See
>
>https://lkml.kernel.org/r/[email protected]
>

Thanks, I get the point.

>
>--
>Thanks,
>
>David / dhildenb

--
Wei Yang
Help you, Help me

2020-09-23 14:33:17

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation

On 9/16/20 9:31 PM, David Hildenbrand wrote:
>
>
>> Am 16.09.2020 um 20:50 schrieb [email protected]:
>>
>> On 2020-09-16 20:34, David Hildenbrand wrote:
>>> When adding separate memory blocks via add_memory*() and onlining them
>>> immediately, the metadata (especially the memmap) of the next block will be
>>> placed onto one of the just added+onlined block. This creates a chain
>>> of unmovable allocations: If the last memory block cannot get
>>> offlined+removed() so will all dependant ones. We directly have unmovable
>>> allocations all over the place.
>>> This can be observed quite easily using virtio-mem, however, it can also
>>> be observed when using DIMMs. The freshly onlined pages will usually be
>>> placed to the head of the freelists, meaning they will be allocated next,
>>> turning the just-added memory usually immediately un-removable. The
>>> fresh pages are cold, prefering to allocate others (that might be hot)
>>> also feels to be the natural thing to do.
>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>>> adding separate, successive memory blocks, each memory block will have
>>> unmovable allocations on them - for example gigantic pages will fail to
>>> allocate.
>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>>> offlined+removed again (any kind of fragmentation with unmovable
>>> allocations is possible), there are many scenarios (hotplugging a lot of
>>> memory, running workload, hotunplug some memory/as much as possible) where
>>> we can offline+remove quite a lot with this patchset.
>>
>> Hi David,
>>
>
> Hi Oscar.
>
>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
>
> Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it‘s not completely ideal, especially for single memory blocks.
>
> With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
>
> Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
>
> Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
>
> So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.

I see the point, but I don't think the head/tail mechanism is great for this. It
might sort of work, but with other interfering activity there are no guarantees
and it relies on a subtle implementation detail. There are better mechanisms
possible I think, such as preparing a larger MIGRATE_UNMOVABLE area in the
existing memory before we allocate those long-term management structures. Or
onlining a bunch of blocks as zone_movable first and only later convert to
zone_normal in a controlled way when existing normal zone becomes depeted?

I guess it's an issue that the e.g. 128M block onlines are so disconnected from
each other it's hard to employ a strategy that works best for e.g. a whole bunch
of GB onlined at once. But I noticed some effort towards new API, so maybe that
will be solved there too?

> Thanks!
>
> David
>
>>
>> I was about to give it a new respin now that thw hwpoison stuff has been settled.
>>
>> [1] https://patchwork.kernel.org/cover/11059175/
>>
>

2020-09-23 15:27:47

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation

On 23.09.20 16:31, Vlastimil Babka wrote:
> On 9/16/20 9:31 PM, David Hildenbrand wrote:
>>
>>
>>> Am 16.09.2020 um 20:50 schrieb [email protected]:
>>>
>>> On 2020-09-16 20:34, David Hildenbrand wrote:
>>>> When adding separate memory blocks via add_memory*() and onlining them
>>>> immediately, the metadata (especially the memmap) of the next block will be
>>>> placed onto one of the just added+onlined block. This creates a chain
>>>> of unmovable allocations: If the last memory block cannot get
>>>> offlined+removed() so will all dependant ones. We directly have unmovable
>>>> allocations all over the place.
>>>> This can be observed quite easily using virtio-mem, however, it can also
>>>> be observed when using DIMMs. The freshly onlined pages will usually be
>>>> placed to the head of the freelists, meaning they will be allocated next,
>>>> turning the just-added memory usually immediately un-removable. The
>>>> fresh pages are cold, prefering to allocate others (that might be hot)
>>>> also feels to be the natural thing to do.
>>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>>>> adding separate, successive memory blocks, each memory block will have
>>>> unmovable allocations on them - for example gigantic pages will fail to
>>>> allocate.
>>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>>>> offlined+removed again (any kind of fragmentation with unmovable
>>>> allocations is possible), there are many scenarios (hotplugging a lot of
>>>> memory, running workload, hotunplug some memory/as much as possible) where
>>>> we can offline+remove quite a lot with this patchset.
>>>
>>> Hi David,
>>>
>>
>> Hi Oscar.
>>
>>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
>>
>> Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it‘s not completely ideal, especially for single memory blocks.
>>
>> With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
>>
>> Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
>>
>> Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
>>
>> So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.
>

Hi Vlastimil,

> I see the point, but I don't think the head/tail mechanism is great for this. It
> might sort of work, but with other interfering activity there are no guarantees
> and it relies on a subtle implementation detail. There are better mechanisms

For the specified use case of adding+onlining a whole bunch of memory
this works just fine. We don't care too much about "other interfering
activity" as you mention here, or about guarantees - this is a pure
optimization that seems to work just fine in practice.

I'm not sure about the "subtle implementation detail" - buddy merging,
and head/tail of buddy lists are a basic concept of our page allocator.
If that would ever change, the optimization here would be lost and we
would have to think of something else. Nothing would actually break -
and it's all kept directly in page_alloc.c

I'd like to stress that what I propose here is both simple and powerful.

> possible I think, such as preparing a larger MIGRATE_UNMOVABLE area in the
> existing memory before we allocate those long-term management structures. Or
> onlining a bunch of blocks as zone_movable first and only later convert to
> zone_normal in a controlled way when existing normal zone becomes depeted?

I see the following (more or less complicated) alternatives

1) Having a larger MIGRATE_UNMOVABLE area

a) Sizing it is difficult. I mean you would have to plan ahead for all
memory you might eventually hotplug later - and that could even be
impossible if you hotplug quite a lot of memory to a smaller machine.
(I've seen people in the vm/container world trying to hotplug 128GB
DIMMs to 2GB VMs ... and failing for obvious reasons)
b) not really desired. You usually want to have most memory movable, not
the opposite (just because you might hotplug memory in small chunks later).

2) smarter onlining

I have prototype patches for better auto-onlining (which I'll share at
some point), where I balance between ZONE_NORMAL and ZONE_MOVABLE in a
defined ratio. Assuming something very simple, adding separate memory
blocks and onlining them based on the current zone ratio (assuming a 1:4
normal:movable target ratio) would (without some other policies I have
in place) result in something like this for hotplugged memory (via
virtio-mem):

[N][M][M][M][M][N][M][M][M][M][N][M][M][M][M]...

(note: layout is suboptimal, just a simple example)

But even here, all [N] memory blocks would immediately be use for
allocations for the memmap of successive blocks. It doesn't solve the
dependency issues.

Now assume we would want to group [N] in a way to allow for gigantic
pages, like

[N][N][N][N][N][N][N][N][M][M][M][M] ....

we would, once again, never be able to allocate a gigantic page because
all [N] would contain a memmap.

3) conversion from MOVABLE -> NORMAL

While a conversion from MOVABLE to NORMAL would be interesting to see,
it's going to be a challenging task to actually implement (people expect
that page_zone() remains stable). Without any hacks, we'd have to

1. offline the selected (MOVABLE) memory block/chunk
2. online the selected memory block/chunk to the NORMAL zone

This is not something we can do out of random context (for example, we
need both, the device hotplug lock and the memory hotplug lock, as we
might race with user space) - so there might still be a chance of
corner-case OOMs.

(I assume there could also be quite a negative performance impact when
always relying on the conversion, and not properly planning ahead as in 2.)

>
> I guess it's an issue that the e.g. 128M block onlines are so disconnected from
> each other it's hard to employ a strategy that works best for e.g. a whole bunch
> of GB onlined at once. But I noticed some effort towards new API, so maybe that
> will be solved there too?

While new interfaces might make it easier to identify boundaries of
separate DIMMs (e.g., to online a single DIMM either movable or
unmovable - which can partially be done right now when going via memory
resource boundaries), it doesn't help for the use case of adding
separate memory blocks.

So while having an automatic conversion from MOVABLE -> NORMAL would be
interesting, I doubt we'll see it in the foreseeable future. Are there
any similarly simple alternatives to optimize this?

Thanks!

--
Thanks,

David / dhildenb

2020-09-24 02:00:29

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation

On Wed, Sep 23, 2020 at 04:31:25PM +0200, Vlastimil Babka wrote:
>On 9/16/20 9:31 PM, David Hildenbrand wrote:
>>
>>
>>> Am 16.09.2020 um 20:50 schrieb [email protected]:
>>>
>>> On 2020-09-16 20:34, David Hildenbrand wrote:
>>>> When adding separate memory blocks via add_memory*() and onlining them
>>>> immediately, the metadata (especially the memmap) of the next block will be
>>>> placed onto one of the just added+onlined block. This creates a chain
>>>> of unmovable allocations: If the last memory block cannot get
>>>> offlined+removed() so will all dependant ones. We directly have unmovable
>>>> allocations all over the place.
>>>> This can be observed quite easily using virtio-mem, however, it can also
>>>> be observed when using DIMMs. The freshly onlined pages will usually be
>>>> placed to the head of the freelists, meaning they will be allocated next,
>>>> turning the just-added memory usually immediately un-removable. The
>>>> fresh pages are cold, prefering to allocate others (that might be hot)
>>>> also feels to be the natural thing to do.
>>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>>>> adding separate, successive memory blocks, each memory block will have
>>>> unmovable allocations on them - for example gigantic pages will fail to
>>>> allocate.
>>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>>>> offlined+removed again (any kind of fragmentation with unmovable
>>>> allocations is possible), there are many scenarios (hotplugging a lot of
>>>> memory, running workload, hotunplug some memory/as much as possible) where
>>>> we can offline+remove quite a lot with this patchset.
>>>
>>> Hi David,
>>>
>>
>> Hi Oscar.
>>
>>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
>>
>> Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it‘s not completely ideal, especially for single memory blocks.
>>
>> With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
>>
>> Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
>>
>> Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
>>
>> So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.
>
>I see the point, but I don't think the head/tail mechanism is great for this. It
>might sort of work, but with other interfering activity there are no guarantees
>and it relies on a subtle implementation detail. There are better mechanisms
>possible I think, such as preparing a larger MIGRATE_UNMOVABLE area in the
>existing memory before we allocate those long-term management structures. Or
>onlining a bunch of blocks as zone_movable first and only later convert to
>zone_normal in a controlled way when existing normal zone becomes depeted?
>

To be honest, David's approach is easy to understand for me.

And I don't see some negative effect.

>I guess it's an issue that the e.g. 128M block onlines are so disconnected from
>each other it's hard to employ a strategy that works best for e.g. a whole bunch
>of GB onlined at once. But I noticed some effort towards new API, so maybe that
>will be solved there too?
>
>> Thanks!
>>
>> David
>>
>>>
>>> I was about to give it a new respin now that thw hwpoison stuff has been settled.
>>>
>>> [1] https://patchwork.kernel.org/cover/11059175/
>>>
>>

--
Wei Yang
Help you, Help me

2020-09-24 09:43:03

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation

On Wed, Sep 23, 2020 at 05:26:06PM +0200, David Hildenbrand wrote:
> >>> ???On 2020-09-16 20:34, David Hildenbrand wrote:
> >>>> When adding separate memory blocks via add_memory*() and onlining them
> >>>> immediately, the metadata (especially the memmap) of the next block will be
> >>>> placed onto one of the just added+onlined block. This creates a chain
> >>>> of unmovable allocations: If the last memory block cannot get
> >>>> offlined+removed() so will all dependant ones. We directly have unmovable
> >>>> allocations all over the place.
> >>>> This can be observed quite easily using virtio-mem, however, it can also
> >>>> be observed when using DIMMs. The freshly onlined pages will usually be
> >>>> placed to the head of the freelists, meaning they will be allocated next,
> >>>> turning the just-added memory usually immediately un-removable. The
> >>>> fresh pages are cold, prefering to allocate others (that might be hot)
> >>>> also feels to be the natural thing to do.
> >>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
> >>>> adding separate, successive memory blocks, each memory block will have
> >>>> unmovable allocations on them - for example gigantic pages will fail to
> >>>> allocate.
> >>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
> >>>> offlined+removed again (any kind of fragmentation with unmovable
> >>>> allocations is possible), there are many scenarios (hotplugging a lot of
> >>>> memory, running workload, hotunplug some memory/as much as possible) where
> >>>> we can offline+remove quite a lot with this patchset.
> >>>
> >>> Hi David,
> >>>
> >>
> >> Hi Oscar.
> >>
> >>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
> >>
> >> Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it???s not completely ideal, especially for single memory blocks.
> >>
> >> With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
> >>
> >> Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
> >>
> >> Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
> >>
> >> So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.
> >
>
> Hi Vlastimil,
>
> > I see the point, but I don't think the head/tail mechanism is great for this. It
> > might sort of work, but with other interfering activity there are no guarantees
> > and it relies on a subtle implementation detail. There are better mechanisms
>
> For the specified use case of adding+onlining a whole bunch of memory
> this works just fine. We don't care too much about "other interfering
> activity" as you mention here, or about guarantees - this is a pure
> optimization that seems to work just fine in practice.
>
> I'm not sure about the "subtle implementation detail" - buddy merging,
> and head/tail of buddy lists are a basic concept of our page allocator.
> If that would ever change, the optimization here would be lost and we
> would have to think of something else. Nothing would actually break -
> and it's all kept directly in page_alloc.c
>

It's somewhat subtle because it's relying heavily on the exact ordering
of how pages are pulled from the free lists at the moment. Lets say for
example that someone was brave enough to tackle the problem of the giant
zone lock and split the zone into allocation arenas (like what glibc does
to split the lock). Depending on the exact ordering of how pages are
added and removed from the list would break your approach. I'm wary of
anything that relies on the ordering of freelists for correctness becauuse
it limits the ability to fix the zone lock (which has been overdue for
fixing for years now and getting worse as node sizes increase).

To be robust, you'd need to do something like early memory bring-up whereby
pages are directly allocated from one part of the DIMM (presumably the
start) and use that for the metadata -- potentially all the metadata that
would be necessary to plug/unplug the entire DIMM. This would effectively
be unmovable but if you want to guarantee that all the memory except the
metadata can be unplugged, you do not have much alteratives. Playing games
with the ordering of the freelists will simply end up as "sometimes works,
sometimes does not".

In terms of forcing ranges to be UNMOVABLE or MOVABLE (either via zones
or by implementing "sticky" pageblocks which hits complex reclaim-related
problems), you start running into problems similar to lowmem starvation
where a page cache allocation fails because unmovable metadata cannot
be allocated.

I suggest you keep it simple -- statically allocate the potential
metadata needed in the future even though it limits the maximum amount
of memory that can be unplugged. The alternative is unpredictable
plug/unplug success rates.

--
Mel Gorman
SUSE Labs

2020-09-24 09:58:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation

On 24.09.20 11:40, Mel Gorman wrote:
> On Wed, Sep 23, 2020 at 05:26:06PM +0200, David Hildenbrand wrote:
>>>>> ???On 2020-09-16 20:34, David Hildenbrand wrote:
>>>>>> When adding separate memory blocks via add_memory*() and onlining them
>>>>>> immediately, the metadata (especially the memmap) of the next block will be
>>>>>> placed onto one of the just added+onlined block. This creates a chain
>>>>>> of unmovable allocations: If the last memory block cannot get
>>>>>> offlined+removed() so will all dependant ones. We directly have unmovable
>>>>>> allocations all over the place.
>>>>>> This can be observed quite easily using virtio-mem, however, it can also
>>>>>> be observed when using DIMMs. The freshly onlined pages will usually be
>>>>>> placed to the head of the freelists, meaning they will be allocated next,
>>>>>> turning the just-added memory usually immediately un-removable. The
>>>>>> fresh pages are cold, prefering to allocate others (that might be hot)
>>>>>> also feels to be the natural thing to do.
>>>>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>>>>>> adding separate, successive memory blocks, each memory block will have
>>>>>> unmovable allocations on them - for example gigantic pages will fail to
>>>>>> allocate.
>>>>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>>>>>> offlined+removed again (any kind of fragmentation with unmovable
>>>>>> allocations is possible), there are many scenarios (hotplugging a lot of
>>>>>> memory, running workload, hotunplug some memory/as much as possible) where
>>>>>> we can offline+remove quite a lot with this patchset.
>>>>>
>>>>> Hi David,
>>>>>
>>>>
>>>> Hi Oscar.
>>>>
>>>>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
>>>>
>>>> Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it???s not completely ideal, especially for single memory blocks.
>>>>
>>>> With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
>>>>
>>>> Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
>>>>
>>>> Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
>>>>
>>>> So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.
>>>
>>
>> Hi Vlastimil,
>>
>>> I see the point, but I don't think the head/tail mechanism is great for this. It
>>> might sort of work, but with other interfering activity there are no guarantees
>>> and it relies on a subtle implementation detail. There are better mechanisms
>>
>> For the specified use case of adding+onlining a whole bunch of memory
>> this works just fine. We don't care too much about "other interfering
>> activity" as you mention here, or about guarantees - this is a pure
>> optimization that seems to work just fine in practice.
>>
>> I'm not sure about the "subtle implementation detail" - buddy merging,
>> and head/tail of buddy lists are a basic concept of our page allocator.
>> If that would ever change, the optimization here would be lost and we
>> would have to think of something else. Nothing would actually break -
>> and it's all kept directly in page_alloc.c
>>

Hi Mel,

thanks for your reply.

>
> It's somewhat subtle because it's relying heavily on the exact ordering
> of how pages are pulled from the free lists at the moment. Lets say for
> example that someone was brave enough to tackle the problem of the giant
> zone lock and split the zone into allocation arenas (like what glibc does
> to split the lock). Depending on the exact ordering of how pages are
> added and removed from the list would break your approach. I'm wary of

First of all, it would not break it (as I already said). The
optimization would be lost. Totally acceptable.

However, I assume we would apply the same technique (optimized buddy
merging - placing to head/tail, page shuffling) on these allocation
arenas. So the optimization would still mostly apply, just in different
granularity - which would be fine.

> anything that relies on the ordering of freelists for correctness becauuse
> it limits the ability to fix the zone lock (which has been overdue for
> fixing for years now and getting worse as node sizes increase).

"for correctness" - no, this is an optimization. As I said, there are no
guarantees. Please keep that in mind.

(also, page shuffling relies on the ordering of freelists right now ...
for correctness)

>
> To be robust, you'd need to do something like early memory bring-up whereby
> pages are directly allocated from one part of the DIMM (presumably the
> start) and use that for the metadata -- potentially all the metadata that
> would be necessary to plug/unplug the entire DIMM. This would effectively
> be unmovable but if you want to guarantee that all the memory except the
> metadata can be unplugged, you do not have much alteratives. Playing games
> with the ordering of the freelists will simply end up as "sometimes works,
> sometimes does not".

As answered to Oscar already, while something like that might be
feasible for DIMMs in the future (and there are still quite some issues
to be sorted out), it isn't always desired adding separate (small -
e.g., 128MB) memory blocks. You - again- have unmovable allocations all
over the place that won't allow you to allocate any gigantic page.

>
> In terms of forcing ranges to be UNMOVABLE or MOVABLE (either via zones
> or by implementing "sticky" pageblocks which hits complex reclaim-related
> problems), you start running into problems similar to lowmem starvation
> where a page cache allocation fails because unmovable metadata cannot
> be allocated.

Exactly.

>
> I suggest you keep it simple -- statically allocate the potential
> metadata needed in the future even though it limits the maximum amount
> of memory that can be unplugged. The alternative is unpredictable
> plug/unplug success rates.
>

I'm sorry I can't follow. How is this "simple"? Or even "simpler" than
what I suggest?

And as I said, it doesn't always work. Assume I hotplug 128GB to a 2GB
machine via virtio-mem (which works just fine, as we add+online memory
in small chunks compared to a single, huge DIMM), I would have to
pre-allocate 2GB just for the memmap - which obviously doesn't work.

Again, I'd like to stress that this is a pure optimization that I am
proposing - nothing would "break" when ripping it out again, except that
we lose the optimizations I mentioned.

--
Thanks,

David / dhildenb

2020-09-24 10:23:12

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH RFC 1/4] mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag

On 9/16/20 8:34 PM, David Hildenbrand wrote:
> Let's prepare for additional flags and avoid long parameter lists of bools.
> Follow-up patches will also make use of the flags in __free_pages_ok(),
> however, I wasn't able to come up with a better name for the type - should
> be good enough for internal purposes.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Wei Yang <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Vlastimil Babka <[email protected]>

> ---
> mm/page_alloc.c | 28 ++++++++++++++++++++--------
> 1 file changed, 20 insertions(+), 8 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6b699d273d6e..91cefb8157dd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -77,6 +77,18 @@
> #include "shuffle.h"
> #include "page_reporting.h"
>
> +/* Free One Page flags: for internal, non-pcp variants of free_pages(). */
> +typedef int __bitwise fop_t;
> +
> +/* No special request */
> +#define FOP_NONE ((__force fop_t)0)
> +
> +/*
> + * Skip free page reporting notification after buddy merging (will *not* mark
> + * the page reported, only skip the notification).
> + */
> +#define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))
> +
> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
> static DEFINE_MUTEX(pcp_batch_high_lock);
> #define MIN_PERCPU_PAGELIST_FRACTION (8)
> @@ -948,10 +960,9 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
> * -- nyc
> */
>
> -static inline void __free_one_page(struct page *page,
> - unsigned long pfn,
> - struct zone *zone, unsigned int order,
> - int migratetype, bool report)
> +static inline void __free_one_page(struct page *page, unsigned long pfn,
> + struct zone *zone, unsigned int order,
> + int migratetype, fop_t fop_flags)
> {
> struct capture_control *capc = task_capc(zone);
> unsigned long buddy_pfn;
> @@ -1038,7 +1049,7 @@ static inline void __free_one_page(struct page *page,
> add_to_free_list(page, zone, order, migratetype);
>
> /* Notify page reporting subsystem of freed page */
> - if (report)
> + if (!(fop_flags & FOP_SKIP_REPORT_NOTIFY))
> page_reporting_notify_free(order);
> }
>
> @@ -1368,7 +1379,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> if (unlikely(isolated_pageblocks))
> mt = get_pageblock_migratetype(page);
>
> - __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
> + __free_one_page(page, page_to_pfn(page), zone, 0, mt, FOP_NONE);
> trace_mm_page_pcpu_drain(page, 0, mt);
> }
> spin_unlock(&zone->lock);
> @@ -1384,7 +1395,7 @@ static void free_one_page(struct zone *zone,
> is_migrate_isolate(migratetype))) {
> migratetype = get_pfnblock_migratetype(page, pfn);
> }
> - __free_one_page(page, pfn, zone, order, migratetype, true);
> + __free_one_page(page, pfn, zone, order, migratetype, FOP_NONE);
> spin_unlock(&zone->lock);
> }
>
> @@ -3277,7 +3288,8 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
> lockdep_assert_held(&zone->lock);
>
> /* Return isolated page to tail of freelist. */
> - __free_one_page(page, page_to_pfn(page), zone, order, mt, false);
> + __free_one_page(page, page_to_pfn(page), zone, order, mt,
> + FOP_SKIP_REPORT_NOTIFY);
> }
>
> /*
>

2020-09-24 10:40:04

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] mm/page_alloc: place pages to tail in __putback_isolated_page()

On 9/16/20 8:34 PM, David Hildenbrand wrote:
> __putback_isolated_page() already documents that pages will be placed to
> the tail of the freelist - this is, however, not the case for
> "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
> the case for all existing users.

I think here should be a sentence saying something along "Thus this patch
introduces a FOP_TO_TAIL flag to really ensure moving pages to tail."

> This change affects two users:
> - free page reporting
> - page isolation, when undoing the isolation.
>
> This behavior is desireable for pages that haven't really been touched
> lately, so exactly the two users that don't actually read/write page
> content, but rather move untouched pages.
>
> The new behavior is especially desirable for memory onlining, where we
> allow allocation of newly onlined pages via undo_isolate_page_range()
> in online_pages(). Right now, we always place them to the head of the
> free list, resulting in undesireable behavior: Assume we add
> individual memory chunks via add_memory() and online them right away to
> the NORMAL zone. We create a dependency chain of unmovable allocations
> e.g., via the memmap. The memmap of the next chunk will be placed onto
> previous chunks - if the last block cannot get offlined+removed, all
> dependent ones cannot get offlined+removed. While this can already be
> observed with individual DIMMs, it's more of an issue for virtio-mem
> (and I suspect also ppc DLPAR).
>
> Note: If we observe a degradation due to the changed page isolation
> behavior (which I doubt), we can always make this configurable by the
> instance triggering undo of isolation (e.g., alloc_contig_range(),
> memory onlining, memory offlining).
>
> Cc: Andrew Morton <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Wei Yang <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Scott Cheloha <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/page_alloc.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91cefb8157dd..bba9a0f60c70 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -89,6 +89,12 @@ typedef int __bitwise fop_t;
> */
> #define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))
>
> +/*
> + * Place the freed page to the tail of the freelist after buddy merging. Will
> + * get ignored with page shuffling enabled.
> + */
> +#define FOP_TO_TAIL ((__force fop_t)BIT(1))
> +
> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
> static DEFINE_MUTEX(pcp_batch_high_lock);
> #define MIN_PERCPU_PAGELIST_FRACTION (8)
> @@ -1040,6 +1046,8 @@ static inline void __free_one_page(struct page *page, unsigned long pfn,
>
> if (is_shuffle_order(order))
> to_tail = shuffle_pick_tail();
> + else if (fop_flags & FOP_TO_TAIL)
> + to_tail = true;

Should we really let random shuffling decision have a larger priority than
explicit FOP_TO_TAIL request? Wei Yang mentioned that there's a call to
shuffle_zone() anyway to process a freshly added memory, so we don't need to do
that also during the process of addition itself? Might help with your goal of
reducing dependencies even on systems that do have shuffling enabled?

Thanks,
Vlastimil

> else
> to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
>
> @@ -3289,7 +3297,7 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
>
> /* Return isolated page to tail of freelist. */
> __free_one_page(page, page_to_pfn(page), zone, order, mt,
> - FOP_SKIP_REPORT_NOTIFY);
> + FOP_SKIP_REPORT_NOTIFY | FOP_TO_TAIL);
> }
>
> /*
>

2020-09-24 11:14:53

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

On 9/16/20 8:34 PM, David Hildenbrand wrote:
> Page isolation doesn't actually touch the pages, it simply isolates
> pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
>
> We already place pages to the tail of the freelists when undoing
> isolation via __putback_isolated_page(), let's do it in any case
> (e.g., if order == pageblock_order) and document the behavior.
>
> This change results in all pages getting onlined via online_pages() to
> be placed to the tail of the freelist.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Wei Yang <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Scott Cheloha <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> include/linux/page-isolation.h | 2 ++
> mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-----
> mm/page_isolation.c | 8 ++++++--
> 3 files changed, 39 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 572458016331..a36be2cf4dbb 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -38,6 +38,8 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
> void set_pageblock_migratetype(struct page *page, int migratetype);
> int move_freepages_block(struct zone *zone, struct page *page,
> int migratetype, int *num_movable);
> +int move_freepages_block_tail(struct zone *zone, struct page *page,
> + int migratetype);
>
> /*
> * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bba9a0f60c70..75b0f49b4022 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -899,6 +899,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
> list_move(&page->lru, &area->free_list[migratetype]);
> }
>
> +/* Used for pages which are on another list */
> +static inline void move_to_free_list_tail(struct page *page, struct zone *zone,
> + unsigned int order, int migratetype)
> +{
> + struct free_area *area = &zone->free_area[order];
> +
> + list_move_tail(&page->lru, &area->free_list[migratetype]);
> +}

There are just 3 callers of move_to_free_list() before this patch, I would just
add the to_tail parameter there instead of new wrapper. For callers with
constant parameter, the inline will eliminate it anyway.

> static inline void del_page_from_free_list(struct page *page, struct zone *zone,
> unsigned int order)
> {
> @@ -2323,7 +2332,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
> */
> static int move_freepages(struct zone *zone,
> struct page *start_page, struct page *end_page,
> - int migratetype, int *num_movable)
> + int migratetype, int *num_movable, bool to_tail)
> {
> struct page *page;
> unsigned int order;
> @@ -2354,7 +2363,10 @@ static int move_freepages(struct zone *zone,
> VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>
> order = page_order(page);
> - move_to_free_list(page, zone, order, migratetype);
> + if (to_tail)
> + move_to_free_list_tail(page, zone, order, migratetype);
> + else
> + move_to_free_list(page, zone, order, migratetype);
> page += 1 << order;
> pages_moved += 1 << order;
> }
> @@ -2362,8 +2374,9 @@ static int move_freepages(struct zone *zone,
> return pages_moved;
> }
>
> -int move_freepages_block(struct zone *zone, struct page *page,
> - int migratetype, int *num_movable)
> +static int __move_freepages_block(struct zone *zone, struct page *page,
> + int migratetype, int *num_movable,
> + bool to_tail)
> {
> unsigned long start_pfn, end_pfn;
> struct page *start_page, *end_page;
> @@ -2384,7 +2397,20 @@ int move_freepages_block(struct zone *zone, struct page *page,
> return 0;
>
> return move_freepages(zone, start_page, end_page, migratetype,
> - num_movable);
> + num_movable, to_tail);
> +}
> +
> +int move_freepages_block(struct zone *zone, struct page *page,
> + int migratetype, int *num_movable)
> +{
> + return __move_freepages_block(zone, page, migratetype, num_movable,
> + false);
> +}
> +
> +int move_freepages_block_tail(struct zone *zone, struct page *page,
> + int migratetype)
> +{
> + return __move_freepages_block(zone, page, migratetype, NULL, true);
> }

Likewise, just 5 callers of move_freepages_block(), all in the files you're
already changing, so no need for this wrappers IMHO.

Thanks,
Vlastimil

> static void change_pageblock_range(struct page *pageblock_page,
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index abfe26ad59fd..84aa1d14751d 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -83,7 +83,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> * Because freepage with more than pageblock_order on isolated
> * pageblock is restricted to merge due to freepage counting problem,
> * it is possible that there is free buddy page.
> - * move_freepages_block() doesn't care of merge so we need other
> + * move_freepages_block*() don't care about merging, so we need another
> * approach in order to merge them. Isolation and free will make
> * these pages to be merged.
> */
> @@ -106,9 +106,13 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> * If we isolate freepage with more than pageblock_order, there
> * should be no freepage in the range, so we could avoid costly
> * pageblock scanning for freepage moving.
> + *
> + * We didn't actually touch any of the isolated pages, so place them
> + * to the tail of the freelists. This is especially relevant during
> + * memory onlining.
> */
> if (!isolated_page) {
> - nr_pages = move_freepages_block(zone, page, migratetype, NULL);
> + nr_pages = move_freepages_block_tail(zone, page, migratetype);
> __mod_zone_freepage_state(zone, nr_pages, migratetype);
> }
> set_pageblock_migratetype(page, migratetype);
>

2020-09-24 14:02:21

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation

On 9/23/20 5:26 PM, David Hildenbrand wrote:
> On 23.09.20 16:31, Vlastimil Babka wrote:
>> On 9/16/20 9:31 PM, David Hildenbrand wrote:
>>
>
> Hi Vlastimil,
>
>> I see the point, but I don't think the head/tail mechanism is great for this. It
>> might sort of work, but with other interfering activity there are no guarantees
>> and it relies on a subtle implementation detail. There are better mechanisms
>
> For the specified use case of adding+onlining a whole bunch of memory
> this works just fine. We don't care too much about "other interfering
> activity" as you mention here, or about guarantees - this is a pure
> optimization that seems to work just fine in practice.
>
> I'm not sure about the "subtle implementation detail" - buddy merging,
> and head/tail of buddy lists are a basic concept of our page allocator.

Mel already explained that, so I won't repeat.

> If that would ever change, the optimization here would be lost and we
> would have to think of something else. Nothing would actually break -
> and it's all kept directly in page_alloc.c

Sure, but then it can become a pointless code churn.

> I'd like to stress that what I propose here is both simple and powerful.
>
>> possible I think, such as preparing a larger MIGRATE_UNMOVABLE area in the
>> existing memory before we allocate those long-term management structures. Or
>> onlining a bunch of blocks as zone_movable first and only later convert to
>> zone_normal in a controlled way when existing normal zone becomes depeted?
>
> I see the following (more or less complicated) alternatives
>
> 1) Having a larger MIGRATE_UNMOVABLE area
>
> a) Sizing it is difficult. I mean you would have to plan ahead for all
> memory you might eventually hotplug later - and that could even be

Yeah, hence my worry about existing interfaces that work on 128MB blocks
individually without a larger strategy.

> impossible if you hotplug quite a lot of memory to a smaller machine.
> (I've seen people in the vm/container world trying to hotplug 128GB
> DIMMs to 2GB VMs ... and failing for obvious reasons)

Some planning should still be possible to maximize the contiguous area without
unmovable allocations.

> b) not really desired. You usually want to have most memory movable, not
> the opposite (just because you might hotplug memory in small chunks later).
>
> 2) smarter onlining
>
> I have prototype patches for better auto-onlining (which I'll share at
> some point), where I balance between ZONE_NORMAL and ZONE_MOVABLE in a
> defined ratio. Assuming something very simple, adding separate memory
> blocks and onlining them based on the current zone ratio (assuming a 1:4
> normal:movable target ratio) would (without some other policies I have
> in place) result in something like this for hotplugged memory (via
> virtio-mem):
>
> [N][M][M][M][M][N][M][M][M][M][N][M][M][M][M]...
>
> (note: layout is suboptimal, just a simple example)
>
> But even here, all [N] memory blocks would immediately be use for
> allocations for the memmap of successive blocks. It doesn't solve the
> dependency issues.
>
> Now assume we would want to group [N] in a way to allow for gigantic
> pages, like
>
> [N][N][N][N][N][N][N][N][M][M][M][M] ....
>
> we would, once again, never be able to allocate a gigantic page because
> all [N] would contain a memmap.

The second approach should work, if you know how much you are going to online,
and plan the size the N group accordingly, and if the onlined amount is several
gigabytes, then only the first one (or first X) will be unusable for a gigantic
page, but the rest would be? Can't get much better than that.

> 3) conversion from MOVABLE -> NORMAL
>
> While a conversion from MOVABLE to NORMAL would be interesting to see,
> it's going to be a challenging task to actually implement (people expect
> that page_zone() remains stable). Without any hacks, we'd have to
>
> 1. offline the selected (MOVABLE) memory block/chunk
> 2. online the selected memory block/chunk to the NORMAL zone
>
> This is not something we can do out of random context (for example, we
> need both, the device hotplug lock and the memory hotplug lock, as we
> might race with user space) - so there might still be a chance of
> corner-case OOMs.

Right, it's trickier than I thought.

> (I assume there could also be quite a negative performance impact when
> always relying on the conversion, and not properly planning ahead as in 2.)
>
>>
>> I guess it's an issue that the e.g. 128M block onlines are so disconnected from
>> each other it's hard to employ a strategy that works best for e.g. a whole bunch
>> of GB onlined at once. But I noticed some effort towards new API, so maybe that
>> will be solved there too?
>
> While new interfaces might make it easier to identify boundaries of
> separate DIMMs (e.g., to online a single DIMM either movable or
> unmovable - which can partially be done right now when going via memory
> resource boundaries), it doesn't help for the use case of adding
> separate memory blocks.
>
> So while having an automatic conversion from MOVABLE -> NORMAL would be
> interesting, I doubt we'll see it in the foreseeable future. Are there
> any similarly simple alternatives to optimize this?

I've reviewed the series and I won't block it - yes it's an optimistic approach
that can break and leave us with code churn. But at least it's not that much
code and the extra test in __free_one_page() shouldn't make this hotpath too
worse. But I still hope we can achieve a more robust solution one day.

> Thanks!
>

2020-09-24 14:31:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation

>> If that would ever change, the optimization here would be lost and we
>> would have to think of something else. Nothing would actually break -
>> and it's all kept directly in page_alloc.c
>
> Sure, but then it can become a pointless code churn.

Indeed, and if there are valid concerns that this will happen in the
near future (e.g., < 1 year), I agree that we should look into
alternatives right from the start. Otherwise it's good enough until some
of the other things I mentioned below become real (which could also take
a while ...).

>
>> I'd like to stress that what I propose here is both simple and powerful.
>>
>>> possible I think, such as preparing a larger MIGRATE_UNMOVABLE area in the
>>> existing memory before we allocate those long-term management structures. Or
>>> onlining a bunch of blocks as zone_movable first and only later convert to
>>> zone_normal in a controlled way when existing normal zone becomes depeted?
>>
>> I see the following (more or less complicated) alternatives
>>
>> 1) Having a larger MIGRATE_UNMOVABLE area
>>
>> a) Sizing it is difficult. I mean you would have to plan ahead for all
>> memory you might eventually hotplug later - and that could even be
>
> Yeah, hence my worry about existing interfaces that work on 128MB blocks
> individually without a larger strategy.

Yes, in the works :)

>
>> impossible if you hotplug quite a lot of memory to a smaller machine.
>> (I've seen people in the vm/container world trying to hotplug 128GB
>> DIMMs to 2GB VMs ... and failing for obvious reasons)
>
> Some planning should still be possible to maximize the contiguous area without
> unmovable allocations.

Indeed, optimizing that is very high on my list of things to look into ...

>>
>> we would, once again, never be able to allocate a gigantic page because
>> all [N] would contain a memmap.
>
> The second approach should work, if you know how much you are going to online,
> and plan the size the N group accordingly, and if the onlined amount is several
> gigabytes, then only the first one (or first X) will be unusable for a gigantic
> page, but the rest would be? Can't get much better than that.

Indeed, it's the optimal case (assuming one can come up with a safe zone
balance - which is usually possible, but unfortunately, there are
exceptions one at least has to identify).

[...]

>
> I've reviewed the series and I won't block it - yes it's an optimistic approach
> that can break and leave us with code churn. But at least it's not that much

Thanks.

I'll try to document somewhere that the behavior of FOP_TO_TAIL is a
pure optimization and might change in the future - along with the case
it tried to optimize (so people know what the use case was).

> code and the extra test in __free_one_page() shouldn't make this hotpath too

I assume the compiler is able to completely propagate constants and
optimize that out - I haven't checked, though.

> worse. But I still hope we can achieve a more robust solution one day.

I definitely agree. I'd also prefer some kind of guarantees, but I
learned that things always sound easier than they actually are when it
comes to memory management in Linux ... and they take a lot of time (for
example, Michal's/Oscar's attempts to implement vmemmap on hotadded memory).

--
Thanks,

David / dhildenb

2020-09-24 18:56:27

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH RFC 4/4] mm/page_alloc: place pages to tail in __free_pages_core()

On 9/16/20 8:34 PM, David Hildenbrand wrote:
> __free_pages_core() is used when exposing fresh memory to the buddy
> during system boot and when onlining memory in generic_online_page().
>
> generic_online_page() is used in two cases:
>
> 1. Direct memory onlining in online_pages().
> 2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
> balloon and virtio-mem), when parts of a section are kept
> fake-offline to be fake-onlined later on.
>
> In 1, we already place pages to the tail of the freelist. Pages will be
> freed to MIGRATE_ISOLATE lists first and moved to the tail of the freelists
> via undo_isolate_page_range().
>
> In 2, we currently don't implement a proper rule. In case of virtio-mem,
> where we currently always online MAX_ORDER - 1 pages, the pages will be
> placed to the HEAD of the freelist - undesireable. While the hyper-v
> balloon calls generic_online_page() with single pages, usually it will
> call it on successive single pages in a larger block.
>
> The pages are fresh, so place them to the tail of the freelists and avoid
> the PCP.
>
> Note: If we detect that the new behavior is undesireable for
> __free_pages_core() during boot, we can let the caller specify the
> behavior.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Wei Yang <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: "K. Y. Srinivasan" <[email protected]>
> Cc: Haiyang Zhang <[email protected]>
> Cc: Stephen Hemminger <[email protected]>
> Cc: Wei Liu <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Vlastimil Babka <[email protected]>

> ---
> mm/page_alloc.c | 32 ++++++++++++++++++++------------
> 1 file changed, 20 insertions(+), 12 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 75b0f49b4022..50746e6dc21b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -264,7 +264,8 @@ bool pm_suspended_storage(void)
> unsigned int pageblock_order __read_mostly;
> #endif
>
> -static void __free_pages_ok(struct page *page, unsigned int order);
> +static void __free_pages_ok(struct page *page, unsigned int order,
> + fop_t fop_flags);
>
> /*
> * results with 256, 32 in the lowmem_reserve sysctl:
> @@ -676,7 +677,7 @@ static void bad_page(struct page *page, const char *reason)
> void free_compound_page(struct page *page)
> {
> mem_cgroup_uncharge(page);
> - __free_pages_ok(page, compound_order(page));
> + __free_pages_ok(page, compound_order(page), FOP_NONE);
> }
>
> void prep_compound_page(struct page *page, unsigned int order)
> @@ -1402,17 +1403,15 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> spin_unlock(&zone->lock);
> }
>
> -static void free_one_page(struct zone *zone,
> - struct page *page, unsigned long pfn,
> - unsigned int order,
> - int migratetype)
> +static void free_one_page(struct zone *zone, struct page *page, unsigned long pfn,
> + unsigned int order, int migratetype, fop_t fop_flags)
> {
> spin_lock(&zone->lock);
> if (unlikely(has_isolate_pageblock(zone) ||
> is_migrate_isolate(migratetype))) {
> migratetype = get_pfnblock_migratetype(page, pfn);
> }
> - __free_one_page(page, pfn, zone, order, migratetype, FOP_NONE);
> + __free_one_page(page, pfn, zone, order, migratetype, fop_flags);
> spin_unlock(&zone->lock);
> }
>
> @@ -1490,7 +1489,8 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
> }
> }
>
> -static void __free_pages_ok(struct page *page, unsigned int order)
> +static void __free_pages_ok(struct page *page, unsigned int order,
> + fop_t fop_flags)
> {
> unsigned long flags;
> int migratetype;
> @@ -1502,7 +1502,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> migratetype = get_pfnblock_migratetype(page, pfn);
> local_irq_save(flags);
> __count_vm_events(PGFREE, 1 << order);
> - free_one_page(page_zone(page), page, pfn, order, migratetype);
> + free_one_page(page_zone(page), page, pfn, order, migratetype,
> + fop_flags);
> local_irq_restore(flags);
> }
>
> @@ -1523,7 +1524,13 @@ void __free_pages_core(struct page *page, unsigned int order)
>
> atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
> set_page_refcounted(page);
> - __free_pages(page, order);
> +
> + /*
> + * Bypass PCP and place fresh pages right to the tail, primarily
> + * relevant for memory onlining.
> + */
> + page_ref_dec(page);
> + __free_pages_ok(page, order, FOP_TO_TAIL);
> }
>
> #ifdef CONFIG_NEED_MULTIPLE_NODES
> @@ -3167,7 +3174,8 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
> */
> if (migratetype >= MIGRATE_PCPTYPES) {
> if (unlikely(is_migrate_isolate(migratetype))) {
> - free_one_page(zone, page, pfn, 0, migratetype);
> + free_one_page(zone, page, pfn, 0, migratetype,
> + FOP_NONE);
> return;
> }
> migratetype = MIGRATE_MOVABLE;
> @@ -4984,7 +4992,7 @@ static inline void free_the_page(struct page *page, unsigned int order)
> if (order == 0) /* Via pcp? */
> free_unref_page(page);
> else
> - __free_pages_ok(page, order);
> + __free_pages_ok(page, order, FOP_NONE);
> }
>
> void __free_pages(struct page *page, unsigned int order)
>

2020-09-25 02:47:39

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

On Thu, Sep 24, 2020 at 01:13:29PM +0200, Vlastimil Babka wrote:
>On 9/16/20 8:34 PM, David Hildenbrand wrote:
>> Page isolation doesn't actually touch the pages, it simply isolates
>> pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
>>
>> We already place pages to the tail of the freelists when undoing
>> isolation via __putback_isolated_page(), let's do it in any case
>> (e.g., if order == pageblock_order) and document the behavior.
>>
>> This change results in all pages getting onlined via online_pages() to
>> be placed to the tail of the freelist.
>>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Alexander Duyck <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Vlastimil Babka <[email protected]>
>> Cc: Wei Yang <[email protected]>
>> Cc: Oscar Salvador <[email protected]>
>> Cc: Mike Rapoport <[email protected]>
>> Cc: Scott Cheloha <[email protected]>
>> Cc: Michael Ellerman <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> include/linux/page-isolation.h | 2 ++
>> mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-----
>> mm/page_isolation.c | 8 ++++++--
>> 3 files changed, 39 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>> index 572458016331..a36be2cf4dbb 100644
>> --- a/include/linux/page-isolation.h
>> +++ b/include/linux/page-isolation.h
>> @@ -38,6 +38,8 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
>> void set_pageblock_migratetype(struct page *page, int migratetype);
>> int move_freepages_block(struct zone *zone, struct page *page,
>> int migratetype, int *num_movable);
>> +int move_freepages_block_tail(struct zone *zone, struct page *page,
>> + int migratetype);
>>
>> /*
>> * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index bba9a0f60c70..75b0f49b4022 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -899,6 +899,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
>> list_move(&page->lru, &area->free_list[migratetype]);
>> }
>>
>> +/* Used for pages which are on another list */
>> +static inline void move_to_free_list_tail(struct page *page, struct zone *zone,
>> + unsigned int order, int migratetype)
>> +{
>> + struct free_area *area = &zone->free_area[order];
>> +
>> + list_move_tail(&page->lru, &area->free_list[migratetype]);
>> +}
>
>There are just 3 callers of move_to_free_list() before this patch, I would just
>add the to_tail parameter there instead of new wrapper. For callers with
>constant parameter, the inline will eliminate it anyway.

Got the same feeling :-)

>
>> static inline void del_page_from_free_list(struct page *page, struct zone *zone,
>> unsigned int order)
>> {
>> @@ -2323,7 +2332,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
>> */
>> static int move_freepages(struct zone *zone,
>> struct page *start_page, struct page *end_page,
>> - int migratetype, int *num_movable)
>> + int migratetype, int *num_movable, bool to_tail)
>> {
>> struct page *page;
>> unsigned int order;
>> @@ -2354,7 +2363,10 @@ static int move_freepages(struct zone *zone,
>> VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>>
>> order = page_order(page);
>> - move_to_free_list(page, zone, order, migratetype);
>> + if (to_tail)
>> + move_to_free_list_tail(page, zone, order, migratetype);
>> + else
>> + move_to_free_list(page, zone, order, migratetype);
>> page += 1 << order;
>> pages_moved += 1 << order;
>> }
>> @@ -2362,8 +2374,9 @@ static int move_freepages(struct zone *zone,
>> return pages_moved;
>> }
>>
>> -int move_freepages_block(struct zone *zone, struct page *page,
>> - int migratetype, int *num_movable)
>> +static int __move_freepages_block(struct zone *zone, struct page *page,
>> + int migratetype, int *num_movable,
>> + bool to_tail)
>> {
>> unsigned long start_pfn, end_pfn;
>> struct page *start_page, *end_page;
>> @@ -2384,7 +2397,20 @@ int move_freepages_block(struct zone *zone, struct page *page,
>> return 0;
>>
>> return move_freepages(zone, start_page, end_page, migratetype,
>> - num_movable);
>> + num_movable, to_tail);
>> +}
>> +
>> +int move_freepages_block(struct zone *zone, struct page *page,
>> + int migratetype, int *num_movable)
>> +{
>> + return __move_freepages_block(zone, page, migratetype, num_movable,
>> + false);
>> +}
>> +
>> +int move_freepages_block_tail(struct zone *zone, struct page *page,
>> + int migratetype)
>> +{
>> + return __move_freepages_block(zone, page, migratetype, NULL, true);
>> }
>
>Likewise, just 5 callers of move_freepages_block(), all in the files you're
>already changing, so no need for this wrappers IMHO.
>
>Thanks,
>Vlastimil
>
>> static void change_pageblock_range(struct page *pageblock_page,
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index abfe26ad59fd..84aa1d14751d 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -83,7 +83,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>> * Because freepage with more than pageblock_order on isolated
>> * pageblock is restricted to merge due to freepage counting problem,
>> * it is possible that there is free buddy page.
>> - * move_freepages_block() doesn't care of merge so we need other
>> + * move_freepages_block*() don't care about merging, so we need another
>> * approach in order to merge them. Isolation and free will make
>> * these pages to be merged.
>> */
>> @@ -106,9 +106,13 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>> * If we isolate freepage with more than pageblock_order, there
>> * should be no freepage in the range, so we could avoid costly
>> * pageblock scanning for freepage moving.
>> + *
>> + * We didn't actually touch any of the isolated pages, so place them
>> + * to the tail of the freelists. This is especially relevant during
>> + * memory onlining.
>> */
>> if (!isolated_page) {
>> - nr_pages = move_freepages_block(zone, page, migratetype, NULL);
>> + nr_pages = move_freepages_block_tail(zone, page, migratetype);
>> __mod_zone_freepage_state(zone, nr_pages, migratetype);
>> }
>> set_pageblock_migratetype(page, migratetype);
>>

--
Wei Yang
Help you, Help me

2020-09-25 08:08:20

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

On 25.09.20 04:45, Wei Yang wrote:
> On Thu, Sep 24, 2020 at 01:13:29PM +0200, Vlastimil Babka wrote:
>> On 9/16/20 8:34 PM, David Hildenbrand wrote:
>>> Page isolation doesn't actually touch the pages, it simply isolates
>>> pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
>>>
>>> We already place pages to the tail of the freelists when undoing
>>> isolation via __putback_isolated_page(), let's do it in any case
>>> (e.g., if order == pageblock_order) and document the behavior.
>>>
>>> This change results in all pages getting onlined via online_pages() to
>>> be placed to the tail of the freelist.
>>>
>>> Cc: Andrew Morton <[email protected]>
>>> Cc: Alexander Duyck <[email protected]>
>>> Cc: Mel Gorman <[email protected]>
>>> Cc: Michal Hocko <[email protected]>
>>> Cc: Dave Hansen <[email protected]>
>>> Cc: Vlastimil Babka <[email protected]>
>>> Cc: Wei Yang <[email protected]>
>>> Cc: Oscar Salvador <[email protected]>
>>> Cc: Mike Rapoport <[email protected]>
>>> Cc: Scott Cheloha <[email protected]>
>>> Cc: Michael Ellerman <[email protected]>
>>> Signed-off-by: David Hildenbrand <[email protected]>
>>> ---
>>> include/linux/page-isolation.h | 2 ++
>>> mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-----
>>> mm/page_isolation.c | 8 ++++++--
>>> 3 files changed, 39 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>>> index 572458016331..a36be2cf4dbb 100644
>>> --- a/include/linux/page-isolation.h
>>> +++ b/include/linux/page-isolation.h
>>> @@ -38,6 +38,8 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
>>> void set_pageblock_migratetype(struct page *page, int migratetype);
>>> int move_freepages_block(struct zone *zone, struct page *page,
>>> int migratetype, int *num_movable);
>>> +int move_freepages_block_tail(struct zone *zone, struct page *page,
>>> + int migratetype);
>>>
>>> /*
>>> * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index bba9a0f60c70..75b0f49b4022 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -899,6 +899,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
>>> list_move(&page->lru, &area->free_list[migratetype]);
>>> }
>>>
>>> +/* Used for pages which are on another list */
>>> +static inline void move_to_free_list_tail(struct page *page, struct zone *zone,
>>> + unsigned int order, int migratetype)
>>> +{
>>> + struct free_area *area = &zone->free_area[order];
>>> +
>>> + list_move_tail(&page->lru, &area->free_list[migratetype]);
>>> +}
>>
>> There are just 3 callers of move_to_free_list() before this patch, I would just
>> add the to_tail parameter there instead of new wrapper. For callers with
>> constant parameter, the inline will eliminate it anyway.
>
> Got the same feeling :-)

I once was told boolean parameters are the root of all evil, so I tried
to keep them file-local :)

One thing to be aware of is, that inline optimizations won't help as
long as this function is in mm/page_alloc.c, see below.

>
>>
>>> static inline void del_page_from_free_list(struct page *page, struct zone *zone,
>>> unsigned int order)
>>> {
>>> @@ -2323,7 +2332,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
>>> */
>>> static int move_freepages(struct zone *zone,
>>> struct page *start_page, struct page *end_page,
>>> - int migratetype, int *num_movable)
>>> + int migratetype, int *num_movable, bool to_tail)
>>> {
>>> struct page *page;
>>> unsigned int order;
>>> @@ -2354,7 +2363,10 @@ static int move_freepages(struct zone *zone,
>>> VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>>>
>>> order = page_order(page);
>>> - move_to_free_list(page, zone, order, migratetype);
>>> + if (to_tail)
>>> + move_to_free_list_tail(page, zone, order, migratetype);
>>> + else
>>> + move_to_free_list(page, zone, order, migratetype);
>>> page += 1 << order;
>>> pages_moved += 1 << order;
>>> }
>>> @@ -2362,8 +2374,9 @@ static int move_freepages(struct zone *zone,
>>> return pages_moved;
>>> }
>>>
>>> -int move_freepages_block(struct zone *zone, struct page *page,
>>> - int migratetype, int *num_movable)
>>> +static int __move_freepages_block(struct zone *zone, struct page *page,
>>> + int migratetype, int *num_movable,
>>> + bool to_tail)
>>> {
>>> unsigned long start_pfn, end_pfn;
>>> struct page *start_page, *end_page;
>>> @@ -2384,7 +2397,20 @@ int move_freepages_block(struct zone *zone, struct page *page,
>>> return 0;
>>>
>>> return move_freepages(zone, start_page, end_page, migratetype,
>>> - num_movable);
>>> + num_movable, to_tail);
>>> +}
>>> +
>>> +int move_freepages_block(struct zone *zone, struct page *page,
>>> + int migratetype, int *num_movable)
>>> +{
>>> + return __move_freepages_block(zone, page, migratetype, num_movable,
>>> + false);
>>> +}
>>> +
>>> +int move_freepages_block_tail(struct zone *zone, struct page *page,
>>> + int migratetype)
>>> +{
>>> + return __move_freepages_block(zone, page, migratetype, NULL, true);
>>> }
>>
>> Likewise, just 5 callers of move_freepages_block(), all in the files you're
>> already changing, so no need for this wrappers IMHO.

As long as we don't want to move the implementation to the header, we'll
need it for the constant propagation to work at compile time (we don't
really have link-time optimizations). Or am I missing something?

Thanks!

--
Thanks,

David / dhildenb

2020-09-25 08:12:30

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] mm/page_alloc: place pages to tail in __putback_isolated_page()

On 24.09.20 12:37, Vlastimil Babka wrote:
> On 9/16/20 8:34 PM, David Hildenbrand wrote:
>> __putback_isolated_page() already documents that pages will be placed to
>> the tail of the freelist - this is, however, not the case for
>> "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
>> the case for all existing users.
>
> I think here should be a sentence saying something along "Thus this patch
> introduces a FOP_TO_TAIL flag to really ensure moving pages to tail."

Agreed, thanks!

>
>> This change affects two users:
>> - free page reporting
>> - page isolation, when undoing the isolation.
>>
>> This behavior is desireable for pages that haven't really been touched
>> lately, so exactly the two users that don't actually read/write page
>> content, but rather move untouched pages.
>>
>> The new behavior is especially desirable for memory onlining, where we
>> allow allocation of newly onlined pages via undo_isolate_page_range()
>> in online_pages(). Right now, we always place them to the head of the
>> free list, resulting in undesireable behavior: Assume we add
>> individual memory chunks via add_memory() and online them right away to
>> the NORMAL zone. We create a dependency chain of unmovable allocations
>> e.g., via the memmap. The memmap of the next chunk will be placed onto
>> previous chunks - if the last block cannot get offlined+removed, all
>> dependent ones cannot get offlined+removed. While this can already be
>> observed with individual DIMMs, it's more of an issue for virtio-mem
>> (and I suspect also ppc DLPAR).
>>
>> Note: If we observe a degradation due to the changed page isolation
>> behavior (which I doubt), we can always make this configurable by the
>> instance triggering undo of isolation (e.g., alloc_contig_range(),
>> memory onlining, memory offlining).
>>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Alexander Duyck <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Vlastimil Babka <[email protected]>
>> Cc: Wei Yang <[email protected]>
>> Cc: Oscar Salvador <[email protected]>
>> Cc: Mike Rapoport <[email protected]>
>> Cc: Scott Cheloha <[email protected]>
>> Cc: Michael Ellerman <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/page_alloc.c | 10 +++++++++-
>> 1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 91cefb8157dd..bba9a0f60c70 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -89,6 +89,12 @@ typedef int __bitwise fop_t;
>> */
>> #define FOP_SKIP_REPORT_NOTIFY ((__force fop_t)BIT(0))
>>
>> +/*
>> + * Place the freed page to the tail of the freelist after buddy merging. Will
>> + * get ignored with page shuffling enabled.
>> + */
>> +#define FOP_TO_TAIL ((__force fop_t)BIT(1))
>> +
>> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>> static DEFINE_MUTEX(pcp_batch_high_lock);
>> #define MIN_PERCPU_PAGELIST_FRACTION (8)
>> @@ -1040,6 +1046,8 @@ static inline void __free_one_page(struct page *page, unsigned long pfn,
>>
>> if (is_shuffle_order(order))
>> to_tail = shuffle_pick_tail();
>> + else if (fop_flags & FOP_TO_TAIL)
>> + to_tail = true;
>
> Should we really let random shuffling decision have a larger priority than
> explicit FOP_TO_TAIL request? Wei Yang mentioned that there's a call to
> shuffle_zone() anyway to process a freshly added memory, so we don't need to do
> that also during the process of addition itself? Might help with your goal of
> reducing dependencies even on systems that do have shuffling enabled?

So, we do have cases where generic_online_page() -> __free_pages_core()
isn't called (see patch #4):

generic_online_page() is used in two cases:

1. Direct memory onlining in online_pages(). Here, we call
shuffle_zone().
2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
balloon and virtio-mem), when parts of a section are kept
fake-offline to be fake-onlined later on.

While we shuffle in the fist instance the whole zone, we wouldn't
shuffle in the second case.

But maybe this should be tackled (just like when alloc_contig_free() a
large contiguous range, memory offlining failing, alloc_contig_range()
failing) by manually shuffling the zone again. That would be cleaner,
and the right thing to do when exposing large, contiguous ranges again
to the buddy.

Thanks!


--
Thanks,

David / dhildenb

2020-09-25 08:43:25

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

On 9/25/20 10:05 AM, David Hildenbrand wrote:
>>>> static inline void del_page_from_free_list(struct page *page, struct zone *zone,
>>>> unsigned int order)
>>>> {
>>>> @@ -2323,7 +2332,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
>>>> */
>>>> static int move_freepages(struct zone *zone,
>>>> struct page *start_page, struct page *end_page,
>>>> - int migratetype, int *num_movable)
>>>> + int migratetype, int *num_movable, bool to_tail)
>>>> {
>>>> struct page *page;
>>>> unsigned int order;
>>>> @@ -2354,7 +2363,10 @@ static int move_freepages(struct zone *zone,
>>>> VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>>>>
>>>> order = page_order(page);
>>>> - move_to_free_list(page, zone, order, migratetype);
>>>> + if (to_tail)
>>>> + move_to_free_list_tail(page, zone, order, migratetype);
>>>> + else
>>>> + move_to_free_list(page, zone, order, migratetype);
>>>> page += 1 << order;
>>>> pages_moved += 1 << order;
>>>> }
>>>> @@ -2362,8 +2374,9 @@ static int move_freepages(struct zone *zone,
>>>> return pages_moved;
>>>> }
>>>>
>>>> -int move_freepages_block(struct zone *zone, struct page *page,
>>>> - int migratetype, int *num_movable)
>>>> +static int __move_freepages_block(struct zone *zone, struct page *page,
>>>> + int migratetype, int *num_movable,
>>>> + bool to_tail)
>>>> {
>>>> unsigned long start_pfn, end_pfn;
>>>> struct page *start_page, *end_page;
>>>> @@ -2384,7 +2397,20 @@ int move_freepages_block(struct zone *zone, struct page *page,
>>>> return 0;
>>>>
>>>> return move_freepages(zone, start_page, end_page, migratetype,
>>>> - num_movable);
>>>> + num_movable, to_tail);
>>>> +}
>>>> +
>>>> +int move_freepages_block(struct zone *zone, struct page *page,
>>>> + int migratetype, int *num_movable)
>>>> +{
>>>> + return __move_freepages_block(zone, page, migratetype, num_movable,
>>>> + false);
>>>> +}
>>>> +
>>>> +int move_freepages_block_tail(struct zone *zone, struct page *page,
>>>> + int migratetype)
>>>> +{
>>>> + return __move_freepages_block(zone, page, migratetype, NULL, true);
>>>> }
>>>
>>> Likewise, just 5 callers of move_freepages_block(), all in the files you're
>>> already changing, so no need for this wrappers IMHO.
>
> As long as we don't want to move the implementation to the header, we'll
> need it for the constant propagation to work at compile time (we don't
> really have link-time optimizations). Or am I missing something?

I guess move_freepages_block() is not exactly fast path, so we could do without it.

> Thanks!
>

2020-09-25 10:38:15

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH RFC 1/4] mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag

On Wed, Sep 16, 2020 at 08:34:08PM +0200, David Hildenbrand wrote:
> Let's prepare for additional flags and avoid long parameter lists of bools.
> Follow-up patches will also make use of the flags in __free_pages_ok(),
> however, I wasn't able to come up with a better name for the type - should
> be good enough for internal purposes.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Wei Yang <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE L3

2020-09-25 13:15:44

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

On 24.09.20 13:13, Vlastimil Babka wrote:
> On 9/16/20 8:34 PM, David Hildenbrand wrote:
>> Page isolation doesn't actually touch the pages, it simply isolates
>> pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
>>
>> We already place pages to the tail of the freelists when undoing
>> isolation via __putback_isolated_page(), let's do it in any case
>> (e.g., if order == pageblock_order) and document the behavior.
>>
>> This change results in all pages getting onlined via online_pages() to
>> be placed to the tail of the freelist.
>>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Alexander Duyck <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Vlastimil Babka <[email protected]>
>> Cc: Wei Yang <[email protected]>
>> Cc: Oscar Salvador <[email protected]>
>> Cc: Mike Rapoport <[email protected]>
>> Cc: Scott Cheloha <[email protected]>
>> Cc: Michael Ellerman <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> include/linux/page-isolation.h | 2 ++
>> mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-----
>> mm/page_isolation.c | 8 ++++++--
>> 3 files changed, 39 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>> index 572458016331..a36be2cf4dbb 100644
>> --- a/include/linux/page-isolation.h
>> +++ b/include/linux/page-isolation.h
>> @@ -38,6 +38,8 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
>> void set_pageblock_migratetype(struct page *page, int migratetype);
>> int move_freepages_block(struct zone *zone, struct page *page,
>> int migratetype, int *num_movable);
>> +int move_freepages_block_tail(struct zone *zone, struct page *page,
>> + int migratetype);
>>
>> /*
>> * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index bba9a0f60c70..75b0f49b4022 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -899,6 +899,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
>> list_move(&page->lru, &area->free_list[migratetype]);
>> }
>>
>> +/* Used for pages which are on another list */
>> +static inline void move_to_free_list_tail(struct page *page, struct zone *zone,
>> + unsigned int order, int migratetype)
>> +{
>> + struct free_area *area = &zone->free_area[order];
>> +
>> + list_move_tail(&page->lru, &area->free_list[migratetype]);
>> +}
>
> There are just 3 callers of move_to_free_list() before this patch, I would just
> add the to_tail parameter there instead of new wrapper. For callers with
> constant parameter, the inline will eliminate it anyway.
>

So, I'll leave this as is for now, it nicely pairs with
add_to_free_list()/add_to_free_list_tail() and ...

[...]

>> +int move_freepages_block_tail(struct zone *zone, struct page *page,
>> + int migratetype)
>> +{
>> + return __move_freepages_block(zone, page, migratetype, NULL, true);
>> }
>

Drop these wrappers, converting callers of move_freepages_block().

Thanks!


--
Thanks,

David / dhildenb

2020-09-25 13:33:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] mm/page_alloc: place pages to tail in __putback_isolated_page()

On 25.09.20 15:19, Oscar Salvador wrote:
> On Wed, Sep 16, 2020 at 08:34:09PM +0200, David Hildenbrand wrote:
>> __putback_isolated_page() already documents that pages will be placed to
>> the tail of the freelist - this is, however, not the case for
>> "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
>> the case for all existing users.
>>
>> This change affects two users:
>> - free page reporting
>> - page isolation, when undoing the isolation.
>>
>> This behavior is desireable for pages that haven't really been touched
>> lately, so exactly the two users that don't actually read/write page
>> content, but rather move untouched pages.
>>
>> The new behavior is especially desirable for memory onlining, where we
>> allow allocation of newly onlined pages via undo_isolate_page_range()
>> in online_pages(). Right now, we always place them to the head of the
>> free list, resulting in undesireable behavior: Assume we add
>> individual memory chunks via add_memory() and online them right away to
>> the NORMAL zone. We create a dependency chain of unmovable allocations
>> e.g., via the memmap. The memmap of the next chunk will be placed onto
>> previous chunks - if the last block cannot get offlined+removed, all
>> dependent ones cannot get offlined+removed. While this can already be
>> observed with individual DIMMs, it's more of an issue for virtio-mem
>> (and I suspect also ppc DLPAR).
>>
>> Note: If we observe a degradation due to the changed page isolation
>> behavior (which I doubt), we can always make this configurable by the
>> instance triggering undo of isolation (e.g., alloc_contig_range(),
>> memory onlining, memory offlining).
>>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Alexander Duyck <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Vlastimil Babka <[email protected]>
>> Cc: Wei Yang <[email protected]>
>> Cc: Oscar Salvador <[email protected]>
>> Cc: Mike Rapoport <[email protected]>
>> Cc: Scott Cheloha <[email protected]>
>> Cc: Michael Ellerman <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>
> LGTM, the only thing is the shuffe_zone topic that Wei and Vlastimil rose.
> Feels a bit odd that takes precedence over something we explicitily demanded.
>

Thanks, yeah I'll be changing that.

> With the comment Vlastimil suggested:
>
> Reviewed-by: Oscar Salvador <[email protected]>
>

--
Thanks,

David / dhildenb

2020-09-25 13:35:03

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH RFC 2/4] mm/page_alloc: place pages to tail in __putback_isolated_page()

On Wed, Sep 16, 2020 at 08:34:09PM +0200, David Hildenbrand wrote:
> __putback_isolated_page() already documents that pages will be placed to
> the tail of the freelist - this is, however, not the case for
> "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
> the case for all existing users.
>
> This change affects two users:
> - free page reporting
> - page isolation, when undoing the isolation.
>
> This behavior is desireable for pages that haven't really been touched
> lately, so exactly the two users that don't actually read/write page
> content, but rather move untouched pages.
>
> The new behavior is especially desirable for memory onlining, where we
> allow allocation of newly onlined pages via undo_isolate_page_range()
> in online_pages(). Right now, we always place them to the head of the
> free list, resulting in undesireable behavior: Assume we add
> individual memory chunks via add_memory() and online them right away to
> the NORMAL zone. We create a dependency chain of unmovable allocations
> e.g., via the memmap. The memmap of the next chunk will be placed onto
> previous chunks - if the last block cannot get offlined+removed, all
> dependent ones cannot get offlined+removed. While this can already be
> observed with individual DIMMs, it's more of an issue for virtio-mem
> (and I suspect also ppc DLPAR).
>
> Note: If we observe a degradation due to the changed page isolation
> behavior (which I doubt), we can always make this configurable by the
> instance triggering undo of isolation (e.g., alloc_contig_range(),
> memory onlining, memory offlining).
>
> Cc: Andrew Morton <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Wei Yang <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Scott Cheloha <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

LGTM, the only thing is the shuffe_zone topic that Wei and Vlastimil rose.
Feels a bit odd that takes precedence over something we explicitily demanded.

With the comment Vlastimil suggested:

Reviewed-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE L3

2020-09-25 14:00:38

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

On Wed, Sep 16, 2020 at 08:34:10PM +0200, David Hildenbrand wrote:
> Page isolation doesn't actually touch the pages, it simply isolates
> pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
>
> We already place pages to the tail of the freelists when undoing
> isolation via __putback_isolated_page(), let's do it in any case
> (e.g., if order == pageblock_order) and document the behavior.
>
> This change results in all pages getting onlined via online_pages() to
> be placed to the tail of the freelist.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Wei Yang <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Scott Cheloha <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

LGTM.
Feel the same way about move_freepages_block_tail/move_freepages_block_tail
wrappers, I think we are better off without them.

Reviewed-by: Oscar Salvador <[email protected]>

Thanks

--
Oscar Salvador
SUSE L3

2020-09-28 08:01:48

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH RFC 4/4] mm/page_alloc: place pages to tail in __free_pages_core()

On Wed, Sep 16, 2020 at 08:34:11PM +0200, David Hildenbrand wrote:
> @@ -1523,7 +1524,13 @@ void __free_pages_core(struct page *page, unsigned int order)
>
> atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
> set_page_refcounted(page);
> - __free_pages(page, order);
> +
> + /*
> + * Bypass PCP and place fresh pages right to the tail, primarily
> + * relevant for memory onlining.
> + */
> + page_ref_dec(page);
> + __free_pages_ok(page, order, FOP_TO_TAIL);

Sorry, I must be missing something obvious here, but I am a bit confused here.
I get the part of placing them at the tail so rmqueue_bulk() won't
find them, but I do not get why we decrement page's refcount.
IIUC, its refcount will be 0, but why do we want to do that?

Another thing a bit unrelated... we mess three times with page's refcount
(two before this patch).
Why do we have this dance in place?

Thanks

--
Oscar Salvador
SUSE L3

2020-09-28 08:37:53

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 4/4] mm/page_alloc: place pages to tail in __free_pages_core()

On 28.09.20 09:58, Oscar Salvador wrote:
> On Wed, Sep 16, 2020 at 08:34:11PM +0200, David Hildenbrand wrote:
>> @@ -1523,7 +1524,13 @@ void __free_pages_core(struct page *page, unsigned int order)
>>
>> atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
>> set_page_refcounted(page);
>> - __free_pages(page, order);
>> +
>> + /*
>> + * Bypass PCP and place fresh pages right to the tail, primarily
>> + * relevant for memory onlining.
>> + */
>> + page_ref_dec(page);
>> + __free_pages_ok(page, order, FOP_TO_TAIL);
>
> Sorry, I must be missing something obvious here, but I am a bit confused here.
> I get the part of placing them at the tail so rmqueue_bulk() won't
> find them, but I do not get why we decrement page's refcount.
> IIUC, its refcount will be 0, but why do we want to do that?
>
> Another thing a bit unrelated... we mess three times with page's refcount
> (two before this patch).
> Why do we have this dance in place?

Hi Oscar!

Old code:

set_page_refcounted(): sets the refcount to 1.
__free_pages()
-> put_page_testzero(): sets it to 0
-> free_the_page()->__free_pages_ok()

New code:

set_page_refcounted(): sets the refcount to 1.
page_ref_dec(page): sets it to 0
__free_pages_ok():


We could skip the set_page_refcounted() + page_ref_dec(page) and lose a
couple of sanity checks but we could simply use a
VM_BUG_ON_PAGE(page_ref_count(page), page), which is what we really care
about when onlining memory.

--
Thanks,

David / dhildenb

2020-09-28 12:58:06

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH RFC 4/4] mm/page_alloc: place pages to tail in __free_pages_core()

On Mon, Sep 28, 2020 at 10:36:00AM +0200, David Hildenbrand wrote:
> Hi Oscar!

Hi David :-)

>
> Old code:
>
> set_page_refcounted(): sets the refcount to 1.
> __free_pages()
> -> put_page_testzero(): sets it to 0
> -> free_the_page()->__free_pages_ok()
>
> New code:
>
> set_page_refcounted(): sets the refcount to 1.
> page_ref_dec(page): sets it to 0
> __free_pages_ok():

bleh, I misread the patch, somehow I managed to not see that you replaced
__free_pages with __free_pages_ok.

To be honest, now that we do not need the page's refcount to be 1 for the
put_page_testzero to trigger (and since you are decrementing it anyways),
I think it would be much clear for those two to be gone.

But not strong, so:

Reviewed-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE L3