2017-03-15 09:13:57

by Michal Hocko

[permalink] [raw]
Subject: [RFC PATCH] rework memory hotplug onlining

Hi,
this is a follow up for [1]. In short the current semantic of the memory
hotplug is awkward and hard/impossible to use from the udev to online
memory as movable. The main problem is that only the last memblock or
the adjacent to highest movable memblock can be onlined as movable:
: Let's simulate memory hot online manually
: # echo 0x100000000 > /sys/devices/system/memory/probe
: # grep . /sys/devices/system/memory/memory32/valid_zones
: Normal Movable
:
: which looks reasonably right? Both Normal and Movable zones are allowed
:
: # echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
: # grep . /sys/devices/system/memory/memory3?/valid_zones
: /sys/devices/system/memory/memory32/valid_zones:Normal
: /sys/devices/system/memory/memory33/valid_zones:Normal Movable
:
: Huh, so our valid_zones have changed under our feet...
:
: # echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
: # grep . /sys/devices/system/memory/memory3?/valid_zones
: /sys/devices/system/memory/memory32/valid_zones:Normal
: /sys/devices/system/memory/memory33/valid_zones:Normal
: /sys/devices/system/memory/memory34/valid_zones:Normal Movable
:
: and again. So only the last memblock is considered movable. Let's try to
: online them now.
:
: # echo online_movable > /sys/devices/system/memory/memory34/state
: # grep . /sys/devices/system/memory/memory3?/valid_zones
: /sys/devices/system/memory/memory32/valid_zones:Normal
: /sys/devices/system/memory/memory33/valid_zones:Normal Movable
: /sys/devices/system/memory/memory34/valid_zones:Movable Normal

Now consider that the userspace gets the notification when the memblock
is added. If the udev context tries to online it it will a) race with
new memblocks showing up which leads to undeterministic behavior and
b) it will see memblocks ordered in growing physical addresses while
the only reliable way to online blocks as movable is exactly from other
directions. This is just plain wrong!

It seems that all this is just started by the semantic introduced by
9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd without sparsemem")
quite some time ago. When the movable onlinining has been introduced it
just built on top of this. It seems that the requirement to have
freshly probed memory associated with the zone normal is no longer
necessary. HOTPLUG depends on CONFIG_SPARSEMEM these days.

The following blob [2] simply removes all the zone specific operations
from __add_pages (aka arch_add_memory) path. Instead we do page->zone
association from move_pfn_range which is called from online_pages. The
criterion for movable/normal zone association is really simple now. We
just have to guarantee that zone Normal is always lower than zone
Movable. It would be actually sufficient to guarantee they do not
overlap and that is indeed trivial to implement now. I didn't do that
yet for simplicity of this change though.

I have lightly tested the patch and nothing really jumped at me. I
assume there will be some rough edges but it should be sufficient to
start the discussion at least. Please note the diffstat. We have added
a lot of code to tweak on top of the previous semantic which is just
sad. Instead of developing a robust solution the memory hotplug is full
of tweaks to satisfy particular usecase without longer term plans.

Please note that this is just for x86 now but I will address other
arches once there is an agreement this is the right approach.

Thoughts, objections?

[1] http://lkml.kernel.org/r/[email protected]
[2] I didn't get to split it out to proper patches yet because I would
like to hear back about potential issues which would basically block the
whole thing before I spend more time on this
---
arch/x86/mm/init_32.c | 4 +-
arch/x86/mm/init_64.c | 9 +-
drivers/base/memory.c | 52 +++---
include/linux/memory_hotplug.h | 11 +-
include/linux/mmzone.h | 13 +-
mm/memory_hotplug.c | 360 +++++++++++------------------------------
mm/page_alloc.c | 11 +-
mm/sparse.c | 3 +-
8 files changed, 137 insertions(+), 326 deletions(-)

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 928cfde76232..9deded3728bf 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -819,12 +819,10 @@ void __init mem_init(void)
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
struct pglist_data *pgdata = NODE_DATA(nid);
- struct zone *zone = pgdata->node_zones +
- zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;

- return __add_pages(nid, zone, start_pfn, nr_pages);
+ return __add_pages(nid, start_pfn, nr_pages);
}

#ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7eef17239378..bc53f24e6703 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -637,22 +637,15 @@ static void update_end_of_memory_vars(u64 start, u64 size)
}
}

-/*
- * Memory is added always to NORMAL zone. This means you will never get
- * additional DMA/DMA32 memory.
- */
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
- struct pglist_data *pgdat = NODE_DATA(nid);
- struct zone *zone = pgdat->node_zones +
- zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;

init_memory_mapping(start, start + size);

- ret = __add_pages(nid, zone, start_pfn, nr_pages);
+ ret = __add_pages(nid, start_pfn, nr_pages);
WARN_ON_ONCE(ret);

/* update max_pfn, max_low_pfn and high_memory */
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index cc4f1d0cbffe..3aea6a3b53e5 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -281,6 +281,7 @@ static int memory_subsys_online(struct device *dev)
* If we are called from store_mem_state(), online_type will be
* set >= 0 Otherwise we were called from the device online
* attribute and need to set the online_type.
+ * TODO WTF is this?
*/
if (mem->online_type < 0)
mem->online_type = MMOP_ONLINE_KEEP;
@@ -388,39 +389,44 @@ static ssize_t show_valid_zones(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct memory_block *mem = to_memory_block(dev);
- unsigned long start_pfn, end_pfn;
- unsigned long valid_start, valid_end, valid_pages;
- unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+ unsigned long start_pfn, nr_pages;
+ bool append = false;
struct zone *zone;
- int zone_shift = 0;
+ int nid;

start_pfn = section_nr_to_pfn(mem->start_section_nr);
- end_pfn = start_pfn + nr_pages;
+ zone = page_zone(pfn_to_page(start_pfn));
+ nr_pages = PAGES_PER_SECTION * sections_per_block;

- /* The block contains more than one zone can not be offlined. */
- if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start, &valid_end))
+ /*
+ * The block contains more than one zone can not be offlined.
+ * This can happen e.g. for ZONE_DMA and ZONE_DMA32
+ */
+ if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages, NULL, NULL))
return sprintf(buf, "none\n");

- zone = page_zone(pfn_to_page(valid_start));
- valid_pages = valid_end - valid_start;
-
- /* MMOP_ONLINE_KEEP */
- sprintf(buf, "%s", zone->name);
-
- /* MMOP_ONLINE_KERNEL */
- zone_can_shift(valid_start, valid_pages, ZONE_NORMAL, &zone_shift);
- if (zone_shift) {
- strcat(buf, " ");
- strcat(buf, (zone + zone_shift)->name);
+ /*
+ * Check the existing zone. Page which is not online will point to zone 0
+ * so double check it belongs to that zone
+ * TODO should we add ZONE_UNINITIALIZED?
+ */
+ if (zone_spans_pfn(zone, start_pfn)) {
+ strcat(buf, zone->name);
+ goto out;
}

- /* MMOP_ONLINE_MOVABLE */
- zone_can_shift(valid_start, valid_pages, ZONE_MOVABLE, &zone_shift);
- if (zone_shift) {
- strcat(buf, " ");
- strcat(buf, (zone + zone_shift)->name);
+ nid = pfn_to_nid(start_pfn);
+ if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_KERNEL)) {
+ strcat(buf, NODE_DATA(nid)->node_zones[ZONE_NORMAL].name);
+ append = true;
}

+ if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_MOVABLE)) {
+ if (append)
+ strcat(buf, " ");
+ strcat(buf, NODE_DATA(nid)->node_zones[ZONE_MOVABLE].name);
+ }
+out:
strcat(buf, "\n");

return strlen(buf);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 134a2f69c21a..375f6e152649 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -109,8 +109,8 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
#endif /* CONFIG_MEMORY_HOTREMOVE */

-/* reasonably generic interface to expand the physical pages in a zone */
-extern int __add_pages(int nid, struct zone *zone, unsigned long start_pfn,
+/* reasonably generic interface to expand the physical pages */
+extern int __add_pages(int nid, unsigned long start_pfn,
unsigned long nr_pages);

#ifdef CONFIG_NUMA
@@ -280,12 +280,11 @@ extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
extern bool is_memblock_offlined(struct memory_block *mem);
extern void remove_memory(int nid, u64 start, u64 size);
-extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn);
+extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
unsigned long map_offset);
extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
unsigned long pnum);
-extern bool zone_can_shift(unsigned long pfn, unsigned long nr_pages,
- enum zone_type target, int *zone_shift);
-
+extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_pages,
+ int online_type);
#endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 618499159a7c..c86c78617d17 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -442,8 +442,6 @@ struct zone {
seqlock_t span_seqlock;
#endif

- int initialized;
-
/* Write-intensive fields used from the page allocator */
ZONE_PADDING(_pad1_)

@@ -520,14 +518,15 @@ static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
}

-static inline bool zone_is_initialized(struct zone *zone)
+static inline bool zone_is_empty(struct zone *zone)
{
- return zone->initialized;
+ return zone->spanned_pages == 0;
}

-static inline bool zone_is_empty(struct zone *zone)
+static inline bool zone_spans_range(const struct zone *zone, unsigned long start_pfn,
+ unsigned long nr_pages)
{
- return zone->spanned_pages == 0;
+ return zone->zone_start_pfn <= start_pfn && start_pfn + nr_pages < zone_end_pfn(zone);
}

/*
@@ -769,7 +768,7 @@ enum memmap_context {
MEMMAP_EARLY,
MEMMAP_HOTPLUG,
};
-extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
+extern void init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
unsigned long size);

extern void lruvec_init(struct lruvec *lruvec);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6fb6bd2df787..4dbed7dc66a1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -295,229 +295,98 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
}
#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */

-static void __meminit grow_zone_span(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn)
+static void __meminit resize_zone(struct zone *zone, unsigned long start_pfn,
+ unsigned long nr_pages)
{
- unsigned long old_zone_end_pfn;
-
- zone_span_writelock(zone);
+ unsigned long old_end_pfn = zone_end_pfn(zone);

- old_zone_end_pfn = zone_end_pfn(zone);
- if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
+ if (start_pfn < zone->zone_start_pfn)
zone->zone_start_pfn = start_pfn;

- zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
- zone->zone_start_pfn;
-
- zone_span_writeunlock(zone);
-}
-
-static void resize_zone(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- zone_span_writelock(zone);
-
- if (end_pfn - start_pfn) {
- zone->zone_start_pfn = start_pfn;
- zone->spanned_pages = end_pfn - start_pfn;
- } else {
- /*
- * make it consist as free_area_init_core(),
- * if spanned_pages = 0, then keep start_pfn = 0
- */
- zone->zone_start_pfn = 0;
- zone->spanned_pages = 0;
- }
-
- zone_span_writeunlock(zone);
-}
-
-static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- enum zone_type zid = zone_idx(zone);
- int nid = zone->zone_pgdat->node_id;
- unsigned long pfn;
-
- for (pfn = start_pfn; pfn < end_pfn; pfn++)
- set_page_links(pfn_to_page(pfn), zid, nid, pfn);
+ zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn;
}

-/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
- * alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */
-static int __ref ensure_zone_is_initialized(struct zone *zone,
- unsigned long start_pfn, unsigned long num_pages)
+static void __meminit resize_pgdat(struct pglist_data *pgdat, unsigned long start_pfn,
+ unsigned long nr_pages)
{
- if (!zone_is_initialized(zone))
- return init_currently_empty_zone(zone, start_pfn, num_pages);
-
- return 0;
-}
-
-static int __meminit move_pfn_range_left(struct zone *z1, struct zone *z2,
- unsigned long start_pfn, unsigned long end_pfn)
-{
- int ret;
- unsigned long flags;
- unsigned long z1_start_pfn;
-
- ret = ensure_zone_is_initialized(z1, start_pfn, end_pfn - start_pfn);
- if (ret)
- return ret;
-
- pgdat_resize_lock(z1->zone_pgdat, &flags);
-
- /* can't move pfns which are higher than @z2 */
- if (end_pfn > zone_end_pfn(z2))
- goto out_fail;
- /* the move out part must be at the left most of @z2 */
- if (start_pfn > z2->zone_start_pfn)
- goto out_fail;
- /* must included/overlap */
- if (end_pfn <= z2->zone_start_pfn)
- goto out_fail;
-
- /* use start_pfn for z1's start_pfn if z1 is empty */
- if (!zone_is_empty(z1))
- z1_start_pfn = z1->zone_start_pfn;
- else
- z1_start_pfn = start_pfn;
-
- resize_zone(z1, z1_start_pfn, end_pfn);
- resize_zone(z2, end_pfn, zone_end_pfn(z2));
-
- pgdat_resize_unlock(z1->zone_pgdat, &flags);
-
- fix_zone_id(z1, start_pfn, end_pfn);
+ unsigned long old_end_pfn = pgdat_end_pfn(pgdat);
+ if (start_pfn < pgdat->node_start_pfn)
+ pgdat->node_start_pfn = start_pfn;

- return 0;
-out_fail:
- pgdat_resize_unlock(z1->zone_pgdat, &flags);
- return -1;
+ pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
}

-static int __meminit move_pfn_range_right(struct zone *z1, struct zone *z2,
- unsigned long start_pfn, unsigned long end_pfn)
+/*
+ * Associates the given pfn range with the given node and the zone appropriate
+ * for the given online type.
+ */
+static struct zone * __meminit move_pfn_range(int online_type, int nid,
+ unsigned long start_pfn, unsigned long nr_pages)
{
- int ret;
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ struct zone *zone = &pgdat->node_zones[ZONE_NORMAL];
unsigned long flags;
- unsigned long z2_end_pfn;
-
- ret = ensure_zone_is_initialized(z2, start_pfn, end_pfn - start_pfn);
- if (ret)
- return ret;
-
- pgdat_resize_lock(z1->zone_pgdat, &flags);
-
- /* can't move pfns which are lower than @z1 */
- if (z1->zone_start_pfn > start_pfn)
- goto out_fail;
- /* the move out part mast at the right most of @z1 */
- if (zone_end_pfn(z1) > end_pfn)
- goto out_fail;
- /* must included/overlap */
- if (start_pfn >= zone_end_pfn(z1))
- goto out_fail;
-
- /* use end_pfn for z2's end_pfn if z2 is empty */
- if (!zone_is_empty(z2))
- z2_end_pfn = zone_end_pfn(z2);
- else
- z2_end_pfn = end_pfn;
-
- resize_zone(z1, z1->zone_start_pfn, start_pfn);
- resize_zone(z2, start_pfn, z2_end_pfn);
-
- pgdat_resize_unlock(z1->zone_pgdat, &flags);
-
- fix_zone_id(z2, start_pfn, end_pfn);
-
- return 0;
-out_fail:
- pgdat_resize_unlock(z1->zone_pgdat, &flags);
- return -1;
-}
-
-static struct zone * __meminit move_pfn_range(int zone_shift,
- unsigned long start_pfn, unsigned long end_pfn)
-{
- struct zone *zone = page_zone(pfn_to_page(start_pfn));
- int ret = 0;
-
- if (zone_shift < 0)
- ret = move_pfn_range_left(zone + zone_shift, zone,
- start_pfn, end_pfn);
- else if (zone_shift)
- ret = move_pfn_range_right(zone, zone + zone_shift,
- start_pfn, end_pfn);
-
- if (ret)
- return NULL;
-
- return zone + zone_shift;
-}
-
-static void __meminit grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- unsigned long old_pgdat_end_pfn = pgdat_end_pfn(pgdat);
-
- if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
- pgdat->node_start_pfn = start_pfn;
-
- pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
- pgdat->node_start_pfn;
-}
+ unsigned long i;

-static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
-{
- struct pglist_data *pgdat = zone->zone_pgdat;
- int nr_pages = PAGES_PER_SECTION;
- int nid = pgdat->node_id;
- int zone_type;
- unsigned long flags, pfn;
- int ret;
+ if (online_type == MMOP_ONLINE_KEEP) {
+ /*
+ * MMOP_ONLINE_KEEP inherits the current zone which is
+ * ZONE_NORMAL by default but we might be within ZONE_MOVABLE
+ * already.
+ */
+ if (zone_spans_range(&pgdat->node_zones[ZONE_MOVABLE], start_pfn, nr_pages))
+ zone = &pgdat->node_zones[ZONE_MOVABLE];
+ } else if (online_type == MMOP_ONLINE_MOVABLE) {
+ zone = &pgdat->node_zones[ZONE_MOVABLE];
+ }

- zone_type = zone - pgdat->node_zones;
- ret = ensure_zone_is_initialized(zone, phys_start_pfn, nr_pages);
- if (ret)
- return ret;
+ if (zone_is_empty(zone))
+ init_currently_empty_zone(zone, start_pfn, nr_pages);

- pgdat_resize_lock(zone->zone_pgdat, &flags);
- grow_zone_span(zone, phys_start_pfn, phys_start_pfn + nr_pages);
- grow_pgdat_span(zone->zone_pgdat, phys_start_pfn,
- phys_start_pfn + nr_pages);
- pgdat_resize_unlock(zone->zone_pgdat, &flags);
- memmap_init_zone(nr_pages, nid, zone_type,
- phys_start_pfn, MEMMAP_HOTPLUG);
+ clear_zone_contiguous(zone);

- /* online_page_range is called later and expects pages reserved */
- for (pfn = phys_start_pfn; pfn < phys_start_pfn + nr_pages; pfn++) {
- if (!pfn_valid(pfn))
- continue;
+ /* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
+ pgdat_resize_lock(pgdat, &flags);
+ zone_span_writelock(zone);
+ resize_zone(zone, start_pfn, nr_pages);
+ zone_span_writeunlock(zone);
+ resize_pgdat(pgdat, start_pfn, nr_pages);
+ pgdat_resize_unlock(pgdat, &flags);

- SetPageReserved(pfn_to_page(pfn));
+ /*
+ * TODO now we have a visible range of pages which are not associated
+ * with their zone properly. Not nice but set_pfnblock_flags_mask
+ * expects the zone spans the pfn range. All the pages in the range
+ * are reserved so nobody should be touching them so we should be safe
+ */
+ memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, MEMMAP_HOTPLUG);
+ for (i = 0; i < nr_pages; i++) {
+ unsigned long pfn = start_pfn + i;
+ set_page_links(pfn_to_page(pfn), zone_idx(zone), nid, pfn);
}
- return 0;
+
+ set_zone_contiguous(zone);
+ return zone;
}

-static int __meminit __add_section(int nid, struct zone *zone,
- unsigned long phys_start_pfn)
+static int __meminit __add_section(int nid, unsigned long phys_start_pfn)
{
int ret;
+ int i;

if (pfn_valid(phys_start_pfn))
return -EEXIST;

- ret = sparse_add_one_section(zone, phys_start_pfn);
-
+ ret = sparse_add_one_section(NODE_DATA(nid), phys_start_pfn);
if (ret < 0)
return ret;

- ret = __add_zone(zone, phys_start_pfn);
-
- if (ret < 0)
- return ret;
+ /*
+ * Make all the pages reserved so that nobody will stumble over half
+ * initialized state.
+ */
+ for (i = 0; i < PAGES_PER_SECTION; i++)
+ SetPageReserved(pfn_to_page(phys_start_pfn + i));

return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
}
@@ -528,7 +397,7 @@ static int __meminit __add_section(int nid, struct zone *zone,
* call this function after deciding the zone to which to
* add the new pages.
*/
-int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
+int __ref __add_pages(int nid, unsigned long phys_start_pfn,
unsigned long nr_pages)
{
unsigned long i;
@@ -536,8 +405,6 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
int start_sec, end_sec;
struct vmem_altmap *altmap;

- clear_zone_contiguous(zone);
-
/* during initialize mem_map, align hot-added range to section */
start_sec = pfn_to_section_nr(phys_start_pfn);
end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
@@ -557,7 +424,7 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
}

for (i = start_sec; i <= end_sec; i++) {
- err = __add_section(nid, zone, section_nr_to_pfn(i));
+ err = __add_section(nid, section_nr_to_pfn(i));

/*
* EEXIST is finally dealt with by ioresource collision
@@ -570,7 +437,6 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
}
vmemmap_populate_print_last();
out:
- set_zone_contiguous(zone);
return err;
}
EXPORT_SYMBOL_GPL(__add_pages);
@@ -944,23 +810,6 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
return 0;
}

-#ifdef CONFIG_MOVABLE_NODE
-/*
- * When CONFIG_MOVABLE_NODE, we permit onlining of a node which doesn't have
- * normal memory.
- */
-static bool can_online_high_movable(struct zone *zone)
-{
- return true;
-}
-#else /* CONFIG_MOVABLE_NODE */
-/* ensure every online node has NORMAL memory */
-static bool can_online_high_movable(struct zone *zone)
-{
- return node_state(zone_to_nid(zone), N_NORMAL_MEMORY);
-}
-#endif /* CONFIG_MOVABLE_NODE */
-
/* check which state of node_states will be changed when online memory */
static void node_states_check_changes_online(unsigned long nr_pages,
struct zone *zone, struct memory_notify *arg)
@@ -1035,39 +884,28 @@ static void node_states_set_node(int node, struct memory_notify *arg)
node_set_state(node, N_MEMORY);
}

-bool zone_can_shift(unsigned long pfn, unsigned long nr_pages,
- enum zone_type target, int *zone_shift)
+bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_pages, int online_type)
{
- struct zone *zone = page_zone(pfn_to_page(pfn));
- enum zone_type idx = zone_idx(zone);
- int i;
-
- *zone_shift = 0;
-
- if (idx < target) {
- /* pages must be at end of current zone */
- if (pfn + nr_pages != zone_end_pfn(zone))
- return false;
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ struct zone *movable_zone = &pgdat->node_zones[ZONE_MOVABLE];
+ struct zone *normal_zone = &pgdat->node_zones[ZONE_NORMAL];

- /* no zones in use between current zone and target */
- for (i = idx + 1; i < target; i++)
- if (zone_is_initialized(zone - idx + i))
- return false;
- }
-
- if (target < idx) {
- /* pages must be at beginning of current zone */
- if (pfn != zone->zone_start_pfn)
- return false;
-
- /* no zones in use between current zone and target */
- for (i = target + 1; i < idx; i++)
- if (zone_is_initialized(zone - idx + i))
- return false;
+ /*
+ * TODO there shouldn't be any inherent reason to have ZONE_NORMAL
+ * physically before ZONE_MOVABLE. All we need is they do not
+ * overlap. Historically we didn't allow ZONE_NORMAL after ZONE_MOVABLE
+ * though so let's stick with it for simplicity for now.
+ */
+ if (online_type == MMOP_ONLINE_KERNEL) {
+ if (!populated_zone(movable_zone))
+ return true;
+ return movable_zone->zone_start_pfn >= pfn + nr_pages;
+ } else if (online_type == MMOP_ONLINE_MOVABLE) {
+ return zone_end_pfn(normal_zone) <= pfn;
}

- *zone_shift = target - idx;
- return true;
+ /* MMOP_ONLINE_KEEP will always succeed and inherits the current zone */
+ return online_type == MMOP_ONLINE_KEEP;
}

/* Must be protected by mem_hotplug_begin() */
@@ -1080,29 +918,13 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
int nid;
int ret;
struct memory_notify arg;
- int zone_shift = 0;

- /*
- * This doesn't need a lock to do pfn_to_page().
- * The section can't be removed here because of the
- * memory_block->state_mutex.
- */
- zone = page_zone(pfn_to_page(pfn));
-
- if ((zone_idx(zone) > ZONE_NORMAL ||
- online_type == MMOP_ONLINE_MOVABLE) &&
- !can_online_high_movable(zone))
+ nid = pfn_to_nid(pfn);
+ if (!allow_online_pfn_range(nid, pfn, nr_pages, online_type))
return -EINVAL;

- if (online_type == MMOP_ONLINE_KERNEL) {
- if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))
- return -EINVAL;
- } else if (online_type == MMOP_ONLINE_MOVABLE) {
- if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, &zone_shift))
- return -EINVAL;
- }
-
- zone = move_pfn_range(zone_shift, pfn, pfn + nr_pages);
+ /* assiciate pfn range with the zone */
+ zone = move_pfn_range(online_type, nid, pfn, nr_pages);
if (!zone)
return -EINVAL;

@@ -1110,8 +932,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
arg.nr_pages = nr_pages;
node_states_check_changes_online(nr_pages, zone, &arg);

- nid = zone_to_nid(zone);
-
ret = memory_notify(MEM_GOING_ONLINE, &arg);
ret = notifier_to_errno(ret);
if (ret)
@@ -1522,8 +1342,10 @@ int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
}

if (zone) {
- *valid_start = start;
- *valid_end = min(end, end_pfn);
+ if (valid_start)
+ *valid_start = start;
+ if (valid_end)
+ *valid_end = min(end, end_pfn);
return 1;
} else {
return 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5ee8a26fa383..c6127f1a62e9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -795,7 +795,7 @@ static inline void __free_one_page(struct page *page,

max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);

- VM_BUG_ON(!zone_is_initialized(zone));
+ VM_BUG_ON(zone_is_empty(zone));
VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);

VM_BUG_ON(migratetype == -1);
@@ -5518,7 +5518,7 @@ static __meminit void zone_pcp_init(struct zone *zone)
zone_batchsize(zone));
}

-int __meminit init_currently_empty_zone(struct zone *zone,
+void __meminit init_currently_empty_zone(struct zone *zone,
unsigned long zone_start_pfn,
unsigned long size)
{
@@ -5535,9 +5535,6 @@ int __meminit init_currently_empty_zone(struct zone *zone,
zone_start_pfn, (zone_start_pfn + size));

zone_init_free_lists(zone);
- zone->initialized = 1;
-
- return 0;
}

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
@@ -6000,7 +5997,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
{
enum zone_type j;
int nid = pgdat->node_id;
- int ret;

pgdat_resize_init(pgdat);
#ifdef CONFIG_NUMA_BALANCING
@@ -6082,8 +6078,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)

set_pageblock_order();
setup_usemap(pgdat, zone, zone_start_pfn, size);
- ret = init_currently_empty_zone(zone, zone_start_pfn, size);
- BUG_ON(ret);
+ init_currently_empty_zone(zone, zone_start_pfn, size);
memmap_init(size, nid, j, zone_start_pfn);
}
}
diff --git a/mm/sparse.c b/mm/sparse.c
index db6bf3c97ea2..94072c0d8952 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -689,10 +689,9 @@ static void free_map_bootmem(struct page *memmap)
* set. If this is <=0, then that means that the passed-in
* map was not consumed and must be freed.
*/
-int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn)
+int __meminit sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn)
{
unsigned long section_nr = pfn_to_section_nr(start_pfn);
- struct pglist_data *pgdat = zone->zone_pgdat;
struct mem_section *ms;
struct page *memmap;
unsigned long *usemap;
--
Michal Hocko
SUSE Labs


2017-03-15 10:48:45

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH] rework memory hotplug onlining

Michal Hocko <[email protected]> writes:

> Hi,
> this is a follow up for [1]. In short the current semantic of the memory
> hotplug is awkward and hard/impossible to use from the udev to online
> memory as movable. The main problem is that only the last memblock or
> the adjacent to highest movable memblock can be onlined as movable:
> : Let's simulate memory hot online manually
> : # echo 0x100000000 > /sys/devices/system/memory/probe
> : # grep . /sys/devices/system/memory/memory32/valid_zones
> : Normal Movable
> :
> : which looks reasonably right? Both Normal and Movable zones are allowed
> :
> : # echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
> : # grep . /sys/devices/system/memory/memory3?/valid_zones
> : /sys/devices/system/memory/memory32/valid_zones:Normal
> : /sys/devices/system/memory/memory33/valid_zones:Normal Movable
> :
> : Huh, so our valid_zones have changed under our feet...
> :
> : # echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
> : # grep . /sys/devices/system/memory/memory3?/valid_zones
> : /sys/devices/system/memory/memory32/valid_zones:Normal
> : /sys/devices/system/memory/memory33/valid_zones:Normal
> : /sys/devices/system/memory/memory34/valid_zones:Normal Movable
> :
> : and again. So only the last memblock is considered movable. Let's try to
> : online them now.
> :
> : # echo online_movable > /sys/devices/system/memory/memory34/state
> : # grep . /sys/devices/system/memory/memory3?/valid_zones
> : /sys/devices/system/memory/memory32/valid_zones:Normal
> : /sys/devices/system/memory/memory33/valid_zones:Normal Movable
> : /sys/devices/system/memory/memory34/valid_zones:Movable Normal
>
> Now consider that the userspace gets the notification when the memblock
> is added. If the udev context tries to online it it will a) race with
> new memblocks showing up which leads to undeterministic behavior and
> b) it will see memblocks ordered in growing physical addresses while
> the only reliable way to online blocks as movable is exactly from other
> directions. This is just plain wrong!
>
> It seems that all this is just started by the semantic introduced by
> 9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd without sparsemem")
> quite some time ago. When the movable onlinining has been introduced it
> just built on top of this. It seems that the requirement to have
> freshly probed memory associated with the zone normal is no longer
> necessary. HOTPLUG depends on CONFIG_SPARSEMEM these days.
>
> The following blob [2] simply removes all the zone specific operations
> from __add_pages (aka arch_add_memory) path. Instead we do page->zone
> association from move_pfn_range which is called from online_pages. The
> criterion for movable/normal zone association is really simple now. We
> just have to guarantee that zone Normal is always lower than zone
> Movable. It would be actually sufficient to guarantee they do not
> overlap and that is indeed trivial to implement now. I didn't do that
> yet for simplicity of this change though.
>
> I have lightly tested the patch and nothing really jumped at me. I
> assume there will be some rough edges but it should be sufficient to
> start the discussion at least. Please note the diffstat. We have added
> a lot of code to tweak on top of the previous semantic which is just
> sad. Instead of developing a robust solution the memory hotplug is full
> of tweaks to satisfy particular usecase without longer term plans.
>
> Please note that this is just for x86 now but I will address other
> arches once there is an agreement this is the right approach.
>
> Thoughts, objections?
>

Speaking about long term approach,

(I'm not really familiar with the history of memory zones code so please
bear with me if my questions are stupid)

Currently when we online memory blocks we need to know where to put the
boundary between NORMAL and MOVABLE and this is a very hard decision to
make, no matter if we do this from kernel or from userspace. In theory,
we just want to avoid redundant limitations with future unplug but we
don't really know how much memory we'll need for kernel allocations in
future.

What actually stops us from having the following approach:
1) Everything is added to MOVABLE
2) When we're out of memory for kernel allocations in NORMAL we 'harvest'
the first MOVABLE block and 'convert' it to NORMAL. It may happen that
there is no free pages in this block but it was MOVABLE which means we
can move all allocations somewhere else.
3) Freeing the whole 128mb memblock takes time but we don't need to wait
till it finishes, we just need to satisfy the currently pending
allocation and we can continue moving everything else in the background.

An alternative approach would be to have lists of memblocks which
constitute ZONE_NORMAL and ZONE_MOVABLE instead of a simple 'NORMAL
before MOVABLE' rule we have now but I'm not sure this is a viable
approach with the current code base.

--
Vitaly

2017-03-15 12:29:38

by Michal Hocko

[permalink] [raw]
Subject: ZONE_NORMAL vs. ZONE_MOVABLE (was: Re: [RFC PATCH] rework memory hotplug onlining)

On Wed 15-03-17 11:48:37, Vitaly Kuznetsov wrote:
> Michal Hocko <[email protected]> writes:
[...]
> Speaking about long term approach,

Not really related to the patch but ok (I hope this will not distract
from the original intention here)...

> (I'm not really familiar with the history of memory zones code so please
> bear with me if my questions are stupid)
>
> Currently when we online memory blocks we need to know where to put the
> boundary between NORMAL and MOVABLE and this is a very hard decision to
> make, no matter if we do this from kernel or from userspace. In theory,
> we just want to avoid redundant limitations with future unplug but we
> don't really know how much memory we'll need for kernel allocations in
> future.

yes, and that is why I am not really all that happy about the whole
movable zones concept. It is basically reintroducing highmem issues from
32b times. But this is the only concept we currently have to provide a
reliable memory hotremove right now.

> What actually stops us from having the following approach:
> 1) Everything is added to MOVABLE
> 2) When we're out of memory for kernel allocations in NORMAL we 'harvest'
> the first MOVABLE block and 'convert' it to NORMAL. It may happen that
> there is no free pages in this block but it was MOVABLE which means we
> can move all allocations somewhere else.
> 3) Freeing the whole 128mb memblock takes time but we don't need to wait
> till it finishes, we just need to satisfy the currently pending
> allocation and we can continue moving everything else in the background.

Although it sounds like a good idea at first sight there are many tiny
details which will make it much more complicated. First of all, how
do we know that the lowmem (resp. all zones normal zones) are under
pressure to reduce the movable zone? Getting OOM for ~__GFP_MOVABLE
request? Isn't that too late already? Sync migration at that state might
be really non trivial (pages might be dirty, pinned etc...). What about
user expectation to hotremove that memory later, should we just break
it? How do we inflate movable zone back?

> An alternative approach would be to have lists of memblocks which
> constitute ZONE_NORMAL and ZONE_MOVABLE instead of a simple 'NORMAL
> before MOVABLE' rule we have now but I'm not sure this is a viable
> approach with the current code base.

I am not sure I understand.

--
Michal Hocko
SUSE Labs

2017-03-15 12:53:42

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: ZONE_NORMAL vs. ZONE_MOVABLE

Michal Hocko <[email protected]> writes:

> On Wed 15-03-17 11:48:37, Vitaly Kuznetsov wrote:
>> Michal Hocko <[email protected]> writes:
> [...]
>> Speaking about long term approach,
>
> Not really related to the patch but ok (I hope this will not distract
> from the original intention here)...
>

Yes, not directly related to your patch.

>> (I'm not really familiar with the history of memory zones code so please
>> bear with me if my questions are stupid)
>>
>> Currently when we online memory blocks we need to know where to put the
>> boundary between NORMAL and MOVABLE and this is a very hard decision to
>> make, no matter if we do this from kernel or from userspace. In theory,
>> we just want to avoid redundant limitations with future unplug but we
>> don't really know how much memory we'll need for kernel allocations in
>> future.
>
> yes, and that is why I am not really all that happy about the whole
> movable zones concept. It is basically reintroducing highmem issues from
> 32b times. But this is the only concept we currently have to provide a
> reliable memory hotremove right now.
>
>> What actually stops us from having the following approach:
>> 1) Everything is added to MOVABLE
>> 2) When we're out of memory for kernel allocations in NORMAL we 'harvest'
>> the first MOVABLE block and 'convert' it to NORMAL. It may happen that
>> there is no free pages in this block but it was MOVABLE which means we
>> can move all allocations somewhere else.
>> 3) Freeing the whole 128mb memblock takes time but we don't need to wait
>> till it finishes, we just need to satisfy the currently pending
>> allocation and we can continue moving everything else in the background.
>
> Although it sounds like a good idea at first sight there are many tiny
> details which will make it much more complicated. First of all, how
> do we know that the lowmem (resp. all zones normal zones) are under
> pressure to reduce the movable zone? Getting OOM for ~__GFP_MOVABLE
> request? Isn't that too late already?

Yes, I was basically thinking about OOM handling. It can also be a sort
of watermark-based decision.

> Sync migration at that state might
> be really non trivial (pages might be dirty, pinned etc...).

Non-trivial, yes, but we already have the code to move all allocations
away from MOVABLE block when we try to offline it, we can probably
leverage it.

> What about
> user expectation to hotremove that memory later, should we just break
> it? How do we inflate movable zone back?

I think that it's OK to leave this block non-offlineable for future. As
Andrea already pointed out it is not practical to try to guarantee we
can unplug everything we plugged in, we're talking about 'best effort'
service here anyway.

>
>> An alternative approach would be to have lists of memblocks which
>> constitute ZONE_NORMAL and ZONE_MOVABLE instead of a simple 'NORMAL
>> before MOVABLE' rule we have now but I'm not sure this is a viable
>> approach with the current code base.
>
> I am not sure I understand.

Now we have

[Normal][Normal][Normal][Movable][Movable][Movable]

we could have
[Normal][Normal][Movable][Normal][Movable][Normal]

so when new block comes in we make a decision to which zone we want to
online it (based on memory usage in these zones) and zone becomes a list
of memblocks which constitute it, not a simple [from..to] range.

--
Vitaly

2017-03-15 13:11:52

by Michal Hocko

[permalink] [raw]
Subject: Re: ZONE_NORMAL vs. ZONE_MOVABLE

On Wed 15-03-17 13:53:09, Vitaly Kuznetsov wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Wed 15-03-17 11:48:37, Vitaly Kuznetsov wrote:
[...]
> >> What actually stops us from having the following approach:
> >> 1) Everything is added to MOVABLE
> >> 2) When we're out of memory for kernel allocations in NORMAL we 'harvest'
> >> the first MOVABLE block and 'convert' it to NORMAL. It may happen that
> >> there is no free pages in this block but it was MOVABLE which means we
> >> can move all allocations somewhere else.
> >> 3) Freeing the whole 128mb memblock takes time but we don't need to wait
> >> till it finishes, we just need to satisfy the currently pending
> >> allocation and we can continue moving everything else in the background.
> >
> > Although it sounds like a good idea at first sight there are many tiny
> > details which will make it much more complicated. First of all, how
> > do we know that the lowmem (resp. all zones normal zones) are under
> > pressure to reduce the movable zone? Getting OOM for ~__GFP_MOVABLE
> > request? Isn't that too late already?
>
> Yes, I was basically thinking about OOM handling. It can also be a sort
> of watermark-based decision.
>
> > Sync migration at that state might
> > be really non trivial (pages might be dirty, pinned etc...).
>
> Non-trivial, yes, but we already have the code to move all allocations
> away from MOVABLE block when we try to offline it, we can probably
> leverage it.

Sure, I am not saying this is impossible. I am just saying there are
many subtle details to be solved.

>
> > What about
> > user expectation to hotremove that memory later, should we just break
> > it? How do we inflate movable zone back?
>
> I think that it's OK to leave this block non-offlineable for future. As
> Andrea already pointed out it is not practical to try to guarantee we
> can unplug everything we plugged in, we're talking about 'best effort'
> service here anyway.

Well, my understanding of movable zone is closer to a requirement than a
best effort thing. You have to sacrifice a lot - higher memory pressure
to other zones with resulting perfomance conseqences, potential
latencies to access remote memory when the data (locks etc.) are on a
remote non-movable node. It would be really bad to find out that all
that was in vain just because the lowmem pressure has stolen your
movable memory.

> >> An alternative approach would be to have lists of memblocks which
> >> constitute ZONE_NORMAL and ZONE_MOVABLE instead of a simple 'NORMAL
> >> before MOVABLE' rule we have now but I'm not sure this is a viable
> >> approach with the current code base.
> >
> > I am not sure I understand.
>
> Now we have
>
> [Normal][Normal][Normal][Movable][Movable][Movable]
>
> we could have
> [Normal][Normal][Movable][Normal][Movable][Normal]
>
> so when new block comes in we make a decision to which zone we want to
> online it (based on memory usage in these zones) and zone becomes a list
> of memblocks which constitute it, not a simple [from..to] range.

OK, I see now. I am afraid there is quite a lot of code which expects
that zones do not overlap. We can have holes in zones but not different
zones interleaving. Probably something which could be addressed but far
from trivial IMHO.

All that being said, I do not want to discourage you from experiments in
those areas. Just be prepared all those are far from trivial and
something for a long project ;)
--
Michal Hocko
SUSE Labs

2017-03-15 16:38:36

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_NORMAL vs. ZONE_MOVABLE

On Wed, Mar 15, 2017 at 02:11:40PM +0100, Michal Hocko wrote:
> OK, I see now. I am afraid there is quite a lot of code which expects
> that zones do not overlap. We can have holes in zones but not different
> zones interleaving. Probably something which could be addressed but far
> from trivial IMHO.
>
> All that being said, I do not want to discourage you from experiments in
> those areas. Just be prepared all those are far from trivial and
> something for a long project ;)

This constraint was known for quite some time, so when I talked about
this very constraint with Mel at least year LSF/MM he suggested sticky
pageblocks would be superior to the current movable zone.

So instead of having a Movable zone, we could use the pageblocks but
make it sticky-movable so they're only going to accept __GFP_MOVABLE
allocations into them. It would be still a quite large change indeed
but it looks simpler and with fewer drawbacks than trying to make the
zone overlap.

Currently when you online memory as movable you're patching down the
movable zone not just onlining the memory and that complexity you've
to deal with, would go away with sticky movable pageblocks.

One other option could be to boot like with _DEFAULT_ONLINE=n and of
course without udev rule. Then after booting with the base memory run
one of the two echo below:

$ cat /sys/devices/system/memory/removable_hotplug_default
[disabled] online online_movable
$ echo online > /sys/devices/system/memory/removable_hotplug_default
$ echo online_movable > /sys/devices/system/memory/removable_hotplug_default

Then the "echo online/online_movable" would activate the in-kernel
hotplug mechanism that is faster and more reliable than udev and it
won't risk to run into the movable zone shift "constraint". After the
"echo" the kernel would behave like if it booted with _DEFAULT_ONLINE=y.

If you still want to do it by hand and leave it disabled or even
trying to fix udev movable shift constraints, sticky pageblocks and
lack of synchronicity (and deal with the resulting slower
performance compared to in-kernel onlining), you could.

The in-kernel onlining would use the exact same code of
_DEFAULT_ONLINE=y, but it would be configured with a file like
/etc/sysctl.conf. And then to switch it to the _movable model you
would just need to edit the file like you've to edit the udev rule
(the one that if you edit it with online_movable currently breaks).

>From usability prospective it would be like udev, but without all
drawbacks of doing the onlining in userland.

Checking if the memory should become movable or not depending on
acpi_has_method(handle, "_EJ0") isn't flexible enough I think, on bare
metal especially we couldn't change the ACPI like we can do with the
hypervisor, but the admin has still to decide freely if he wants to
risk early OOM and movable zone imbalance or if he prefers not being
able to hotunplug the memory ever again. So it would need to become a
grub boot option which is probably less friendly than editing
sysctl.conf or something like that (especially given grub-mkconfig
output..).

Thanks,
Andrea

2017-03-15 23:08:51

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [RFC PATCH] rework memory hotplug onlining

On Wed, 2017-03-15 at 10:13 +0100, Michal Hocko wrote:
:
> @@ -388,39 +389,44 @@ static ssize_t show_valid_zones(struct device
> *dev,
>   struct device_attribute *attr, char
> *buf)
>  {
>   struct memory_block *mem = to_memory_block(dev);
> - unsigned long start_pfn, end_pfn;
> - unsigned long valid_start, valid_end, valid_pages;
> - unsigned long nr_pages = PAGES_PER_SECTION *
> sections_per_block;
> + unsigned long start_pfn, nr_pages;
> + bool append = false;
>   struct zone *zone;
> - int zone_shift = 0;
> + int nid;
>  
>   start_pfn = section_nr_to_pfn(mem->start_section_nr);
> - end_pfn = start_pfn + nr_pages;
> + zone = page_zone(pfn_to_page(start_pfn));
> + nr_pages = PAGES_PER_SECTION * sections_per_block;
>  
> - /* The block contains more than one zone can not be
> offlined. */
> - if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
> &valid_end))
> + /*
> +  * The block contains more than one zone can not be
> offlined.
> +  * This can happen e.g. for ZONE_DMA and ZONE_DMA32
> +  */
> + if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages,
> NULL, NULL))
>   return sprintf(buf, "none\n");
>  
> - zone = page_zone(pfn_to_page(valid_start));

Please do not remove the fix made in a96dfddbcc043. zone needs to be
set from valid_start, not from start_pfn.

Thanks,
-Toshi

2017-03-16 05:33:06

by Joonsoo Kim

[permalink] [raw]
Subject: Re: ZONE_NORMAL vs. ZONE_MOVABLE

On Wed, Mar 15, 2017 at 05:37:29PM +0100, Andrea Arcangeli wrote:
> On Wed, Mar 15, 2017 at 02:11:40PM +0100, Michal Hocko wrote:
> > OK, I see now. I am afraid there is quite a lot of code which expects
> > that zones do not overlap. We can have holes in zones but not different
> > zones interleaving. Probably something which could be addressed but far
> > from trivial IMHO.
> >
> > All that being said, I do not want to discourage you from experiments in
> > those areas. Just be prepared all those are far from trivial and
> > something for a long project ;)
>
> This constraint was known for quite some time, so when I talked about
> this very constraint with Mel at least year LSF/MM he suggested sticky
> pageblocks would be superior to the current movable zone.
>
> So instead of having a Movable zone, we could use the pageblocks but
> make it sticky-movable so they're only going to accept __GFP_MOVABLE
> allocations into them. It would be still a quite large change indeed
> but it looks simpler and with fewer drawbacks than trying to make the
> zone overlap.

Hello,

I don't follow up previous discussion so please let me know if I miss
something. I'd just like to mention about sticky pageblocks.

Before that, I'd like to say that a lot of code already deals with zone
overlap. Zone overlap exists for a long time although I don't know exact
history. IIRC, Mel fixed such a case before and compaction code has a
check for it. And, I added the overlap check to some pfn iterators which
doesn't have such a check for preparation of introducing a new zone,
ZONE_CMA, which has zone range overlap property. See following commits.

'ba6b097', '9d43f5a', 'a91c43c'.

Come to my main topic, I disagree that sticky pageblock would be
superior to the current separate zone approach. There is some reasons
about the objection to sticky movable pageblock in following link.

Sticky movable pageblock is conceptually same with MIGRATE_CMA and it
will cause many subtle issues like as MIGRATE_CMA did for CMA users.
MIGRATE_CMA introduces many hooks in various code path, and, to fix the
remaining issues, it needs more hooks. I don't think it is
maintainable approach. If you see following link which implements ZONE
approach, you can see that many hooks are removed in the end.

lkml.kernel.org/r/[email protected]

I don't know exact requirement on memory hotplug so it would be
possible that ZONE approach is not suitable for it. But, anyway, sticky
pageblock seems not to be a good solution to me.

Thanks.

2017-03-16 08:55:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] rework memory hotplug onlining

On Wed 15-03-17 23:08:14, Kani, Toshimitsu wrote:
> On Wed, 2017-03-15 at 10:13 +0100, Michal Hocko wrote:
> :
> > @@ -388,39 +389,44 @@ static ssize_t show_valid_zones(struct device
> > *dev,
> > ? struct device_attribute *attr, char
> > *buf)
> > ?{
> > ? struct memory_block *mem = to_memory_block(dev);
> > - unsigned long start_pfn, end_pfn;
> > - unsigned long valid_start, valid_end, valid_pages;
> > - unsigned long nr_pages = PAGES_PER_SECTION *
> > sections_per_block;
> > + unsigned long start_pfn, nr_pages;
> > + bool append = false;
> > ? struct zone *zone;
> > - int zone_shift = 0;
> > + int nid;
> > ?
> > ? start_pfn = section_nr_to_pfn(mem->start_section_nr);
> > - end_pfn = start_pfn + nr_pages;
> > + zone = page_zone(pfn_to_page(start_pfn));
> > + nr_pages = PAGES_PER_SECTION * sections_per_block;
> > ?
> > - /* The block contains more than one zone can not be
> > offlined. */
> > - if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
> > &valid_end))
> > + /*
> > + ?* The block contains more than one zone can not be
> > offlined.
> > + ?* This can happen e.g. for ZONE_DMA and ZONE_DMA32
> > + ?*/
> > + if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages,
> > NULL, NULL))
> > ? return sprintf(buf, "none\n");
> > ?
> > - zone = page_zone(pfn_to_page(valid_start));
>
> Please do not remove the fix made in a96dfddbcc043. zone needs to be
> set from valid_start, not from start_pfn.

Thanks for pointing this out. I was scratching my head about this part
but was too tired from previous git archeology so I didn't check the
history of this particular part.

I will restore the original behavior but before I do that I am really
curious whether partial memblocks are even supported for onlining. Maybe
I am missing something but I do not see any explicit checks for NULL
struct page when we set zone boundaries or online a memblock. Is it
possible those memblocks are just never hotplugable?
--
Michal Hocko
SUSE Labs

2017-03-16 17:35:24

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [RFC PATCH] rework memory hotplug onlining

On Thu, 2017-03-16 at 09:54 +0100, Michal Hocko wrote:
> On Wed 15-03-17 23:08:14, Kani, Toshimitsu wrote:
> > On Wed, 2017-03-15 at 10:13 +0100, Michal Hocko wrote:
:
> > > - zone = page_zone(pfn_to_page(valid_start));
> >
> > Please do not remove the fix made in a96dfddbcc043. zone needs to
> > be set from valid_start, not from start_pfn.
>
> Thanks for pointing this out. I was scratching my head about this
> part but was too tired from previous git archeology so I didn't check
> the history of this particular part.
>
> I will restore the original behavior but before I do that I am really
> curious whether partial memblocks are even supported for onlining.
> Maybe I am missing something but I do not see any explicit checks for
> NULL struct page when we set zone boundaries or online a memblock. Is
> it possible those memblocks are just never hotplugable?

check_hotplug_memory_range() checks if a given range is aligned by the
section size.

This memory device represents a memory_block, which may have multiple
sections per 'sections_per_block'. This value is set to 2GB/128MB for
2GB memory_block. So, I'd expect that hot-add works as long as the
address is aligned by 128MB, but I have not tested it myself.

Thanks,
-Toshi

2017-03-16 17:42:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] rework memory hotplug onlining

On Thu 16-03-17 17:19:34, Kani, Toshimitsu wrote:
> On Thu, 2017-03-16 at 09:54 +0100, Michal Hocko wrote:
> > On Wed 15-03-17 23:08:14, Kani, Toshimitsu wrote:
> > > On Wed, 2017-03-15 at 10:13 +0100, Michal Hocko wrote:
> :
> > > > - zone = page_zone(pfn_to_page(valid_start));
> > >
> > > Please do not remove the fix made in a96dfddbcc043. zone needs to
> > > be set from valid_start, not from start_pfn.
> >
> > Thanks for pointing this out. I was scratching my head about this
> > part but was too tired from previous git archeology so I didn't check
> > the history of this particular part.
> >
> > I will restore the original behavior but before I do that I am really
> > curious whether partial memblocks are even supported for onlining.
> > Maybe I am missing something but I do not see any explicit checks for
> > NULL struct page when we set zone boundaries or online a memblock. Is
> > it possible those memblocks are just never hotplugable?
>
> check_hotplug_memory_range() checks if a given range is aligned by the
> section size.

Ohh, right you are! I have completely missed check_hotplug_memory_range.
Thanks for pointing it out.

--
Michal Hocko
SUSE Labs

2017-03-16 19:11:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_NORMAL vs. ZONE_MOVABLE

Hello Joonsoo,

On Thu, Mar 16, 2017 at 02:31:22PM +0900, Joonsoo Kim wrote:
> I don't follow up previous discussion so please let me know if I miss
> something. I'd just like to mention about sticky pageblocks.

The interesting part of the previous discussion relevant for the
sticky movable pageblock is this part from Vitaly:

=== quote ===
Now we have

[Normal][Normal][Normal][Movable][Movable][Movable]

we could have

[Normal][Normal][Movable][Normal][Movable][Normal]
=== quote ===

Suppose you're an admin you can try to do starting from an
all-offlined hotplug memory:

kvm ~ # cat /sys/devices/system/memory/memory3[6-9]/online
0
0
0
0
kvm ~ # python ~andrea/zoneinfo.py
Zone: DMA Present: 15M Managed: 15M Start: 0M End: 16M
Zone: DMA32 Present: 2031M Managed: 1892M Start: 16M End: 2047M

All hotplug memory is offline, no Movable zone.

Then you online interleaved:

kvm ~ # echo online_movable > /sys/devices/system/memory/memory39/online
kvm ~ # python ~andrea/zoneinfo.py
Zone: DMA Present: 15M Managed: 15M Start: 0M End: 16M
Zone: DMA32 Present: 2031M Managed: 1892M Start: 16M End: 2047M
Zone: Movable Present: 128M Managed: 128M Start: 4.9G End: 5.0G
kvm ~ # echo online > /sys/devices/system/memory/memory38/online
kvm ~ # python ~andrea/zoneinfo.py
Zone: DMA Present: 15M Managed: 15M Start: 0M End: 16M
Zone: DMA32 Present: 2031M Managed: 1892M Start: 16M End: 2047M
Zone: Normal Present: 128M Managed: 128M Start: 4.0G End: 4.9G
Zone: Movable Present: 128M Managed: 128M Start: 4.9G End: 5.0G

So far so good.

kvm ~ # echo online_movable > /sys/devices/system/memory/memory37/online
kvm ~ # python ~andrea/zoneinfo.py
Zone: DMA Present: 15M Managed: 15M Start: 0M End: 16M
Zone: DMA32 Present: 2031M Managed: 1892M Start: 16M End: 2047M
Zone: Normal Present: 256M Managed: 256M Start: 4.0G End: 4.9G
Zone: Movable Present: 128M Managed: 128M Start: 4.9G End: 5.0G

Oops you thought you onlined movable memory37 but instead it silently
went in the normal zone (without even erroring out) and it's
definitely not going to be unpluggable and it's definitely non
movable.... all falls apart here. Admin won't run my zoneinfo.py
script that I had write specifically to understand what a mess what
was happening with online_movable interleaved.

The admin is much better off not touching
/sys/devices/system/memory/memory37 ever, and just use the in-kernel
onlining, at the very least until udev and sys interface are fixed for
both movable and non-movable hotplug onlining.

> Before that, I'd like to say that a lot of code already deals with zone
> overlap. Zone overlap exists for a long time although I don't know exact
> history. IIRC, Mel fixed such a case before and compaction code has a
> check for it. And, I added the overlap check to some pfn iterators which
> doesn't have such a check for preparation of introducing a new zone,
> ZONE_CMA, which has zone range overlap property. See following commits.
>
> 'ba6b097', '9d43f5a', 'a91c43c'.
>

So you suggest to create a full overlap like:

--------------- Movable --------------
--------------- Normal --------------

Then search for pages in the Movable zone buddy which will only
contain those that are onlined with echo online_movable?

> Come to my main topic, I disagree that sticky pageblock would be
> superior to the current separate zone approach. There is some reasons
> about the objection to sticky movable pageblock in following link.
>
> Sticky movable pageblock is conceptually same with MIGRATE_CMA and it
> will cause many subtle issues like as MIGRATE_CMA did for CMA users.
> MIGRATE_CMA introduces many hooks in various code path, and, to fix the
> remaining issues, it needs more hooks. I don't think it is

I'm not saying the sticky movable pageblocks are the way to go, to the
contrary we're saying the Movable zone constraints can better be
satisfied by the in-kernel onlining mechanism and it's overall much
simpler for the user to use the in-kernel onlining, than in trying to
fix udev to be synchronous and implementing sticky movable pageblocks
to make the /sys interface usable without unexpected side effects. And
I would suggest to look into dropping the MOVABLE_NODE config option
first (and turn it in a kernel parameter if something).

I agree sticky movable pageblocks may slowdown things and increase
complexity so it'd be better not having to implement those.

> maintainable approach. If you see following link which implements ZONE
> approach, you can see that many hooks are removed in the end.
>
> lkml.kernel.org/r/[email protected]
>
> I don't know exact requirement on memory hotplug so it would be
> possible that ZONE approach is not suitable for it. But, anyway, sticky
> pageblock seems not to be a good solution to me.

The fact sticky movable pageblocks aren't ideal for CMA doesn't mean
they're not ideal for memory hotunplug though.

With CMA there's no point in having the sticky movable pageblocks
scattered around and it's purely a misfeature to use sticky movable
pageblocks because you need the whole CMA area contiguous hence a
ZONE_CMA is ideal.

As opposed with memory hotplug the sticky movable pageblocks would
allow the kernel to satisfy the current /sys API and they would
provide no downside unlike in the CMA case where the size of the
allocation is unknown.

If we can make zone overlap work with a 100% overlap across the whole
node that would be a fine alternative, the zoneinfo.py output will
look weird, but if that's the only downside it's no big deal. With
sticky movable pageblocks it'll all be ZONE_NORMAL, with overlap it'll
all be both ZONE_NORMAL and ZONE_MOVABLE at the same time.

Again with the in-kernel onlining none of the above is necessary as
nobody should then need to echo online/online_movable >memory*/enabled
ever again and it can all be obsoleted. So before dropping the only
option we have that works flawlessly, we should fix all the above in
udev, /sys and provide full zone overlap or sticky movable pageblocks.

Thanks,
Andrea

2017-03-17 10:25:55

by Igor Mammedov

[permalink] [raw]
Subject: Re: ZONE_NORMAL vs. ZONE_MOVABLE

On Thu, 16 Mar 2017 20:01:25 +0100
Andrea Arcangeli <[email protected]> wrote:

[...]
> If we can make zone overlap work with a 100% overlap across the whole
> node that would be a fine alternative, the zoneinfo.py output will
> look weird, but if that's the only downside it's no big deal. With
> sticky movable pageblocks it'll all be ZONE_NORMAL, with overlap it'll
> all be both ZONE_NORMAL and ZONE_MOVABLE at the same time.
Looks like I'm not getting idea with zone overlap, so

We potentially have a flag that hotplugged block is removable
so on hotplug we could register them with zone MOVABLE as default,
however here comes zone balance issue so we can't do it until
it's solved.

As Vitaly's suggested we could steal(convert) existing blocks from
the border of MOVABLE zone into zone NORMAL when there isn't enough
memory in zone NORMAL to accommodate page tables extension for
just arrived new memory block. That would make a memory module
containing stolen block non-removable, but that may be acceptable
sacrifice to keep system alive. Potentially on attempt to remove it
kernel could even inform hardware(hypervisor) that memory module
become non removable using _OST ACPI method.


> Thanks,
> Andrea

2017-03-17 13:33:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] rework memory hotplug onlining

On Wed 15-03-17 10:13:47, Michal Hocko wrote:
[...]
> It seems that all this is just started by the semantic introduced by
> 9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd without sparsemem")
> quite some time ago. When the movable onlinining has been introduced it
> just built on top of this. It seems that the requirement to have
> freshly probed memory associated with the zone normal is no longer
> necessary. HOTPLUG depends on CONFIG_SPARSEMEM these days.
>
> The following blob [2] simply removes all the zone specific operations
> from __add_pages (aka arch_add_memory) path. Instead we do page->zone
> association from move_pfn_range which is called from online_pages. The
> criterion for movable/normal zone association is really simple now. We
> just have to guarantee that zone Normal is always lower than zone
> Movable. It would be actually sufficient to guarantee they do not
> overlap and that is indeed trivial to implement now. I didn't do that
> yet for simplicity of this change though.

Does anybody have any comments on this? Any issues I've overlooked
(except for the one pointed by Toshi Kani which is already fixed in my
local branch)?
--
Michal Hocko
SUSE Labs

2017-03-20 06:33:06

by Joonsoo Kim

[permalink] [raw]
Subject: Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-17 4:01 GMT+09:00 Andrea Arcangeli <[email protected]>:
> Hello Joonsoo,

Hello, Andrea.

> On Thu, Mar 16, 2017 at 02:31:22PM +0900, Joonsoo Kim wrote:
>> I don't follow up previous discussion so please let me know if I miss
>> something. I'd just like to mention about sticky pageblocks.
>
> The interesting part of the previous discussion relevant for the
> sticky movable pageblock is this part from Vitaly:
>
> === quote ===
> Now we have
>
> [Normal][Normal][Normal][Movable][Movable][Movable]
>
> we could have
>
> [Normal][Normal][Movable][Normal][Movable][Normal]
> === quote ===
>
> Suppose you're an admin you can try to do starting from an
> all-offlined hotplug memory:
>
> kvm ~ # cat /sys/devices/system/memory/memory3[6-9]/online
> 0
> 0
> 0
> 0
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA Present: 15M Managed: 15M Start: 0M End: 16M
> Zone: DMA32 Present: 2031M Managed: 1892M Start: 16M End: 2047M
>
> All hotplug memory is offline, no Movable zone.
>
> Then you online interleaved:
>
> kvm ~ # echo online_movable > /sys/devices/system/memory/memory39/online
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA Present: 15M Managed: 15M Start: 0M End: 16M
> Zone: DMA32 Present: 2031M Managed: 1892M Start: 16M End: 2047M
> Zone: Movable Present: 128M Managed: 128M Start: 4.9G End: 5.0G
> kvm ~ # echo online > /sys/devices/system/memory/memory38/online
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA Present: 15M Managed: 15M Start: 0M End: 16M
> Zone: DMA32 Present: 2031M Managed: 1892M Start: 16M End: 2047M
> Zone: Normal Present: 128M Managed: 128M Start: 4.0G End: 4.9G
> Zone: Movable Present: 128M Managed: 128M Start: 4.9G End: 5.0G
>
> So far so good.
>
> kvm ~ # echo online_movable > /sys/devices/system/memory/memory37/online
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA Present: 15M Managed: 15M Start: 0M End: 16M
> Zone: DMA32 Present: 2031M Managed: 1892M Start: 16M End: 2047M
> Zone: Normal Present: 256M Managed: 256M Start: 4.0G End: 4.9G
> Zone: Movable Present: 128M Managed: 128M Start: 4.9G End: 5.0G
>
> Oops you thought you onlined movable memory37 but instead it silently
> went in the normal zone (without even erroring out) and it's
> definitely not going to be unpluggable and it's definitely non
> movable.... all falls apart here. Admin won't run my zoneinfo.py
> script that I had write specifically to understand what a mess what
> was happening with online_movable interleaved.
>
> The admin is much better off not touching
> /sys/devices/system/memory/memory37 ever, and just use the in-kernel
> onlining, at the very least until udev and sys interface are fixed for
> both movable and non-movable hotplug onlining.

Thanks for explanation. Now, I understand the issue correctly.

>> Before that, I'd like to say that a lot of code already deals with zone
>> overlap. Zone overlap exists for a long time although I don't know exact
>> history. IIRC, Mel fixed such a case before and compaction code has a
>> check for it. And, I added the overlap check to some pfn iterators which
>> doesn't have such a check for preparation of introducing a new zone,
>> ZONE_CMA, which has zone range overlap property. See following commits.
>>
>> 'ba6b097', '9d43f5a', 'a91c43c'.
>>
>
> So you suggest to create a full overlap like:
>
> --------------- Movable --------------
> --------------- Normal --------------
>
> Then search for pages in the Movable zone buddy which will only
> contain those that are onlined with echo online_movable?

Yes. Full overlap would be the worst case but it's possible and it
would work well(?) even in current kernel.

>> Come to my main topic, I disagree that sticky pageblock would be
>> superior to the current separate zone approach. There is some reasons
>> about the objection to sticky movable pageblock in following link.
>>
>> Sticky movable pageblock is conceptually same with MIGRATE_CMA and it
>> will cause many subtle issues like as MIGRATE_CMA did for CMA users.
>> MIGRATE_CMA introduces many hooks in various code path, and, to fix the
>> remaining issues, it needs more hooks. I don't think it is
>
> I'm not saying the sticky movable pageblocks are the way to go, to the
> contrary we're saying the Movable zone constraints can better be
> satisfied by the in-kernel onlining mechanism and it's overall much
> simpler for the user to use the in-kernel onlining, than in trying to
> fix udev to be synchronous and implementing sticky movable pageblocks
> to make the /sys interface usable without unexpected side effects. And
> I would suggest to look into dropping the MOVABLE_NODE config option
> first (and turn it in a kernel parameter if something).

Okay.

> I agree sticky movable pageblocks may slowdown things and increase
> complexity so it'd be better not having to implement those.
>
>> maintainable approach. If you see following link which implements ZONE
>> approach, you can see that many hooks are removed in the end.
>>
>> lkml.kernel.org/r/[email protected]
>>
>> I don't know exact requirement on memory hotplug so it would be
>> possible that ZONE approach is not suitable for it. But, anyway, sticky
>> pageblock seems not to be a good solution to me.
>
> The fact sticky movable pageblocks aren't ideal for CMA doesn't mean
> they're not ideal for memory hotunplug though.
>
> With CMA there's no point in having the sticky movable pageblocks
> scattered around and it's purely a misfeature to use sticky movable
> pageblocks because you need the whole CMA area contiguous hence a
> ZONE_CMA is ideal.

No. CMA ranges could be registered many times for each devices and they
could be scattered due to device's H/W limitation. So, current implementation
in kernel, MIGRATE_CMA pageblocks, are scattered sometimes.

> As opposed with memory hotplug the sticky movable pageblocks would
> allow the kernel to satisfy the current /sys API and they would
> provide no downside unlike in the CMA case where the size of the
> allocation is unknown.

No, same downside also exists in this case. Downside is not related to the case
that device uses that range. It is related to VM management to this range and
problems are the same. For example, with sticky movable pageblock, we need to
subtract number of freepages in sticky movable pageblock when watermark is
checked for non-movable allocation and it causes some problems.

> If we can make zone overlap work with a 100% overlap across the whole
> node that would be a fine alternative, the zoneinfo.py output will
> look weird, but if that's the only downside it's no big deal. With
> sticky movable pageblocks it'll all be ZONE_NORMAL, with overlap it'll
> all be both ZONE_NORMAL and ZONE_MOVABLE at the same time.

Okay.

> Again with the in-kernel onlining none of the above is necessary as
> nobody should then need to echo online/online_movable >memory*/enabled
> ever again and it can all be obsoleted. So before dropping the only
> option we have that works flawlessly, we should fix all the above in
> udev, /sys and provide full zone overlap or sticky movable pageblocks.

Okay.

Thanks.

2017-03-21 14:10:24

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH] rework memory hotplug onlining

On Fri, Mar 17, 2017 at 6:20 AM, Michal Hocko <[email protected]> wrote:
> On Wed 15-03-17 10:13:47, Michal Hocko wrote:
> [...]
>> It seems that all this is just started by the semantic introduced by
>> 9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd without sparsemem")
>> quite some time ago. When the movable onlinining has been introduced it
>> just built on top of this. It seems that the requirement to have
>> freshly probed memory associated with the zone normal is no longer
>> necessary. HOTPLUG depends on CONFIG_SPARSEMEM these days.
>>
>> The following blob [2] simply removes all the zone specific operations
>> from __add_pages (aka arch_add_memory) path. Instead we do page->zone
>> association from move_pfn_range which is called from online_pages. The
>> criterion for movable/normal zone association is really simple now. We
>> just have to guarantee that zone Normal is always lower than zone
>> Movable. It would be actually sufficient to guarantee they do not
>> overlap and that is indeed trivial to implement now. I didn't do that
>> yet for simplicity of this change though.
>
> Does anybody have any comments on this? Any issues I've overlooked
> (except for the one pointed by Toshi Kani which is already fixed in my
> local branch)?

It disables the ZONE_DEVICE use case, but like we chatted about at LSF
I'll take a look at having devm_memremap_pages() call
move_pfn_range().

2017-03-30 07:55:37

by Vlastimil Babka

[permalink] [raw]
Subject: Re: ZONE_NORMAL vs. ZONE_MOVABLE

On 03/20/2017 07:33 AM, Joonsoo Kim wrote:
>> The fact sticky movable pageblocks aren't ideal for CMA doesn't mean
>> they're not ideal for memory hotunplug though.
>>
>> With CMA there's no point in having the sticky movable pageblocks
>> scattered around and it's purely a misfeature to use sticky movable
>> pageblocks because you need the whole CMA area contiguous hence a
>> ZONE_CMA is ideal.
> No. CMA ranges could be registered many times for each devices and they
> could be scattered due to device's H/W limitation. So, current implementation
> in kernel, MIGRATE_CMA pageblocks, are scattered sometimes.
>
>> As opposed with memory hotplug the sticky movable pageblocks would
>> allow the kernel to satisfy the current /sys API and they would
>> provide no downside unlike in the CMA case where the size of the
>> allocation is unknown.
> No, same downside also exists in this case. Downside is not related to the case
> that device uses that range. It is related to VM management to this range and
> problems are the same. For example, with sticky movable pageblock, we need to
> subtract number of freepages in sticky movable pageblock when watermark is
> checked for non-movable allocation and it causes some problems.

Agree. Right now for CMA we have to account NR_FREE_CMA_PAGES (number of
free pages within MIGRATE_CMA pageblocks), which brings all those hooks
and other troubles for keep the accounting precise (there used to be
various races in there). This goes against the rest of page grouping by
mobility design, which wasn't meant to be precise for performance
reasons (e.g. when you change pageblock type and move pages between
freelists, any pcpu cached pages are left at their previous type's list).

We also can't ignore this accounting, as then the watermark check could
then pass for e.g. UNMOVABLE allocation, which would proceed to find
that the only free pages available are within the MIGRATE_CMA (or
sticky-movable) pageblocks, where it's not allowed to fallback to. If
only then we went reclaiming, the zone balance checks would also
consider the zone balanced, even though unmovable allocations would
still not be possible.

Even with this extra accounting, things are not perfect, because reclaim
doesn't guarantee freeing the pages in the right pageblocks, so we can
easily overreclaim. That's mainly why I agreed that ZONE_CMA should be
better than the current implementation, and I'm skeptical about the
sticky-movable pageblock idea. Note the conversion to node-lru reclaim
has changed things somewhat, as we can't reclaim a single zone anymore,
but the accounting troubles remain.