Hi,
I have posted a crude RFC for this rework [1] and there didn't seem any
objections so I have split up the patch into smaller chunks which will
make the review easier hopefully.
Motivation:
Movable onlining is a real hack with many downsides - mainly
reintroduction of lowmem/highmem issues we used to have on 32b systems -
but it is the only way to make the memory hotremove more reliable which
is something that people are asking for.
The current semantic of memory movable onlinening is really cumbersome,
however. The main reason for this is that the udev driven approach is
basically unusable because udev races with the memory probing while only
the last memory block or the one adjacent to the existing zone_movable
are allowed to be onlined movable. In short the criterion for the
successful online_movable changes under udev's feet. A reliable udev
approach would require a 2 phase approach where the first successful
movable online would have to check all the previous blocks and online
them in descending order. This is hard to be considered sane.
This patchset aims at making the onlining semantic more usable. First of
all it allows to online memory movable as long as it doesn't clash with
the existing ZONE_NORMAL. That means that ZONE_NORMAL and ZONE_MOVABLE
cannot overlap. Currently I preserve the original ordering semantic so
the zone always precedes the movable zone but I have plans to remove this
restriction in future because it is not really necessary.
The series consists of 4 cleanup patches which should be ready to be
merged right away (unless I have missed something subtle of course).
Patch 5 is the core of the change. In order to make it easier to review
I have tried it to be as minimalistic as possible and the large code
removal is moved to patch 6.
I would appreciate if s390 folks could take a look at patch 4 and the
arch_add_memory because I am not sure I've grokked what they wanted to
achieve there completely.
I have tested the patches in kvm:
# qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ...
and then probed the additional memory by
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Then I have used this simple script to probe the memory block by hand
# cat probe_memblock.sh
#!/bin/sh
BLOCK_NR=$1
echo $((0x100000000+$BLOCK_NR*(128<<20))) > /sys/devices/system/memory/probe
# for i in $(seq 10); do sh probe_memblock.sh $i; done
# grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Normal Movable
/sys/devices/system/memory/memory35/valid_zones:Normal Movable
/sys/devices/system/memory/memory36/valid_zones:Normal Movable
/sys/devices/system/memory/memory37/valid_zones:Normal Movable
/sys/devices/system/memory/memory38/valid_zones:Normal Movable
/sys/devices/system/memory/memory39/valid_zones:Normal Movable
The main difference to the original implementation is that all new
memblocks can be both online_kernel and online_movable initially
because there is no clash obviously. For the comparison the original
implementation would have
/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal
/sys/devices/system/memory/memory35/valid_zones:Normal
/sys/devices/system/memory/memory36/valid_zones:Normal
/sys/devices/system/memory/memory37/valid_zones:Normal
/sys/devices/system/memory/memory38/valid_zones:Normal
/sys/devices/system/memory/memory39/valid_zones:Normal Movable
Now
# echo online_movable > /sys/devices/system/memory/memory34/state
# grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
/sys/devices/system/memory/memory35/valid_zones:Movable
/sys/devices/system/memory/memory36/valid_zones:Movable
/sys/devices/system/memory/memory37/valid_zones:Movable
/sys/devices/system/memory/memory38/valid_zones:Movable
/sys/devices/system/memory/memory39/valid_zones:Movable
Block 33 can still be online both kernel and movable while all
the remaining can be only movable.
/proc/zonelist says
Node 0, zone Normal
pages free 0
min 0
low 0
high 0
spanned 0
present 0
--
Node 0, zone Movable
pages free 32753
min 85
low 117
high 149
spanned 32768
present 32768
A new memblock at a lower address will result in a new memblock (32)
which will still allow both Normal and Movable.
# sh probe_memblock.sh 0
# grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
/sys/devices/system/memory/memory35/valid_zones:Movable
and online_kernel will convert it to the zone normal properly
while 33 can be still onlined both ways.
# echo online_kernel > /sys/devices/system/memory/memory32/state
# grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
/sys/devices/system/memory/memory35/valid_zones:Movable
/proc/zoneinfo will now tell
Node 0, zone Normal
pages free 65441
min 165
low 230
high 295
spanned 65536
present 65536
--
Node 0, zone Movable
pages free 32740
min 82
low 114
high 146
spanned 32768
present 32768
so both zones have one memblock spanned and present.
Onlining 39 should associate this block to the movable zone
# echo online > /sys/devices/system/memory/memory39/state
/proc/zoneinfo will now tell
Node 0, zone Normal
pages free 32765
min 80
low 112
high 144
spanned 32768
present 32768
--
Node 0, zone Movable
pages free 65501
min 160
low 225
high 290
spanned 196608
present 65536
so we will have a movable zone which spans 6 memblocks, 2 present and 4
representing a hole.
Offlining both movable blocks will lead to the zone with no present
pages which is the expected behavior I believe.
# echo offline > /sys/devices/system/memory/memory39/state
# echo offline > /sys/devices/system/memory/memory34/state
# grep -A6 "Movable\|Normal" /proc/zoneinfo
Node 0, zone Normal
pages free 32735
min 90
low 122
high 154
spanned 32768
present 32768
--
Node 0, zone Movable
pages free 0
min 0
low 0
high 0
spanned 196608
present 0
Any thoughts, complains, suggestions?
As a bonus we will get a nice cleanup in the memory hotplug codebase
arch/ia64/mm/init.c | 10 +-
arch/powerpc/mm/mem.c | 12 +-
arch/s390/mm/init.c | 32 +---
arch/sh/mm/init.c | 10 +-
arch/tile/mm/init.c | 30 ---
arch/x86/mm/init_32.c | 7 +-
arch/x86/mm/init_64.c | 11 +-
drivers/base/memory.c | 52 ++---
include/linux/memory_hotplug.h | 17 +-
include/linux/mmzone.h | 13 +-
kernel/memremap.c | 5 +-
mm/memory_hotplug.c | 423 ++++++++++++-----------------------------
mm/page_alloc.c | 11 +-
mm/sparse.c | 3 +-
14 files changed, 186 insertions(+), 450 deletions(-)
Shortlog says:
Michal Hocko (6):
mm: get rid of zone_is_initialized
mm, tile: drop arch_{add,remove}_memory
mm: remove return value from init_currently_empty_zone
mm, memory_hotplug: use node instead of zone in can_online_high_movable
mm, memory_hotplug: do not associate hotadded memory to zones until online
mm, memory_hotplug: remove unused cruft after memory hotplug rework
[1] http://lkml.kernel.org/r/[email protected]
From: Michal Hocko <[email protected]>
the primary purpose of this helper is to query the node state so use
the node id directly. This is a preparatory patch for later changes.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memory_hotplug.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 056dbbe6d20e..221f622bcc88 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -936,15 +936,15 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
* When CONFIG_MOVABLE_NODE, we permit onlining of a node which doesn't have
* normal memory.
*/
-static bool can_online_high_movable(struct zone *zone)
+static bool can_online_high_movable(int nid)
{
return true;
}
#else /* CONFIG_MOVABLE_NODE */
/* ensure every online node has NORMAL memory */
-static bool can_online_high_movable(struct zone *zone)
+static bool can_online_high_movable(int nid)
{
- return node_state(zone_to_nid(zone), N_NORMAL_MEMORY);
+ return node_state(nid, N_NORMAL_MEMORY);
}
#endif /* CONFIG_MOVABLE_NODE */
@@ -1078,7 +1078,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
if ((zone_idx(zone) > ZONE_NORMAL ||
online_type == MMOP_ONLINE_MOVABLE) &&
- !can_online_high_movable(zone))
+ !can_online_high_movable(pfn_to_nid(pfn)))
return -EINVAL;
if (online_type == MMOP_ONLINE_KERNEL) {
--
2.11.0
From: Michal Hocko <[email protected]>
arch_add_memory doesn't need for_device parameter anymore because
devm_memremap_pages already does all what it needs to.
zone_for_memory doesn't have any user anymore as well as the whole zone
shifting infrastructure so drop them as well.
Signed-off-by: Michal Hocko <[email protected]>
---
arch/ia64/mm/init.c | 2 +-
arch/powerpc/mm/mem.c | 2 +-
arch/s390/mm/init.c | 2 +-
arch/sh/mm/init.c | 3 +-
arch/x86/mm/init_32.c | 2 +-
arch/x86/mm/init_64.c | 2 +-
include/linux/memory_hotplug.h | 4 +-
kernel/memremap.c | 2 +-
mm/memory_hotplug.c | 209 +----------------------------------------
9 files changed, 9 insertions(+), 219 deletions(-)
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 0fb7f3946785..6ebb570f1eae 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -645,7 +645,7 @@ mem_init (void)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index db1369a7f69f..707a4146938b 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -126,7 +126,7 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
return -ENODEV;
}
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 9e8c515ee29f..7d7591b63b57 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -153,7 +153,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
#endif
#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long size_pages = PFN_DOWN(size);
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 95261b66bcf3..4e4afa0ab3c3 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -485,13 +485,12 @@ void free_initrd_mem(unsigned long start, unsigned long end)
#endif
#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
-
/* We only have ZONE_NORMAL, so this is easy.. */
ret = __add_pages(nid, start_pfn, nr_pages);
if (unlikely(ret))
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 5de79aa7d6ce..b389f2ec75f9 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -816,7 +816,7 @@ void __init mem_init(void)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bc53f24e6703..20f575333b08 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -637,7 +637,7 @@ static void update_end_of_memory_vars(u64 start, u64 size)
}
}
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 63577ce57028..d5adb7c468d1 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -274,9 +274,7 @@ extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
void *arg, int (*func)(struct memory_block *, void *));
extern int add_memory(int nid, u64 start, u64 size);
extern int add_memory_resource(int nid, struct resource *resource, bool online);
-extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
- bool for_device);
-extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
+extern int arch_add_memory(int nid, u64 start, u64 size);
extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 1b137649cb82..808f7974918a 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -366,7 +366,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
lock_device_hotplug();
mem_hotplug_begin();
- error = arch_add_memory(nid, align_start, align_size, true);
+ error = arch_add_memory(nid, align_start, align_size);
if (!error)
move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
align_start, align_size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d7efa0191bb7..b9dc1c4e26c3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -295,180 +295,6 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
}
#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
-static void __meminit grow_zone_span(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- unsigned long old_zone_end_pfn;
-
- zone_span_writelock(zone);
-
- old_zone_end_pfn = zone_end_pfn(zone);
- if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
- zone->zone_start_pfn = start_pfn;
-
- zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
- zone->zone_start_pfn;
-
- zone_span_writeunlock(zone);
-}
-
-static void resize_zone(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- zone_span_writelock(zone);
-
- if (end_pfn - start_pfn) {
- zone->zone_start_pfn = start_pfn;
- zone->spanned_pages = end_pfn - start_pfn;
- } else {
- /*
- * make it consist as free_area_init_core(),
- * if spanned_pages = 0, then keep start_pfn = 0
- */
- zone->zone_start_pfn = 0;
- zone->spanned_pages = 0;
- }
-
- zone_span_writeunlock(zone);
-}
-
-static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- enum zone_type zid = zone_idx(zone);
- int nid = zone->zone_pgdat->node_id;
- unsigned long pfn;
-
- for (pfn = start_pfn; pfn < end_pfn; pfn++)
- set_page_links(pfn_to_page(pfn), zid, nid, pfn);
-}
-
-static void __ref ensure_zone_is_initialized(struct zone *zone,
- unsigned long start_pfn, unsigned long num_pages)
-{
- if (!zone_is_empty(zone))
- init_currently_empty_zone(zone, start_pfn, num_pages);
-}
-
-static int __meminit move_pfn_range_left(struct zone *z1, struct zone *z2,
- unsigned long start_pfn, unsigned long end_pfn)
-{
- unsigned long flags;
- unsigned long z1_start_pfn;
-
- ensure_zone_is_initialized(z1, start_pfn, end_pfn - start_pfn);
-
- pgdat_resize_lock(z1->zone_pgdat, &flags);
-
- /* can't move pfns which are higher than @z2 */
- if (end_pfn > zone_end_pfn(z2))
- goto out_fail;
- /* the move out part must be at the left most of @z2 */
- if (start_pfn > z2->zone_start_pfn)
- goto out_fail;
- /* must included/overlap */
- if (end_pfn <= z2->zone_start_pfn)
- goto out_fail;
-
- /* use start_pfn for z1's start_pfn if z1 is empty */
- if (!zone_is_empty(z1))
- z1_start_pfn = z1->zone_start_pfn;
- else
- z1_start_pfn = start_pfn;
-
- resize_zone(z1, z1_start_pfn, end_pfn);
- resize_zone(z2, end_pfn, zone_end_pfn(z2));
-
- pgdat_resize_unlock(z1->zone_pgdat, &flags);
-
- fix_zone_id(z1, start_pfn, end_pfn);
-
- return 0;
-out_fail:
- pgdat_resize_unlock(z1->zone_pgdat, &flags);
- return -1;
-}
-
-static int __meminit move_pfn_range_right(struct zone *z1, struct zone *z2,
- unsigned long start_pfn, unsigned long end_pfn)
-{
- unsigned long flags;
- unsigned long z2_end_pfn;
-
- ensure_zone_is_initialized(z2, start_pfn, end_pfn - start_pfn);
-
- pgdat_resize_lock(z1->zone_pgdat, &flags);
-
- /* can't move pfns which are lower than @z1 */
- if (z1->zone_start_pfn > start_pfn)
- goto out_fail;
- /* the move out part mast at the right most of @z1 */
- if (zone_end_pfn(z1) > end_pfn)
- goto out_fail;
- /* must included/overlap */
- if (start_pfn >= zone_end_pfn(z1))
- goto out_fail;
-
- /* use end_pfn for z2's end_pfn if z2 is empty */
- if (!zone_is_empty(z2))
- z2_end_pfn = zone_end_pfn(z2);
- else
- z2_end_pfn = end_pfn;
-
- resize_zone(z1, z1->zone_start_pfn, start_pfn);
- resize_zone(z2, start_pfn, z2_end_pfn);
-
- pgdat_resize_unlock(z1->zone_pgdat, &flags);
-
- fix_zone_id(z2, start_pfn, end_pfn);
-
- return 0;
-out_fail:
- pgdat_resize_unlock(z1->zone_pgdat, &flags);
- return -1;
-}
-
-static void __meminit grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- unsigned long old_pgdat_end_pfn = pgdat_end_pfn(pgdat);
-
- if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
- pgdat->node_start_pfn = start_pfn;
-
- pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
- pgdat->node_start_pfn;
-}
-
-static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
-{
- struct pglist_data *pgdat = zone->zone_pgdat;
- int nr_pages = PAGES_PER_SECTION;
- int nid = pgdat->node_id;
- int zone_type;
- unsigned long flags, pfn;
-
- zone_type = zone - pgdat->node_zones;
- ensure_zone_is_initialized(zone, phys_start_pfn, nr_pages);
-
- pgdat_resize_lock(zone->zone_pgdat, &flags);
- grow_zone_span(zone, phys_start_pfn, phys_start_pfn + nr_pages);
- grow_pgdat_span(zone->zone_pgdat, phys_start_pfn,
- phys_start_pfn + nr_pages);
- pgdat_resize_unlock(zone->zone_pgdat, &flags);
- memmap_init_zone(nr_pages, nid, zone_type,
- phys_start_pfn, MEMMAP_HOTPLUG);
-
- /* online_page_range is called later and expects pages reserved */
- for (pfn = phys_start_pfn; pfn < phys_start_pfn + nr_pages; pfn++) {
- if (!pfn_valid(pfn))
- continue;
-
- SetPageReserved(pfn_to_page(pfn));
- }
- return 0;
-}
-
static int __meminit __add_section(int nid, unsigned long phys_start_pfn)
{
int ret;
@@ -1332,39 +1158,6 @@ static int check_hotplug_memory_range(u64 start, u64 size)
return 0;
}
-/*
- * If movable zone has already been setup, newly added memory should be check.
- * If its address is higher than movable zone, it should be added as movable.
- * Without this check, movable zone may overlap with other zone.
- */
-static int should_add_memory_movable(int nid, u64 start, u64 size)
-{
- unsigned long start_pfn = start >> PAGE_SHIFT;
- pg_data_t *pgdat = NODE_DATA(nid);
- struct zone *movable_zone = pgdat->node_zones + ZONE_MOVABLE;
-
- if (zone_is_empty(movable_zone))
- return 0;
-
- if (movable_zone->zone_start_pfn <= start_pfn)
- return 1;
-
- return 0;
-}
-
-int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
- bool for_device)
-{
-#ifdef CONFIG_ZONE_DEVICE
- if (for_device)
- return ZONE_DEVICE;
-#endif
- if (should_add_memory_movable(nid, start, size))
- return ZONE_MOVABLE;
-
- return zone_default;
-}
-
static int online_memory_block(struct memory_block *mem, void *arg)
{
return device_online(&mem->dev);
@@ -1410,7 +1203,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
}
/* call arch's memory hotadd */
- ret = arch_add_memory(nid, start, size, false);
+ ret = arch_add_memory(nid, start, size);
if (ret < 0)
goto error;
--
2.11.0
From: Michal Hocko <[email protected]>
There shouldn't be any reason to add initialized when we can tell the
same thing from checking whether there are any pages spanned to the
zone. Remove zone_is_initialized() and replace it by zone_is_empty
which can be used for the same set of tests.
This shouldn't have any visible effect
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/mmzone.h | 11 +++++------
mm/memory_hotplug.c | 6 +++---
mm/page_alloc.c | 5 +----
3 files changed, 9 insertions(+), 13 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 618499159a7c..dbe3b32fe85d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -442,8 +442,6 @@ struct zone {
seqlock_t span_seqlock;
#endif
- int initialized;
-
/* Write-intensive fields used from the page allocator */
ZONE_PADDING(_pad1_)
@@ -520,14 +518,15 @@ static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
}
-static inline bool zone_is_initialized(struct zone *zone)
+static inline bool zone_is_empty(struct zone *zone)
{
- return zone->initialized;
+ return zone->spanned_pages == 0;
}
-static inline bool zone_is_empty(struct zone *zone)
+static inline bool zone_spans_range(const struct zone *zone, unsigned long start_pfn,
+ unsigned long nr_pages)
{
- return zone->spanned_pages == 0;
+ return zone->zone_start_pfn <= start_pfn && start_pfn + nr_pages < zone_end_pfn(zone);
}
/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6fb6bd2df787..699f5a2a8efd 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -348,7 +348,7 @@ static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
static int __ref ensure_zone_is_initialized(struct zone *zone,
unsigned long start_pfn, unsigned long num_pages)
{
- if (!zone_is_initialized(zone))
+ if (zone_is_empty(zone))
return init_currently_empty_zone(zone, start_pfn, num_pages);
return 0;
@@ -1051,7 +1051,7 @@ bool zone_can_shift(unsigned long pfn, unsigned long nr_pages,
/* no zones in use between current zone and target */
for (i = idx + 1; i < target; i++)
- if (zone_is_initialized(zone - idx + i))
+ if (!zone_is_empty(zone - idx + i))
return false;
}
@@ -1062,7 +1062,7 @@ bool zone_can_shift(unsigned long pfn, unsigned long nr_pages,
/* no zones in use between current zone and target */
for (i = target + 1; i < idx; i++)
- if (zone_is_initialized(zone - idx + i))
+ if (!zone_is_empty(zone - idx + i))
return false;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5ee8a26fa383..af58b51c5897 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -795,7 +795,7 @@ static inline void __free_one_page(struct page *page,
max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
- VM_BUG_ON(!zone_is_initialized(zone));
+ VM_BUG_ON(zone_is_empty(zone));
VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
VM_BUG_ON(migratetype == -1);
@@ -5535,9 +5535,6 @@ int __meminit init_currently_empty_zone(struct zone *zone,
zone_start_pfn, (zone_start_pfn + size));
zone_init_free_lists(zone);
- zone->initialized = 1;
-
- return 0;
}
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
--
2.11.0
From: Michal Hocko <[email protected]>
init_currently_empty_zone doesn't have any error to return yet it is
still an int and callers try to be defensive and try to handle potential
error. Remove this nonsense and simplify all callers.
This patch shouldn't have any visible effect
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/mmzone.h | 2 +-
mm/memory_hotplug.c | 25 ++++++-------------------
mm/page_alloc.c | 6 ++----
3 files changed, 9 insertions(+), 24 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dbe3b32fe85d..c86c78617d17 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -768,7 +768,7 @@ enum memmap_context {
MEMMAP_EARLY,
MEMMAP_HOTPLUG,
};
-extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
+extern void init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
unsigned long size);
extern void lruvec_init(struct lruvec *lruvec);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 699f5a2a8efd..056dbbe6d20e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -343,27 +343,20 @@ static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
set_page_links(pfn_to_page(pfn), zid, nid, pfn);
}
-/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
- * alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */
-static int __ref ensure_zone_is_initialized(struct zone *zone,
+static void __ref ensure_zone_is_initialized(struct zone *zone,
unsigned long start_pfn, unsigned long num_pages)
{
- if (zone_is_empty(zone))
- return init_currently_empty_zone(zone, start_pfn, num_pages);
-
- return 0;
+ if (!zone_is_empty(zone))
+ init_currently_empty_zone(zone, start_pfn, num_pages);
}
static int __meminit move_pfn_range_left(struct zone *z1, struct zone *z2,
unsigned long start_pfn, unsigned long end_pfn)
{
- int ret;
unsigned long flags;
unsigned long z1_start_pfn;
- ret = ensure_zone_is_initialized(z1, start_pfn, end_pfn - start_pfn);
- if (ret)
- return ret;
+ ensure_zone_is_initialized(z1, start_pfn, end_pfn - start_pfn);
pgdat_resize_lock(z1->zone_pgdat, &flags);
@@ -399,13 +392,10 @@ static int __meminit move_pfn_range_left(struct zone *z1, struct zone *z2,
static int __meminit move_pfn_range_right(struct zone *z1, struct zone *z2,
unsigned long start_pfn, unsigned long end_pfn)
{
- int ret;
unsigned long flags;
unsigned long z2_end_pfn;
- ret = ensure_zone_is_initialized(z2, start_pfn, end_pfn - start_pfn);
- if (ret)
- return ret;
+ ensure_zone_is_initialized(z2, start_pfn, end_pfn - start_pfn);
pgdat_resize_lock(z1->zone_pgdat, &flags);
@@ -476,12 +466,9 @@ static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
int nid = pgdat->node_id;
int zone_type;
unsigned long flags, pfn;
- int ret;
zone_type = zone - pgdat->node_zones;
- ret = ensure_zone_is_initialized(zone, phys_start_pfn, nr_pages);
- if (ret)
- return ret;
+ ensure_zone_is_initialized(zone, phys_start_pfn, nr_pages);
pgdat_resize_lock(zone->zone_pgdat, &flags);
grow_zone_span(zone, phys_start_pfn, phys_start_pfn + nr_pages);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index af58b51c5897..c6127f1a62e9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5518,7 +5518,7 @@ static __meminit void zone_pcp_init(struct zone *zone)
zone_batchsize(zone));
}
-int __meminit init_currently_empty_zone(struct zone *zone,
+void __meminit init_currently_empty_zone(struct zone *zone,
unsigned long zone_start_pfn,
unsigned long size)
{
@@ -5997,7 +5997,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
{
enum zone_type j;
int nid = pgdat->node_id;
- int ret;
pgdat_resize_init(pgdat);
#ifdef CONFIG_NUMA_BALANCING
@@ -6079,8 +6078,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
set_pageblock_order();
setup_usemap(pgdat, zone, zone_start_pfn, size);
- ret = init_currently_empty_zone(zone, zone_start_pfn, size);
- BUG_ON(ret);
+ init_currently_empty_zone(zone, zone_start_pfn, size);
memmap_init(size, nid, j, zone_start_pfn);
}
}
--
2.11.0
From: Michal Hocko <[email protected]>
The current memory hotplug implementation relies on having all the
struct pages associate with a zone during the physical hotplug phase
(arch_add_memory->__add_pages->__add_section->__add_zone). In the vast
majority of cases this means that they are added to ZONE_NORMAL. This
has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd
without sparsemem") and it wasn't a big deal back then.
Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable
memory and portion memory") and then things got more complicated. Rather
than reconsidering the zone association which was no longer needed
(because the memory hotplug already depended on SPARSEMEM) a convoluted
semantic of zone shifting has been developed. Only the currently last
memblock or the one adjacent to the zone_movable can be onlined movable.
This essentially means that the online time changes as the new memblocks
are added.
Let's simulate memory hot online manually
Normal Movable
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal Movable
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable Normal
This is an awkward semantic because an udev event is sent as soon as the
block is onlined and an udev handler might want to online it based on
some policy (e.g. association with a node) but it will inherently race
with new blocks showing up.
This patch changes the physical online phase to not associate pages
with any zone at all. All the pages are just marked reserved and wait
for the onlining phase to be associated with the zone as per the online
request. There are only two requirements
- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap
- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses
the later on is not inherent and can be changed in the future. It
preserves the current behavior and made the code slightly simpler. This
is subject to change in future.
This means that the same physical online steps as above will lead to the
following state:
Normal Movable
/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Normal Movable
/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
Implementation:
The current move_pfn_range is reimplemented to check the above
requirements (allow_online_pfn_range) and then updates the respective
zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
pfn range with the zone/node. __add_pages is updated to not require the
zone and only initializes sections in the range. This allowed to
simplify the arch_add_memory code (s390 could get rid of quite some
of code).
devm_memremap_pages is the only user of arch_add_memory which relies
on the zone association because it only hooks into the memory hotplug
only half way. It uses it to associate the new memory with ZONE_DEVICE
but doesn't allow it to be {on,off}lined via sysfs. This means that this
particular code path has to call move_pfn_range_to_zone explicitly.
The original zone shifting code is kept in place and will be removed in
the follow up patch for an easier review.
Cc: Lai Jiangshan <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: [email protected]
Signed-off-by: Michal Hocko <[email protected]>
---
arch/ia64/mm/init.c | 8 +-
arch/powerpc/mm/mem.c | 10 +--
arch/s390/mm/init.c | 30 +------
arch/sh/mm/init.c | 7 +-
arch/x86/mm/init_32.c | 5 +-
arch/x86/mm/init_64.c | 9 +-
drivers/base/memory.c | 52 ++++++-----
include/linux/memory_hotplug.h | 13 +--
kernel/memremap.c | 3 +
mm/memory_hotplug.c | 195 +++++++++++++++++++++++++----------------
mm/sparse.c | 3 +-
11 files changed, 165 insertions(+), 170 deletions(-)
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 06cdaef54b2e..0fb7f3946785 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -647,17 +647,11 @@ mem_init (void)
#ifdef CONFIG_MEMORY_HOTPLUG
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
- pg_data_t *pgdat;
- struct zone *zone;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
- pgdat = NODE_DATA(nid);
-
- zone = pgdat->node_zones +
- zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
- ret = __add_pages(nid, zone, start_pfn, nr_pages);
+ ret = __add_pages(nid, start_pfn, nr_pages);
if (ret)
printk("%s: Problem encountered in __add_pages() as ret=%d\n",
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5f844337de21..db1369a7f69f 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -128,14 +128,10 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
- struct pglist_data *pgdata;
- struct zone *zone;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int rc;
- pgdata = NODE_DATA(nid);
-
start = (unsigned long)__va(start);
rc = create_section_mapping(start, start + size);
if (rc) {
@@ -145,11 +141,7 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
return -EFAULT;
}
- /* this should work for most non-highmem platforms */
- zone = pgdata->node_zones +
- zone_for_memory(nid, start, size, 0, for_device);
-
- return __add_pages(nid, zone, start_pfn, nr_pages);
+ return __add_pages(nid, start_pfn, nr_pages);
}
#ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index bf5b8a0c4ff7..9e8c515ee29f 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -155,41 +155,15 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
#ifdef CONFIG_MEMORY_HOTPLUG
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
- unsigned long zone_start_pfn, zone_end_pfn, nr_pages;
unsigned long start_pfn = PFN_DOWN(start);
unsigned long size_pages = PFN_DOWN(size);
- pg_data_t *pgdat = NODE_DATA(nid);
- struct zone *zone;
- int rc, i;
+ int rc;
rc = vmem_add_mapping(start, size);
if (rc)
return rc;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- zone = pgdat->node_zones + i;
- if (zone_idx(zone) != ZONE_MOVABLE) {
- /* Add range within existing zone limits, if possible */
- zone_start_pfn = zone->zone_start_pfn;
- zone_end_pfn = zone->zone_start_pfn +
- zone->spanned_pages;
- } else {
- /* Add remaining range to ZONE_MOVABLE */
- zone_start_pfn = start_pfn;
- zone_end_pfn = start_pfn + size_pages;
- }
- if (start_pfn < zone_start_pfn || start_pfn >= zone_end_pfn)
- continue;
- nr_pages = (start_pfn + size_pages > zone_end_pfn) ?
- zone_end_pfn - start_pfn : size_pages;
- rc = __add_pages(nid, zone, start_pfn, nr_pages);
- if (rc)
- break;
- start_pfn += nr_pages;
- size_pages -= nr_pages;
- if (!size_pages)
- break;
- }
+ rc = __add_pages(nid, start_pfn, size_pages);
if (rc)
vmem_remove_mapping(start, size);
return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 75491862d900..95261b66bcf3 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -487,18 +487,13 @@ void free_initrd_mem(unsigned long start, unsigned long end)
#ifdef CONFIG_MEMORY_HOTPLUG
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
- pg_data_t *pgdat;
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
- pgdat = NODE_DATA(nid);
/* We only have ZONE_NORMAL, so this is easy.. */
- ret = __add_pages(nid, pgdat->node_zones +
- zone_for_memory(nid, start, size, ZONE_NORMAL,
- for_device),
- start_pfn, nr_pages);
+ ret = __add_pages(nid, start_pfn, nr_pages);
if (unlikely(ret))
printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 928cfde76232..5de79aa7d6ce 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -818,13 +818,10 @@ void __init mem_init(void)
#ifdef CONFIG_MEMORY_HOTPLUG
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
- struct pglist_data *pgdata = NODE_DATA(nid);
- struct zone *zone = pgdata->node_zones +
- zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- return __add_pages(nid, zone, start_pfn, nr_pages);
+ return __add_pages(nid, start_pfn, nr_pages);
}
#ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7eef17239378..bc53f24e6703 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -637,22 +637,15 @@ static void update_end_of_memory_vars(u64 start, u64 size)
}
}
-/*
- * Memory is added always to NORMAL zone. This means you will never get
- * additional DMA/DMA32 memory.
- */
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
- struct pglist_data *pgdat = NODE_DATA(nid);
- struct zone *zone = pgdat->node_zones +
- zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
init_memory_mapping(start, start + size);
- ret = __add_pages(nid, zone, start_pfn, nr_pages);
+ ret = __add_pages(nid, start_pfn, nr_pages);
WARN_ON_ONCE(ret);
/* update max_pfn, max_low_pfn and high_memory */
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index cc4f1d0cbffe..f516cf597258 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -388,39 +388,43 @@ static ssize_t show_valid_zones(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct memory_block *mem = to_memory_block(dev);
- unsigned long start_pfn, end_pfn;
- unsigned long valid_start, valid_end, valid_pages;
+ unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
- struct zone *zone;
- int zone_shift = 0;
+ unsigned long valid_start_pfn, valid_end_pfn;
+ bool append = false;
+ int nid;
- start_pfn = section_nr_to_pfn(mem->start_section_nr);
- end_pfn = start_pfn + nr_pages;
-
- /* The block contains more than one zone can not be offlined. */
- if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start, &valid_end))
+ /*
+ * The block contains more than one zone can not be offlined.
+ * This can happen e.g. for ZONE_DMA and ZONE_DMA32
+ */
+ if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages, &valid_start_pfn, &valid_end_pfn))
return sprintf(buf, "none\n");
- zone = page_zone(pfn_to_page(valid_start));
- valid_pages = valid_end - valid_start;
-
- /* MMOP_ONLINE_KEEP */
- sprintf(buf, "%s", zone->name);
+ start_pfn = valid_start_pfn;
+ nr_pages = valid_end_pfn - valid_end_pfn;
- /* MMOP_ONLINE_KERNEL */
- zone_can_shift(valid_start, valid_pages, ZONE_NORMAL, &zone_shift);
- if (zone_shift) {
- strcat(buf, " ");
- strcat(buf, (zone + zone_shift)->name);
+ /*
+ * Check the existing zone. Make sure that we do that only on the
+ * online nodes otherwise the page_zone is not reliable
+ */
+ if (mem->state == MEM_ONLINE) {
+ strcat(buf, page_zone(pfn_to_page(start_pfn))->name);
+ goto out;
}
- /* MMOP_ONLINE_MOVABLE */
- zone_can_shift(valid_start, valid_pages, ZONE_MOVABLE, &zone_shift);
- if (zone_shift) {
- strcat(buf, " ");
- strcat(buf, (zone + zone_shift)->name);
+ nid = pfn_to_nid(start_pfn);
+ if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_KERNEL)) {
+ strcat(buf, NODE_DATA(nid)->node_zones[ZONE_NORMAL].name);
+ append = true;
}
+ if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_MOVABLE)) {
+ if (append)
+ strcat(buf, " ");
+ strcat(buf, NODE_DATA(nid)->node_zones[ZONE_MOVABLE].name);
+ }
+out:
strcat(buf, "\n");
return strlen(buf);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 134a2f69c21a..63577ce57028 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -109,8 +109,8 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
#endif /* CONFIG_MEMORY_HOTREMOVE */
-/* reasonably generic interface to expand the physical pages in a zone */
-extern int __add_pages(int nid, struct zone *zone, unsigned long start_pfn,
+/* reasonably generic interface to expand the physical pages */
+extern int __add_pages(int nid, unsigned long start_pfn,
unsigned long nr_pages);
#ifdef CONFIG_NUMA
@@ -277,15 +277,16 @@ extern int add_memory_resource(int nid, struct resource *resource, bool online);
extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
bool for_device);
extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
+extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
+ unsigned long nr_pages);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
extern bool is_memblock_offlined(struct memory_block *mem);
extern void remove_memory(int nid, u64 start, u64 size);
-extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn);
+extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
unsigned long map_offset);
extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
unsigned long pnum);
-extern bool zone_can_shift(unsigned long pfn, unsigned long nr_pages,
- enum zone_type target, int *zone_shift);
-
+extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_pages,
+ int online_type);
#endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 06123234f118..1b137649cb82 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -367,6 +367,9 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
lock_device_hotplug();
mem_hotplug_begin();
error = arch_add_memory(nid, align_start, align_size, true);
+ if (!error)
+ move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
+ align_start, align_size);
mem_hotplug_done();
unlock_device_hotplug();
if (error)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 221f622bcc88..d7efa0191bb7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -428,25 +428,6 @@ static int __meminit move_pfn_range_right(struct zone *z1, struct zone *z2,
return -1;
}
-static struct zone * __meminit move_pfn_range(int zone_shift,
- unsigned long start_pfn, unsigned long end_pfn)
-{
- struct zone *zone = page_zone(pfn_to_page(start_pfn));
- int ret = 0;
-
- if (zone_shift < 0)
- ret = move_pfn_range_left(zone + zone_shift, zone,
- start_pfn, end_pfn);
- else if (zone_shift)
- ret = move_pfn_range_right(zone, zone + zone_shift,
- start_pfn, end_pfn);
-
- if (ret)
- return NULL;
-
- return zone + zone_shift;
-}
-
static void __meminit grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
unsigned long end_pfn)
{
@@ -488,23 +469,29 @@ static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
return 0;
}
-static int __meminit __add_section(int nid, struct zone *zone,
- unsigned long phys_start_pfn)
+static int __meminit __add_section(int nid, unsigned long phys_start_pfn)
{
int ret;
+ int i;
if (pfn_valid(phys_start_pfn))
return -EEXIST;
- ret = sparse_add_one_section(zone, phys_start_pfn);
-
+ ret = sparse_add_one_section(NODE_DATA(nid), phys_start_pfn);
if (ret < 0)
return ret;
- ret = __add_zone(zone, phys_start_pfn);
+ /*
+ * Make all the pages reserved so that nobody will stumble over half
+ * initialized state.
+ */
+ for (i = 0; i < PAGES_PER_SECTION; i++) {
+ unsigned long pfn = phys_start_pfn + i;
+ if (!pfn_valid(pfn))
+ continue;
- if (ret < 0)
- return ret;
+ SetPageReserved(pfn_to_page(phys_start_pfn + i));
+ }
return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
}
@@ -515,7 +502,7 @@ static int __meminit __add_section(int nid, struct zone *zone,
* call this function after deciding the zone to which to
* add the new pages.
*/
-int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
+int __ref __add_pages(int nid, unsigned long phys_start_pfn,
unsigned long nr_pages)
{
unsigned long i;
@@ -523,8 +510,6 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
int start_sec, end_sec;
struct vmem_altmap *altmap;
- clear_zone_contiguous(zone);
-
/* during initialize mem_map, align hot-added range to section */
start_sec = pfn_to_section_nr(phys_start_pfn);
end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
@@ -544,7 +529,7 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
}
for (i = start_sec; i <= end_sec; i++) {
- err = __add_section(nid, zone, section_nr_to_pfn(i));
+ err = __add_section(nid, section_nr_to_pfn(i));
/*
* EEXIST is finally dealt with by ioresource collision
@@ -557,7 +542,6 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
}
vmemmap_populate_print_last();
out:
- set_zone_contiguous(zone);
return err;
}
EXPORT_SYMBOL_GPL(__add_pages);
@@ -1022,39 +1006,113 @@ static void node_states_set_node(int node, struct memory_notify *arg)
node_set_state(node, N_MEMORY);
}
-bool zone_can_shift(unsigned long pfn, unsigned long nr_pages,
- enum zone_type target, int *zone_shift)
+bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_pages, int online_type)
{
- struct zone *zone = page_zone(pfn_to_page(pfn));
- enum zone_type idx = zone_idx(zone);
- int i;
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ struct zone *movable_zone = &pgdat->node_zones[ZONE_MOVABLE];
+ struct zone *normal_zone = &pgdat->node_zones[ZONE_NORMAL];
- *zone_shift = 0;
+ /*
+ * TODO there shouldn't be any inherent reason to have ZONE_NORMAL
+ * physically before ZONE_MOVABLE. All we need is they do not
+ * overlap. Historically we didn't allow ZONE_NORMAL after ZONE_MOVABLE
+ * though so let's stick with it for simplicity for now.
+ * TODO make sure we do not overlap with ZONE_DEVICE
+ */
+ if (online_type == MMOP_ONLINE_KERNEL) {
+ if (!populated_zone(movable_zone))
+ return true;
+ return movable_zone->zone_start_pfn >= pfn + nr_pages;
+ } else if (online_type == MMOP_ONLINE_MOVABLE) {
+ return zone_end_pfn(normal_zone) <= pfn;
+ }
- if (idx < target) {
- /* pages must be at end of current zone */
- if (pfn + nr_pages != zone_end_pfn(zone))
- return false;
+ /* MMOP_ONLINE_KEEP will always succeed and inherits the current zone */
+ return online_type == MMOP_ONLINE_KEEP;
+}
+
+static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn,
+ unsigned long nr_pages)
+{
+ unsigned long old_end_pfn = zone_end_pfn(zone);
+
+ if (start_pfn < zone->zone_start_pfn)
+ zone->zone_start_pfn = start_pfn;
+
+ zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn;
+}
+
+static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned long start_pfn,
+ unsigned long nr_pages)
+{
+ unsigned long old_end_pfn = pgdat_end_pfn(pgdat);
- /* no zones in use between current zone and target */
- for (i = idx + 1; i < target; i++)
- if (!zone_is_empty(zone - idx + i))
- return false;
+ if (start_pfn < pgdat->node_start_pfn)
+ pgdat->node_start_pfn = start_pfn;
+
+ pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
+}
+
+void move_pfn_range_to_zone(struct zone *zone,
+ unsigned long start_pfn, unsigned long nr_pages)
+{
+ struct pglist_data *pgdat = zone->zone_pgdat;
+ int nid = pgdat->node_id;
+ unsigned long flags;
+ unsigned long i;
+
+ if (zone_is_empty(zone))
+ init_currently_empty_zone(zone, start_pfn, nr_pages);
+
+ clear_zone_contiguous(zone);
+
+ /* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
+ pgdat_resize_lock(pgdat, &flags);
+ zone_span_writelock(zone);
+ resize_zone_range(zone, start_pfn, nr_pages);
+ zone_span_writeunlock(zone);
+ resize_pgdat_range(pgdat, start_pfn, nr_pages);
+ pgdat_resize_unlock(pgdat, &flags);
+
+ /*
+ * TODO now we have a visible range of pages which are not associated
+ * with their zone properly. Not nice but set_pfnblock_flags_mask
+ * expects the zone spans the pfn range. All the pages in the range
+ * are reserved so nobody should be touching them so we should be safe
+ */
+ memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, MEMMAP_HOTPLUG);
+ for (i = 0; i < nr_pages; i++) {
+ unsigned long pfn = start_pfn + i;
+ set_page_links(pfn_to_page(pfn), zone_idx(zone), nid, pfn);
}
- if (target < idx) {
- /* pages must be at beginning of current zone */
- if (pfn != zone->zone_start_pfn)
- return false;
+ set_zone_contiguous(zone);
+}
- /* no zones in use between current zone and target */
- for (i = target + 1; i < idx; i++)
- if (!zone_is_empty(zone - idx + i))
- return false;
+/*
+ * Associates the given pfn range with the given node and the zone appropriate
+ * for the given online type.
+ */
+static struct zone * __meminit move_pfn_range(int online_type, int nid,
+ unsigned long start_pfn, unsigned long nr_pages)
+{
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ struct zone *zone = &pgdat->node_zones[ZONE_NORMAL];
+
+ if (online_type == MMOP_ONLINE_KEEP) {
+ /*
+ * MMOP_ONLINE_KEEP inherits the current zone which is
+ * ZONE_NORMAL by default but we might be within ZONE_MOVABLE
+ * already.
+ */
+ if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_MOVABLE))
+ zone = &pgdat->node_zones[ZONE_MOVABLE];
+ } else if (online_type == MMOP_ONLINE_MOVABLE) {
+ zone = &pgdat->node_zones[ZONE_MOVABLE];
}
- *zone_shift = target - idx;
- return true;
+ move_pfn_range_to_zone(zone, start_pfn, nr_pages);
+ return zone;
}
/* Must be protected by mem_hotplug_begin() */
@@ -1067,29 +1125,16 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
int nid;
int ret;
struct memory_notify arg;
- int zone_shift = 0;
- /*
- * This doesn't need a lock to do pfn_to_page().
- * The section can't be removed here because of the
- * memory_block->state_mutex.
- */
- zone = page_zone(pfn_to_page(pfn));
-
- if ((zone_idx(zone) > ZONE_NORMAL ||
- online_type == MMOP_ONLINE_MOVABLE) &&
- !can_online_high_movable(pfn_to_nid(pfn)))
+ nid = pfn_to_nid(pfn);
+ if (!allow_online_pfn_range(nid, pfn, nr_pages, online_type))
return -EINVAL;
- if (online_type == MMOP_ONLINE_KERNEL) {
- if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))
- return -EINVAL;
- } else if (online_type == MMOP_ONLINE_MOVABLE) {
- if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, &zone_shift))
- return -EINVAL;
- }
+ if (online_type == MMOP_ONLINE_MOVABLE && !can_online_high_movable(nid))
+ return -EINVAL;
- zone = move_pfn_range(zone_shift, pfn, pfn + nr_pages);
+ /* associate pfn range with the zone */
+ zone = move_pfn_range(online_type, nid, pfn, nr_pages);
if (!zone)
return -EINVAL;
@@ -1097,8 +1142,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
arg.nr_pages = nr_pages;
node_states_check_changes_online(nr_pages, zone, &arg);
- nid = zone_to_nid(zone);
-
ret = memory_notify(MEM_GOING_ONLINE, &arg);
ret = notifier_to_errno(ret);
if (ret)
diff --git a/mm/sparse.c b/mm/sparse.c
index db6bf3c97ea2..94072c0d8952 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -689,10 +689,9 @@ static void free_map_bootmem(struct page *memmap)
* set. If this is <=0, then that means that the passed-in
* map was not consumed and must be freed.
*/
-int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn)
+int __meminit sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn)
{
unsigned long section_nr = pfn_to_section_nr(start_pfn);
- struct pglist_data *pgdat = zone->zone_pgdat;
struct mem_section *ms;
struct page *memmap;
unsigned long *usemap;
--
2.11.0
From: Michal Hocko <[email protected]>
these functions are unreachable because tile doesn't support memory
hotplug becasuse it doesn't select ARCH_ENABLE_MEMORY_HOTPLUG nor
it supports SPARSEMEM.
This code hasn't been compiled for a while obviously because nobody has
noticed that __add_pages has a different signature since 2009.
Cc: Chris Metcalf <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
arch/tile/mm/init.c | 30 ------------------------------
1 file changed, 30 deletions(-)
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index 3a97e4d7205c..5f757e04bcd2 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -857,36 +857,6 @@ void __init mem_init(void)
#endif
}
-/*
- * this is for the non-NUMA, single node SMP system case.
- * Specifically, in the case of x86, we will always add
- * memory to the highmem for now.
- */
-#ifndef CONFIG_NEED_MULTIPLE_NODES
-int arch_add_memory(u64 start, u64 size, bool for_device)
-{
- struct pglist_data *pgdata = &contig_page_data;
- struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
- unsigned long start_pfn = start >> PAGE_SHIFT;
- unsigned long nr_pages = size >> PAGE_SHIFT;
-
- return __add_pages(zone, start_pfn, nr_pages);
-}
-
-int remove_memory(u64 start, u64 size)
-{
- return -EINVAL;
-}
-
-#ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
-{
- /* TODO */
- return -EBUSY;
-}
-#endif
-#endif
-
struct kmem_cache *pgd_cache;
void __init pgtable_cache_init(void)
--
2.11.0
On 3/30/2017 7:54 AM, Michal Hocko wrote:
> From: Michal Hocko<[email protected]>
>
> these functions are unreachable because tile doesn't support memory
> hotplug becasuse it doesn't select ARCH_ENABLE_MEMORY_HOTPLUG nor
> it supports SPARSEMEM.
>
> This code hasn't been compiled for a while obviously because nobody has
> noticed that __add_pages has a different signature since 2009.
>
> Cc: Chris Metcalf<[email protected]>
> Signed-off-by: Michal Hocko<[email protected]>
> ---
> arch/tile/mm/init.c | 30 ------------------------------
> 1 file changed, 30 deletions(-)
Thanks - taken into the tile tree.
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
On March 30, 2017 7:55 PM Michal Hocko wrote:
>
> @@ -5535,9 +5535,6 @@ int __meminit init_currently_empty_zone(struct zone *zone,
> zone_start_pfn, (zone_start_pfn + size));
>
> zone_init_free_lists(zone);
> - zone->initialized = 1;
> -
> - return 0;
> }
Nit: Add changes more than correct.
Hillf
On March 30, 2017 7:55 PM Michal Hocko wrote:
>
> From: Michal Hocko <[email protected]>
>
> init_currently_empty_zone doesn't have any error to return yet it is
> still an int and callers try to be defensive and try to handle potential
> error. Remove this nonsense and simplify all callers.
>
It is already cut off in 1/6 in this series?
<snip>
> -/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
> - * alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */
> -static int __ref ensure_zone_is_initialized(struct zone *zone,
> +static void __ref ensure_zone_is_initialized(struct zone *zone,
> unsigned long start_pfn, unsigned long num_pages)
> {
> - if (zone_is_empty(zone))
> - return init_currently_empty_zone(zone, start_pfn, num_pages);
> -
> - return 0;
> + if (!zone_is_empty(zone))
> + init_currently_empty_zone(zone, start_pfn, num_pages);
> }
Semantic change added?
Hillf
On March 30, 2017 7:55 PM Michal Hocko wrote:
>
> +static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn,
> + unsigned long nr_pages)
> +{
> + unsigned long old_end_pfn = zone_end_pfn(zone);
> +
> + if (start_pfn < zone->zone_start_pfn)
> + zone->zone_start_pfn = start_pfn;
> +
> + zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn;
> +}
The implementation above implies zone can only go bigger.
Can we resize zone with the given data?
btw, this mail address, Zhang Zhen <[email protected]> , is not reachable.
Hillf
On Fri 31-03-17 11:39:07, Hillf Danton wrote:
>
> On March 30, 2017 7:55 PM Michal Hocko wrote:
> >
> > @@ -5535,9 +5535,6 @@ int __meminit init_currently_empty_zone(struct zone *zone,
> > zone_start_pfn, (zone_start_pfn + size));
> >
> > zone_init_free_lists(zone);
> > - zone->initialized = 1;
> > -
> > - return 0;
> > }
> Nit: Add changes more than correct.
I am sorry, I do not follow?
--
Michal Hocko
SUSE Labs
On Fri 31-03-17 08:43:01, Michal Hocko wrote:
> On Fri 31-03-17 11:39:07, Hillf Danton wrote:
> >
> > On March 30, 2017 7:55 PM Michal Hocko wrote:
> > >
> > > @@ -5535,9 +5535,6 @@ int __meminit init_currently_empty_zone(struct zone *zone,
> > > zone_start_pfn, (zone_start_pfn + size));
> > >
> > > zone_init_free_lists(zone);
> > > - zone->initialized = 1;
> > > -
> > > - return 0;
> > > }
> > Nit: Add changes more than correct.
>
> I am sorry, I do not follow?
OK, got it. init_currently_empty_zone should be changed to void. Screw
up during the split of the initial patch. Will fix that up. Also
zone_spans_range should go to patch 5 because it doesn't have any user
until then.
--
Michal Hocko
SUSE Labs
On Fri 31-03-17 11:49:49, Hillf Danton wrote:
[...]
> > -/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
> > - * alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */
> > -static int __ref ensure_zone_is_initialized(struct zone *zone,
> > +static void __ref ensure_zone_is_initialized(struct zone *zone,
> > unsigned long start_pfn, unsigned long num_pages)
> > {
> > - if (zone_is_empty(zone))
> > - return init_currently_empty_zone(zone, start_pfn, num_pages);
> > -
> > - return 0;
> > + if (!zone_is_empty(zone))
> > + init_currently_empty_zone(zone, start_pfn, num_pages);
> > }
> Semantic change added?
could you be more specific?
--
Michal Hocko
SUSE Labs
On Fri 31-03-17 14:18:08, Hillf Danton wrote:
>
> On March 30, 2017 7:55 PM Michal Hocko wrote:
> >
> > +static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn,
> > + unsigned long nr_pages)
> > +{
> > + unsigned long old_end_pfn = zone_end_pfn(zone);
> > +
> > + if (start_pfn < zone->zone_start_pfn)
> > + zone->zone_start_pfn = start_pfn;
> > +
> > + zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn;
> > +}
> The implementation above implies zone can only go bigger.
yes, we do not shrink zones currently and I see no poit in doing that
right now.
> Can we resize zone with the given data?
Why couldn't we?
--
Michal Hocko
SUSE Labs
On March 31, 2017 2:49 PM Michal Hocko wrote:
> On Fri 31-03-17 11:49:49, Hillf Danton wrote:
> [...]
> > > -/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
> > > - * alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */
> > > -static int __ref ensure_zone_is_initialized(struct zone *zone,
> > > +static void __ref ensure_zone_is_initialized(struct zone *zone,
> > > unsigned long start_pfn, unsigned long num_pages)
> > > {
> > > - if (zone_is_empty(zone))
> > > - return init_currently_empty_zone(zone, start_pfn, num_pages);
> > > -
> > > - return 0;
> > > + if (!zone_is_empty(zone))
> > > + init_currently_empty_zone(zone, start_pfn, num_pages);
> > > }
> > Semantic change added?
>
> could you be more specific?
Well, I'm wondering why you are trying to initiate a nonempty zone.
Hillf
On Fri 31-03-17 15:06:41, Hillf Danton wrote:
> On March 31, 2017 2:49 PM Michal Hocko wrote:
> > On Fri 31-03-17 11:49:49, Hillf Danton wrote:
> > [...]
> > > > -/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
> > > > - * alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */
> > > > -static int __ref ensure_zone_is_initialized(struct zone *zone,
> > > > +static void __ref ensure_zone_is_initialized(struct zone *zone,
> > > > unsigned long start_pfn, unsigned long num_pages)
> > > > {
> > > > - if (zone_is_empty(zone))
> > > > - return init_currently_empty_zone(zone, start_pfn, num_pages);
> > > > -
> > > > - return 0;
> > > > + if (!zone_is_empty(zone))
> > > > + init_currently_empty_zone(zone, start_pfn, num_pages);
> > > > }
> > > Semantic change added?
> >
> > could you be more specific?
>
> Well, I'm wondering why you are trying to initiate a nonempty zone.
Ups, another fuck up during the initial patch split up. Thanks for
catching that but it would be more helpful to be more specific during
the feedback. I am getting blind to the code I am staring for quite some
time so it was not obvious what you mean here.
Thanks
--
Michal Hocko
SUSE Labs
Fixed screw ups during the initial patch split up as per Hillf
---
>From 8be6c5e47de66210e47710c80e72e8abd899017b Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 29 Mar 2017 15:11:30 +0200
Subject: [PATCH] mm: get rid of zone_is_initialized
There shouldn't be any reason to add initialized when we can tell the
same thing from checking whether there are any pages spanned to the
zone. Remove zone_is_initialized() and replace it by zone_is_empty
which can be used for the same set of tests.
This shouldn't have any visible effect
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/mmzone.h | 7 -------
mm/memory_hotplug.c | 6 +++---
mm/page_alloc.c | 3 +--
3 files changed, 4 insertions(+), 12 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 618499159a7c..3bac3ed71c7a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -442,8 +442,6 @@ struct zone {
seqlock_t span_seqlock;
#endif
- int initialized;
-
/* Write-intensive fields used from the page allocator */
ZONE_PADDING(_pad1_)
@@ -520,11 +518,6 @@ static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
}
-static inline bool zone_is_initialized(struct zone *zone)
-{
- return zone->initialized;
-}
-
static inline bool zone_is_empty(struct zone *zone)
{
return zone->spanned_pages == 0;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6fb6bd2df787..699f5a2a8efd 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -348,7 +348,7 @@ static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
static int __ref ensure_zone_is_initialized(struct zone *zone,
unsigned long start_pfn, unsigned long num_pages)
{
- if (!zone_is_initialized(zone))
+ if (zone_is_empty(zone))
return init_currently_empty_zone(zone, start_pfn, num_pages);
return 0;
@@ -1051,7 +1051,7 @@ bool zone_can_shift(unsigned long pfn, unsigned long nr_pages,
/* no zones in use between current zone and target */
for (i = idx + 1; i < target; i++)
- if (zone_is_initialized(zone - idx + i))
+ if (!zone_is_empty(zone - idx + i))
return false;
}
@@ -1062,7 +1062,7 @@ bool zone_can_shift(unsigned long pfn, unsigned long nr_pages,
/* no zones in use between current zone and target */
for (i = target + 1; i < idx; i++)
- if (zone_is_initialized(zone - idx + i))
+ if (!zone_is_empty(zone - idx + i))
return false;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5ee8a26fa383..756353d1e293 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -795,7 +795,7 @@ static inline void __free_one_page(struct page *page,
max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
- VM_BUG_ON(!zone_is_initialized(zone));
+ VM_BUG_ON(zone_is_empty(zone));
VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
VM_BUG_ON(migratetype == -1);
@@ -5535,7 +5535,6 @@ int __meminit init_currently_empty_zone(struct zone *zone,
zone_start_pfn, (zone_start_pfn + size));
zone_init_free_lists(zone);
- zone->initialized = 1;
return 0;
}
--
2.11.0
--
Michal Hocko
SUSE Labs
Fixed screw ups during the initial patch split up as per Hillf
---
>From a53c465b8f8046c6e6886dc3c17945238e63ee7c Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 29 Mar 2017 15:17:48 +0200
Subject: [PATCH] mm: remove return value from init_currently_empty_zone
init_currently_empty_zone doesn't have any error to return yet it is
still an int and callers try to be defensive and try to handle potential
error. Remove this nonsense and simplify all callers.
This patch shouldn't have any visible effect
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/mmzone.h | 2 +-
mm/memory_hotplug.c | 23 +++++------------------
mm/page_alloc.c | 8 ++------
3 files changed, 8 insertions(+), 25 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3bac3ed71c7a..6973a3e6ad73 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -762,7 +762,7 @@ enum memmap_context {
MEMMAP_EARLY,
MEMMAP_HOTPLUG,
};
-extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
+extern void init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
unsigned long size);
extern void lruvec_init(struct lruvec *lruvec);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 699f5a2a8efd..1d55e5ec943f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -343,27 +343,20 @@ static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
set_page_links(pfn_to_page(pfn), zid, nid, pfn);
}
-/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
- * alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */
-static int __ref ensure_zone_is_initialized(struct zone *zone,
+static void __ref ensure_zone_is_initialized(struct zone *zone,
unsigned long start_pfn, unsigned long num_pages)
{
if (zone_is_empty(zone))
- return init_currently_empty_zone(zone, start_pfn, num_pages);
-
- return 0;
+ init_currently_empty_zone(zone, start_pfn, num_pages);
}
static int __meminit move_pfn_range_left(struct zone *z1, struct zone *z2,
unsigned long start_pfn, unsigned long end_pfn)
{
- int ret;
unsigned long flags;
unsigned long z1_start_pfn;
- ret = ensure_zone_is_initialized(z1, start_pfn, end_pfn - start_pfn);
- if (ret)
- return ret;
+ ensure_zone_is_initialized(z1, start_pfn, end_pfn - start_pfn);
pgdat_resize_lock(z1->zone_pgdat, &flags);
@@ -399,13 +392,10 @@ static int __meminit move_pfn_range_left(struct zone *z1, struct zone *z2,
static int __meminit move_pfn_range_right(struct zone *z1, struct zone *z2,
unsigned long start_pfn, unsigned long end_pfn)
{
- int ret;
unsigned long flags;
unsigned long z2_end_pfn;
- ret = ensure_zone_is_initialized(z2, start_pfn, end_pfn - start_pfn);
- if (ret)
- return ret;
+ ensure_zone_is_initialized(z2, start_pfn, end_pfn - start_pfn);
pgdat_resize_lock(z1->zone_pgdat, &flags);
@@ -476,12 +466,9 @@ static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
int nid = pgdat->node_id;
int zone_type;
unsigned long flags, pfn;
- int ret;
zone_type = zone - pgdat->node_zones;
- ret = ensure_zone_is_initialized(zone, phys_start_pfn, nr_pages);
- if (ret)
- return ret;
+ ensure_zone_is_initialized(zone, phys_start_pfn, nr_pages);
pgdat_resize_lock(zone->zone_pgdat, &flags);
grow_zone_span(zone, phys_start_pfn, phys_start_pfn + nr_pages);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 756353d1e293..c6127f1a62e9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5518,7 +5518,7 @@ static __meminit void zone_pcp_init(struct zone *zone)
zone_batchsize(zone));
}
-int __meminit init_currently_empty_zone(struct zone *zone,
+void __meminit init_currently_empty_zone(struct zone *zone,
unsigned long zone_start_pfn,
unsigned long size)
{
@@ -5535,8 +5535,6 @@ int __meminit init_currently_empty_zone(struct zone *zone,
zone_start_pfn, (zone_start_pfn + size));
zone_init_free_lists(zone);
-
- return 0;
}
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
@@ -5999,7 +5997,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
{
enum zone_type j;
int nid = pgdat->node_id;
- int ret;
pgdat_resize_init(pgdat);
#ifdef CONFIG_NUMA_BALANCING
@@ -6081,8 +6078,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
set_pageblock_order();
setup_usemap(pgdat, zone, zone_start_pfn, size);
- ret = init_currently_empty_zone(zone, zone_start_pfn, size);
- BUG_ON(ret);
+ init_currently_empty_zone(zone, zone_start_pfn, size);
memmap_init(size, nid, j, zone_start_pfn);
}
}
--
2.11.0
--
Michal Hocko
SUSE Labs
rebase on top of previous changes
---
>From dfb7cab2783d2175ad8355ce65943a28e43d8700 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 29 Mar 2017 18:08:51 +0200
Subject: [PATCH] mm, memory_hotplug: remove unused cruft after memory hotplug
rework
arch_add_memory doesn't need for_device parameter anymore because
devm_memremap_pages already does all what it needs to.
zone_for_memory doesn't have any user anymore as well as the whole zone
shifting infrastructure so drop them as well.
Signed-off-by: Michal Hocko <[email protected]>
---
arch/ia64/mm/init.c | 2 +-
arch/powerpc/mm/mem.c | 2 +-
arch/s390/mm/init.c | 2 +-
arch/sh/mm/init.c | 3 +-
arch/x86/mm/init_32.c | 2 +-
arch/x86/mm/init_64.c | 2 +-
include/linux/memory_hotplug.h | 4 +-
kernel/memremap.c | 2 +-
mm/memory_hotplug.c | 209 +----------------------------------------
9 files changed, 9 insertions(+), 219 deletions(-)
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 0fb7f3946785..6ebb570f1eae 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -645,7 +645,7 @@ mem_init (void)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index db1369a7f69f..707a4146938b 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -126,7 +126,7 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
return -ENODEV;
}
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 9e8c515ee29f..7d7591b63b57 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -153,7 +153,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
#endif
#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long size_pages = PFN_DOWN(size);
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 95261b66bcf3..4e4afa0ab3c3 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -485,13 +485,12 @@ void free_initrd_mem(unsigned long start, unsigned long end)
#endif
#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
-
/* We only have ZONE_NORMAL, so this is easy.. */
ret = __add_pages(nid, start_pfn, nr_pages);
if (unlikely(ret))
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 5de79aa7d6ce..b389f2ec75f9 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -816,7 +816,7 @@ void __init mem_init(void)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bc53f24e6703..20f575333b08 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -637,7 +637,7 @@ static void update_end_of_memory_vars(u64 start, u64 size)
}
}
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 63577ce57028..d5adb7c468d1 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -274,9 +274,7 @@ extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
void *arg, int (*func)(struct memory_block *, void *));
extern int add_memory(int nid, u64 start, u64 size);
extern int add_memory_resource(int nid, struct resource *resource, bool online);
-extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
- bool for_device);
-extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
+extern int arch_add_memory(int nid, u64 start, u64 size);
extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 1b137649cb82..808f7974918a 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -366,7 +366,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
lock_device_hotplug();
mem_hotplug_begin();
- error = arch_add_memory(nid, align_start, align_size, true);
+ error = arch_add_memory(nid, align_start, align_size);
if (!error)
move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
align_start, align_size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3a8b24e47a2b..b9dc1c4e26c3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -295,180 +295,6 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
}
#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
-static void __meminit grow_zone_span(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- unsigned long old_zone_end_pfn;
-
- zone_span_writelock(zone);
-
- old_zone_end_pfn = zone_end_pfn(zone);
- if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
- zone->zone_start_pfn = start_pfn;
-
- zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
- zone->zone_start_pfn;
-
- zone_span_writeunlock(zone);
-}
-
-static void resize_zone(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- zone_span_writelock(zone);
-
- if (end_pfn - start_pfn) {
- zone->zone_start_pfn = start_pfn;
- zone->spanned_pages = end_pfn - start_pfn;
- } else {
- /*
- * make it consist as free_area_init_core(),
- * if spanned_pages = 0, then keep start_pfn = 0
- */
- zone->zone_start_pfn = 0;
- zone->spanned_pages = 0;
- }
-
- zone_span_writeunlock(zone);
-}
-
-static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- enum zone_type zid = zone_idx(zone);
- int nid = zone->zone_pgdat->node_id;
- unsigned long pfn;
-
- for (pfn = start_pfn; pfn < end_pfn; pfn++)
- set_page_links(pfn_to_page(pfn), zid, nid, pfn);
-}
-
-static void __ref ensure_zone_is_initialized(struct zone *zone,
- unsigned long start_pfn, unsigned long num_pages)
-{
- if (zone_is_empty(zone))
- init_currently_empty_zone(zone, start_pfn, num_pages);
-}
-
-static int __meminit move_pfn_range_left(struct zone *z1, struct zone *z2,
- unsigned long start_pfn, unsigned long end_pfn)
-{
- unsigned long flags;
- unsigned long z1_start_pfn;
-
- ensure_zone_is_initialized(z1, start_pfn, end_pfn - start_pfn);
-
- pgdat_resize_lock(z1->zone_pgdat, &flags);
-
- /* can't move pfns which are higher than @z2 */
- if (end_pfn > zone_end_pfn(z2))
- goto out_fail;
- /* the move out part must be at the left most of @z2 */
- if (start_pfn > z2->zone_start_pfn)
- goto out_fail;
- /* must included/overlap */
- if (end_pfn <= z2->zone_start_pfn)
- goto out_fail;
-
- /* use start_pfn for z1's start_pfn if z1 is empty */
- if (!zone_is_empty(z1))
- z1_start_pfn = z1->zone_start_pfn;
- else
- z1_start_pfn = start_pfn;
-
- resize_zone(z1, z1_start_pfn, end_pfn);
- resize_zone(z2, end_pfn, zone_end_pfn(z2));
-
- pgdat_resize_unlock(z1->zone_pgdat, &flags);
-
- fix_zone_id(z1, start_pfn, end_pfn);
-
- return 0;
-out_fail:
- pgdat_resize_unlock(z1->zone_pgdat, &flags);
- return -1;
-}
-
-static int __meminit move_pfn_range_right(struct zone *z1, struct zone *z2,
- unsigned long start_pfn, unsigned long end_pfn)
-{
- unsigned long flags;
- unsigned long z2_end_pfn;
-
- ensure_zone_is_initialized(z2, start_pfn, end_pfn - start_pfn);
-
- pgdat_resize_lock(z1->zone_pgdat, &flags);
-
- /* can't move pfns which are lower than @z1 */
- if (z1->zone_start_pfn > start_pfn)
- goto out_fail;
- /* the move out part mast at the right most of @z1 */
- if (zone_end_pfn(z1) > end_pfn)
- goto out_fail;
- /* must included/overlap */
- if (start_pfn >= zone_end_pfn(z1))
- goto out_fail;
-
- /* use end_pfn for z2's end_pfn if z2 is empty */
- if (!zone_is_empty(z2))
- z2_end_pfn = zone_end_pfn(z2);
- else
- z2_end_pfn = end_pfn;
-
- resize_zone(z1, z1->zone_start_pfn, start_pfn);
- resize_zone(z2, start_pfn, z2_end_pfn);
-
- pgdat_resize_unlock(z1->zone_pgdat, &flags);
-
- fix_zone_id(z2, start_pfn, end_pfn);
-
- return 0;
-out_fail:
- pgdat_resize_unlock(z1->zone_pgdat, &flags);
- return -1;
-}
-
-static void __meminit grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- unsigned long old_pgdat_end_pfn = pgdat_end_pfn(pgdat);
-
- if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
- pgdat->node_start_pfn = start_pfn;
-
- pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
- pgdat->node_start_pfn;
-}
-
-static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
-{
- struct pglist_data *pgdat = zone->zone_pgdat;
- int nr_pages = PAGES_PER_SECTION;
- int nid = pgdat->node_id;
- int zone_type;
- unsigned long flags, pfn;
-
- zone_type = zone - pgdat->node_zones;
- ensure_zone_is_initialized(zone, phys_start_pfn, nr_pages);
-
- pgdat_resize_lock(zone->zone_pgdat, &flags);
- grow_zone_span(zone, phys_start_pfn, phys_start_pfn + nr_pages);
- grow_pgdat_span(zone->zone_pgdat, phys_start_pfn,
- phys_start_pfn + nr_pages);
- pgdat_resize_unlock(zone->zone_pgdat, &flags);
- memmap_init_zone(nr_pages, nid, zone_type,
- phys_start_pfn, MEMMAP_HOTPLUG);
-
- /* online_page_range is called later and expects pages reserved */
- for (pfn = phys_start_pfn; pfn < phys_start_pfn + nr_pages; pfn++) {
- if (!pfn_valid(pfn))
- continue;
-
- SetPageReserved(pfn_to_page(pfn));
- }
- return 0;
-}
-
static int __meminit __add_section(int nid, unsigned long phys_start_pfn)
{
int ret;
@@ -1332,39 +1158,6 @@ static int check_hotplug_memory_range(u64 start, u64 size)
return 0;
}
-/*
- * If movable zone has already been setup, newly added memory should be check.
- * If its address is higher than movable zone, it should be added as movable.
- * Without this check, movable zone may overlap with other zone.
- */
-static int should_add_memory_movable(int nid, u64 start, u64 size)
-{
- unsigned long start_pfn = start >> PAGE_SHIFT;
- pg_data_t *pgdat = NODE_DATA(nid);
- struct zone *movable_zone = pgdat->node_zones + ZONE_MOVABLE;
-
- if (zone_is_empty(movable_zone))
- return 0;
-
- if (movable_zone->zone_start_pfn <= start_pfn)
- return 1;
-
- return 0;
-}
-
-int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
- bool for_device)
-{
-#ifdef CONFIG_ZONE_DEVICE
- if (for_device)
- return ZONE_DEVICE;
-#endif
- if (should_add_memory_movable(nid, start, size))
- return ZONE_MOVABLE;
-
- return zone_default;
-}
-
static int online_memory_block(struct memory_block *mem, void *arg)
{
return device_online(&mem->dev);
@@ -1410,7 +1203,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
}
/* call arch's memory hotadd */
- ret = arch_add_memory(nid, start, size, false);
+ ret = arch_add_memory(nid, start, size);
if (ret < 0)
goto error;
--
2.11.0
--
Michal Hocko
SUSE Labs
On Thu, Mar 30, 2017 at 01:54:48PM +0200, Michal Hocko wrote:
> Patch 5 is the core of the change. In order to make it easier to review
> I have tried it to be as minimalistic as possible and the large code
> removal is moved to patch 6.
>
> I would appreciate if s390 folks could take a look at patch 4 and the
> arch_add_memory because I am not sure I've grokked what they wanted to
> achieve there completely.
[adding Gerald Schaefer]
This seems to work fine on s390. So for the s390 bits:
Acked-by: Heiko Carstens <[email protected]>
On Fri 31-03-17 21:19:24, Heiko Carstens wrote:
> On Thu, Mar 30, 2017 at 01:54:48PM +0200, Michal Hocko wrote:
> > Patch 5 is the core of the change. In order to make it easier to review
> > I have tried it to be as minimalistic as possible and the large code
> > removal is moved to patch 6.
> >
> > I would appreciate if s390 folks could take a look at patch 4 and the
> > arch_add_memory because I am not sure I've grokked what they wanted to
> > achieve there completely.
>
> [adding Gerald Schaefer]
>
> This seems to work fine on s390. So for the s390 bits:
> Acked-by: Heiko Carstens <[email protected]>
Thanks a lot!
--
Michal Hocko
SUSE Labs
On Thu 30-03-17 13:54:48, Michal Hocko wrote:
[...]
> Any thoughts, complains, suggestions?
Anyting? I would really appreciate a feedback from IBM and Futjitsu guys
who have shaped this code last few years. Also Igor and Vitaly seem to
be using memory hotplug in virtualized environments. I do not expect
they would see a huge advantage of the rework but I would appreciate
to give it some testing to catch any potential regressions.
I plan to repost the series and would like to prevent from pointless
submission if there are any obvious issues.
Thanks!
--
Michal Hocko
SUSE Labs
On Mon, 3 Apr 2017 13:55:46 +0200
Michal Hocko <[email protected]> wrote:
> On Thu 30-03-17 13:54:48, Michal Hocko wrote:
> [...]
> > Any thoughts, complains, suggestions?
>
> Anyting? I would really appreciate a feedback from IBM and Futjitsu guys
> who have shaped this code last few years. Also Igor and Vitaly seem to
> be using memory hotplug in virtualized environments. I do not expect
> they would see a huge advantage of the rework but I would appreciate
> to give it some testing to catch any potential regressions.
I really appreciate this rework as it simplifies code a bit and potentially
would allow me/Vitaly to make auto-online work with movable zone as well.
I'll try to test the series within this week.
>
> I plan to repost the series and would like to prevent from pointless
> submission if there are any obvious issues.
>
> Thanks!
On Mon, Apr 03, 2017 at 01:55:46PM +0200, Michal Hocko wrote:
>Anyting? I would really appreciate a feedback from IBM and Futjitsu
>guys who have shaped this code last few years. Also Igor and Vitaly
>seem to be using memory hotplug in virtualized environments. I do not
>expect they would see a huge advantage of the rework but I would
>appreciate to give it some testing to catch any potential regressions.
Sorry for the delayed reply.
With this set, I'm able to "online_movable" blocks in ascending order.
However, I am seeing a regression. When adding memory to a memoryless
node, it shows up in node 0 instead. I'm digging to see if I can help
narrow down where things go wrong.
--
Reza Arbab
On Mon 03-04-17 14:58:30, Reza Arbab wrote:
> On Mon, Apr 03, 2017 at 01:55:46PM +0200, Michal Hocko wrote:
> >Anyting? I would really appreciate a feedback from IBM and Futjitsu guys
> >who have shaped this code last few years. Also Igor and Vitaly seem to be
> >using memory hotplug in virtualized environments. I do not expect they
> >would see a huge advantage of the rework but I would appreciate to give it
> >some testing to catch any potential regressions.
>
> Sorry for the delayed reply.
>
> With this set, I'm able to "online_movable" blocks in ascending order.
>
> However, I am seeing a regression. When adding memory to a memoryless node,
> it shows up in node 0 instead. I'm digging to see if I can help narrow down
> where things go wrong.
OK, I guess I know what is going on here. online_pages relies on
pfn_to_nid(pfn) to return a proper node. But we are doing
page_to_nid(pfn_to_page(__pfn_to_nid_pfn)) so we rely on the page being
properly initialized. Damn, I should have noticed that. There are two
ways around that. Either the __add_section stores the nid into the
struct page and make page_to_nid reliable or store it somewhere else
(ideally into the memblock). The first is easier (let's do it for now)
but longterm we do not want to rely on the struct page at all I believe.
Does the following help?
---
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b9dc1c4e26c3..0e21b9f67c9d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -309,14 +309,19 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn)
/*
* Make all the pages reserved so that nobody will stumble over half
- * initialized state.
+ * initialized state.
+ * FIXME: We also have to associate it with a node because pfn_to_node
+ * relies on having page with the proper node.
*/
for (i = 0; i < PAGES_PER_SECTION; i++) {
unsigned long pfn = phys_start_pfn + i;
+ struct page *page;
if (!pfn_valid(pfn))
continue;
- SetPageReserved(pfn_to_page(phys_start_pfn + i));
+ page = pfn_to_page(pfn);
+ set_page_node(page, nid);
+ SetPageReserved(page);
}
return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
--
Michal Hocko
SUSE Labs
On Mon, Apr 03, 2017 at 10:23:38PM +0200, Michal Hocko wrote:
>On Mon 03-04-17 14:58:30, Reza Arbab wrote:
>> However, I am seeing a regression. When adding memory to a memoryless
>> node, it shows up in node 0 instead. I'm digging to see if I can help
>> narrow down where things go wrong.
>
>OK, I guess I know what is going on here. online_pages relies on
>pfn_to_nid(pfn) to return a proper node. But we are doing
>page_to_nid(pfn_to_page(__pfn_to_nid_pfn)) so we rely on the page being
>properly initialized. Damn, I should have noticed that. There are two
>ways around that. Either the __add_section stores the nid into the
>struct page and make page_to_nid reliable or store it somewhere else
>(ideally into the memblock). The first is easier (let's do it for now)
>but longterm we do not want to rely on the struct page at all I believe.
>
>Does the following help?
>---
>diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>index b9dc1c4e26c3..0e21b9f67c9d 100644
>--- a/mm/memory_hotplug.c
>+++ b/mm/memory_hotplug.c
>@@ -309,14 +309,19 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn)
>
> /*
> * Make all the pages reserved so that nobody will stumble over half
>- * initialized state.
>+ * initialized state.
>+ * FIXME: We also have to associate it with a node because pfn_to_node
>+ * relies on having page with the proper node.
> */
> for (i = 0; i < PAGES_PER_SECTION; i++) {
> unsigned long pfn = phys_start_pfn + i;
>+ struct page *page;
> if (!pfn_valid(pfn))
> continue;
>
>- SetPageReserved(pfn_to_page(phys_start_pfn + i));
>+ page = pfn_to_page(pfn);
>+ set_page_node(page, nid);
>+ SetPageReserved(page);
> }
>
> return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
Almost there. I'm seeing the memory in the correct node now, but the
/sys/devices/system/node/nodeX/memoryY links are not being created.
I think it's tripping up here, in register_mem_sect_under_node():
page_nid = get_nid_for_pfn(pfn);
if (page_nid < 0)
continue;
--
Reza Arbab
On Thu, Mar 30, 2017 at 01:54:51PM +0200, Michal Hocko wrote:
>init_currently_empty_zone doesn't have any error to return yet it is
>still an int and callers try to be defensive and try to handle potential
>error. Remove this nonsense and simplify all callers.
Semi-related; arch_remove_memory() returns int, but callers ignore it.
Is that worth cleaning up? If so, should the implementations be
simplified, or should we maybe do a pr_error() or something with it?
--
Reza Arbab
[Let's add Gary who as introduced this code c04fc586c1a48]
On Mon 03-04-17 15:42:13, Reza Arbab wrote:
> On Mon, Apr 03, 2017 at 10:23:38PM +0200, Michal Hocko wrote:
> >On Mon 03-04-17 14:58:30, Reza Arbab wrote:
> >>However, I am seeing a regression. When adding memory to a memoryless
> >>node, it shows up in node 0 instead. I'm digging to see if I can help
> >>narrow down where things go wrong.
> >
> >OK, I guess I know what is going on here. online_pages relies on
> >pfn_to_nid(pfn) to return a proper node. But we are doing
> >page_to_nid(pfn_to_page(__pfn_to_nid_pfn)) so we rely on the page being
> >properly initialized. Damn, I should have noticed that. There are two
> >ways around that. Either the __add_section stores the nid into the
> >struct page and make page_to_nid reliable or store it somewhere else
> >(ideally into the memblock). The first is easier (let's do it for now)
> >but longterm we do not want to rely on the struct page at all I believe.
> >
> >Does the following help?
> >---
> >diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >index b9dc1c4e26c3..0e21b9f67c9d 100644
> >--- a/mm/memory_hotplug.c
> >+++ b/mm/memory_hotplug.c
> >@@ -309,14 +309,19 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn)
> >
> > /*
> > * Make all the pages reserved so that nobody will stumble over half
> >- * initialized state.
> >+ * initialized state.
> >+ * FIXME: We also have to associate it with a node because pfn_to_node
> >+ * relies on having page with the proper node.
> > */
> > for (i = 0; i < PAGES_PER_SECTION; i++) {
> > unsigned long pfn = phys_start_pfn + i;
> >+ struct page *page;
> > if (!pfn_valid(pfn))
> > continue;
> >
> >- SetPageReserved(pfn_to_page(phys_start_pfn + i));
> >+ page = pfn_to_page(pfn);
> >+ set_page_node(page, nid);
> >+ SetPageReserved(page);
> > }
> >
> > return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
>
> Almost there. I'm seeing the memory in the correct node now, but the
> /sys/devices/system/node/nodeX/memoryY links are not being created.
>
> I think it's tripping up here, in register_mem_sect_under_node():
>
> page_nid = get_nid_for_pfn(pfn);
> if (page_nid < 0)
> continue;
Huh, this code is confusing. How can we have a memblock spanning more
nodes? If not then the loop over all sections in the memblock seem
pointless as well. Also why do we require page_initialized() in
get_nid_for_pfn? The changelog doesn't explain that and there are no
comments that would help either.
Gary, could you clarify this please?
--
Michal Hocko
SUSE Labs
On Mon 03-04-17 16:22:32, Reza Arbab wrote:
> On Thu, Mar 30, 2017 at 01:54:51PM +0200, Michal Hocko wrote:
> >init_currently_empty_zone doesn't have any error to return yet it is
> >still an int and callers try to be defensive and try to handle potential
> >error. Remove this nonsense and simplify all callers.
>
> Semi-related; arch_remove_memory() returns int, but callers ignore it.
>
> Is that worth cleaning up? If so, should the implementations be simplified,
> or should we maybe do a pr_error() or something with it?
No, pr_error is not really helpful. Either that path can fail and we
should handle it properly - which will be hard because remove_memory
cannot handle that or we should just make arch_remove_memory
non-failing. I have a suspicion that this path doesn't really fail
in fact. This requires a deeper inspection though. I've put that on my
todo list.
Thanks!
--
Michal Hocko
SUSE Labs
On Tue 04-04-17 09:23:29, Michal Hocko wrote:
> [Let's add Gary who as introduced this code c04fc586c1a48]
OK, so Gary's email doesn't exist anymore. Does anybody can comment on
this? I suspect this code is just-in-case... Mel?
> On Mon 03-04-17 15:42:13, Reza Arbab wrote:
[...]
> > Almost there. I'm seeing the memory in the correct node now, but the
> > /sys/devices/system/node/nodeX/memoryY links are not being created.
> >
> > I think it's tripping up here, in register_mem_sect_under_node():
> >
> > page_nid = get_nid_for_pfn(pfn);
> > if (page_nid < 0)
> > continue;
>
> Huh, this code is confusing. How can we have a memblock spanning more
> nodes? If not then the loop over all sections in the memblock seem
> pointless as well. Also why do we require page_initialized() in
> get_nid_for_pfn? The changelog doesn't explain that and there are no
> comments that would help either.
>
> Gary, could you clarify this please?
> --
> Michal Hocko
> SUSE Labs
--
Michal Hocko
SUSE Labs
On Tue 04-04-17 09:34:12, Michal Hocko wrote:
> On Tue 04-04-17 09:23:29, Michal Hocko wrote:
> > [Let's add Gary who as introduced this code c04fc586c1a48]
>
> OK, so Gary's email doesn't exist anymore. Does anybody can comment on
> this? I suspect this code is just-in-case... Mel?
>
> > On Mon 03-04-17 15:42:13, Reza Arbab wrote:
> [...]
> > > Almost there. I'm seeing the memory in the correct node now, but the
> > > /sys/devices/system/node/nodeX/memoryY links are not being created.
> > >
> > > I think it's tripping up here, in register_mem_sect_under_node():
> > >
> > > page_nid = get_nid_for_pfn(pfn);
> > > if (page_nid < 0)
> > > continue;
> >
> > Huh, this code is confusing. How can we have a memblock spanning more
> > nodes? If not then the loop over all sections in the memblock seem
> > pointless as well. Also why do we require page_initialized() in
> > get_nid_for_pfn? The changelog doesn't explain that and there are no
> > comments that would help either.
OK, so I've been thinkin about that and I believe that page_initialized
check in get_nid_for_pfn is just bogus. There is nothing to rely on the
page::lru to be already initialized. So I will go with the following as
a separate preparatory patch.
I believe the whole code should be revisited and I have put that on my
ever growing todo list because I suspect that it is more complex than
necessary. I suspect that memblock do not span more nodes and all this
is just-in-case code (e.g. the onlining code assumes a single zone aka
node. But let's do that later.
---
>From fd2e3b6eca1cf7766527203d23db6aca5957a3f1 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 4 Apr 2017 10:05:06 +0200
Subject: [PATCH] mm: drop page_initialized check from get_nid_for_pfn
c04fc586c1a4 ("mm: show node to memory section relationship with
symlinks in sysfs") has added means to export memblock<->node
association into the sysfs. It has also introduced get_nid_for_pfn
which is a rather confusing counterpart of pfn_to_nid which checks also
whether the pfn page is already initialized (page_initialized). This
is done by checking page::lru != NULL which doesn't make any sense at
all. Nothing in this path really relies on the lru list being used or
initialized. Just remove it
Signed-off-by: Michal Hocko <[email protected]>
---
drivers/base/node.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f9686016..ee080a35e869 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -368,8 +368,6 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
}
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
-#define page_initialized(page) (page->lru.next)
-
static int __ref get_nid_for_pfn(unsigned long pfn)
{
struct page *page;
@@ -380,9 +378,6 @@ static int __ref get_nid_for_pfn(unsigned long pfn)
if (system_state == SYSTEM_BOOTING)
return early_pfn_to_nid(pfn);
#endif
- page = pfn_to_page(pfn);
- if (!page_initialized(page))
- return -1;
return pfn_to_nid(pfn);
}
--
2.11.0
--
Michal Hocko
SUSE Labs
On 30.03.17, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> The current memory hotplug implementation relies on having all the
> struct pages associate with a zone during the physical hotplug phase
> (arch_add_memory->__add_pages->__add_section->__add_zone). In the vast
> majority of cases this means that they are added to ZONE_NORMAL. This
> has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd
> without sparsemem") and it wasn't a big deal back then.
>
> Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
> onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable
> memory and portion memory") and then things got more complicated. Rather
> than reconsidering the zone association which was no longer needed
> (because the memory hotplug already depended on SPARSEMEM) a convoluted
> semantic of zone shifting has been developed. Only the currently last
> memblock or the one adjacent to the zone_movable can be onlined movable.
> This essentially means that the online time changes as the new memblocks
> are added.
>
> Let's simulate memory hot online manually
> Normal Movable
>
> /sys/devices/system/memory/memory32/valid_zones:Normal
> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
>
> /sys/devices/system/memory/memory32/valid_zones:Normal
> /sys/devices/system/memory/memory33/valid_zones:Normal
> /sys/devices/system/memory/memory34/valid_zones:Normal Movable
>
> /sys/devices/system/memory/memory32/valid_zones:Normal
> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
> /sys/devices/system/memory/memory34/valid_zones:Movable Normal
>
> This is an awkward semantic because an udev event is sent as soon as the
> block is onlined and an udev handler might want to online it based on
> some policy (e.g. association with a node) but it will inherently race
> with new blocks showing up.
>
> This patch changes the physical online phase to not associate pages
> with any zone at all. All the pages are just marked reserved and wait
> for the onlining phase to be associated with the zone as per the online
> request. There are only two requirements
> - existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap
> - ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses
> the later on is not inherent and can be changed in the future. It
> preserves the current behavior and made the code slightly simpler. This
> is subject to change in future.
>
> This means that the same physical online steps as above will lead to the
> following state:
> Normal Movable
>
> /sys/devices/system/memory/memory32/valid_zones:Normal Movable
> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
>
> /sys/devices/system/memory/memory32/valid_zones:Normal Movable
> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
> /sys/devices/system/memory/memory34/valid_zones:Normal Movable
>
> /sys/devices/system/memory/memory32/valid_zones:Normal Movable
> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
> /sys/devices/system/memory/memory34/valid_zones:Movable
>
> Implementation:
> The current move_pfn_range is reimplemented to check the above
> requirements (allow_online_pfn_range) and then updates the respective
> zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
> pfn range with the zone/node. __add_pages is updated to not require the
> zone and only initializes sections in the range. This allowed to
> simplify the arch_add_memory code (s390 could get rid of quite some
> of code).
>
> devm_memremap_pages is the only user of arch_add_memory which relies
> on the zone association because it only hooks into the memory hotplug
> only half way. It uses it to associate the new memory with ZONE_DEVICE
> but doesn't allow it to be {on,off}lined via sysfs. This means that this
> particular code path has to call move_pfn_range_to_zone explicitly.
>
> The original zone shifting code is kept in place and will be removed in
> the follow up patch for an easier review.
>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Martin Schwidefsky <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: [email protected]
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> arch/ia64/mm/init.c | 8 +-
> arch/powerpc/mm/mem.c | 10 +--
> arch/s390/mm/init.c | 30 +------
> arch/sh/mm/init.c | 7 +-
> arch/x86/mm/init_32.c | 5 +-
> arch/x86/mm/init_64.c | 9 +-
> drivers/base/memory.c | 52 ++++++-----
> include/linux/memory_hotplug.h | 13 +--
> kernel/memremap.c | 3 +
> mm/memory_hotplug.c | 195 +++++++++++++++++++++++++----------------
> mm/sparse.c | 3 +-
> 11 files changed, 165 insertions(+), 170 deletions(-
>
Hi Michal,
building an x86 allmodconfig with next-20170404 results in the following
section mismatch warnings probably caused by this patch:
WARNING: mm/built-in.o(.text+0x5a1c2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
The function move_pfn_range_to_zone() references
the function __meminit memmap_init_zone().
This is often because move_pfn_range_to_zone lacks a __meminit
annotation or the annotation of memmap_init_zone is wrong.
WARNING: mm/built-in.o(.text+0x5a25b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
The function move_pfn_range_to_zone() references
the function __meminit init_currently_empty_zone().
This is often because move_pfn_range_to_zone lacks a __meminit
annotation or the annotation of init_currently_empty_zone is wrong.
WARNING: vmlinux.o(.text+0x188aa2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
The function move_pfn_range_to_zone() references
the function __meminit memmap_init_zone().
This is often because move_pfn_range_to_zone lacks a __meminit
annotation or the annotation of memmap_init_zone is wrong.
WARNING: vmlinux.o(.text+0x188b3b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
The function move_pfn_range_to_zone() references
the function __meminit init_currently_empty_zone().
This is often because move_pfn_range_to_zone lacks a __meminit
annotation or the annotation of init_currently_empty_zone is wrong.
--
Tobias
On Tue 04-04-17 14:21:19, Tobias Regnery wrote:
[...]
> Hi Michal,
Hi
> building an x86 allmodconfig with next-20170404 results in the following
> section mismatch warnings probably caused by this patch:
>
> WARNING: mm/built-in.o(.text+0x5a1c2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
> The function move_pfn_range_to_zone() references
> the function __meminit memmap_init_zone().
> This is often because move_pfn_range_to_zone lacks a __meminit
> annotation or the annotation of memmap_init_zone is wrong.
Right. __add_pages which used to call memmap_init_zone before
is __ref (to hide it the checker) which is not the case for
move_pfn_range_to_zone. I cannot say I would see the point of separating
all meminit functions because they are not going away but using __ref
for move_pfn_range_to_zone should be as safe as __add_pages is.
> WARNING: mm/built-in.o(.text+0x5a25b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
> The function move_pfn_range_to_zone() references
> the function __meminit init_currently_empty_zone().
> This is often because move_pfn_range_to_zone lacks a __meminit
> annotation or the annotation of init_currently_empty_zone is wrong.
and this is the same thing. Thanks a lot. The following patch should fix
it. I will keep it separate to have a reference why this has been
done...
---
>From 1ff2996125393601a28065c44d790145078483db Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 4 Apr 2017 14:36:25 +0200
Subject: [PATCH] mm, hotplug: fix the section mismatch warning
Tobias has reported following section mismatches introduced by "mm,
memory_hotplug: do not associate hotadded memory to zones until online".
WARNING: mm/built-in.o(.text+0x5a1c2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
The function move_pfn_range_to_zone() references
the function __meminit memmap_init_zone().
This is often because move_pfn_range_to_zone lacks a __meminit
annotation or the annotation of memmap_init_zone is wrong.
WARNING: mm/built-in.o(.text+0x5a25b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
The function move_pfn_range_to_zone() references
the function __meminit init_currently_empty_zone().
This is often because move_pfn_range_to_zone lacks a __meminit
annotation or the annotation of init_currently_empty_zone is wrong.
WARNING: vmlinux.o(.text+0x188aa2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
The function move_pfn_range_to_zone() references
the function __meminit memmap_init_zone().
This is often because move_pfn_range_to_zone lacks a __meminit
annotation or the annotation of memmap_init_zone is wrong.
WARNING: vmlinux.o(.text+0x188b3b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
The function move_pfn_range_to_zone() references
the function __meminit init_currently_empty_zone().
This is often because move_pfn_range_to_zone lacks a __meminit
annotation or the annotation of init_currently_empty_zone is wrong.
Both memmap_init_zone and init_currently_empty_zone are marked __meminit
but move_pfn_range_to_zone is used outside of __meminit sections (e.g.
devm_memremap_pages) so we have to hide it from the checker by __ref
annotation.
Reported-by: Tobias Regnery <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memory_hotplug.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0e21b9f67c9d..a358d7a67651 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -884,7 +884,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
}
-void move_pfn_range_to_zone(struct zone *zone,
+void __ref move_pfn_range_to_zone(struct zone *zone,
unsigned long start_pfn, unsigned long nr_pages)
{
struct pglist_data *pgdat = zone->zone_pgdat;
--
2.11.0
--
Michal Hocko
SUSE Labs
On Tue, Apr 04, 2017 at 10:23:02AM +0200, Michal Hocko wrote:
>OK, so I've been thinkin about that and I believe that page_initialized
>check in get_nid_for_pfn is just bogus. There is nothing to rely on the
>page::lru to be already initialized. So I will go with the following as
>a separate preparatory patch.
>
>I believe the whole code should be revisited and I have put that on my
>ever growing todo list because I suspect that it is more complex than
>necessary. I suspect that memblock do not span more nodes and all this
>is just-in-case code (e.g. the onlining code assumes a single zone aka
>node. But let's do that later.
>
>---
>From fd2e3b6eca1cf7766527203d23db6aca5957a3f1 Mon Sep 17 00:00:00 2001
>From: Michal Hocko <[email protected]>
>Date: Tue, 4 Apr 2017 10:05:06 +0200
>Subject: [PATCH] mm: drop page_initialized check from get_nid_for_pfn
>
>c04fc586c1a4 ("mm: show node to memory section relationship with
>symlinks in sysfs") has added means to export memblock<->node
>association into the sysfs. It has also introduced get_nid_for_pfn
>which is a rather confusing counterpart of pfn_to_nid which checks also
>whether the pfn page is already initialized (page_initialized). This
>is done by checking page::lru != NULL which doesn't make any sense at
>all. Nothing in this path really relies on the lru list being used or
>initialized. Just remove it
>
>Signed-off-by: Michal Hocko <[email protected]>
>---
> drivers/base/node.c | 5 -----
> 1 file changed, 5 deletions(-)
>
>diff --git a/drivers/base/node.c b/drivers/base/node.c
>index 5548f9686016..ee080a35e869 100644
>--- a/drivers/base/node.c
>+++ b/drivers/base/node.c
>@@ -368,8 +368,6 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
> }
>
> #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
>-#define page_initialized(page) (page->lru.next)
>-
> static int __ref get_nid_for_pfn(unsigned long pfn)
> {
> struct page *page;
>@@ -380,9 +378,6 @@ static int __ref get_nid_for_pfn(unsigned long pfn)
> if (system_state == SYSTEM_BOOTING)
> return early_pfn_to_nid(pfn);
> #endif
>- page = pfn_to_page(pfn);
>- if (!page_initialized(page))
>- return -1;
> return pfn_to_nid(pfn);
> }
>
Verified that /sys/devices/system/node/nodeX/memoryY links are there now.
--
Reza Arbab
On Tue, Apr 04, 2017 at 10:23:02AM +0200, Michal Hocko wrote:
>diff --git a/drivers/base/node.c b/drivers/base/node.c
>index 5548f9686016..ee080a35e869 100644
>--- a/drivers/base/node.c
>+++ b/drivers/base/node.c
>@@ -368,8 +368,6 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
> }
>
> #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
>-#define page_initialized(page) (page->lru.next)
>-
> static int __ref get_nid_for_pfn(unsigned long pfn)
> {
> struct page *page;
>@@ -380,9 +378,6 @@ static int __ref get_nid_for_pfn(unsigned long pfn)
> if (system_state == SYSTEM_BOOTING)
> return early_pfn_to_nid(pfn);
> #endif
>- page = pfn_to_page(pfn);
>- if (!page_initialized(page))
>- return -1;
> return pfn_to_nid(pfn);
> }
>
You can get rid of 'page' altogether.
drivers/base/node.c: In function ‘get_nid_for_pfn’:
drivers/base/node.c:373:15: warning: unused variable ‘page’ [-Wunused-variable]
--
Reza Arbab
On Tue 04-04-17 11:02:39, Reza Arbab wrote:
> On Tue, Apr 04, 2017 at 10:23:02AM +0200, Michal Hocko wrote:
> >diff --git a/drivers/base/node.c b/drivers/base/node.c
> >index 5548f9686016..ee080a35e869 100644
> >--- a/drivers/base/node.c
> >+++ b/drivers/base/node.c
> >@@ -368,8 +368,6 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
> >}
> >
> >#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
> >-#define page_initialized(page) (page->lru.next)
> >-
> >static int __ref get_nid_for_pfn(unsigned long pfn)
> >{
> > struct page *page;
> >@@ -380,9 +378,6 @@ static int __ref get_nid_for_pfn(unsigned long pfn)
> > if (system_state == SYSTEM_BOOTING)
> > return early_pfn_to_nid(pfn);
> >#endif
> >- page = pfn_to_page(pfn);
> >- if (!page_initialized(page))
> >- return -1;
> > return pfn_to_nid(pfn);
> >}
> >
>
> You can get rid of 'page' altogether.
>
> drivers/base/node.c: In function ‘get_nid_for_pfn’:
> drivers/base/node.c:373:15: warning: unused variable ‘page’ [-Wunused-variable]
Right, updated.
Thanks for your testing! This is highly appreciated.
Can I assume your Tested-by?
--
Michal Hocko
SUSE Labs
On Tue, Apr 04, 2017 at 06:44:53PM +0200, Michal Hocko wrote:
>Thanks for your testing! This is highly appreciated.
>Can I assume your Tested-by?
Of course! Not quite done, though. I think I found another edge case.
You get an oops when removing all of a node's memory:
__nr_to_section
__pfn_to_section
find_biggest_section_pfn
shrink_pgdat_span
__remove_zone
__remove_section
__remove_pages
arch_remove_memory
remove_memory
I stuck some debugging prints in, for context:
shrink_pgdat_span: start_pfn=0x10000, end_pfn=0x10100, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
shrink_pgdat_span: start_pfn=0x10100, end_pfn=0x10200, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
...%<...
shrink_pgdat_span: start_pfn=0x1fe00, end_pfn=0x1ff00, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
shrink_pgdat_span: start_pfn=0x1ff00, end_pfn=0x20000, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
find_biggest_section_pfn: start_pfn=0x0, end_pfn=0x1ff00
find_biggest_section_pfn loop: pfn=0x1feff, sec_nr = 0x1fe
find_biggest_section_pfn loop: pfn=0x1fdff, sec_nr = 0x1fd
...%<...
find_biggest_section_pfn loop: pfn=0x1ff, sec_nr = 0x1
find_biggest_section_pfn loop: pfn=0xff, sec_nr = 0x0
find_biggest_section_pfn loop: pfn=0xffffffffffffffff, sec_nr = 0xffffffffffffff
Unable to handle kernel paging request for data at address 0xc000800000f19e78
--
Reza Arbab
On Tue 04-04-17 13:30:13, Reza Arbab wrote:
> On Tue, Apr 04, 2017 at 06:44:53PM +0200, Michal Hocko wrote:
> >Thanks for your testing! This is highly appreciated.
> >Can I assume your Tested-by?
>
> Of course! Not quite done, though.
Ohh, I didn't mean to rush you to that!
> I think I found another edge case. You
> get an oops when removing all of a node's memory:
>
> __nr_to_section
> __pfn_to_section
> find_biggest_section_pfn
> shrink_pgdat_span
> __remove_zone
> __remove_section
> __remove_pages
> arch_remove_memory
> remove_memory
Is this something new or an old issue? I believe the state after the
online should be the same as before. So if you onlined the full node
then there shouldn't be any difference. Let me have a look...
> I stuck some debugging prints in, for context:
>
> shrink_pgdat_span: start_pfn=0x10000, end_pfn=0x10100, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
> shrink_pgdat_span: start_pfn=0x10100, end_pfn=0x10200, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
> ...%<...
> shrink_pgdat_span: start_pfn=0x1fe00, end_pfn=0x1ff00, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
> shrink_pgdat_span: start_pfn=0x1ff00, end_pfn=0x20000, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
> find_biggest_section_pfn: start_pfn=0x0, end_pfn=0x1ff00
> find_biggest_section_pfn loop: pfn=0x1feff, sec_nr = 0x1fe
> find_biggest_section_pfn loop: pfn=0x1fdff, sec_nr = 0x1fd
> ...%<...
> find_biggest_section_pfn loop: pfn=0x1ff, sec_nr = 0x1
> find_biggest_section_pfn loop: pfn=0xff, sec_nr = 0x0
> find_biggest_section_pfn loop: pfn=0xffffffffffffffff, sec_nr = 0xffffffffffffff
> Unable to handle kernel paging request for data at address 0xc000800000f19e78
...this looks like a straight underflow and it is clear that the code
is just broken. Have a look at the loop
pfn = end_pfn - 1;
for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
assume that end_pfn is properly PAGES_PER_SECTION aligned (start_pfn
would be 0 obviously). This is unsigned arithmetic and so it cannot work
for the first section. So the code is broken and has been broken since
it has been introduced. Nobody has noticed because the low pfns are
usually reserved and out of the hotplug reach. We could tweak it but I
am not even sure we really want/need this behavior. It complicates the
code and am not really sure we need to support
online_movable(range)
offline_movable(range)
online_kernel(range)
While the flexibility is attractive I do not think it is worth the
additional complexity without any proof of the usecase. Especially when
we consider that this only work when we offline from the start or end of
the zone or whole zone. I guess it would be the best to simply revert
this whole thing. It is quite a lot of code with a dubious use. What
do Futjitsu guys think about it?
---
>From 1b08ecef3e8ebcef585fe8f2b23155be54cce335 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 4 Apr 2017 21:09:00 +0200
Subject: [PATCH] mm, hotplug: get rid of zone/node shrinking
this is basically a revert of 815121d2b5cd ("memory_hotplug: clear zone
when removing the memory"). While the node/zone shrinking sounds
attractive at first sight because it allows to
online_movable(range)
[...]
offline_movable(range)
[...]
online_kernel(range)
but this requires that the range is in the beginning or the end of a
zone or operate on the whole zone basis. This code is even broken as
noticed by Reza Arbab. He has triggered an oops when doing the full node
offline
Unable to handle kernel paging request for data at address 0xc000800000f19e78
__nr_to_section
__pfn_to_section
find_biggest_section_pfn
shrink_pgdat_span
__remove_zone
__remove_section
__remove_pages
arch_remove_memory
remove_memory
which is caused by an overflow in find_biggest_section_pfn. This code
simply cannot work on the first section [0, PAGES_PER_SECTION] because
pfn = end_pfn - 1;
for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
pfn would underflow in the unsigned arithmetic. This doesn't happen
usually because the lowest pfns are usually reserved and out of the
hotplug reach.
The changelog of the above commit doesn't mention any such usecase and
sounds more like an nice-to-have and inverse to __add_zone which we are
trying to get rid of in this series. So let's simplify the code and
remove the complication. We can reintroduce it back along with a valid
usecase description.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memory_hotplug.c | 207 ----------------------------------------------------
1 file changed, 207 deletions(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a358d7a67651..d48a4456b20d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -379,212 +379,9 @@ EXPORT_SYMBOL_GPL(__add_pages);
#ifdef CONFIG_MEMORY_HOTREMOVE
/* find the smallest valid pfn in the range [start_pfn, end_pfn) */
-static int find_smallest_section_pfn(int nid, struct zone *zone,
- unsigned long start_pfn,
- unsigned long end_pfn)
-{
- struct mem_section *ms;
-
- for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
- ms = __pfn_to_section(start_pfn);
-
- if (unlikely(!valid_section(ms)))
- continue;
-
- if (unlikely(pfn_to_nid(start_pfn) != nid))
- continue;
-
- if (zone && zone != page_zone(pfn_to_page(start_pfn)))
- continue;
-
- return start_pfn;
- }
-
- return 0;
-}
-
-/* find the biggest valid pfn in the range [start_pfn, end_pfn). */
-static int find_biggest_section_pfn(int nid, struct zone *zone,
- unsigned long start_pfn,
- unsigned long end_pfn)
-{
- struct mem_section *ms;
- unsigned long pfn;
-
- /* pfn is the end pfn of a memory section. */
- pfn = end_pfn - 1;
- for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
- ms = __pfn_to_section(pfn);
-
- if (unlikely(!valid_section(ms)))
- continue;
-
- if (unlikely(pfn_to_nid(pfn) != nid))
- continue;
-
- if (zone && zone != page_zone(pfn_to_page(pfn)))
- continue;
-
- return pfn;
- }
-
- return 0;
-}
-
-static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- unsigned long zone_start_pfn = zone->zone_start_pfn;
- unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
- unsigned long zone_end_pfn = z;
- unsigned long pfn;
- struct mem_section *ms;
- int nid = zone_to_nid(zone);
-
- zone_span_writelock(zone);
- if (zone_start_pfn == start_pfn) {
- /*
- * If the section is smallest section in the zone, it need
- * shrink zone->zone_start_pfn and zone->zone_spanned_pages.
- * In this case, we find second smallest valid mem_section
- * for shrinking zone.
- */
- pfn = find_smallest_section_pfn(nid, zone, end_pfn,
- zone_end_pfn);
- if (pfn) {
- zone->zone_start_pfn = pfn;
- zone->spanned_pages = zone_end_pfn - pfn;
- }
- } else if (zone_end_pfn == end_pfn) {
- /*
- * If the section is biggest section in the zone, it need
- * shrink zone->spanned_pages.
- * In this case, we find second biggest valid mem_section for
- * shrinking zone.
- */
- pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
- start_pfn);
- if (pfn)
- zone->spanned_pages = pfn - zone_start_pfn + 1;
- }
-
- /*
- * The section is not biggest or smallest mem_section in the zone, it
- * only creates a hole in the zone. So in this case, we need not
- * change the zone. But perhaps, the zone has only hole data. Thus
- * it check the zone has only hole or not.
- */
- pfn = zone_start_pfn;
- for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
- ms = __pfn_to_section(pfn);
-
- if (unlikely(!valid_section(ms)))
- continue;
-
- if (page_zone(pfn_to_page(pfn)) != zone)
- continue;
-
- /* If the section is current section, it continues the loop */
- if (start_pfn == pfn)
- continue;
-
- /* If we find valid section, we have nothing to do */
- zone_span_writeunlock(zone);
- return;
- }
-
- /* The zone has no valid section */
- zone->zone_start_pfn = 0;
- zone->spanned_pages = 0;
- zone_span_writeunlock(zone);
-}
-
-static void shrink_pgdat_span(struct pglist_data *pgdat,
- unsigned long start_pfn, unsigned long end_pfn)
-{
- unsigned long pgdat_start_pfn = pgdat->node_start_pfn;
- unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace clash */
- unsigned long pgdat_end_pfn = p;
- unsigned long pfn;
- struct mem_section *ms;
- int nid = pgdat->node_id;
-
- if (pgdat_start_pfn == start_pfn) {
- /*
- * If the section is smallest section in the pgdat, it need
- * shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
- * In this case, we find second smallest valid mem_section
- * for shrinking zone.
- */
- pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
- pgdat_end_pfn);
- if (pfn) {
- pgdat->node_start_pfn = pfn;
- pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
- }
- } else if (pgdat_end_pfn == end_pfn) {
- /*
- * If the section is biggest section in the pgdat, it need
- * shrink pgdat->node_spanned_pages.
- * In this case, we find second biggest valid mem_section for
- * shrinking zone.
- */
- pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
- start_pfn);
- if (pfn)
- pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
- }
-
- /*
- * If the section is not biggest or smallest mem_section in the pgdat,
- * it only creates a hole in the pgdat. So in this case, we need not
- * change the pgdat.
- * But perhaps, the pgdat has only hole data. Thus it check the pgdat
- * has only hole or not.
- */
- pfn = pgdat_start_pfn;
- for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
- ms = __pfn_to_section(pfn);
-
- if (unlikely(!valid_section(ms)))
- continue;
-
- if (pfn_to_nid(pfn) != nid)
- continue;
-
- /* If the section is current section, it continues the loop */
- if (start_pfn == pfn)
- continue;
-
- /* If we find valid section, we have nothing to do */
- return;
- }
-
- /* The pgdat has no valid section */
- pgdat->node_start_pfn = 0;
- pgdat->node_spanned_pages = 0;
-}
-
-static void __remove_zone(struct zone *zone, unsigned long start_pfn)
-{
- struct pglist_data *pgdat = zone->zone_pgdat;
- int nr_pages = PAGES_PER_SECTION;
- int zone_type;
- unsigned long flags;
-
- zone_type = zone - pgdat->node_zones;
-
- pgdat_resize_lock(zone->zone_pgdat, &flags);
- shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
- shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
- pgdat_resize_unlock(zone->zone_pgdat, &flags);
-}
-
static int __remove_section(struct zone *zone, struct mem_section *ms,
unsigned long map_offset)
{
- unsigned long start_pfn;
- int scn_nr;
int ret = -EINVAL;
if (!valid_section(ms))
@@ -594,10 +391,6 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
if (ret)
return ret;
- scn_nr = __section_nr(ms);
- start_pfn = section_nr_to_pfn(scn_nr);
- __remove_zone(zone, start_pfn);
-
sparse_remove_one_section(zone, ms, map_offset);
return 0;
}
--
2.11.0
--
Michal Hocko
SUSE Labs
On Tue, Apr 04, 2017 at 09:41:22PM +0200, Michal Hocko wrote:
>On Tue 04-04-17 13:30:13, Reza Arbab wrote:
>> I think I found another edge case. You
>> get an oops when removing all of a node's memory:
>>
>> __nr_to_section
>> __pfn_to_section
>> find_biggest_section_pfn
>> shrink_pgdat_span
>> __remove_zone
>> __remove_section
>> __remove_pages
>> arch_remove_memory
>> remove_memory
>
>Is this something new or an old issue? I believe the state after the
>online should be the same as before. So if you onlined the full node
>then there shouldn't be any difference. Let me have a look...
It's new. Without this patchset, I can repeatedly
add_memory()->online_movable->offline->remove_memory() all of a node's
memory.
>From 1b08ecef3e8ebcef585fe8f2b23155be54cce335 Mon Sep 17 00:00:00 2001
>From: Michal Hocko <[email protected]>
>Date: Tue, 4 Apr 2017 21:09:00 +0200
>Subject: [PATCH] mm, hotplug: get rid of zone/node shrinking
>
...%<...
>---
> mm/memory_hotplug.c | 207 ----------------------------------------------------
> 1 file changed, 207 deletions(-)
Okay, getting further. With this I can again repeatedly add and remove,
but now I'm seeing a weird variation of that earlier issue:
1. add_memory(), online_movable
/sys/devices/system/node/nodeX/memoryY symlinks are created.
2. offline, remove_memory()
The node is offlined, since all memory has been removed, so all of
/sys/devices/system/node/nodeX is gone. This is normal.
3. add_memory(), online_movable
The node is onlined, so /sys/devices/system/node/nodeX is recreated,
and the memory is added, but just like earlier in this email thread,
the memoryY links are not there.
--
Reza Arbab
On Tue 04-04-17 21:41:22, Michal Hocko wrote:
> On Tue 04-04-17 13:30:13, Reza Arbab wrote:
> > On Tue, Apr 04, 2017 at 06:44:53PM +0200, Michal Hocko wrote:
> > >Thanks for your testing! This is highly appreciated.
> > >Can I assume your Tested-by?
> >
> > Of course! Not quite done, though.
>
> Ohh, I didn't mean to rush you to that!
>
> > I think I found another edge case. You
> > get an oops when removing all of a node's memory:
> >
> > __nr_to_section
> > __pfn_to_section
> > find_biggest_section_pfn
> > shrink_pgdat_span
> > __remove_zone
> > __remove_section
> > __remove_pages
> > arch_remove_memory
> > remove_memory
>
> Is this something new or an old issue? I believe the state after the
> online should be the same as before. So if you onlined the full node
> then there shouldn't be any difference. Let me have a look...
>
> > I stuck some debugging prints in, for context:
> >
> > shrink_pgdat_span: start_pfn=0x10000, end_pfn=0x10100, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
> > shrink_pgdat_span: start_pfn=0x10100, end_pfn=0x10200, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
> > ...%<...
> > shrink_pgdat_span: start_pfn=0x1fe00, end_pfn=0x1ff00, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
> > shrink_pgdat_span: start_pfn=0x1ff00, end_pfn=0x20000, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
> > find_biggest_section_pfn: start_pfn=0x0, end_pfn=0x1ff00
> > find_biggest_section_pfn loop: pfn=0x1feff, sec_nr = 0x1fe
> > find_biggest_section_pfn loop: pfn=0x1fdff, sec_nr = 0x1fd
> > ...%<...
> > find_biggest_section_pfn loop: pfn=0x1ff, sec_nr = 0x1
> > find_biggest_section_pfn loop: pfn=0xff, sec_nr = 0x0
> > find_biggest_section_pfn loop: pfn=0xffffffffffffffff, sec_nr = 0xffffffffffffff
> > Unable to handle kernel paging request for data at address 0xc000800000f19e78
>
> ...this looks like a straight underflow and it is clear that the code
> is just broken. Have a look at the loop
> pfn = end_pfn - 1;
> for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
>
> assume that end_pfn is properly PAGES_PER_SECTION aligned (start_pfn
> would be 0 obviously). This is unsigned arithmetic and so it cannot work
> for the first section. So the code is broken and has been broken since
> it has been introduced. Nobody has noticed because the low pfns are
> usually reserved and out of the hotplug reach. We could tweak it but I
> am not even sure we really want/need this behavior. It complicates the
> code and am not really sure we need to support
> online_movable(range)
> offline_movable(range)
> online_kernel(range)
OK, so I managed to confuse myself. This is not about offlining. This is
about arch_remove_memory path which means this is about memory
hotremove. So we are talking about hotremove(N1, range1) and hotadd(N2,
range2) where range1 and range2 have a non-empty intersection. Do we
need to supporst this usecase?
--
Michal Hocko
SUSE Labs
On Tue 04-04-17 16:43:39, Reza Arbab wrote:
> On Tue, Apr 04, 2017 at 09:41:22PM +0200, Michal Hocko wrote:
> >On Tue 04-04-17 13:30:13, Reza Arbab wrote:
> >>I think I found another edge case. You
> >>get an oops when removing all of a node's memory:
> >>
> >>__nr_to_section
> >>__pfn_to_section
> >>find_biggest_section_pfn
> >>shrink_pgdat_span
> >>__remove_zone
> >>__remove_section
> >>__remove_pages
> >>arch_remove_memory
> >>remove_memory
> >
> >Is this something new or an old issue? I believe the state after the
> >online should be the same as before. So if you onlined the full node
> >then there shouldn't be any difference. Let me have a look...
>
> It's new. Without this patchset, I can repeatedly
> add_memory()->online_movable->offline->remove_memory() all of a node's
> memory.
This is quite unexpected because the code obviously cannot handle the
first memory section. Could you paste /proc/zoneinfo and
grep . -r /sys/devices/system/memory/auto_online_blocks/memory*, after
onlining for both patched and unpatched kernels?
> >From 1b08ecef3e8ebcef585fe8f2b23155be54cce335 Mon Sep 17 00:00:00 2001
> >From: Michal Hocko <[email protected]>
> >Date: Tue, 4 Apr 2017 21:09:00 +0200
> >Subject: [PATCH] mm, hotplug: get rid of zone/node shrinking
> >
> ...%<...
> >---
> >mm/memory_hotplug.c | 207 ----------------------------------------------------
> >1 file changed, 207 deletions(-)
>
> Okay, getting further. With this I can again repeatedly add and remove, but
> now I'm seeing a weird variation of that earlier issue:
>
> 1. add_memory(), online_movable
> /sys/devices/system/node/nodeX/memoryY symlinks are created.
>
> 2. offline, remove_memory()
> The node is offlined, since all memory has been removed, so all of
> /sys/devices/system/node/nodeX is gone. This is normal.
>
> 3. add_memory(), online_movable
> The node is onlined, so /sys/devices/system/node/nodeX is recreated,
> and the memory is added, but just like earlier in this email thread,
> the memoryY links are not there.
Could you add some printks to see why the sysfs creation failed please?
--
Michal Hocko
SUSE Labs
On Fri 31-03-17 09:39:54, Michal Hocko wrote:
> Fixed screw ups during the initial patch split up as per Hillf
> ---
> From 8be6c5e47de66210e47710c80e72e8abd899017b Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Wed, 29 Mar 2017 15:11:30 +0200
> Subject: [PATCH] mm: get rid of zone_is_initialized
>
> There shouldn't be any reason to add initialized when we can tell the
> same thing from checking whether there are any pages spanned to the
> zone. Remove zone_is_initialized() and replace it by zone_is_empty
> which can be used for the same set of tests.
>
> This shouldn't have any visible effect
I've decided to drop this patch. My main motivation was to simplify
the hotplug workflow/ The situation is more hairy than I expected,
though. On one hand all zones should be initialized early during the
hotplug in add_memory_resource but direct users of arch_add_memory will
need this to be called I suspect. Let's just keep the current status quo
and clean up it later. It is not really needed for this series.
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/linux/mmzone.h | 7 -------
> mm/memory_hotplug.c | 6 +++---
> mm/page_alloc.c | 3 +--
> 3 files changed, 4 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 618499159a7c..3bac3ed71c7a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -442,8 +442,6 @@ struct zone {
> seqlock_t span_seqlock;
> #endif
>
> - int initialized;
> -
> /* Write-intensive fields used from the page allocator */
> ZONE_PADDING(_pad1_)
>
> @@ -520,11 +518,6 @@ static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
> return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
> }
>
> -static inline bool zone_is_initialized(struct zone *zone)
> -{
> - return zone->initialized;
> -}
> -
> static inline bool zone_is_empty(struct zone *zone)
> {
> return zone->spanned_pages == 0;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6fb6bd2df787..699f5a2a8efd 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -348,7 +348,7 @@ static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
> static int __ref ensure_zone_is_initialized(struct zone *zone,
> unsigned long start_pfn, unsigned long num_pages)
> {
> - if (!zone_is_initialized(zone))
> + if (zone_is_empty(zone))
> return init_currently_empty_zone(zone, start_pfn, num_pages);
>
> return 0;
> @@ -1051,7 +1051,7 @@ bool zone_can_shift(unsigned long pfn, unsigned long nr_pages,
>
> /* no zones in use between current zone and target */
> for (i = idx + 1; i < target; i++)
> - if (zone_is_initialized(zone - idx + i))
> + if (!zone_is_empty(zone - idx + i))
> return false;
> }
>
> @@ -1062,7 +1062,7 @@ bool zone_can_shift(unsigned long pfn, unsigned long nr_pages,
>
> /* no zones in use between current zone and target */
> for (i = target + 1; i < idx; i++)
> - if (zone_is_initialized(zone - idx + i))
> + if (!zone_is_empty(zone - idx + i))
> return false;
> }
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5ee8a26fa383..756353d1e293 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -795,7 +795,7 @@ static inline void __free_one_page(struct page *page,
>
> max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
>
> - VM_BUG_ON(!zone_is_initialized(zone));
> + VM_BUG_ON(zone_is_empty(zone));
> VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
>
> VM_BUG_ON(migratetype == -1);
> @@ -5535,7 +5535,6 @@ int __meminit init_currently_empty_zone(struct zone *zone,
> zone_start_pfn, (zone_start_pfn + size));
>
> zone_init_free_lists(zone);
> - zone->initialized = 1;
>
> return 0;
> }
> --
> 2.11.0
>
> --
> Michal Hocko
> SUSE Labs
--
Michal Hocko
SUSE Labs
On Wed 05-04-17 11:06:54, Igor Mammedov wrote:
> On Wed, 5 Apr 2017 10:14:00 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Fri 31-03-17 09:39:54, Michal Hocko wrote:
> > > Fixed screw ups during the initial patch split up as per Hillf
> > > ---
> > > From 8be6c5e47de66210e47710c80e72e8abd899017b Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <[email protected]>
> > > Date: Wed, 29 Mar 2017 15:11:30 +0200
> > > Subject: [PATCH] mm: get rid of zone_is_initialized
> > >
> > > There shouldn't be any reason to add initialized when we can tell the
> > > same thing from checking whether there are any pages spanned to the
> > > zone. Remove zone_is_initialized() and replace it by zone_is_empty
> > > which can be used for the same set of tests.
> > >
> > > This shouldn't have any visible effect
> >
> > I've decided to drop this patch. My main motivation was to simplify
> > the hotplug workflow/ The situation is more hairy than I expected,
> > though. On one hand all zones should be initialized early during the
> > hotplug in add_memory_resource but direct users of arch_add_memory will
> > need this to be called I suspect. Let's just keep the current status quo
> > and clean up it later. It is not really needed for this series.
> Would you post v2 with all fixups you've done so far?
yes, but I would like to sort what Reza Arbab is seeing first. I have
pushed my current changes to [1] for others to have something to play
with
[1] git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git attempts/rewrite-mem_hotplug-WIP
branch. Please note I might rebase it later
--
Michal Hocko
SUSE Labs
On Wed 05-04-17 08:42:39, Michal Hocko wrote:
> On Tue 04-04-17 16:43:39, Reza Arbab wrote:
> > On Tue, Apr 04, 2017 at 09:41:22PM +0200, Michal Hocko wrote:
> > >On Tue 04-04-17 13:30:13, Reza Arbab wrote:
> > >>I think I found another edge case. You
> > >>get an oops when removing all of a node's memory:
> > >>
> > >>__nr_to_section
> > >>__pfn_to_section
> > >>find_biggest_section_pfn
> > >>shrink_pgdat_span
> > >>__remove_zone
> > >>__remove_section
> > >>__remove_pages
> > >>arch_remove_memory
> > >>remove_memory
> > >
> > >Is this something new or an old issue? I believe the state after the
> > >online should be the same as before. So if you onlined the full node
> > >then there shouldn't be any difference. Let me have a look...
> >
> > It's new. Without this patchset, I can repeatedly
> > add_memory()->online_movable->offline->remove_memory() all of a node's
> > memory.
>
> This is quite unexpected because the code obviously cannot handle the
> first memory section. Could you paste /proc/zoneinfo and
> grep . -r /sys/devices/system/memory/auto_online_blocks/memory*, after
> onlining for both patched and unpatched kernels?
Btw. how do you test this? I am really surprised you managed to
hotremove such a low pfn range.
--
Michal Hocko
SUSE Labs
On Wed, 5 Apr 2017 10:14:00 +0200
Michal Hocko <[email protected]> wrote:
> On Fri 31-03-17 09:39:54, Michal Hocko wrote:
> > Fixed screw ups during the initial patch split up as per Hillf
> > ---
> > From 8be6c5e47de66210e47710c80e72e8abd899017b Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Wed, 29 Mar 2017 15:11:30 +0200
> > Subject: [PATCH] mm: get rid of zone_is_initialized
> >
> > There shouldn't be any reason to add initialized when we can tell the
> > same thing from checking whether there are any pages spanned to the
> > zone. Remove zone_is_initialized() and replace it by zone_is_empty
> > which can be used for the same set of tests.
> >
> > This shouldn't have any visible effect
>
> I've decided to drop this patch. My main motivation was to simplify
> the hotplug workflow/ The situation is more hairy than I expected,
> though. On one hand all zones should be initialized early during the
> hotplug in add_memory_resource but direct users of arch_add_memory will
> need this to be called I suspect. Let's just keep the current status quo
> and clean up it later. It is not really needed for this series.
Would you post v2 with all fixups you've done so far?
On Tue 04-04-17 16:43:39, Reza Arbab wrote:
> On Tue, Apr 04, 2017 at 09:41:22PM +0200, Michal Hocko wrote:
> >On Tue 04-04-17 13:30:13, Reza Arbab wrote:
> >>I think I found another edge case. You
> >>get an oops when removing all of a node's memory:
> >>
> >>__nr_to_section
> >>__pfn_to_section
> >>find_biggest_section_pfn
> >>shrink_pgdat_span
> >>__remove_zone
> >>__remove_section
> >>__remove_pages
> >>arch_remove_memory
> >>remove_memory
> >
> >Is this something new or an old issue? I believe the state after the
> >online should be the same as before. So if you onlined the full node
> >then there shouldn't be any difference. Let me have a look...
>
> It's new. Without this patchset, I can repeatedly
> add_memory()->online_movable->offline->remove_memory() all of a node's
> memory.
OK, I know what is going on here.
shrink_pgdat_span: start_pfn=0x1ff00, end_pfn=0x20000, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
[...]
find_biggest_section_pfn loop: pfn=0xff, sec_nr = 0x0
so the node starts at pfn 0 while we are trying to remove range starting
from pfn=255 (1MB). Rather than going with find_smallest_section_pfn we
go with the other branch and that underflows as already mentioned. I
seriously doubt that the node really starts at pfn 0. I am not sure
which arch you are testing on but I believe we reserve the lowest
address pfn range on all aches. The previous code presumably handled
that properly because the original node/zone has started at the lowest
possible address and the zone shifting then preserves that.
My code doesn't do that though. So I guess I have to sanitize. Does this
help? Please drop the "mm, memory_hotplug: get rid of zone/node
shrinking" patch.
---
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index acf2b5eb5ecb..2c5613d19eb6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -750,6 +750,15 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
int ret;
struct memory_notify arg;
+ do {
+ if (pfn_valid(pfn))
+ break;
+ pfn++;
+ } while (--nr_pages > 0);
+
+ if (!nr_pages)
+ return -EINVAL;
+
nid = pfn_to_nid(pfn);
if (!allow_online_pfn_range(nid, pfn, nr_pages, online_type))
return -EINVAL;
--
Michal Hocko
SUSE Labs
On Wed, Apr 05, 2017 at 11:24:27AM +0200, Michal Hocko wrote:
>On Wed 05-04-17 08:42:39, Michal Hocko wrote:
>> On Tue 04-04-17 16:43:39, Reza Arbab wrote:
>> > It's new. Without this patchset, I can repeatedly
>> > add_memory()->online_movable->offline->remove_memory() all of a node's
>> > memory.
>>
>> This is quite unexpected because the code obviously cannot handle the
>> first memory section. Could you paste /proc/zoneinfo and
>> grep . -r /sys/devices/system/memory/auto_online_blocks/memory*, after
>> onlining for both patched and unpatched kernels?
>
>Btw. how do you test this? I am really surprised you managed to
>hotremove such a low pfn range.
When I boot, I have node 0 (4GB) and node 1 (empty):
Early memory node ranges
node 0: [mem 0x0000000000000000-0x00000000ffffffff]
Initmem setup node 0 [mem 0x0000000000000000-0x00000000ffffffff]
On node 0 totalpages: 65536
DMA zone: 64 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 65536 pages, LIFO batch:1
Could not find start_pfn for node 1
Initmem setup node 1 [mem 0x0000000000000000-0x0000000000000000]
On node 1 totalpages: 0
My steps from there:
1. add_memory(1, 0x100000000, 0x100000000)
2. echo online_movable > /sys/devices/system/node/node1/memory[511..256]
3. echo offline > /sys/devices/system/node/node1/memory[256..511]
4. remove_memory(1, 0x100000000, 0x100000000)
After step 2, regardless of kernel:
$ cat /proc/zoneinfo
Node 0, zone DMA
per-node stats
nr_inactive_anon 418
nr_active_anon 2710
nr_inactive_file 4895
nr_active_file 1945
nr_unevictable 0
nr_isolated_anon 0
nr_isolated_file 0
nr_pages_scanned 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
nr_anon_pages 2654
nr_mapped 739
nr_file_pages 7314
nr_dirty 1
nr_writeback 0
nr_writeback_temp 0
nr_shmem 474
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_anon_transparent_hugepages 0
nr_unstable 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 3259
nr_written 460
pages free 53520
min 63
low 128
high 193
node_scanned 0
spanned 65536
present 65536
managed 65218
nr_free_pages 53520
nr_zone_inactive_anon 418
nr_zone_active_anon 2710
nr_zone_inactive_file 4895
nr_zone_active_file 1945
nr_zone_unevictable 0
nr_zone_write_pending 1
nr_mlock 0
nr_slab_reclaimable 438
nr_slab_unreclaimable 808
nr_page_table_pages 32
nr_kernel_stack 2080
nr_bounce 0
numa_hit 313226
numa_miss 0
numa_foreign 0
numa_interleave 3071
numa_local 313226
numa_other 0
nr_free_cma 0
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 2
high: 6
batch: 1
vm stats threshold: 12
node_unreclaimable: 0
start_pfn: 0
node_inactive_ratio: 0
Node 1, zone Movable
per-node stats
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_isolated_anon 0
nr_isolated_file 0
nr_pages_scanned 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_writeback_temp 0
nr_shmem 0
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_anon_transparent_hugepages 0
nr_unstable 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 0
nr_written 0
pages free 65536
min 63
low 128
high 193
node_scanned 0
spanned 65536
present 65536
managed 65536
nr_free_pages 65536
nr_zone_inactive_anon 0
nr_zone_active_anon 0
nr_zone_inactive_file 0
nr_zone_active_file 0
nr_zone_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_bounce 0
numa_hit 0
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 0
numa_other 0
nr_free_cma 0
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 0
high: 6
batch: 1
vm stats threshold: 14
node_unreclaimable: 1
start_pfn: 65536
node_inactive_ratio: 0
After step 2, on v4.11-rc5:
$ grep . /sys/devices/system/memory/memory*/valid_zones
/sys/devices/system/memory/memory[0..254]/valid_zones:DMA
/sys/devices/system/memory/memory255/valid_zones:DMA Normal Movable
/sys/devices/system/memory/memory256/valid_zones:Movable Normal
/sys/devices/system/memory/memory[257..511]/valid_zones:Movable
After step 2, on v4.11-rc5 + all the patches from this thread:
$ grep . /sys/devices/system/memory/memory*/valid_zones
/sys/devices/system/memory/memory[0..255]/valid_zones:DMA
/sys/devices/system/memory/memory[256..511]/valid_zones:Movable
On v4.11-rc5, I can do steps 1-4 ad nauseam.
On v4.11-rc5 + all the patches from this thread, I can do things
repeatedly, but starting on the second iteration, all the
/sys/devices/system/node/node1/memory*
symlinks are not created. I can still proceed using the actual files,
/sys/devices/system/memory/memory[256..511]
instead. I think it may be because step 4 does node_set_offline(1). That
is, the node is not only emptied of memory, it is offlined completely.
I hope this made sense. :/
--
Reza Arbab
On Wed, Apr 05, 2017 at 03:52:49PM +0200, Michal Hocko wrote:
>My code doesn't do that though. So I guess I have to sanitize. Does
>this help? Please drop the "mm, memory_hotplug: get rid of zone/node
>shrinking" patch.
>---
>diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>index acf2b5eb5ecb..2c5613d19eb6 100644
>--- a/mm/memory_hotplug.c
>+++ b/mm/memory_hotplug.c
>@@ -750,6 +750,15 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
> int ret;
> struct memory_notify arg;
>
>+ do {
>+ if (pfn_valid(pfn))
>+ break;
>+ pfn++;
>+ } while (--nr_pages > 0);
>+
>+ if (!nr_pages)
>+ return -EINVAL;
>+
> nid = pfn_to_nid(pfn);
> if (!allow_online_pfn_range(nid, pfn, nr_pages, online_type))
> return -EINVAL;
Sorry, no change. Back to the oops in find_biggest_section_pfn().
--
Reza Arbab
On Wed 05-04-17 09:53:05, Reza Arbab wrote:
[...]
> I hope this made sense. :/
yes it certainly helped me to make some picture of your setup. I will
keep thinking about that. But one thing that is really bugging me is
how could you see low pfns in the previous oops. Please drop the last
patch and sprinkle printks down the remove_memory path to see where this
all go south. I believe that there is something in the initialization
code lurking in my code. Please also scratch the pfn_valid check in
online_pages diff. It will not help here.
I suspect I broke some hidden expectation that made add_memory vs.
remove_memory more symetric before. I will keep digging.
--
Michal Hocko
SUSE Labs
On Wed, Apr 05, 2017 at 08:42:39AM +0200, Michal Hocko wrote:
>On Tue 04-04-17 16:43:39, Reza Arbab wrote:
>> Okay, getting further. With this I can again repeatedly add and
>> remove, but now I'm seeing a weird variation of that earlier issue:
>>
>> 1. add_memory(), online_movable
>> /sys/devices/system/node/nodeX/memoryY symlinks are created.
>>
>> 2. offline, remove_memory()
>> The node is offlined, since all memory has been removed, so all of
>> /sys/devices/system/node/nodeX is gone. This is normal.
>>
>> 3. add_memory(), online_movable
>> The node is onlined, so /sys/devices/system/node/nodeX is recreated,
>> and the memory is added, but just like earlier in this email thread,
>> the memoryY links are not there.
>
>Could you add some printks to see why the sysfs creation failed please?
Ah, simple enough. It's this, right at the top of
register_mem_sect_under_node():
if (!node_online(nid))
return 0;
That being the case, I really don't understand why your patches make any
difference. Is node_set_online() being called later than before somehow?
--
Reza Arbab
On Wed 05-04-17 10:48:52, Reza Arbab wrote:
> On Wed, Apr 05, 2017 at 08:42:39AM +0200, Michal Hocko wrote:
> >On Tue 04-04-17 16:43:39, Reza Arbab wrote:
> >>Okay, getting further. With this I can again repeatedly add and remove,
> >>but now I'm seeing a weird variation of that earlier issue:
> >>
> >>1. add_memory(), online_movable
> >> /sys/devices/system/node/nodeX/memoryY symlinks are created.
> >>
> >>2. offline, remove_memory()
> >> The node is offlined, since all memory has been removed, so all of
> >> /sys/devices/system/node/nodeX is gone. This is normal.
> >>
> >>3. add_memory(), online_movable
> >> The node is onlined, so /sys/devices/system/node/nodeX is recreated,
> >> and the memory is added, but just like earlier in this email thread,
> >> the memoryY links are not there.
> >
> >Could you add some printks to see why the sysfs creation failed please?
>
> Ah, simple enough. It's this, right at the top of
> register_mem_sect_under_node():
>
> if (!node_online(nid))
> return 0;
>
> That being the case, I really don't understand why your patches make any
> difference. Is node_set_online() being called later than before somehow?
This is really interesting. Because add_memory_resource does the
following
/* call arch's memory hotadd */
ret = arch_add_memory(nid, start, size);
if (ret < 0)
goto error;
/* we online node here. we can't roll back from here. */
node_set_online(nid);
so we are setting the node online _after_ arch_add_memory but the code
which adds those sysfs file is called from
arch_add_memory
__add_pages
__add_section
register_new_memory
register_mem_sect_under_node
node_online check
I haven't touched this part. What is the point of this check anyway? We
have already associated all the pages with a node (and with a zone prior
to my patches) so we _know_ how to create those links. The check goes
back to the initial submissions. Gary is not available anymore so we
cannot ask. But I completely fail to see how my changes could have made
any difference.
I assume that things start working after you remove that check? Btw. if
you put printk to the original kernel does it see the node online? I
would be also interested whether you see try_offline_node setting the
node offline in the original code.
Thanks!
--
Michal Hocko
SUSE Labs
On Wed, Apr 05, 2017 at 05:42:59PM +0200, Michal Hocko wrote:
>But one thing that is really bugging me is how could you see low pfns
>in the previous oops. Please drop the last patch and sprinkle printks
>down the remove_memory path to see where this all go south. I believe
>that there is something in the initialization code lurking in my code.
>Please also scratch the pfn_valid check in online_pages diff. It will
>not help here.
Got it.
shrink_pgdat_span: start_pfn=0x10000, end_pfn=0x10100, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
The problem is that pgdat_start_pfn here should be 0x10000. As you
suspected, it never got set. This fixes things for me.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 623507f..37c1b63 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -884,7 +884,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
{
unsigned long old_end_pfn = pgdat_end_pfn(pgdat);
- if (start_pfn < pgdat->node_start_pfn)
+ if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
pgdat->node_start_pfn = start_pfn;
pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
---
Along these lines, maybe we should also do
- if (start_pfn < zone->zone_start_pfn)
+ if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
in resize_zone_range()?
--
Reza Arbab
On Wed 05-04-17 12:32:49, Reza Arbab wrote:
> On Wed, Apr 05, 2017 at 05:42:59PM +0200, Michal Hocko wrote:
> >But one thing that is really bugging me is how could you see low pfns in
> >the previous oops. Please drop the last patch and sprinkle printks down
> >the remove_memory path to see where this all go south. I believe that
> >there is something in the initialization code lurking in my code. Please
> >also scratch the pfn_valid check in online_pages diff. It will not help
> >here.
>
> Got it.
>
> shrink_pgdat_span: start_pfn=0x10000, end_pfn=0x10100, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
>
> The problem is that pgdat_start_pfn here should be 0x10000. As you
> suspected, it never got set. This fixes things for me.
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 623507f..37c1b63 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -884,7 +884,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
> {
> unsigned long old_end_pfn = pgdat_end_pfn(pgdat);
>
> - if (start_pfn < pgdat->node_start_pfn)
> + if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
> pgdat->node_start_pfn = start_pfn;
Dang! You are absolutely right. This explains the issue during the
remove_memory. I still fail to see how this makes any difference for the
sysfs file registration though. If anything the pgdat will be larger and
so try_offline_node would check also unrelated node0 but the code will
handle that and eventually offline the node1 anyway. /me still confused.
> pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
> ---
>
> Along these lines, maybe we should also do
>
> - if (start_pfn < zone->zone_start_pfn)
> + if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
yes we should.
Thanks a lot!
--
Michal Hocko
SUSE Labs
On Wed 05-04-17 20:15:02, Michal Hocko wrote:
> On Wed 05-04-17 12:32:49, Reza Arbab wrote:
> > On Wed, Apr 05, 2017 at 05:42:59PM +0200, Michal Hocko wrote:
> > >But one thing that is really bugging me is how could you see low pfns in
> > >the previous oops. Please drop the last patch and sprinkle printks down
> > >the remove_memory path to see where this all go south. I believe that
> > >there is something in the initialization code lurking in my code. Please
> > >also scratch the pfn_valid check in online_pages diff. It will not help
> > >here.
> >
> > Got it.
> >
> > shrink_pgdat_span: start_pfn=0x10000, end_pfn=0x10100, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
> >
> > The problem is that pgdat_start_pfn here should be 0x10000. As you
> > suspected, it never got set. This fixes things for me.
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 623507f..37c1b63 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -884,7 +884,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
> > {
> > unsigned long old_end_pfn = pgdat_end_pfn(pgdat);
> >
> > - if (start_pfn < pgdat->node_start_pfn)
> > + if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
> > pgdat->node_start_pfn = start_pfn;
>
> Dang! You are absolutely right. This explains the issue during the
> remove_memory. I still fail to see how this makes any difference for the
> sysfs file registration though. If anything the pgdat will be larger and
> so try_offline_node would check also unrelated node0 but the code will
> handle that and eventually offline the node1 anyway. /me still confused.
OK, so I've managed to convince my kvm setup to create a node without
any memory initially but I cannot seem to be able to reach
node_set_offline in try_offline_node because check_and_unmap_cpu_on_node
fails for me even when I offline cpus bound to the node because we are
using for_each_present_cpu in check_cpu_on_node so I would have to start
with a cpuless node or find a way how to hotremove those cpus.
--
Michal Hocko
SUSE Labs
On Wed, Apr 05, 2017 at 06:34:39PM +0200, Michal Hocko wrote:
>This is really interesting. Because add_memory_resource does the
>following
> /* call arch's memory hotadd */
> ret = arch_add_memory(nid, start, size);
>
> if (ret < 0)
> goto error;
>
> /* we online node here. we can't roll back from here. */
> node_set_online(nid);
>
>so we are setting the node online _after_ arch_add_memory but the code
>which adds those sysfs file is called from
>
>arch_add_memory
> __add_pages
> __add_section
> register_new_memory
> register_mem_sect_under_node
> node_online check
Okay, so it turns out the original code ends up creating the sysfs links
not here, but just a little bit afterwards.
add_memory
add_memory_resource
arch_add_memory
[your quoted stack trace above]
...
set_node_online
...
register_one_node
link_mem_sections
register_mem_sect_under_node
The reason they're not getting created now is because
NODE_DATA(nid)->node_spanned_pages = 0 at this point.
link_mem_sections: nid=1, start_pfn=0x10000, end_pfn=0x10000
This is another uninitialized situation, like the one with
node_start_pfn which caused my removal crash. Except here I'm not sure
the correct place to splice in and set it.
--
Reza Arbab
On Wed 05-04-17 20:15:02, Michal Hocko wrote:
> On Wed 05-04-17 12:32:49, Reza Arbab wrote:
> > On Wed, Apr 05, 2017 at 05:42:59PM +0200, Michal Hocko wrote:
> > >But one thing that is really bugging me is how could you see low pfns in
> > >the previous oops. Please drop the last patch and sprinkle printks down
> > >the remove_memory path to see where this all go south. I believe that
> > >there is something in the initialization code lurking in my code. Please
> > >also scratch the pfn_valid check in online_pages diff. It will not help
> > >here.
> >
> > Got it.
> >
> > shrink_pgdat_span: start_pfn=0x10000, end_pfn=0x10100, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000
> >
> > The problem is that pgdat_start_pfn here should be 0x10000. As you
> > suspected, it never got set. This fixes things for me.
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 623507f..37c1b63 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -884,7 +884,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
> > {
> > unsigned long old_end_pfn = pgdat_end_pfn(pgdat);
> >
> > - if (start_pfn < pgdat->node_start_pfn)
> > + if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
> > pgdat->node_start_pfn = start_pfn;
>
> Dang! You are absolutely right. This explains the issue during the
> remove_memory. I still fail to see how this makes any difference for the
> sysfs file registration though. If anything the pgdat will be larger and
> so try_offline_node would check also unrelated node0 but the code will
> handle that and eventually offline the node1 anyway. /me still confused.
OK, I was staring into the code and I guess I finally understand what is
going on here. Looking at arch_add_memory->...->register_mem_sect_under_node
was just misleading. I am still not 100% sure why but we try to do the
same thing later from register_one_node->link_mem_sections for nodes
which were offline. I should have noticed this path before. And here
is the difference from the previous code. We are past arch_add_memory
and that path used to do __add_zone which among other things will also
resize node boundaries. I am not doing that anymore because I postpone
that to the onlining phase. Jeez this code is so convoluted my head
spins.
I am not really sure how to fix this. I suspect register_mem_sect_under_node
should just ignore the online state of the node. But I wouldn't
be all that surprised if this had some subtle reason as well. An
alternative would be to actually move register_mem_sect_under_node out
of register_new_memory and move it up the call stack, most probably to
add_memory_resource. We have the range and can map it to the memblock
and so will not rely on the node range. I will sleep over it and
hopefully come up with something tomorrow.
--
Michal Hocko
SUSE Labs
On Thu 30-03-17 13:54:53, Michal Hocko wrote:
[...]
> -static int __meminit __add_section(int nid, struct zone *zone,
> - unsigned long phys_start_pfn)
> +static int __meminit __add_section(int nid, unsigned long phys_start_pfn)
> {
> int ret;
> + int i;
>
> if (pfn_valid(phys_start_pfn))
> return -EEXIST;
>
> - ret = sparse_add_one_section(zone, phys_start_pfn);
> -
> + ret = sparse_add_one_section(NODE_DATA(nid), phys_start_pfn);
> if (ret < 0)
> return ret;
>
> - ret = __add_zone(zone, phys_start_pfn);
> + /*
> + * Make all the pages reserved so that nobody will stumble over half
> + * initialized state.
> + */
> + for (i = 0; i < PAGES_PER_SECTION; i++) {
> + unsigned long pfn = phys_start_pfn + i;
> + if (!pfn_valid(pfn))
> + continue;
>
> - if (ret < 0)
> - return ret;
> + SetPageReserved(pfn_to_page(phys_start_pfn + i));
> + }
>
> return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
I have just realized one more dependency on the zone initialization.
register_new_memory relies on is_zone_device_section to rule out
memblock specific operations including sysfs infrastructure. I have come
up with the following to handle this.
---
>From 377eb5691706de179fbf570915c1cf365d3253c0 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Thu, 6 Apr 2017 09:50:35 +0200
Subject: [PATCH] mm, memory_hotplug: get rid of is_zone_device_section
device memory hotplug hooks into regular memory hotplug only half way.
It needs memory sections to track struct pages but there is no
need/desire to associate those sections with memory blocks and export
them to the userspace via sysfs because they cannot be onlined anyway.
This is currently expressed by for_device argument to arch_add_memory
which then makes sure to associate the given memory range with
ZONE_DEVICE. register_new_memory then relies on is_zone_device_section
to distinguish special memory hotplug from the regular one. While this
works now, later patches in this series want to move __add_zone outside
of arch_add_memory path so we have to come up with something else.
Add want_memblock down the __add_pages path and use it to control
whether the section->memblock association should be done. arch_add_memory
the just trivially want memblock for everything but for_device hotplug.
remove_memory_section doesn't need is_zone_device_section either. We can
simply skip all the memblock specific cleanup if there is no memblock
for the given section.
Cc: Dan Williams <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
arch/ia64/mm/init.c | 2 +-
arch/powerpc/mm/mem.c | 2 +-
arch/s390/mm/init.c | 2 +-
arch/sh/mm/init.c | 2 +-
arch/x86/mm/init_32.c | 2 +-
arch/x86/mm/init_64.c | 2 +-
drivers/base/memory.c | 22 ++++++++--------------
include/linux/memory_hotplug.h | 2 +-
mm/memory_hotplug.c | 11 +++++++----
9 files changed, 22 insertions(+), 25 deletions(-)
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 06cdaef54b2e..62085fd902e6 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -657,7 +657,7 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
zone = pgdat->node_zones +
zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
- ret = __add_pages(nid, zone, start_pfn, nr_pages);
+ ret = __add_pages(nid, zone, start_pfn, nr_pages, !for_device);
if (ret)
printk("%s: Problem encountered in __add_pages() as ret=%d\n",
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5f844337de21..ea3e09a62f38 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -149,7 +149,7 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
zone = pgdata->node_zones +
zone_for_memory(nid, start, size, 0, for_device);
- return __add_pages(nid, zone, start_pfn, nr_pages);
+ return __add_pages(nid, zone, start_pfn, nr_pages, !for_device);
}
#ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index bf5b8a0c4ff7..5c84346e5211 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -182,7 +182,7 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
continue;
nr_pages = (start_pfn + size_pages > zone_end_pfn) ?
zone_end_pfn - start_pfn : size_pages;
- rc = __add_pages(nid, zone, start_pfn, nr_pages);
+ rc = __add_pages(nid, zone, start_pfn, nr_pages, !for_device);
if (rc)
break;
start_pfn += nr_pages;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 75491862d900..a9d57f75ae8c 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -498,7 +498,7 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
ret = __add_pages(nid, pgdat->node_zones +
zone_for_memory(nid, start, size, ZONE_NORMAL,
for_device),
- start_pfn, nr_pages);
+ start_pfn, nr_pages, !for_device);
if (unlikely(ret))
printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 928cfde76232..bc5530cc0746 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -824,7 +824,7 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- return __add_pages(nid, zone, start_pfn, nr_pages);
+ return __add_pages(nid, zone, start_pfn, nr_pages, !for_device);
}
#ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7eef17239378..39cfaee93975 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -652,7 +652,7 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
init_memory_mapping(start, start + size);
- ret = __add_pages(nid, zone, start_pfn, nr_pages);
+ ret = __add_pages(nid, zone, start_pfn, nr_pages, !for_device);
WARN_ON_ONCE(ret);
/* update max_pfn, max_low_pfn and high_memory */
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index cc4f1d0cbffe..89c15e942852 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -685,14 +685,6 @@ static int add_memory_block(int base_section_nr)
return 0;
}
-static bool is_zone_device_section(struct mem_section *ms)
-{
- struct page *page;
-
- page = sparse_decode_mem_map(ms->section_mem_map, __section_nr(ms));
- return is_zone_device_page(page);
-}
-
/*
* need an interface for the VM to add new memory regions,
* but without onlining it.
@@ -702,9 +694,6 @@ int register_new_memory(int nid, struct mem_section *section)
int ret = 0;
struct memory_block *mem;
- if (is_zone_device_section(section))
- return 0;
-
mutex_lock(&mem_sysfs_mutex);
mem = find_memory_block(section);
@@ -741,11 +730,16 @@ static int remove_memory_section(unsigned long node_id,
{
struct memory_block *mem;
- if (is_zone_device_section(section))
- return 0;
-
mutex_lock(&mem_sysfs_mutex);
+
+ /*
+ * Some users of the memory hotplug do not want/need memblock to
+ * track all sections. Skip over those.
+ */
mem = find_memory_block(section);
+ if (!mem)
+ return 0;
+
unregister_mem_sect_under_nodes(mem, __section_nr(section));
mem->section_count--;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 134a2f69c21a..3c8cf86201c3 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -111,7 +111,7 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
/* reasonably generic interface to expand the physical pages in a zone */
extern int __add_pages(int nid, struct zone *zone, unsigned long start_pfn,
- unsigned long nr_pages);
+ unsigned long nr_pages, bool want_memblock);
#ifdef CONFIG_NUMA
extern int memory_add_physaddr_to_nid(u64 start);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 28b16a831d40..5a8924a41a3b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -489,7 +489,7 @@ static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
}
static int __meminit __add_section(int nid, struct zone *zone,
- unsigned long phys_start_pfn)
+ unsigned long phys_start_pfn, bool want_memblock)
{
int ret;
@@ -506,7 +506,10 @@ static int __meminit __add_section(int nid, struct zone *zone,
if (ret < 0)
return ret;
- return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
+ if (want_memblock)
+ ret = register_new_memory(nid, __pfn_to_section(phys_start_pfn));
+
+ return ret;
}
/*
@@ -516,7 +519,7 @@ static int __meminit __add_section(int nid, struct zone *zone,
* add the new pages.
*/
int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
- unsigned long nr_pages)
+ unsigned long nr_pages, bool want_memblock)
{
unsigned long i;
int err = 0;
@@ -544,7 +547,7 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
}
for (i = start_sec; i <= end_sec; i++) {
- err = __add_section(nid, zone, section_nr_to_pfn(i));
+ err = __add_section(nid, zone, section_nr_to_pfn(i), want_memblock);
/*
* EEXIST is finally dealt with by ioresource collision
--
2.11.0
--
Michal Hocko
SUSE Labs
On Wed 05-04-17 18:34:39, Michal Hocko wrote:
> On Wed 05-04-17 10:48:52, Reza Arbab wrote:
> > On Wed, Apr 05, 2017 at 08:42:39AM +0200, Michal Hocko wrote:
> > >On Tue 04-04-17 16:43:39, Reza Arbab wrote:
> > >>Okay, getting further. With this I can again repeatedly add and remove,
> > >>but now I'm seeing a weird variation of that earlier issue:
> > >>
> > >>1. add_memory(), online_movable
> > >> /sys/devices/system/node/nodeX/memoryY symlinks are created.
> > >>
> > >>2. offline, remove_memory()
> > >> The node is offlined, since all memory has been removed, so all of
> > >> /sys/devices/system/node/nodeX is gone. This is normal.
> > >>
> > >>3. add_memory(), online_movable
> > >> The node is onlined, so /sys/devices/system/node/nodeX is recreated,
> > >> and the memory is added, but just like earlier in this email thread,
> > >> the memoryY links are not there.
> > >
> > >Could you add some printks to see why the sysfs creation failed please?
> >
> > Ah, simple enough. It's this, right at the top of
> > register_mem_sect_under_node():
> >
> > if (!node_online(nid))
> > return 0;
> >
> > That being the case, I really don't understand why your patches make any
> > difference. Is node_set_online() being called later than before somehow?
>
> This is really interesting. Because add_memory_resource does the
> following
> /* call arch's memory hotadd */
> ret = arch_add_memory(nid, start, size);
>
> if (ret < 0)
> goto error;
>
> /* we online node here. we can't roll back from here. */
> node_set_online(nid);
>
> so we are setting the node online _after_ arch_add_memory but the code
> which adds those sysfs file is called from
>
> arch_add_memory
> __add_pages
> __add_section
> register_new_memory
> register_mem_sect_under_node
> node_online check
>
> I haven't touched this part. What is the point of this check anyway? We
> have already associated all the pages with a node (and with a zone prior
> to my patches) so we _know_ how to create those links. The check goes
> back to the initial submissions. Gary is not available anymore so we
> cannot ask. But I completely fail to see how my changes could have made
> any difference.
I wasn't able to undestand that from the code so I've just tried to
remove the check and it blown up
BUG: unable to handle kernel NULL pointer dereference at
0000000000000040
IP: sysfs_create_link_nowarn+0x13/0x32
if (!kobj)
parent = sysfs_root_kn;
else
parent = kobj->sd;
^^^^^^^^^^^^^^^^^^
when creating the link
register_mem_sect_under_node:
ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
&mem_blk->dev.kobj,
kobject_name(&mem_blk->dev.kobj));
which means that node_devices[nid]->dev.kobj is NULL. This happens later
in register_one_node->register_node. This really _screems_ for a clean up!
--
Michal Hocko
SUSE Labs
On Wed 05-04-17 23:02:14, Michal Hocko wrote:
[...]
> OK, I was staring into the code and I guess I finally understand what is
> going on here. Looking at arch_add_memory->...->register_mem_sect_under_node
> was just misleading. I am still not 100% sure why but we try to do the
> same thing later from register_one_node->link_mem_sections for nodes
> which were offline. I should have noticed this path before. And here
> is the difference from the previous code. We are past arch_add_memory
> and that path used to do __add_zone which among other things will also
> resize node boundaries. I am not doing that anymore because I postpone
> that to the onlining phase. Jeez this code is so convoluted my head
> spins.
>
> I am not really sure how to fix this. I suspect register_mem_sect_under_node
> should just ignore the online state of the node. But I wouldn't
> be all that surprised if this had some subtle reason as well. An
> alternative would be to actually move register_mem_sect_under_node out
> of register_new_memory and move it up the call stack, most probably to
> add_memory_resource. We have the range and can map it to the memblock
> and so will not rely on the node range. I will sleep over it and
> hopefully come up with something tomorrow.
OK, so this is the most sensible way I was able to come up with. I
didn't get to test it yet but from the above analysis it should work.
---
>From 6c99a3284ea70262e3f25cbe71826a57aeaa7ffd Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Thu, 6 Apr 2017 11:59:37 +0200
Subject: [PATCH] mm, memory_hotplug: split up register_one_node
Memory hotplug (add_memory_resource) has to reinitialize node
infrastructure if the node is offline (one which went through the
complete add_memory(); remove_memory() cycle). That involves node
registration to the kobj infrastructure (register_node), the proper
association with cpus (register_cpu_under_node) and finally creation of
node<->memblock symlinks (link_mem_sections).
The last part requires to know node_start_pfn and node_spanned_pages
which we currently have but a leter patch will postpone this
initialization to the onlining phase which happens later. In fact
we do not need to rely on the early initialization even now because
we know which range is currently hot added.
Split register_one_node into core which does all the common work for
the boot time NUMA initialization and the hotplug (__register_one_node).
register_one_node keeps the full initialization while hotplug calls
__register_one_node and manually calls link_mem_sections for the proper
range.
This shouldn't introduce any functional change.
Signed-off-by: Michal Hocko <[email protected]>
---
drivers/base/node.c | 51 ++++++++++++++++++++-------------------------------
include/linux/node.h | 35 ++++++++++++++++++++++++++++++++++-
mm/memory_hotplug.c | 17 ++++++++++++++++-
3 files changed, 70 insertions(+), 33 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 06294d69779b..dff5b53f7905 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -461,10 +461,9 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
return 0;
}
-static int link_mem_sections(int nid)
+int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
{
- unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
- unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
+ unsigned long end_pfn = start_pfn + nr_pages;
unsigned long pfn;
struct memory_block *mem_blk = NULL;
int err = 0;
@@ -552,10 +551,7 @@ static int node_memory_callback(struct notifier_block *self,
return NOTIFY_OK;
}
#endif /* CONFIG_HUGETLBFS */
-#else /* !CONFIG_MEMORY_HOTPLUG_SPARSE */
-
-static int link_mem_sections(int nid) { return 0; }
-#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
+#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
#if !defined(CONFIG_MEMORY_HOTPLUG_SPARSE) || \
!defined(CONFIG_HUGETLBFS)
@@ -569,39 +565,32 @@ static void init_node_hugetlb_work(int nid) { }
#endif
-int register_one_node(int nid)
+int __register_one_node(int nid)
{
- int error = 0;
+ int p_node = parent_node(nid);
+ struct node *parent = NULL;
+ int error;
int cpu;
- if (node_online(nid)) {
- int p_node = parent_node(nid);
- struct node *parent = NULL;
-
- if (p_node != nid)
- parent = node_devices[p_node];
-
- node_devices[nid] = kzalloc(sizeof(struct node), GFP_KERNEL);
- if (!node_devices[nid])
- return -ENOMEM;
-
- error = register_node(node_devices[nid], nid, parent);
+ if (p_node != nid)
+ parent = node_devices[p_node];
- /* link cpu under this node */
- for_each_present_cpu(cpu) {
- if (cpu_to_node(cpu) == nid)
- register_cpu_under_node(cpu, nid);
- }
+ node_devices[nid] = kzalloc(sizeof(struct node), GFP_KERNEL);
+ if (!node_devices[nid])
+ return -ENOMEM;
- /* link memory sections under this node */
- error = link_mem_sections(nid);
+ error = register_node(node_devices[nid], nid, parent);
- /* initialize work queue for memory hot plug */
- init_node_hugetlb_work(nid);
+ /* link cpu under this node */
+ for_each_present_cpu(cpu) {
+ if (cpu_to_node(cpu) == nid)
+ register_cpu_under_node(cpu, nid);
}
- return error;
+ /* initialize work queue for memory hot plug */
+ init_node_hugetlb_work(nid);
+ return error;
}
void unregister_one_node(int nid)
diff --git a/include/linux/node.h b/include/linux/node.h
index 2115ad5d6f19..2baa640d0b92 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -30,9 +30,38 @@ struct memory_block;
extern struct node *node_devices[];
typedef void (*node_registration_func_t)(struct node *);
+#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+extern int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages);
+#else
+static int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
+{
+ return 0;
+}
+#endif
+
extern void unregister_node(struct node *node);
#ifdef CONFIG_NUMA
-extern int register_one_node(int nid);
+/* Core of the node registration - only memory hotplug should use this */
+extern int __register_one_node(int nid);
+
+/* Registers an online node */
+static inline int register_one_node(int nid)
+{
+ int error = 0;
+
+ if (node_online(nid)) {
+ struct pglist_data *pgdat = NODE_DATA(nid);
+
+ error = __register_one_node(nid);
+ if (error)
+ return error;
+ /* link memory sections under this node */
+ error = link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages);
+ }
+
+ return error;
+}
+
extern void unregister_one_node(int nid);
extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
@@ -46,6 +75,10 @@ extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
node_registration_func_t unregister);
#endif
#else
+static inline int __register_one_node(int nid)
+{
+ return 0;
+}
static inline int register_one_node(int nid)
{
return 0;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c2b018c808b7..2c731bdfa845 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1220,7 +1220,22 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
node_set_online(nid);
if (new_node) {
- ret = register_one_node(nid);
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+
+ ret = __register_one_node(nid);
+ if (ret)
+ goto register_fail;
+
+ /*
+ * link memory sections under this node. This is already
+ * done when creatig memory section in register_new_memory
+ * but that depends to have the node registered so offline
+ * nodes have to go through register_node.
+ * TODO clean up this mess.
+ */
+ ret = link_mem_sections(nid, start_pfn, nr_pages);
+register_fail:
/*
* If sysfs file of new node can't create, cpu on the node
* can't be hot-added. There is no rollback way now.
--
2.11.0
--
Michal Hocko
SUSE Labs
On Thu 30-03-17 13:54:53, Michal Hocko wrote:
[...]
> +static struct zone * __meminit move_pfn_range(int online_type, int nid,
> + unsigned long start_pfn, unsigned long nr_pages)
> +{
> + struct pglist_data *pgdat = NODE_DATA(nid);
> + struct zone *zone = &pgdat->node_zones[ZONE_NORMAL];
> +
> + if (online_type == MMOP_ONLINE_KEEP) {
> + /*
> + * MMOP_ONLINE_KEEP inherits the current zone which is
> + * ZONE_NORMAL by default but we might be within ZONE_MOVABLE
> + * already.
> + */
> + if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_MOVABLE))
> + zone = &pgdat->node_zones[ZONE_MOVABLE];
> + } else if (online_type == MMOP_ONLINE_MOVABLE) {
> + zone = &pgdat->node_zones[ZONE_MOVABLE];
> }
>
> - *zone_shift = target - idx;
> - return true;
> + move_pfn_range_to_zone(zone, start_pfn, nr_pages);
> + return zone;
> }
I got the MMOP_ONLINE_KEEP wrong here. Relying on allow_online_pfn_range
is wrong because that would lead to MMOP_ONLINE_MOVABLE by when there is
no ZONE_NORMAL while I believe we should online_kernel in that case.
Well the semantic of MMOP_ONLINE_KEEP is rathe fuzzy to me but I guess
it make some sense to online movable only when explicitly state or
_within_ and existing ZONE_MOVABLE. The following will fix this. I will
fold it into this patch.
---
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7c4fef1aba84..0f8816fd0b52 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -531,6 +531,20 @@ static inline bool zone_is_empty(struct zone *zone)
}
/*
+ * Return true if [start_pfn, start_pfn + nr_pages) range has a non-mpty
+ * intersection with the given zone
+ */
+static inline bool zone_intersects(struct zone *zone,
+ unsigned long start_pfn, unsigned long nr_pages)
+{
+ if (zone->zone_start_pfn <= start_pfn && start_pfn < zone_end_pfn(zone))
+ return true;
+ if (start_pfn + nr_pages > start_pfn && !zone_is_empty(zone))
+ return true;
+ return false;
+}
+
+/*
* The "priority" of VM scanning is how much of the queues we will scan in one
* go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
* queues ("queue_length >> 12") during an aging round.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4f80abdc2047..2ff988f42377 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -934,13 +934,14 @@ static struct zone * __meminit move_pfn_range(int online_type, int nid,
struct zone *zone = &pgdat->node_zones[ZONE_NORMAL];
if (online_type == MMOP_ONLINE_KEEP) {
+ struct zone *movable_zone = &pgdat->node_zones[ZONE_MOVABLE];
/*
* MMOP_ONLINE_KEEP inherits the current zone which is
* ZONE_NORMAL by default but we might be within ZONE_MOVABLE
* already.
*/
- if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_MOVABLE))
- zone = &pgdat->node_zones[ZONE_MOVABLE];
+ if (zone_intersects(movable_zone, start_pfn, nr_pages))
+ zone = movable_zone;
} else if (online_type == MMOP_ONLINE_MOVABLE) {
zone = &pgdat->node_zones[ZONE_MOVABLE];
}
--
Michal Hocko
SUSE Labs
OK, so after recent change mostly driven by testing from Reza Arbab
(thanks again) I believe I am getting to a working state finally. All I
currently have is
in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
attempts/rewrite-mem_hotplug-WIP branch. I will highly appreciate more
testing of course and if there are no new issues found I will repost the
series for the review.
--
Michal Hocko
SUSE Labs
On Thu, Apr 06, 2017 at 03:08:46PM +0200, Michal Hocko wrote:
>OK, so after recent change mostly driven by testing from Reza Arbab
>(thanks again) I believe I am getting to a working state finally. All I
>currently have is
>in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
>attempts/rewrite-mem_hotplug-WIP branch. I will highly appreciate more
>testing of course and if there are no new issues found I will repost the
>series for the review.
Looking good! I can do my add/remove/repeat test and things seem fine.
One thing--starting on the second iteration, I am seeing the WARN in
free_area_init_node();
add_memory
add_memory_resource
hotadd_new_pgdat
free_area_init_node
WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
--
Reza Arbab
On Thu 06-04-17 10:24:49, Reza Arbab wrote:
> On Thu, Apr 06, 2017 at 03:08:46PM +0200, Michal Hocko wrote:
> >OK, so after recent change mostly driven by testing from Reza Arbab
> >(thanks again) I believe I am getting to a working state finally. All I
> >currently have is
> >in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> >attempts/rewrite-mem_hotplug-WIP branch. I will highly appreciate more
> >testing of course and if there are no new issues found I will repost the
> >series for the review.
>
> Looking good! I can do my add/remove/repeat test and things seem fine.
>
> One thing--starting on the second iteration, I am seeing the WARN in
> free_area_init_node();
>
> add_memory
> add_memory_resource
> hotadd_new_pgdat
> free_area_init_node
> WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
Have you tested with my attempts/rewrite-mem_hotplug-WIP mentioned
elsewhere? Because I suspect that "mm: get rid of zone_is_initialized"
might cause this.
--
Michal Hocko
SUSE Labs
On Thu, Apr 06, 2017 at 05:41:28PM +0200, Michal Hocko wrote:
>On Thu 06-04-17 10:24:49, Reza Arbab wrote:
>> On Thu, Apr 06, 2017 at 03:08:46PM +0200, Michal Hocko wrote:
>> >OK, so after recent change mostly driven by testing from Reza Arbab
>> >(thanks again) I believe I am getting to a working state finally. All I
>> >currently have is
>> >in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
>> >attempts/rewrite-mem_hotplug-WIP branch. I will highly appreciate more
>> >testing of course and if there are no new issues found I will repost the
>> >series for the review.
>>
>> Looking good! I can do my add/remove/repeat test and things seem fine.
>>
>> One thing--starting on the second iteration, I am seeing the WARN in
>> free_area_init_node();
>>
>> add_memory
>> add_memory_resource
>> hotadd_new_pgdat
>> free_area_init_node
>> WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
>
>Have you tested with my attempts/rewrite-mem_hotplug-WIP mentioned
>elsewhere? Because I suspect that "mm: get rid of zone_is_initialized"
>might cause this.
This was my first time using your git branch instead of applying the
patches from this thread to v4.11-rc5 myself.
--
Reza Arbab
On Thu 06-04-17 10:46:53, Reza Arbab wrote:
> On Thu, Apr 06, 2017 at 05:41:28PM +0200, Michal Hocko wrote:
> >On Thu 06-04-17 10:24:49, Reza Arbab wrote:
> >>On Thu, Apr 06, 2017 at 03:08:46PM +0200, Michal Hocko wrote:
> >>>OK, so after recent change mostly driven by testing from Reza Arbab
> >>>(thanks again) I believe I am getting to a working state finally. All I
> >>>currently have is
> >>>in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> >>>attempts/rewrite-mem_hotplug-WIP branch. I will highly appreciate more
> >>>testing of course and if there are no new issues found I will repost the
> >>>series for the review.
> >>
> >>Looking good! I can do my add/remove/repeat test and things seem fine.
> >>
> >>One thing--starting on the second iteration, I am seeing the WARN in
> >>free_area_init_node();
> >>
> >>add_memory
> >> add_memory_resource
> >> hotadd_new_pgdat
> >> free_area_init_node
> >> WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
> >
> >Have you tested with my attempts/rewrite-mem_hotplug-WIP mentioned
> >elsewhere? Because I suspect that "mm: get rid of zone_is_initialized"
> >might cause this.
>
> This was my first time using your git branch instead of applying the patches
> from this thread to v4.11-rc5 myself.
OK, so this looks like another thing to resolve. I have seen this
warning as well but I didn't consider it relevant because I had to tweak
the code make the node go offline (removed check_and_unmap_cpu_on_node
from try_offline_node) so I thought it was a fallout from there.
But let's have a look. hotadd_new_pgdat does for an existing pgdat
/* Reset the nr_zones, order and classzone_idx before reuse */
pgdat->nr_zones = 0;
pgdat->kswapd_order = 0;
pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
so free_area_init_node absolutely has to hit this warning. This is not
in the Linus tree because it is still in Andrew's mmotm coming from
http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-vmscan-prevent-kswapd-sleeping-prematurely-due-to-mismatched-classzone_idx.patch
So yay, finally that doesn't come from me. Mel, I guess that either
hotadd_new_pgdat should keep its kswapd_classzone_idx = 0 or the warning
should be updated.
--
Michal Hocko
SUSE Labs
On Thu, Apr 06, 2017 at 06:21:55PM +0200, Michal Hocko wrote:
> On Thu 06-04-17 10:46:53, Reza Arbab wrote:
> > On Thu, Apr 06, 2017 at 05:41:28PM +0200, Michal Hocko wrote:
> > >On Thu 06-04-17 10:24:49, Reza Arbab wrote:
> > >>On Thu, Apr 06, 2017 at 03:08:46PM +0200, Michal Hocko wrote:
> > >>>OK, so after recent change mostly driven by testing from Reza Arbab
> > >>>(thanks again) I believe I am getting to a working state finally. All I
> > >>>currently have is
> > >>>in git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> > >>>attempts/rewrite-mem_hotplug-WIP branch. I will highly appreciate more
> > >>>testing of course and if there are no new issues found I will repost the
> > >>>series for the review.
> > >>
> > >>Looking good! I can do my add/remove/repeat test and things seem fine.
> > >>
> > >>One thing--starting on the second iteration, I am seeing the WARN in
> > >>free_area_init_node();
> > >>
> > >>add_memory
> > >> add_memory_resource
> > >> hotadd_new_pgdat
> > >> free_area_init_node
> > >> WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
> > >
> > >Have you tested with my attempts/rewrite-mem_hotplug-WIP mentioned
> > >elsewhere? Because I suspect that "mm: get rid of zone_is_initialized"
> > >might cause this.
> >
> > This was my first time using your git branch instead of applying the patches
> > from this thread to v4.11-rc5 myself.
>
> OK, so this looks like another thing to resolve. I have seen this
> warning as well but I didn't consider it relevant because I had to tweak
> the code make the node go offline (removed check_and_unmap_cpu_on_node
> from try_offline_node) so I thought it was a fallout from there.
>
> But let's have a look. hotadd_new_pgdat does for an existing pgdat
> /* Reset the nr_zones, order and classzone_idx before reuse */
> pgdat->nr_zones = 0;
> pgdat->kswapd_order = 0;
> pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>
> so free_area_init_node absolutely has to hit this warning. This is not
> in the Linus tree because it is still in Andrew's mmotm coming from
> http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-vmscan-prevent-kswapd-sleeping-prematurely-due-to-mismatched-classzone_idx.patch
>
> So yay, finally that doesn't come from me. Mel, I guess that either
> hotadd_new_pgdat should keep its kswapd_classzone_idx = 0 or the warning
> should be updated.
>
Almost certainly the case that the warning should be updated.
--
Mel Gorman
SUSE Labs
On Thu, Apr 06, 2017 at 06:21:55PM +0200, Michal Hocko wrote:
> > This was my first time using your git branch instead of applying the patches
> > from this thread to v4.11-rc5 myself.
>
> OK, so this looks like another thing to resolve. I have seen this
> warning as well but I didn't consider it relevant because I had to tweak
> the code make the node go offline (removed check_and_unmap_cpu_on_node
> from try_offline_node) so I thought it was a fallout from there.
>
> But let's have a look. hotadd_new_pgdat does for an existing pgdat
> /* Reset the nr_zones, order and classzone_idx before reuse */
> pgdat->nr_zones = 0;
> pgdat->kswapd_order = 0;
> pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>
> so free_area_init_node absolutely has to hit this warning. This is not
> in the Linus tree because it is still in Andrew's mmotm coming from
> http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-vmscan-prevent-kswapd-sleeping-prematurely-due-to-mismatched-classzone_idx.patch
>
> So yay, finally that doesn't come from me. Mel, I guess that either
> hotadd_new_pgdat should keep its kswapd_classzone_idx = 0 or the warning
> should be updated.
>
Actually, it's obvious very quickly when I started the fix that updating
the warning would then trigger on normal boot. It's more appropriate to
let a hotadd of a new pgdat defer the initialisation of that field to
kswapd starting for the new node.
Can you try this? It's build/boot tested only, no hotplug testing.
---8<---
mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx -fix
The patch "mm, vmscan: prevent kswapd sleeping prematurely due to mismatched
classzone_idx" has different initial starting conditions when kswapd
is asleep. kswapd initialises it properly when it starts but the patch
initialises kswapd_classzone_idx early and trips on a warning in
free_area_init_node. This patch leaves the kswapd_classzone_idx as zero
and defers to kswapd to initialise it properly when it starts.
This is a fix to the mmotm patch
mm-vmscan-prevent-kswapd-sleeping-prematurely-due-to-mismatched-classzone_idx.patch
Signed-off-by: Mel Gorman <[email protected]>
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2309a7fbec93..76d4745513ee 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1214,10 +1214,14 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
arch_refresh_nodedata(nid, pgdat);
} else {
- /* Reset the nr_zones, order and classzone_idx before reuse */
+ /*
+ * Reset the nr_zones, order and classzone_idx before reuse.
+ * Note that kswapd will init kswapd_classzone_idx properly
+ * when it starts in the near future.
+ */
pgdat->nr_zones = 0;
pgdat->kswapd_order = 0;
- pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
+ pgdat->kswapd_classzone_idx = 0;
}
/* we can use NODE_DATA(nid) from here */
On Thu 06-04-17 17:55:20, Mel Gorman wrote:
> On Thu, Apr 06, 2017 at 06:21:55PM +0200, Michal Hocko wrote:
> > > This was my first time using your git branch instead of applying the patches
> > > from this thread to v4.11-rc5 myself.
> >
> > OK, so this looks like another thing to resolve. I have seen this
> > warning as well but I didn't consider it relevant because I had to tweak
> > the code make the node go offline (removed check_and_unmap_cpu_on_node
> > from try_offline_node) so I thought it was a fallout from there.
> >
> > But let's have a look. hotadd_new_pgdat does for an existing pgdat
> > /* Reset the nr_zones, order and classzone_idx before reuse */
> > pgdat->nr_zones = 0;
> > pgdat->kswapd_order = 0;
> > pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> >
> > so free_area_init_node absolutely has to hit this warning. This is not
> > in the Linus tree because it is still in Andrew's mmotm coming from
> > http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-vmscan-prevent-kswapd-sleeping-prematurely-due-to-mismatched-classzone_idx.patch
> >
> > So yay, finally that doesn't come from me. Mel, I guess that either
> > hotadd_new_pgdat should keep its kswapd_classzone_idx = 0 or the warning
> > should be updated.
> >
>
> Actually, it's obvious very quickly when I started the fix that updating
> the warning would then trigger on normal boot. It's more appropriate to
> let a hotadd of a new pgdat defer the initialisation of that field to
> kswapd starting for the new node.
>
> Can you try this? It's build/boot tested only, no hotplug testing.
>
> ---8<---
> mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx -fix
>
> The patch "mm, vmscan: prevent kswapd sleeping prematurely due to mismatched
> classzone_idx" has different initial starting conditions when kswapd
> is asleep. kswapd initialises it properly when it starts but the patch
> initialises kswapd_classzone_idx early and trips on a warning in
> free_area_init_node. This patch leaves the kswapd_classzone_idx as zero
> and defers to kswapd to initialise it properly when it starts.
It will start during the online phase which is later than this physical
memory hotadd.
> This is a fix to the mmotm patch
> mm-vmscan-prevent-kswapd-sleeping-prematurely-due-to-mismatched-classzone_idx.patch
>
> Signed-off-by: Mel Gorman <[email protected]>
Yes, that is what I would expect. Feel free to add
Acked-by: Michal Hocko <[email protected]>
if this is routed as a separate patch. Although I expect Andrew will
fold it into the original patch.
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2309a7fbec93..76d4745513ee 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1214,10 +1214,14 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
>
> arch_refresh_nodedata(nid, pgdat);
> } else {
> - /* Reset the nr_zones, order and classzone_idx before reuse */
> + /*
> + * Reset the nr_zones, order and classzone_idx before reuse.
> + * Note that kswapd will init kswapd_classzone_idx properly
> + * when it starts in the near future.
> + */
> pgdat->nr_zones = 0;
> pgdat->kswapd_order = 0;
> - pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> + pgdat->kswapd_classzone_idx = 0;
> }
>
> /* we can use NODE_DATA(nid) from here */
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Michal Hocko
SUSE Labs
On Thu, Apr 06, 2017 at 07:12:42PM +0200, Michal Hocko wrote:
> > ---8<---
> > mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx -fix
> >
> > The patch "mm, vmscan: prevent kswapd sleeping prematurely due to mismatched
> > classzone_idx" has different initial starting conditions when kswapd
> > is asleep. kswapd initialises it properly when it starts but the patch
> > initialises kswapd_classzone_idx early and trips on a warning in
> > free_area_init_node. This patch leaves the kswapd_classzone_idx as zero
> > and defers to kswapd to initialise it properly when it starts.
>
> It will start during the online phase which is later than this physical
> memory hotadd.
>
Good, that's what appeared to be happening at least. It would be
somewhat insane if kswapd was running before zones were initialised.
> > This is a fix to the mmotm patch
> > mm-vmscan-prevent-kswapd-sleeping-prematurely-due-to-mismatched-classzone_idx.patch
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> Yes, that is what I would expect. Feel free to add
> Acked-by: Michal Hocko <[email protected]>
>
> if this is routed as a separate patch. Although I expect Andrew will
> fold it into the original patch.
>
I added the ack anyway and resent the patch so it doesn't get lost in
the middle of a thread. Thanks.
--
Mel Gorman
SUSE Labs