This is based on v5.17-rc1 and:
* [PATCH v1] drivers/base/memory: add memory block to memory group
after registration succeeded [1]
* [PATCH RFC v1] drivers/base/node: consolidate node device subsystem
initialization in node_dev_init() [2]
--
I remember talking to Michal in the past about removing
test_pages_in_a_zone(), which we use for:
* verifying that a memory block we intend to offline is really only managed
by a single zone. We don't support offlining of memory blocks that are
managed by multiple zones (e.g., multiple nodes, DMA and DMA32)
* exposing that zone to user space via
/sys/devices/system/memory/memory*/valid_zones
Now that I identified some more cases where test_pages_in_a_zone() might
go wrong, and we received an UBSAN report (see patch #3), let's get rid of
this PFN walker.
So instead of detecting the zone at runtime with test_pages_in_a_zone() by
scanning the memmap, let's determine and remember for each memory block
if it's managed by a single zone. The stored zone can then be used for
the above two cases, avoiding a manual lookup using test_pages_in_a_zone().
This avoids eventually stumbling over uninitialized memmaps in corner
cases, especially when ZONE_DEVICE ranges partly fall into memory block
(that are responsible for managing System RAM).
Handling memory onlining is easy, because we online to exactly one zone.
Handling boot memory is more tricky, because we want to avoid scanning
all zones of all nodes to detect possible zones that overlap with the
physical memory region of interest. Fortunately, we already have code that
determines the applicable nodes for a memory block, to create sysfs links
-- we'll hook into that.
Patch #1 is a simple cleanup I had laying around for a longer time.
Patch #2 contains the main logic to remove test_pages_in_a_zone() and
further details.
In theory, we could do without [2], however, with a consolidated node
device initialization logic it's easier to verify that we actually have the
information we need (NIDs for memory blocks) around at the right time and
before anything else might rely on it.
[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
Cc: Andrew Morton <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
David Hildenbrand (2):
drivers/base/node: rename link_mem_sections() to
register_memory_block_under_node()
drivers/base/memory: determine and store zone for single-zone memory
blocks
drivers/base/memory.c | 92 ++++++++++++++++++++++++++++++++--
drivers/base/node.c | 18 +++----
include/linux/memory.h | 13 +++++
include/linux/memory_hotplug.h | 6 +--
include/linux/node.h | 16 +++---
mm/memory_hotplug.c | 58 ++++-----------------
6 files changed, 127 insertions(+), 76 deletions(-)
--
2.34.1
Let's adjust the stale terminology, making it match
unregister_memory_block_under_nodes() and
do_register_memory_block_under_node(). We're dealing with memory block
devices, which span 1..X memory sections.
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/base/node.c | 5 +++--
include/linux/node.h | 16 ++++++++--------
mm/memory_hotplug.c | 6 +++---
3 files changed, 14 insertions(+), 13 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index a133981a12fc..5d75341413ce 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -892,8 +892,9 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
}
-void link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn,
- enum meminit_context context)
+void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
+ unsigned long end_pfn,
+ enum meminit_context context)
{
walk_memory_blocks_func_t func;
diff --git a/include/linux/node.h b/include/linux/node.h
index f3be6ccfebed..9ec680dd607f 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -99,13 +99,13 @@ extern struct node *node_devices[];
typedef void (*node_registration_func_t)(struct node *);
#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA)
-void link_mem_sections(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context);
+void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
+ unsigned long end_pfn,
+ enum meminit_context context);
#else
-static inline void link_mem_sections(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context)
+static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
+ unsigned long end_pfn,
+ enum meminit_context context)
{
}
#endif
@@ -129,8 +129,8 @@ static inline int register_one_node(int nid)
error = __register_one_node(nid);
if (error)
return error;
- /* link memory sections under this node */
- link_mem_sections(nid, start_pfn, end_pfn, MEMINIT_EARLY);
+ register_memory_blocks_under_node(nid, start_pfn, end_pfn,
+ MEMINIT_EARLY);
}
return error;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2a9627dc784c..69af90e9f507 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1421,9 +1421,9 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
BUG_ON(ret);
}
- /* link memory sections under this node.*/
- link_mem_sections(nid, PFN_DOWN(start), PFN_UP(start + size - 1),
- MEMINIT_HOTPLUG);
+ register_memory_blocks_under_node(nid, PFN_DOWN(start),
+ PFN_UP(start + size - 1),
+ MEMINIT_HOTPLUG);
/* create new memmap entry */
if (!strcmp(res->name, "System RAM"))
--
2.34.1
test_pages_in_a_zone() is just another nasty PFN walker that can easily
stumble over ZONE_DEVICE memory ranges falling into the same memory block
as ordinary system RAM: the memmap of parts of these ranges might possibly
be uninitialized. In fact, we observed (on an older kernel) with UBSAN:
[ 7691.855626] UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50
[ 7691.862155] index 7 is out of range for type 'zone [5]'
[ 7691.867393] CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...]
[ 7691.879990] Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019
[ 7691.887643] Call Trace:
[ 7691.890107] dump_stack+0x9a/0xf0
[ 7691.893438] ubsan_epilogue+0x9/0x7a
[ 7691.897025] __ubsan_handle_out_of_bounds+0x13a/0x181
[ 7691.902086] ? __ubsan_handle_shift_out_of_bounds+0x289/0x289
[ 7691.907841] ? sched_clock_cpu+0x18/0x1e0
[ 7691.911867] ? __lock_acquire+0x610/0x38d0
[ 7691.915979] test_pages_in_a_zone+0x3c4/0x500
[ 7691.920357] show_valid_zones+0x1fa/0x380
[ 7691.924375] ? print_allowed_zone+0x80/0x80
[ 7691.928571] ? __lock_is_held+0xb4/0x140
[ 7691.932509] ? __lock_is_held+0xb4/0x140
[ 7691.936447] ? dev_attr_store+0x70/0x70
[ 7691.940296] dev_attr_show+0x43/0xb0
[ 7691.943884] ? memset+0x1f/0x40
[ 7691.947042] sysfs_kf_seq_show+0x1c5/0x440
[ 7691.951153] seq_read+0x49d/0x1190
[ 7691.954574] ? seq_escape+0x1f0/0x1f0
[ 7691.958249] ? fsnotify_first_mark+0x150/0x150
[ 7691.962713] vfs_read+0xff/0x300
[ 7691.965952] ksys_read+0xb8/0x170
[ 7691.969279] ? kernel_write+0x130/0x130
[ 7691.973126] ? entry_SYSCALL_64_after_hwframe+0x7a/0xdf
[ 7691.978365] ? do_syscall_64+0x22/0x4b0
[ 7691.982212] do_syscall_64+0xa5/0x4b0
[ 7691.985887] entry_SYSCALL_64_after_hwframe+0x6a/0xdf
[ 7691.990947] RIP: 0033:0x7f01f4439b52
We seem to stumble over a memmap that contains a garbage zone id. While
we could try inserting pfn_to_online_page() calles, it will just make
memory offlining slower, because we use test_pages_in_a_zone() to make
sure we're offlining pages that all belong to the same zone.
Let's just get rid of this PFN walker and determine the single zone
of a memory block -- if any -- for early memory blocks during boot. For
memory onlining, we know the single zone already. Let's avoid any
additional memmap scanning and just rely on the zone information
available during boot.
For memory hot(un)plug, we only really care about memory blocks that:
* span a single zone (and, thereby, a single node)
* are completely System RAM (IOW, no holes, no ZONE_DEVICE)
If one of these conditions is not met, we reject memory offlining.
Hotplugged memory blocks (starting out offline), always meet both
conditions.
There are three scenarios to handle:
(1) Memory hot(un)plug
A memory block with zone == NULL cannot be offlined, corresponding to
our previous test_pages_in_a_zone() check.
After successful memory onlining/offlining, we simply set the zone
accordingly.
* Memory onlining: set the zone we just used for onlining
* Memory offlining: set zone = NULL
So a hotplugged memory block starts with zone = NULL. Once memory
onlining is done, we set the proper zone.
(2) Boot memory with !CONFIG_NUMA
We know that there is just a single pgdat, so we simply scan all zones
of that pgdat for an intersection with our memory block PFN range when
adding the memory block. If more than one zone intersects (e.g., DMA and
DMA32 on x86 for the first memory block) we set zone = NULL and
consequently mimic what test_pages_in_a_zone() used to do.
(3) Boot memory with CONFIG_NUMA
At the point in time we create the memory block devices during boot, we
don't know yet which nodes *actually* span a memory block. While we could
scan all zones of all nodes for intersections, overlapping nodes complicate
the situation and scanning all nodes is possibly expensive. But that
problem has already been solved by the code that sets the node of a memory
block and creates the link in the sysfs --
do_register_memory_block_under_node().
So, we hook into the code that sets the node id for a memory block. If
we already have a different node id set for the memory block, we know
that multiple nodes *actually* have PFNs falling into our memory block:
we set zone = NILL and consequently mimic what test_pages_in_a_zone() used
to do. If there is no node id set, we do the same as (2) for the given
node.
Note that the call order in driver_init() is:
-> memory_dev_init(): create memory block devices
-> node_dev_init(): link memory block devices to the node and set the
node id
So in summary, we detect if there is a single zone responsible for this
memory block and we consequently store the zone in that case in the
memory block, updating it during memory onlining/offlining.
Reported-by: Rafael Parra <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/base/memory.c | 92 ++++++++++++++++++++++++++++++++--
drivers/base/node.c | 13 ++---
include/linux/memory.h | 13 +++++
include/linux/memory_hotplug.h | 6 +--
mm/memory_hotplug.c | 52 +++----------------
5 files changed, 113 insertions(+), 63 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 60c38f9cf1a7..8652368d179d 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -215,6 +215,7 @@ static int memory_block_online(struct memory_block *mem)
adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
nr_vmemmap_pages);
+ mem->zone = zone;
return ret;
}
@@ -225,6 +226,9 @@ static int memory_block_offline(struct memory_block *mem)
unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
int ret;
+ if (!mem->zone)
+ return -EBUSY;
+
/*
* Unaccount before offlining, such that unpopulated zone and kthreads
* can properly be torn down in offline_pages().
@@ -234,7 +238,7 @@ static int memory_block_offline(struct memory_block *mem)
-nr_vmemmap_pages);
ret = offline_pages(start_pfn + nr_vmemmap_pages,
- nr_pages - nr_vmemmap_pages, mem->group);
+ nr_pages - nr_vmemmap_pages, mem->zone, mem->group);
if (ret) {
/* offline_pages() failed. Account back. */
if (nr_vmemmap_pages)
@@ -246,6 +250,7 @@ static int memory_block_offline(struct memory_block *mem)
if (nr_vmemmap_pages)
mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
+ mem->zone = NULL;
return ret;
}
@@ -411,11 +416,10 @@ static ssize_t valid_zones_show(struct device *dev,
*/
if (mem->state == MEM_ONLINE) {
/*
- * The block contains more than one zone can not be offlined.
- * This can happen e.g. for ZONE_DMA and ZONE_DMA32
+ * If !mem->zone, the memory block spans multiple zones and
+ * cannot get offlined.
*/
- default_zone = test_pages_in_a_zone(start_pfn,
- start_pfn + nr_pages);
+ default_zone = mem->zone;
if (!default_zone)
return sysfs_emit(buf, "%s\n", "none");
len += sysfs_emit_at(buf, len, "%s", default_zone->name);
@@ -641,6 +645,73 @@ int register_memory(struct memory_block *memory)
return ret;
}
+static struct zone *early_node_zone_for_memory_block(struct memory_block *mem,
+ int nid)
+{
+ const unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
+ const unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+ struct zone *zone, *matching_zone = NULL;
+ pg_data_t *pgdat = NODE_DATA(nid);
+ int i;
+
+ /*
+ * This logic only works for early memory, when the applicable zones
+ * already span the memory block. We don't expect overlapping zones on
+ * a single node for early memory. So if we're told that some pfns
+ * of a node fall into this memory block, we can assume that all node
+ * zones that intersect with the memory block are actually applicable.
+ * No need to look at the memmap.
+ *
+ * Memory hot(un)plug updates the zone manually after memory
+ * onlinig/offlining succeeded.
+ */
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ zone = pgdat->node_zones + i;
+ if (!populated_zone(zone))
+ continue;
+ if (!zone_intersects(zone, start_pfn, nr_pages))
+ continue;
+ if (!matching_zone) {
+ matching_zone = zone;
+ continue;
+ }
+ /* Spans multiple zones ... */
+ matching_zone = NULL;
+ break;
+ }
+ return matching_zone;
+}
+
+#ifdef CONFIG_NUMA
+void memory_block_set_nid(struct memory_block *mem, int nid,
+ enum meminit_context context)
+{
+ if (context == MEMINIT_EARLY && mem->nid != nid) {
+ /*
+ * For early memory we have to determine the zone when setting
+ * the node id and handle multiple nodes spanning a single
+ * memory block by indicate via zone == NULL that we're not
+ * dealing with a single zone. So if we're setting the node id
+ * the first time, determine if there is a single zone. If we're
+ * setting the node id a second time to a differen node,
+ * invalidate the single detected zone.
+ */
+ if (mem->nid == NUMA_NO_NODE)
+ mem->zone = early_node_zone_for_memory_block(mem, nid);
+ else
+ mem->zone = NULL;
+ }
+
+ /*
+ * If this memory block spans multiple nodes, we only indicate
+ * the last processed node. If we span multiple nodes (not applicable
+ * to hotplugged memory), zone == NULL will prohibit memory offlining
+ * and consequently unplug.
+ */
+ mem->nid = nid;
+}
+#endif
+
static int init_memory_block(unsigned long block_id, unsigned long state,
unsigned long nr_vmemmap_pages,
struct memory_group *group)
@@ -663,6 +734,17 @@ static int init_memory_block(unsigned long block_id, unsigned long state,
mem->nr_vmemmap_pages = nr_vmemmap_pages;
INIT_LIST_HEAD(&mem->group_next);
+#ifndef CONFIG_NUMA
+ if (state == MEM_ONLINE)
+ /*
+ * MEM_ONLINE at this point imples early memory. With NUMA,
+ * we'll determine the zone when setting the node id via
+ * memory_block_set_nid(). Memory hotplug updated the zone
+ * manually when memory onlining/offlining succeeds.
+ */
+ mem->zone = early_node_zone_for_memory_block(mem, NUMA_NO_NODE);
+#endif /* CONFIG_NUMA */
+
ret = register_memory(mem);
if (ret)
return ret;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5d75341413ce..4ba3990d8f7f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -796,15 +796,12 @@ static int __ref get_nid_for_pfn(unsigned long pfn)
}
static void do_register_memory_block_under_node(int nid,
- struct memory_block *mem_blk)
+ struct memory_block *mem_blk,
+ enum meminit_context context)
{
int ret;
- /*
- * If this memory block spans multiple nodes, we only indicate
- * the last processed node.
- */
- mem_blk->nid = nid;
+ memory_block_set_nid(mem_blk, nid, context);
ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
&mem_blk->dev.kobj,
@@ -857,7 +854,7 @@ static int register_mem_block_under_node_early(struct memory_block *mem_blk,
if (page_nid != nid)
continue;
- do_register_memory_block_under_node(nid, mem_blk);
+ do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
return 0;
}
/* mem section does not span the specified node */
@@ -873,7 +870,7 @@ static int register_mem_block_under_node_hotplug(struct memory_block *mem_blk,
{
int nid = *(int *)arg;
- do_register_memory_block_under_node(nid, mem_blk);
+ do_register_memory_block_under_node(nid, mem_blk, MEMINIT_HOTPLUG);
return 0;
}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 88eb587b5143..13b7d2582af0 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -70,6 +70,13 @@ struct memory_block {
unsigned long state; /* serialized by the dev->lock */
int online_type; /* for passing data to online routine */
int nid; /* NID for this memory block */
+ /*
+ * The single zone of this memory block if all PFNs of this memory block
+ * that are System RAM (not a memory hole, not ZONE_DEVICE ranges) are
+ * managed by a single zone. NULL if multiple zones (including nodes)
+ * apply.
+ */
+ struct zone *zone;
struct device dev;
/*
* Number of vmemmap pages. These pages
@@ -161,6 +168,12 @@ int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
})
#define register_hotmemory_notifier(nb) register_memory_notifier(nb)
#define unregister_hotmemory_notifier(nb) unregister_memory_notifier(nb)
+
+#ifdef CONFIG_NUMA
+/* Only to be used by node initialization code. */
+void memory_block_set_nid(struct memory_block *mem, int nid,
+ enum meminit_context context);
+#endif /* CONFIG_NUMA */
#endif /* CONFIG_MEMORY_HOTPLUG */
/*
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index be48e003a518..05b2cdf6c935 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -107,8 +107,6 @@ extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages);
extern int online_pages(unsigned long pfn, unsigned long nr_pages,
struct zone *zone, struct memory_group *group);
-extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
- unsigned long end_pfn);
extern void __offline_isolated_pages(unsigned long start_pfn,
unsigned long end_pfn);
@@ -297,7 +295,7 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
extern void try_offline_node(int nid);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
- struct memory_group *group);
+ struct zone *zone, struct memory_group *group);
extern int remove_memory(u64 start, u64 size);
extern void __remove_memory(u64 start, u64 size);
extern int offline_and_remove_memory(u64 start, u64 size);
@@ -306,7 +304,7 @@ extern int offline_and_remove_memory(u64 start, u64 size);
static inline void try_offline_node(int nid) {}
static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
- struct memory_group *group)
+ struct zone *zone, struct memory_group *group)
{
return -EINVAL;
}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 69af90e9f507..5a185313dd0d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1589,38 +1589,6 @@ bool mhp_range_allowed(u64 start, u64 size, bool need_mapping)
}
#ifdef CONFIG_MEMORY_HOTREMOVE
-/*
- * Confirm all pages in a range [start, end) belong to the same zone (skipping
- * memory holes). When true, return the zone.
- */
-struct zone *test_pages_in_a_zone(unsigned long start_pfn,
- unsigned long end_pfn)
-{
- unsigned long pfn, sec_end_pfn;
- struct zone *zone = NULL;
- struct page *page;
-
- for (pfn = start_pfn, sec_end_pfn = SECTION_ALIGN_UP(start_pfn + 1);
- pfn < end_pfn;
- pfn = sec_end_pfn, sec_end_pfn += PAGES_PER_SECTION) {
- /* Make sure the memory section is present first */
- if (!present_section_nr(pfn_to_section_nr(pfn)))
- continue;
- for (; pfn < sec_end_pfn && pfn < end_pfn;
- pfn += MAX_ORDER_NR_PAGES) {
- /* Check if we got outside of the zone */
- if (zone && !zone_spans_pfn(zone, pfn))
- return NULL;
- page = pfn_to_page(pfn);
- if (zone && page_zone(page) != zone)
- return NULL;
- zone = page_zone(page);
- }
- }
-
- return zone;
-}
-
/*
* Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
* non-lru movable pages and hugepages). Will skip over most unmovable
@@ -1844,15 +1812,15 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
}
int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
- struct memory_group *group)
+ struct zone *zone, struct memory_group *group)
{
const unsigned long end_pfn = start_pfn + nr_pages;
unsigned long pfn, system_ram_pages = 0;
+ const int node = zone_to_nid(zone);
unsigned long flags;
- struct zone *zone;
struct memory_notify arg;
- int ret, node;
char *reason;
+ int ret;
/*
* {on,off}lining is constrained to full memory sections (or more
@@ -1863,7 +1831,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
*/
if (WARN_ON_ONCE(!nr_pages ||
!IS_ALIGNED(start_pfn, pageblock_nr_pages) ||
- !IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)))
+ !IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION) ||
+ page_zone(pfn_to_page(start_pfn)) != zone ||
+ page_zone(pfn_to_page(end_pfn - 1)) != zone))
return -EINVAL;
mem_hotplug_begin();
@@ -1884,16 +1854,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
goto failed_removal;
}
- /* This makes hotplug much easier...and readable.
- we assume this for now. .*/
- zone = test_pages_in_a_zone(start_pfn, end_pfn);
- if (!zone) {
- ret = -EINVAL;
- reason = "multizone range";
- goto failed_removal;
- }
- node = zone_to_nid(zone);
-
/*
* Disable pcplists so that page isolation cannot race with freeing
* in a way that pages from isolated pageblock are left on pcplists.
--
2.34.1
On Fri, Jan 28, 2022 at 04:26:19PM +0100, David Hildenbrand wrote:
> Let's adjust the stale terminology, making it match
> unregister_memory_block_under_nodes() and
> do_register_memory_block_under_node(). We're dealing with memory block
> devices, which span 1..X memory sections.
>
> Signed-off-by: David Hildenbrand <[email protected]>
Makes sense
Acked-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE Labs
> !IS_ALIGNED(start_pfn, pageblock_nr_pages) ||
> - !IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)))
> + !IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION) ||
> + page_zone(pfn_to_page(start_pfn)) != zone ||
> + page_zone(pfn_to_page(end_pfn - 1)) != zone))
I just realized that we have to do these sanity checks after our check
for memory holes.
--
Thanks,
David / dhildenb
On Fri, Jan 28, 2022 at 04:26:20PM +0100, David Hildenbrand wrote:
> For memory hot(un)plug, we only really care about memory blocks that:
> * span a single zone (and, thereby, a single node)
> * are completely System RAM (IOW, no holes, no ZONE_DEVICE)
> If one of these conditions is not met, we reject memory offlining.
> Hotplugged memory blocks (starting out offline), always meet both
> conditions.
This has been always hard for me to follow, so bear with me.
I remember we changed the memory-hotplug policy, not long ago, wrt.
what we can online/offline so we could get rid of certain assumptions like
"there are no holes in this memblock, so it can go" etc.
AFAIR, we can only offline if the memory
1) belongs to a single node ("which is always the case for
hotplugged-memory, boot memory is trickier")
2) does not have any holes
3) spans a single zone
These are the only requeriments we have atm, right?
By default, hotplugged memory already complies with they all three,
only in case of ZONE_DEVICE stuff we might violate 2) and 3).
> There are three scenarios to handle:
...
...
> @@ -225,6 +226,9 @@ static int memory_block_offline(struct memory_block *mem)
> unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
> int ret;
>
> + if (!mem->zone)
> + return -EBUSY;
Should not we return -EINVAL? I mean, -EBUSY reads like this might be a
temporary error which might get fixed later on, but that isn't the case.
> @@ -234,7 +238,7 @@ static int memory_block_offline(struct memory_block *mem)
> -nr_vmemmap_pages);
>
> ret = offline_pages(start_pfn + nr_vmemmap_pages,
> - nr_pages - nr_vmemmap_pages, mem->group);
> + nr_pages - nr_vmemmap_pages, mem->zone, mem->group);
Why not passing the node as well?
> +static struct zone *early_node_zone_for_memory_block(struct memory_block *mem,
> + int nid)
> +{
> + const unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
> + const unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> + struct zone *zone, *matching_zone = NULL;
> + pg_data_t *pgdat = NODE_DATA(nid);
I was about to complain because in init_memory_block() you call
early_node_zone_for_memory_block() with nid == NUMA_NODE_NODE, but then
I saw that NODE_DATA on !CONFIG_NUMA falls to contig_page_data.
So, I guess we cannot really reach this on CONFIG_NUMA machines with nid
being NUMA_NO_NODE, right? (do we want to add a check just in case?)
> +#ifdef CONFIG_NUMA
> +void memory_block_set_nid(struct memory_block *mem, int nid,
> + enum meminit_context context)
But we also set the zone? (Only for boot memory)
> @@ -663,6 +734,17 @@ static int init_memory_block(unsigned long block_id, unsigned long state,
> mem->nr_vmemmap_pages = nr_vmemmap_pages;
> INIT_LIST_HEAD(&mem->group_next);
>
> +#ifndef CONFIG_NUMA
> + if (state == MEM_ONLINE)
> + /*
> + * MEM_ONLINE at this point imples early memory. With NUMA,
implies
--
Oscar Salvador
SUSE Labs
On 31.01.22 12:29, Oscar Salvador wrote:
> On Fri, Jan 28, 2022 at 04:26:20PM +0100, David Hildenbrand wrote:
>> For memory hot(un)plug, we only really care about memory blocks that:
>> * span a single zone (and, thereby, a single node)
>> * are completely System RAM (IOW, no holes, no ZONE_DEVICE)
>> If one of these conditions is not met, we reject memory offlining.
>> Hotplugged memory blocks (starting out offline), always meet both
>> conditions.
Thanks for the review Oscar!
>
> This has been always hard for me to follow, so bear with me.
>
> I remember we changed the memory-hotplug policy, not long ago, wrt.
> what we can online/offline so we could get rid of certain assumptions like
> "there are no holes in this memblock, so it can go" etc.
Yes, end of 2019 via c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to
online/offline memory blocks with holes").
>
> AFAIR, we can only offline if the memory
>
> 1) belongs to a single node ("which is always the case for
> hotplugged-memory, boot memory is trickier")
> 2) does not have any holes
> 3) spans a single zone
>
> These are the only requeriments we have atm, right?
The most prominent core requirements, yes, leaving memory notifiers out
of the picture.
3) implies 1) as zones are per-node.
>
> By default, hotplugged memory already complies with they all three,
> only in case of ZONE_DEVICE stuff we might violate 2) and 3).
>
>> There are three scenarios to handle:
> ...
> ...
>
>> @@ -225,6 +226,9 @@ static int memory_block_offline(struct memory_block *mem)
>> unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
>> int ret;
>>
>> + if (!mem->zone)
>> + return -EBUSY;
>
> Should not we return -EINVAL? I mean, -EBUSY reads like this might be a
> temporary error which might get fixed later on, but that isn't the case.
>
>> @@ -234,7 +238,7 @@ static int memory_block_offline(struct memory_block *mem)
>> -nr_vmemmap_pages);
>>
>> ret = offline_pages(start_pfn + nr_vmemmap_pages,
>> - nr_pages - nr_vmemmap_pages, mem->group);
>> + nr_pages - nr_vmemmap_pages, mem->zone, mem->group);
>
> Why not passing the node as well?
The zone implies the node, and the prototype now matches the one of
online_pages(). So if we'd ever want to change that we should do it for
both functions, but I don't necessarily see the need for it.
>
>> +static struct zone *early_node_zone_for_memory_block(struct memory_block *mem,
>> + int nid)
>> +{
>> + const unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
>> + const unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
>> + struct zone *zone, *matching_zone = NULL;
>> + pg_data_t *pgdat = NODE_DATA(nid);
>
> I was about to complain because in init_memory_block() you call
> early_node_zone_for_memory_block() with nid == NUMA_NODE_NODE, but then
> I saw that NODE_DATA on !CONFIG_NUMA falls to contig_page_data.
> So, I guess we cannot really reach this on CONFIG_NUMA machines with nid
> being NUMA_NO_NODE, right? (do we want to add a check just in case?)
>
Yes, on CONFIG_NUMA this is only called via memory_block_set_nid().
memory_block_set_nid() is only available with CONFIG_NUMA and calling
memory_block_set_nid() with NUMA_NO_NODE would be a BUG.
(before sending this out I even had a BUG_ON() in memory_block_set_nid()
to verify that, but I removed it because BUG_ON's are frowned-upon.)
--
Thanks,
David / dhildenb
[...] having accidentally skipped two comments.
>
>> There are three scenarios to handle:
> ...
> ...
>
>> @@ -225,6 +226,9 @@ static int memory_block_offline(struct memory_block *mem)
>> unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
>> int ret;
>>
>> + if (!mem->zone)
>> + return -EBUSY;
>
> Should not we return -EINVAL? I mean, -EBUSY reads like this might be a
> temporary error which might get fixed later on, but that isn't the case.
We should, and I could have sworn I fixed that up last-minute.
>> +static struct zone *early_node_zone_for_memory_block(struct memory_block *mem,
>> + int nid)
>> +{
>> + const unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
>> + const unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
>> + struct zone *zone, *matching_zone = NULL;
>> + pg_data_t *pgdat = NODE_DATA(nid);
>
> I was about to complain because in init_memory_block() you call
> early_node_zone_for_memory_block() with nid == NUMA_NODE_NODE, but then
> I saw that NODE_DATA on !CONFIG_NUMA falls to contig_page_data.
> So, I guess we cannot really reach this on CONFIG_NUMA machines with nid
> being NUMA_NO_NODE, right? (do we want to add a check just in case?)
>
>> +#ifdef CONFIG_NUMA
>> +void memory_block_set_nid(struct memory_block *mem, int nid,
>> + enum meminit_context context)
>
> But we also set the zone? (Only for boot memory)
Yes, it's derived from the node internally, though, and not supplied
explicitly. Renaming it could be misleading IMHO.
--
Thanks,
David / dhildenb