Changelog:
v5 - v4
- Addressed more comments from Ingo Molnar and Michal Hocko.
- In the patch "optimize memory hotplug" we are now using
struct memory_block to hold node id as suggested by Michal.
- In the patch "don't read nid from struct page during hotplug"
renamed register_new_memory() to hotplug_memory_register() as
suggested by Ingo. Also, in this patch replaced the
description with the one provided by Michal.
- Fixed other spelling issues found by Ingo.
v3 - v4
Addressed comments from Ingo Molnar and from Michal Hocko
Split 4th patch into three patches
Instead of using section table to save node ids, saving node id in
the first page of every section.
v2 - v3
Fixed two issues found during testing
Addressed Kbuild warning reports
v1 - v2
Added struct page poisoning checking in order to verify that
struct pages are never accessed until initialized during memory
hotplug
This patchset:
- Improves hotplug performance by eliminating a number of
struct page traverses during memory hotplug.
- Fixes some issues with hotplugging, where boundaries
were not properly checked. And on x86 block size was not properly aligned
with end of memory
- Also, potentially improves boot performance by eliminating condition from
__init_single_page().
- Adds robustness by verifying that that struct pages are correctly
poisoned when flags are accessed.
The following experiments were performed on Xeon(R)
CPU E7-8895 v3 @ 2.60GHz with 1T RAM:
booting in qemu with 960G of memory, time to initialize struct pages:
no-kvm:
TRY1 TRY2
BEFORE: 39.433668 39.39705
AFTER: 36.903781 36.989329
with-kvm:
BEFORE: 10.977447 11.103164
AFTER: 10.929072 10.751885
Hotplug 896G memory:
no-kvm:
TRY1 TRY2
BEFORE: 848.740000 846.910000
AFTER: 783.070000 786.560000
with-kvm:
TRY1 TRY2
BEFORE: 34.410000 33.57
AFTER: 29.810000 29.580000
Pavel Tatashin (6):
mm/memory_hotplug: enforce block size aligned range check
x86/mm/memory_hotplug: determine block size based on the end of boot
memory
mm: add uninitialized struct page poisoning sanity checking
mm/memory_hotplug: optimize probe routine
mm/memory_hotplug: don't read nid from struct page during hotplug
mm/memory_hotplug: optimize memory hotplug
arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++++++++----
drivers/base/memory.c | 40 ++++++++++++++++++++++------------------
drivers/base/node.c | 24 +++++++++++++++++-------
include/linux/memory.h | 3 ++-
include/linux/mm.h | 4 +++-
include/linux/node.h | 4 ++--
include/linux/page-flags.h | 22 +++++++++++++++++-----
mm/memblock.c | 2 +-
mm/memory_hotplug.c | 44 +++++++++++++++++---------------------------
mm/page_alloc.c | 28 ++++++++++------------------
mm/sparse.c | 8 +++++++-
11 files changed, 127 insertions(+), 85 deletions(-)
--
2.16.2
During memory hotplugging we traverse struct pages three times:
1. memset(0) in sparse_add_one_section()
2. loop in __add_section() to set do: set_page_node(page, nid); and
SetPageReserved(page);
3. loop in memmap_init_zone() to call __init_single_pfn()
This patch removes the first two loops, and leaves only loop 3. All struct
pages are initialized in one place, the same as it is done during boot.
The benefits:
- We improve memory hotplug performance because we are not evicting the
cache several times and also reduce loop branching overhead.
- Remove condition from hotpath in __init_single_pfn(), that was added in
order to fix the problem that was reported by Bharata in the above email
thread, thus also improve performance during normal boot.
- Make memory hotplug more similar to the boot memory initialization path
because we zero and initialize struct pages only in one function.
- Simplifies memory hotplug struct page initialization code, and thus
enables future improvements, such as multi-threading the initialization
of struct pages in order to improve hotplug performance even further on
larger machines.
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
---
drivers/base/node.c | 2 ++
include/linux/memory.h | 1 +
mm/memory_hotplug.c | 27 ++++++++-------------------
mm/page_alloc.c | 28 ++++++++++------------------
mm/sparse.c | 8 +++++++-
5 files changed, 28 insertions(+), 38 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index d7cfc8d8a5c5..51de4af290ac 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -405,6 +405,8 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
if (!mem_blk)
return -EFAULT;
+
+ mem_blk->nid = nid;
if (!node_online(nid))
return 0;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 9f8cd856ca1e..31ca3e28b0eb 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -33,6 +33,7 @@ struct memory_block {
void *hw; /* optional pointer to fw/hw data */
int (*phys_callback)(struct memory_block *);
struct device dev;
+ int nid; /* NID for this memory block */
};
int arch_get_memory_phys_device(unsigned long start_pfn);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 477e183a4ac7..6a9ba14e18ed 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -250,7 +250,6 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
struct vmem_altmap *altmap, bool want_memblock)
{
int ret;
- int i;
if (pfn_valid(phys_start_pfn))
return -EEXIST;
@@ -259,23 +258,6 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
if (ret < 0)
return ret;
- /*
- * Make all the pages reserved so that nobody will stumble over half
- * initialized state.
- * FIXME: We also have to associate it with a node because page_to_nid
- * relies on having page with the proper node.
- */
- for (i = 0; i < PAGES_PER_SECTION; i++) {
- unsigned long pfn = phys_start_pfn + i;
- struct page *page;
- if (!pfn_valid(pfn))
- continue;
-
- page = pfn_to_page(pfn);
- set_page_node(page, nid);
- SetPageReserved(page);
- }
-
if (!want_memblock)
return 0;
@@ -908,8 +890,15 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
int nid;
int ret;
struct memory_notify arg;
+ struct memory_block *mem;
+
+ /*
+ * We can't use pfn_to_nid() because nid might be stored in struct page
+ * which is not yet initialized. Instead, we find nid from memory block.
+ */
+ mem = find_memory_block(__pfn_to_section(pfn));
+ nid = mem->nid;
- nid = pfn_to_nid(pfn);
/* associate pfn range with the zone */
zone = move_pfn_range(online_type, nid, pfn, nr_pages);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb416723538f..8bf3b9c215c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1181,10 +1181,9 @@ static void free_one_page(struct zone *zone,
}
static void __meminit __init_single_page(struct page *page, unsigned long pfn,
- unsigned long zone, int nid, bool zero)
+ unsigned long zone, int nid)
{
- if (zero)
- mm_zero_struct_page(page);
+ mm_zero_struct_page(page);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
@@ -1198,12 +1197,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
#endif
}
-static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
- int nid, bool zero)
-{
- return __init_single_page(pfn_to_page(pfn), pfn, zone, nid, zero);
-}
-
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
static void __meminit init_reserved_page(unsigned long pfn)
{
@@ -1222,7 +1215,7 @@ static void __meminit init_reserved_page(unsigned long pfn)
if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
break;
}
- __init_single_pfn(pfn, zid, nid, true);
+ __init_single_page(pfn_to_page(pfn), pfn, zid, nid);
}
#else
static inline void init_reserved_page(unsigned long pfn)
@@ -1539,7 +1532,7 @@ static unsigned long __init deferred_init_pages(int nid, int zid,
} else {
page++;
}
- __init_single_page(page, pfn, zid, nid, true);
+ __init_single_page(page, pfn, zid, nid);
nr_pages++;
}
return (nr_pages);
@@ -5332,6 +5325,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
pg_data_t *pgdat = NODE_DATA(nid);
unsigned long pfn;
unsigned long nr_initialised = 0;
+ struct page *page;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
struct memblock_region *r = NULL, *tmp;
#endif
@@ -5393,6 +5387,11 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
#endif
not_early:
+ page = pfn_to_page(pfn);
+ __init_single_page(page, pfn, zone, nid);
+ if (context == MEMMAP_HOTPLUG)
+ SetPageReserved(page);
+
/*
* Mark the block movable so that blocks are reserved for
* movable at startup. This will force kernel allocations
@@ -5409,15 +5408,8 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
* because this is done early in sparse_add_one_section
*/
if (!(pfn & (pageblock_nr_pages - 1))) {
- struct page *page = pfn_to_page(pfn);
-
- __init_single_page(page, pfn, zone, nid,
- context != MEMMAP_HOTPLUG);
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
cond_resched();
- } else {
- __init_single_pfn(pfn, zone, nid,
- context != MEMMAP_HOTPLUG);
}
}
}
diff --git a/mm/sparse.c b/mm/sparse.c
index 7af5e7a92528..775e1a4fd95e 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -816,7 +816,13 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat,
goto out;
}
- memset(memmap, 0, sizeof(struct page) * PAGES_PER_SECTION);
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * Poison uninitialized struct pages in order to catch invalid flags
+ * combinations.
+ */
+ memset(memmap, PAGE_POISON_PATTERN, sizeof(struct page) * PAGES_PER_SECTION);
+#endif
section_mark_present(ms);
--
2.16.2
register_mem_sect_under_node is careful to check the node id of each
pfn in the memblock range to handle configurations with interleaving
nodes. This is not really needed for the memory hotplug because hotadded
ranges are bound to a single NUMA node. We simply cannot handle
interleaving NUMA nodes in the same memblock currently and there are no
signs that anybody would want anything like that in future. That would
require much more refactoring.
This is a preparatory patch for later patches.
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
---
drivers/base/memory.c | 4 ++--
drivers/base/node.c | 22 +++++++++++++++-------
include/linux/memory.h | 2 +-
include/linux/node.h | 4 ++--
mm/memory_hotplug.c | 2 +-
5 files changed, 21 insertions(+), 13 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index deb3f029b451..79fcd2bae96b 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -712,7 +712,7 @@ static int add_memory_block(int base_section_nr)
* need an interface for the VM to add new memory regions,
* but without onlining it.
*/
-int register_new_memory(int nid, struct mem_section *section)
+int hotplug_memory_register(int nid, struct mem_section *section)
{
int ret = 0;
struct memory_block *mem;
@@ -731,7 +731,7 @@ int register_new_memory(int nid, struct mem_section *section)
}
if (mem->section_count == sections_per_block)
- ret = register_mem_sect_under_node(mem, nid);
+ ret = register_mem_sect_under_node(mem, nid, false);
out:
mutex_unlock(&mem_sysfs_mutex);
return ret;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index ee090ab9171c..d7cfc8d8a5c5 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -397,7 +397,8 @@ static int __ref get_nid_for_pfn(unsigned long pfn)
}
/* register memory section under specified node if it spans that node */
-int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
+int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
+ bool check_nid)
{
int ret;
unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -423,11 +424,18 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
continue;
}
- page_nid = get_nid_for_pfn(pfn);
- if (page_nid < 0)
- continue;
- if (page_nid != nid)
- continue;
+ /*
+ * We need to check if page belongs to nid only for the boot
+ * case, during hotplug we know that all pages in the memory
+ * block belong to the same node.
+ */
+ if (check_nid) {
+ page_nid = get_nid_for_pfn(pfn);
+ if (page_nid < 0)
+ continue;
+ if (page_nid != nid)
+ continue;
+ }
ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
&mem_blk->dev.kobj,
kobject_name(&mem_blk->dev.kobj));
@@ -502,7 +510,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
mem_blk = find_memory_block_hinted(mem_sect, mem_blk);
- ret = register_mem_sect_under_node(mem_blk, nid);
+ ret = register_mem_sect_under_node(mem_blk, nid, true);
if (!err)
err = ret;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index f71e732c77b2..9f8cd856ca1e 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -109,7 +109,7 @@ extern int register_memory_notifier(struct notifier_block *nb);
extern void unregister_memory_notifier(struct notifier_block *nb);
extern int register_memory_isolate_notifier(struct notifier_block *nb);
extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
-extern int register_new_memory(int, struct mem_section *);
+int hotplug_memory_register(int nid, struct mem_section *section);
#ifdef CONFIG_MEMORY_HOTREMOVE
extern int unregister_memory_section(struct mem_section *);
#endif
diff --git a/include/linux/node.h b/include/linux/node.h
index 4ece0fee0ffc..41f171861dcc 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -67,7 +67,7 @@ extern void unregister_one_node(int nid);
extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
extern int register_mem_sect_under_node(struct memory_block *mem_blk,
- int nid);
+ int nid, bool check_nid);
extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
unsigned long phys_index);
@@ -97,7 +97,7 @@ static inline int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
return 0;
}
static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
- int nid)
+ int nid, bool check_nid)
{
return 0;
}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 565048f496f7..477e183a4ac7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -279,7 +279,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
if (!want_memblock)
return 0;
- return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
+ return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn));
}
/*
--
2.16.2
Start qemu with the following arguments:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
Which boots machine with 64G and adds a device mem1 with 2G that can be
hotplugged later.
Also make sure that .config has the following options turned on:
CONFIG_MEMORY_HOTPLUG
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
CONFIG_ACPI_HOTPLUG_MEMORY
Using the qemu monitor hotplug the memory:
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
The operation will fail with the following trace:
WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
pages_correctly_reserved+0xe6/0x110
Modules linked in:
CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:pages_correctly_reserved+0xe6/0x110
RSP: 0018:ffffbe5086b53d98 EFLAGS: 00010246
RAX: ffff9acb3fff3180 RBX: ffff9acaf7646038 RCX: 0000000000000800
RDX: ffff9acb3fff3000 RSI: 0000000000000218 RDI: 00000000010c0000
RBP: 0000000001080000 R08: ffffe81f83000040 R09: 0000000001100000
R10: ffff9acb3fff6000 R11: 0000000000000246 R12: 0000000000080000
R13: 0000000000000000 R14: ffffbe5086b53f08 R15: ffff9acaf7506f20
FS: 00007fd7f20da8c0(0000) GS:ffff9acb3fc00000(0000) knlGS:000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd7f20f2000 CR3: 0000000ff7ac2001 CR4: 00000000001606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
memory_subsys_online+0x44/0xa0
device_online+0x51/0x80
store_mem_state+0x5e/0xe0
kernfs_fop_write+0xfa/0x170
__vfs_write+0x2e/0x150
? __inode_security_revalidate+0x47/0x60
? selinux_file_permission+0xd5/0x130
? _cond_resched+0x10/0x20
vfs_write+0xa8/0x1a0
? find_vma+0x54/0x60
SyS_write+0x4d/0xb0
do_syscall_64+0x5d/0x110
entry_SYSCALL_64_after_hwframe+0x21/0x86
RIP: 0033:0x7fd7f0d3a840
RSP: 002b:00007fff5db77c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7f0d3a840
RDX: 0000000000000006 RSI: 00007fd7f20f2000 RDI: 0000000000000007
RBP: 00007fd7f20f2000 R08: 000055db265c4ab0 R09: 00007fd7f20da8c0
R10: 0000000000000006 R11: 0000000000000246 R12: 000055db265c49d0
R13: 0000000000000006 R14: 000055db265c5510 R15: 000000000000000b
Code: fe ff ff 07 00 77 24 48 89 f8 48 c1 e8 17 49 8b 14 c2 48 85 d2 74 14
40 0f b6 c6 49 81 c0 00 00 20 00 48 c1 e0 04 48 01 d0 75 93 <0f> ff 31 c0
c3 b8 01 00 00 00 c3 31 d2 48 c7 c7 b0 32 67 a6 31
---[ end trace 6203bc4f1a5d30e8 ]---
The problem is detected in: drivers/base/memory.c
static bool pages_correctly_reserved(unsigned long start_pfn)
if (WARN_ON_ONCE(!pfn_valid(pfn)))
This function loops through every section in the newly added memory block
and verifies that the first pfn in each section is valid, meaning section
exists, has mapping (struct page array), and is online.
The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory the block size is changed to 2G:
$ cat /sys/devices/system/memory/block_size_bytes
80000000
or
$ dmesg | grep "block size"
[ 0.086469] x86/mm: Memory block size: 2048MB
During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations. See:
Documentation/memory-hotplug.txt
So, when the start_pfn of newly added memory is not block size aligned, we
can get a memory block with partially populated sections.
In our case the start_pfn starts from the last_pfn (end of physical
memory).
$ dmesg | grep last_pfn
[ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
0x1040000 == 65G, and so is not 2G aligned!
The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.
With this fix, running the above sequence yield to the following result:
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/memory_hotplug.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b2bd52ff7605..565048f496f7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1083,15 +1083,16 @@ int try_online_node(int nid)
static int check_hotplug_memory_range(u64 start, u64 size)
{
- u64 start_pfn = PFN_DOWN(start);
+ unsigned long block_sz = memory_block_size_bytes();
+ u64 block_nr_pages = block_sz >> PAGE_SHIFT;
u64 nr_pages = size >> PAGE_SHIFT;
+ u64 start_pfn = PFN_DOWN(start);
- /* Memory range must be aligned with section */
- if ((start_pfn & ~PAGE_SECTION_MASK) ||
- (nr_pages % PAGES_PER_SECTION) || (!nr_pages)) {
- pr_err("Section-unaligned hotplug range: start 0x%llx, size 0x%llx\n",
- (unsigned long long)start,
- (unsigned long long)size);
+ /* memory range must be block size aligned */
+ if (!nr_pages || !IS_ALIGNED(start_pfn, block_nr_pages) ||
+ !IS_ALIGNED(nr_pages, block_nr_pages)) {
+ pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx",
+ block_sz, start, size);
return -EINVAL;
}
--
2.16.2
When memory is hotplugged pages_correctly_reserved() is called to verify
that the added memory is present, this routine traverses through every
struct page and verifies that PageReserved() is set. This is a slow
operation especially if a large amount of memory is added.
Instead of checking every page, it is enough to simply check that the
section is present, has mapping (struct page array is allocated), and the
mapping is online.
In addition, we should not excpect that probe routine sets flags in struct
page, as the struct pages have not yet been initialized. The initialization
should be done in __init_single_page(), the same as during boot.
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
drivers/base/memory.c | 36 ++++++++++++++++++++----------------
1 file changed, 20 insertions(+), 16 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index fe4b24f05f6a..deb3f029b451 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -187,13 +187,14 @@ int memory_isolate_notify(unsigned long val, void *v)
}
/*
- * The probe routines leave the pages reserved, just as the bootmem code does.
- * Make sure they're still that way.
+ * The probe routines leave the pages uninitialized, just as the bootmem code
+ * does. Make sure we do not access them, but instead use only information from
+ * within sections.
*/
-static bool pages_correctly_reserved(unsigned long start_pfn)
+static bool pages_correctly_probed(unsigned long start_pfn)
{
- int i, j;
- struct page *page;
+ unsigned long section_nr = pfn_to_section_nr(start_pfn);
+ unsigned long section_nr_end = section_nr + sections_per_block;
unsigned long pfn = start_pfn;
/*
@@ -201,21 +202,24 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
* SPARSEMEM_VMEMMAP. We lookup the page once per section
* and assume memmap is contiguous within each section
*/
- for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) {
+ for (; section_nr < section_nr_end; section_nr++) {
if (WARN_ON_ONCE(!pfn_valid(pfn)))
return false;
- page = pfn_to_page(pfn);
-
- for (j = 0; j < PAGES_PER_SECTION; j++) {
- if (PageReserved(page + j))
- continue;
-
- printk(KERN_WARNING "section number %ld page number %d "
- "not reserved, was it already online?\n",
- pfn_to_section_nr(pfn), j);
+ if (!present_section_nr(section_nr)) {
+ pr_warn("section %ld pfn[%lx, %lx) not present",
+ section_nr, pfn, pfn + PAGES_PER_SECTION);
+ return false;
+ } else if (!valid_section_nr(section_nr)) {
+ pr_warn("section %ld pfn[%lx, %lx) no valid memmap",
+ section_nr, pfn, pfn + PAGES_PER_SECTION);
+ return false;
+ } else if (online_section_nr(section_nr)) {
+ pr_warn("section %ld pfn[%lx, %lx) is already online",
+ section_nr, pfn, pfn + PAGES_PER_SECTION);
return false;
}
+ pfn += PAGES_PER_SECTION;
}
return true;
@@ -237,7 +241,7 @@ memory_block_action(unsigned long phys_index, unsigned long action, int online_t
switch (action) {
case MEM_ONLINE:
- if (!pages_correctly_reserved(start_pfn))
+ if (!pages_correctly_probed(start_pfn))
return -EBUSY;
ret = online_pages(start_pfn, nr_pages, online_type);
--
2.16.2
During boot we poison struct page memory in order to ensure that no one is
accessing this memory until the struct pages are initialized in
__init_single_page().
This patch adds more scrutiny to this checking by making sure that flags
do not equal the poison pattern when they are accessed. The pattern is all
ones.
Since node id is also stored in struct page, and may be accessed quite
early, we add this enforcement into page_to_nid() function as well.
Note, this is applicable only when NODE_NOT_IN_PAGE_FLAGS=n
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
include/linux/mm.h | 4 +++-
include/linux/page-flags.h | 22 +++++++++++++++++-----
mm/memblock.c | 2 +-
3 files changed, 21 insertions(+), 7 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..ad71136a6494 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -896,7 +896,9 @@ extern int page_to_nid(const struct page *page);
#else
static inline int page_to_nid(const struct page *page)
{
- return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
+ struct page *p = (struct page *)page;
+
+ return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
}
#endif
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50c2b8786831..e34a27727b9a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -156,9 +156,18 @@ static __always_inline int PageCompound(struct page *page)
return test_bit(PG_head, &page->flags) || PageTail(page);
}
+#define PAGE_POISON_PATTERN -1l
+static inline int PagePoisoned(const struct page *page)
+{
+ return page->flags == PAGE_POISON_PATTERN;
+}
+
/*
* Page flags policies wrt compound pages
*
+ * PF_POISONED_CHECK
+ * check if this struct page poisoned/uninitialized
+ *
* PF_ANY:
* the page flag is relevant for small, head and tail pages.
*
@@ -176,17 +185,20 @@ static __always_inline int PageCompound(struct page *page)
* PF_NO_COMPOUND:
* the page flag is not relevant for compound pages.
*/
-#define PF_ANY(page, enforce) page
-#define PF_HEAD(page, enforce) compound_head(page)
+#define PF_POISONED_CHECK(page) ({ \
+ VM_BUG_ON_PGFLAGS(PagePoisoned(page), page); \
+ page; })
+#define PF_ANY(page, enforce) PF_POISONED_CHECK(page)
+#define PF_HEAD(page, enforce) PF_POISONED_CHECK(compound_head(page))
#define PF_ONLY_HEAD(page, enforce) ({ \
VM_BUG_ON_PGFLAGS(PageTail(page), page); \
- page;})
+ PF_POISONED_CHECK(page); })
#define PF_NO_TAIL(page, enforce) ({ \
VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page); \
- compound_head(page);})
+ PF_POISONED_CHECK(compound_head(page)); })
#define PF_NO_COMPOUND(page, enforce) ({ \
VM_BUG_ON_PGFLAGS(enforce && PageCompound(page), page); \
- page;})
+ PF_POISONED_CHECK(page); })
/*
* Macros to create function definitions for page flags
diff --git a/mm/memblock.c b/mm/memblock.c
index 5a9ca2a1751b..d85c8754e0ce 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1373,7 +1373,7 @@ void * __init memblock_virt_alloc_try_nid_raw(
min_addr, max_addr, nid);
#ifdef CONFIG_DEBUG_VM
if (ptr && size > 0)
- memset(ptr, 0xff, size);
+ memset(ptr, PAGE_POISON_PATTERN, size);
#endif
return ptr;
}
--
2.16.2
Memory sections are combined into "memory block" chunks. These chunks are
the units upon which memory can be added and removed.
On x86 the new memory may be added after the end of the boot memory,
therefore, if block size does not align with end of boot memory, memory
hotplugging/hotremoving can be broken.
Currently, whenever machine is booted with more than 64G the block size
is unconditionally increased to 2G from the base 128M. This is done in
order to reduce number of memory device files in sysfs:
/sys/devices/system/memory/memoryXXX
We must use the largest allowed block size that aligns to the next
address to be able to hotplug the next block of memory.
So, when memory is larger or equal to 64G, we check the end address and
find the largest block size that is still power of two but smaller or
equal to 2G.
Before, the fix:
Run qemu with:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure
With the fix memory is added successfully as the block size is set to 1G,
and therefore aligns with start address 0x1040000000.
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++++++++----
1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8b72923f1d35..9ac776de2252 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1326,14 +1326,39 @@ int kern_addr_valid(unsigned long addr)
return pfn_valid(pte_pfn(*pte));
}
+/*
+ * Block size is the minimum amount of memory which can be hotplugged or
+ * hotremoved. It must be power of two and must be equal or larger than
+ * MIN_MEMORY_BLOCK_SIZE.
+ */
+#define MAX_BLOCK_SIZE (2UL << 30)
+
+/* Amount of ram needed to start using large blocks */
+#define MEM_SIZE_FOR_LARGE_BLOCK (64UL << 30)
+
static unsigned long probe_memory_block_size(void)
{
- unsigned long bz = MIN_MEMORY_BLOCK_SIZE;
+ unsigned long boot_mem_end = max_pfn << PAGE_SHIFT;
+ unsigned long bz;
- /* if system is UV or has 64GB of RAM or more, use large blocks */
- if (is_uv_system() || ((max_pfn << PAGE_SHIFT) >= (64UL << 30)))
- bz = 2UL << 30; /* 2GB */
+ /* If this is UV system, always set 2G block size */
+ if (is_uv_system()) {
+ bz = MAX_BLOCK_SIZE;
+ goto done;
+ }
+ /* Use regular block if RAM is smaller than MEM_SIZE_FOR_LARGE_BLOCK */
+ if (boot_mem_end < MEM_SIZE_FOR_LARGE_BLOCK) {
+ bz = MIN_MEMORY_BLOCK_SIZE;
+ goto done;
+ }
+
+ /* Find the largest allowed block size that aligns to memory end */
+ for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
+ if (IS_ALIGNED(boot_mem_end, bz))
+ break;
+ }
+done:
pr_info("x86/mm: Memory block size: %ldMB\n", bz >> 20);
return bz;
--
2.16.2
Hi Michal,
Thank you for letting me know, its OK, the patches are in mm-tree, so
they are getting tested, and there is no rush.
Pavel
Hi,
I will not get to review this version before Mar 12 because I am moving
and will be without access to my email and I am pretty sure the time
will not work well for me either.
Sorry about that.
On Tue 27-02-18 22:03:02, Pavel Tatashin wrote:
> Changelog:
> v5 - v4
> - Addressed more comments from Ingo Molnar and Michal Hocko.
> - In the patch "optimize memory hotplug" we are now using
> struct memory_block to hold node id as suggested by Michal.
> - In the patch "don't read nid from struct page during hotplug"
> renamed register_new_memory() to hotplug_memory_register() as
> suggested by Ingo. Also, in this patch replaced the
> description with the one provided by Michal.
> - Fixed other spelling issues found by Ingo.
>
> v3 - v4
> Addressed comments from Ingo Molnar and from Michal Hocko
> Split 4th patch into three patches
> Instead of using section table to save node ids, saving node id in
> the first page of every section.
>
> v2 - v3
> Fixed two issues found during testing
> Addressed Kbuild warning reports
>
> v1 - v2
> Added struct page poisoning checking in order to verify that
> struct pages are never accessed until initialized during memory
> hotplug
>
> This patchset:
> - Improves hotplug performance by eliminating a number of
> struct page traverses during memory hotplug.
>
> - Fixes some issues with hotplugging, where boundaries
> were not properly checked. And on x86 block size was not properly aligned
> with end of memory
>
> - Also, potentially improves boot performance by eliminating condition from
> __init_single_page().
>
> - Adds robustness by verifying that that struct pages are correctly
> poisoned when flags are accessed.
>
> The following experiments were performed on Xeon(R)
> CPU E7-8895 v3 @ 2.60GHz with 1T RAM:
>
> booting in qemu with 960G of memory, time to initialize struct pages:
>
> no-kvm:
> TRY1 TRY2
> BEFORE: 39.433668 39.39705
> AFTER: 36.903781 36.989329
>
> with-kvm:
> BEFORE: 10.977447 11.103164
> AFTER: 10.929072 10.751885
>
> Hotplug 896G memory:
> no-kvm:
> TRY1 TRY2
> BEFORE: 848.740000 846.910000
> AFTER: 783.070000 786.560000
>
> with-kvm:
> TRY1 TRY2
> BEFORE: 34.410000 33.57
> AFTER: 29.810000 29.580000
>
> Pavel Tatashin (6):
> mm/memory_hotplug: enforce block size aligned range check
> x86/mm/memory_hotplug: determine block size based on the end of boot
> memory
> mm: add uninitialized struct page poisoning sanity checking
> mm/memory_hotplug: optimize probe routine
> mm/memory_hotplug: don't read nid from struct page during hotplug
> mm/memory_hotplug: optimize memory hotplug
>
> arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++++++++----
> drivers/base/memory.c | 40 ++++++++++++++++++++++------------------
> drivers/base/node.c | 24 +++++++++++++++++-------
> include/linux/memory.h | 3 ++-
> include/linux/mm.h | 4 +++-
> include/linux/node.h | 4 ++--
> include/linux/page-flags.h | 22 +++++++++++++++++-----
> mm/memblock.c | 2 +-
> mm/memory_hotplug.c | 44 +++++++++++++++++---------------------------
> mm/page_alloc.c | 28 ++++++++++------------------
> mm/sparse.c | 8 +++++++-
> 11 files changed, 127 insertions(+), 85 deletions(-)
>
> --
> 2.16.2
>
--
Michal Hocko
SUSE Labs