2020-12-23 08:10:31

by Baoquan He

[permalink] [raw]
Subject: [PATCH v3 0/1] mm: memmap defer init dosn't work as expected

Post the regression fix in a standalone patch as Andrew suggested for
-stable branch better back porting. This is rebased on the latest
master branch of mainline kenrel, surely there's almost no change
comparing with v2.
https://lore.kernel.org/linux-mm/[email protected]/

Tested on a system with 24G ram as below, adding 'memmap=128M!0x500000000'
to split the one ram region into two regions in numa node1 to simulate
the scenario of VMware.

[ +0.000000] BIOS-provided physical RAM map:
[ +0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009bfff] usable
[ +0.000000] BIOS-e820: [mem 0x000000000009c000-0x000000000009ffff] reserved
[ +0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[ +0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000006cdcefff] usable
[ +0.000000] BIOS-e820: [mem 0x000000006cdcf000-0x000000006efcefff] reserved
[ +0.000000] BIOS-e820: [mem 0x000000006efcf000-0x000000006fdfefff] ACPI NVS
[ +0.000000] BIOS-e820: [mem 0x000000006fdff000-0x000000006fffefff] ACPI data
[ +0.000000] BIOS-e820: [mem 0x000000006ffff000-0x000000006fffffff] usable
[ +0.000000] BIOS-e820: [mem 0x0000000070000000-0x000000008fffffff] reserved
[ +0.000000] BIOS-e820: [mem 0x00000000e0000000-0x00000000ffffffff] reserved
[ +0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000067f1fffff] usable
[ +0.000000] BIOS-e820: [mem 0x000000067f200000-0x000000067fffffff] reserved

Test passed as below. As you can see, with patch applied, memmap init
will cost much less time on numa node 1:

Without the patch:
[ 0.065029] Early memory node ranges
[ 0.065030] node 0: [mem 0x0000000000001000-0x000000000009bfff]
[ 0.065032] node 0: [mem 0x0000000000100000-0x000000006cdcefff]
[ 0.065034] node 0: [mem 0x000000006ffff000-0x000000006fffffff]
[ 0.065036] node 0: [mem 0x0000000100000000-0x000000027fffffff]
[ 0.065038] node 1: [mem 0x0000000280000000-0x00000004ffffffff]
[ 0.065040] node 1: [mem 0x0000000508000000-0x000000067f1fffff]
[ 0.065185] Zeroed struct page in unavailable ranges: 16533 pages
[ 0.065187] Initmem setup node 0 [mem 0x0000000000001000-0x000000027fffffff]
[ 0.069616] Initmem setup node 1 [mem 0x0000000280000000-0x000000067f1fffff]
[ 0.096298] ACPI: PM-Timer IO Port: 0x408

With the patch applied:
[ 0.065029] Early memory node ranges
[ 0.065030] node 0: [mem 0x0000000000001000-0x000000000009bfff]
[ 0.065032] node 0: [mem 0x0000000000100000-0x000000006cdcefff]
[ 0.065034] node 0: [mem 0x000000006ffff000-0x000000006fffffff]
[ 0.065036] node 0: [mem 0x0000000100000000-0x000000027fffffff]
[ 0.065038] node 1: [mem 0x0000000280000000-0x00000004ffffffff]
[ 0.065041] node 1: [mem 0x0000000508000000-0x000000067f1fffff]
[ 0.065187] Zeroed struct page in unavailable ranges: 16533 pages
[ 0.065189] Initmem setup node 0 [mem 0x0000000000001000-0x000000027fffffff]
[ 0.069572] Initmem setup node 1 [mem 0x0000000280000000-0x000000067f1fffff]
[ 0.070161] ACPI: PM-Timer IO Port: 0x408


Baoquan He (1):
mm: memmap defer init dosn't work as expected

arch/ia64/mm/init.c | 4 ++--
include/linux/mm.h | 5 +++--
mm/memory_hotplug.c | 2 +-
mm/page_alloc.c | 8 +++++---
4 files changed, 11 insertions(+), 8 deletions(-)

--
2.17.2


2020-12-23 08:11:47

by Baoquan He

[permalink] [raw]
Subject: [PATCH v3 1/1] mm: memmap defer init dosn't work as expected

VMware observed a performance regression during memmap init on their platform,
and bisected to commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock
regions rather that check each PFN") causing it.

Before the commit:

[0.033176] Normal zone: 1445888 pages used for memmap
[0.033176] Normal zone: 89391104 pages, LIFO batch:63
[0.035851] ACPI: PM-Timer IO Port: 0x448

With commit

[0.026874] Normal zone: 1445888 pages used for memmap
[0.026875] Normal zone: 89391104 pages, LIFO batch:63
[2.028450] ACPI: PM-Timer IO Port: 0x448

The root cause is the current memmap defer init doesn't work as expected.
Before, memmap_init_zone() was used to do memmap init of one whole zone, to
initialize all low zones of one numa node, but defer memmap init of the
last zone in that numa node. However, since commit 73a6e474cb376, function
memmap_init() is adapted to iterater over memblock regions inside one zone,
then call memmap_init_zone() to do memmap init for each region.

E.g, on VMware's system, the memory layout is as below, there are two memory
regions in node 2. The current code will mistakenly initialize the whole 1st
region [mem 0xab00000000-0xfcffffffff], then do memmap defer to iniatialize
only one memmory section on the 2nd region [mem 0x10000000000-0x1033fffffff].
In fact, we only expect to see that there's only one memory section's memmap
initialized. That's why more time is costed at the time.

[ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
[ 0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
[ 0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
[ 0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
[ 0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]

Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the real zone end pfn so that defer_init() can use it to judge whether
defer need be taken in zone wide.

Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Reported-by: Rahul Gopakumar <[email protected]>
Signed-off-by: Baoquan He <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Cc: [email protected]
---
arch/ia64/mm/init.c | 4 ++--
include/linux/mm.h | 5 +++--
mm/memory_hotplug.c | 2 +-
mm/page_alloc.c | 8 +++++---
4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 9b5acf8fb092..e76386a3479e 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -536,7 +536,7 @@ virtual_memmap_init(u64 start, u64 end, void *arg)

if (map_start < map_end)
memmap_init_zone((unsigned long)(map_end - map_start),
- args->nid, args->zone, page_to_pfn(map_start),
+ args->nid, args->zone, page_to_pfn(map_start), page_to_pfn(map_end),
MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
return 0;
}
@@ -546,7 +546,7 @@ memmap_init (unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn)
{
if (!vmem_map) {
- memmap_init_zone(size, nid, zone, start_pfn,
+ memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size,
MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5299b90a6c40..af0d3a8d77f7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2432,8 +2432,9 @@ extern int __meminit early_pfn_to_nid(unsigned long pfn);
#endif

extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
- enum meminit_context, struct vmem_altmap *, int migratetype);
+extern void memmap_init_zone(unsigned long, int, unsigned long,
+ unsigned long, unsigned long, enum meminit_context,
+ struct vmem_altmap *, int migratetype);
extern void setup_per_zone_wmarks(void);
extern int __meminit init_per_zone_wmark_min(void);
extern void mem_init(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c01604224299..789fceb4f2d5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -713,7 +713,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
* expects the zone spans the pfn range. All the pages in the range
* are reserved so nobody should be touching them so we should be safe
*/
- memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
+ memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0,
MEMINIT_HOTPLUG, altmap, migratetype);

set_zone_contiguous(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7a2c89b21115..bdbec4c98173 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -423,6 +423,8 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
if (end_pfn < pgdat_end_pfn(NODE_DATA(nid)))
return false;

+ if (NODE_DATA(nid)->first_deferred_pfn != ULONG_MAX)
+ return true;
/*
* We start only with one section of pages, more pages are added as
* needed until the rest of deferred pages are initialized.
@@ -6116,7 +6118,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
* zone stats (e.g., nr_isolate_pageblock) are touched.
*/
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
- unsigned long start_pfn,
+ unsigned long start_pfn, unsigned long zone_end_pfn,
enum meminit_context context,
struct vmem_altmap *altmap, int migratetype)
{
@@ -6152,7 +6154,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (context == MEMINIT_EARLY) {
if (overlap_memmap_init(zone, &pfn))
continue;
- if (defer_init(nid, pfn, end_pfn))
+ if (defer_init(nid, pfn, zone_end_pfn))
break;
}

@@ -6266,7 +6268,7 @@ void __meminit __weak memmap_init(unsigned long size, int nid,

if (end_pfn > start_pfn) {
size = end_pfn - start_pfn;
- memmap_init_zone(size, nid, zone, start_pfn,
+ memmap_init_zone(size, nid, zone, start_pfn, range_end_pfn,
MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
}
}
--
2.17.2