LinuxLists.cc - [RFC PATCH 0/14] Parallel memory initialisation

2015-04-13 10:17:22

[permalink] [raw]

Subject: [RFC PATCH 0/14] Parallel memory initialisation

Memory initialisation had been identified as one of the reasons why large
machines take a long time to boot. Patches were posted a long time ago
that attempted to move deferred initialisation into the page allocator
paths. This was rejected on the grounds it should not be necessary to hurt
the fast paths to parallelise initialisation. This series reuses much of
the work from that time but defers the initialisation of memory to kswapd
so that one thread per node initialises memory local to that node. The
issue is that on the machines I tested with, memory initialisation was not
a major contributor to boot times. I'm posting the RFC to both review the
series and see if it actually helps users of very large machines.

After applying the series and setting the appropriate Kconfig variable I
see this in the boot log on a 64G machine

[ 7.383764] kswapd 0 initialised deferred memory in 188ms
[ 7.404253] kswapd 1 initialised deferred memory in 208ms
[ 7.411044] kswapd 3 initialised deferred memory in 216ms
[ 7.411551] kswapd 2 initialised deferred memory in 216ms

On a 1TB machine, I see

[ 11.913324] kswapd 0 initialised deferred memory in 1168ms
[ 12.220011] kswapd 2 initialised deferred memory in 1476ms
[ 12.245369] kswapd 3 initialised deferred memory in 1500ms
[ 12.271680] kswapd 1 initialised deferred memory in 1528ms

Once booted the machine appears to work as normal. Boot times were measured
from the time shutdown was called until ssh was available again. In the
64G case, the boot time savings are negligible. On the 1TB machine, the
savings were 10 seconds (about 8% improvement on kernel times but 1-2%
overall as POST takes so long).

It would be nice if the people that have access to really large machines
would test this series and report back if the complexity is justified.

Patches are against 4.0-rc7.

Documentation/kernel-parameters.txt | 8 +
arch/ia64/mm/numa.c | 19 +-
arch/x86/Kconfig | 2 +
include/linux/memblock.h | 18 ++
include/linux/mm.h | 8 +-
include/linux/mmzone.h | 37 +++-
init/main.c | 1 +
mm/Kconfig | 29 +++
mm/bootmem.c | 6 +-
mm/internal.h | 23 ++-
mm/memblock.c | 34 ++-
mm/mm_init.c | 9 +-
mm/nobootmem.c | 7 +-
mm/page_alloc.c | 398 +++++++++++++++++++++++++++++++-----
mm/vmscan.c | 6 +-
15 files changed, 507 insertions(+), 98 deletions(-)

--
2.1.2

2015-04-13 10:17:19

[permalink] [raw]

Subject: [PATCH 01/14] memblock: Introduce a for_each_reserved_mem_region iterator.

From: Robin Holt <[email protected]>

As part of initializing struct page's in 2MiB chunks, we noticed that
at the end of free_all_bootmem(), there was nothing which had forced
the reserved/allocated 4KiB pages to be initialized.

This helper function will be used for that expansion.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nate Zimmer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/memblock.h | 18 ++++++++++++++++++
mm/memblock.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index e8cc45307f8f..3075e7673c54 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -93,6 +93,9 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a,
struct memblock_type *type_b, phys_addr_t *out_start,
phys_addr_t *out_end, int *out_nid);

+void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
+ phys_addr_t *out_end);
+
/**
* for_each_mem_range - iterate through memblock areas from type_a and not
* included in type_b. Or just type_a if type_b is NULL.
@@ -132,6 +135,21 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a,
__next_mem_range_rev(&i, nid, type_a, type_b, \
p_start, p_end, p_nid))

+/**
+ * for_each_reserved_mem_region - iterate over all reserved memblock areas
+ * @i: u64 used as loop variable
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over reserved areas of memblock. Available as soon as memblock
+ * is initialized.
+ */
+#define for_each_reserved_mem_region(i, p_start, p_end) \
+ for (i = 0UL, \
+ __next_reserved_mem_region(&i, p_start, p_end); \
+ i != (u64)ULLONG_MAX; \
+ __next_reserved_mem_region(&i, p_start, p_end))
+
#ifdef CONFIG_MOVABLE_NODE
static inline bool memblock_is_hotpluggable(struct memblock_region *m)
{
diff --git a/mm/memblock.c b/mm/memblock.c
index 252b77bdf65e..e0cc2d174f74 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -765,6 +765,38 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
}

/**
+ * __next_reserved_mem_region - next function for for_each_reserved_region()
+ * @idx: pointer to u64 loop variable
+ * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL
+ * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL
+ *
+ * Iterate over all reserved memory regions.
+ */
+void __init_memblock __next_reserved_mem_region(u64 *idx,
+ phys_addr_t *out_start,
+ phys_addr_t *out_end)
+{
+ struct memblock_type *rsv = &memblock.reserved;
+
+ if (*idx >= 0 && *idx < rsv->cnt) {
+ struct memblock_region *r = &rsv->regions[*idx];
+ phys_addr_t base = r->base;
+ phys_addr_t size = r->size;
+
+ if (out_start)
+ *out_start = base;
+ if (out_end)
+ *out_end = base + size - 1;
+
+ *idx += 1;
+ return;
+ }
+
+ /* signal end of iteration */
+ *idx = ULLONG_MAX;
+}
+
+/**
* __next__mem_range - next function for for_each_free_mem_range() etc.
* @idx: pointer to u64 loop variable
* @nid: node selector, %NUMA_NO_NODE for all nodes
--
2.1.2

2015-04-13 10:17:16

[permalink] [raw]

Subject: [PATCH 02/14] mm: meminit: Move page initialization into a separate function.

From: Robin Holt <[email protected]>

Currently, memmap_init_zone() has all the smarts for initializing a single
page. A subset of this is required for parallel page initialisation and so
this patch breaks up the monolithic function in preparation.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 79 +++++++++++++++++++++++++++++++++------------------------
1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40e29429e7b0..fd7a6d09062d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -778,6 +778,51 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
return 0;
}

+static void __meminit __init_single_page(struct page *page, unsigned long pfn,
+ unsigned long zone, int nid)
+{
+ struct zone *z = &NODE_DATA(nid)->node_zones[zone];
+
+ set_page_links(page, zone, nid, pfn);
+ mminit_verify_page_links(page, zone, nid, pfn);
+ init_page_count(page);
+ page_mapcount_reset(page);
+ page_cpupid_reset_last(page);
+ SetPageReserved(page);
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made. Later some blocks near
+ * the start are marked MIGRATE_RESERVE by
+ * setup_zone_migrate_reserve()
+ *
+ * bitmap is created for zone's valid pfn range. but memmap
+ * can be created for invalid pages (for alignment)
+ * check here not to call set_pageblock_migratetype() against
+ * pfn out of zone.
+ */
+ if ((z->zone_start_pfn <= pfn)
+ && (pfn < zone_end_pfn(z))
+ && !(pfn & (pageblock_nr_pages - 1)))
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+
+ INIT_LIST_HEAD(&page->lru);
+#ifdef WANT_PAGE_VIRTUAL
+ /* The shift won't overflow because ZONE_NORMAL is below 4G. */
+ if (!is_highmem_idx(zone))
+ set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+}
+
+static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
+ int nid)
+{
+ return __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
bool compound = PageCompound(page);
@@ -4124,7 +4169,6 @@ static void setup_zone_migrate_reserve(struct zone *zone)
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
- struct page *page;
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
struct zone *z;
@@ -4145,38 +4189,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (!early_pfn_in_nid(pfn, nid))
continue;
}
- page = pfn_to_page(pfn);
- set_page_links(page, zone, nid, pfn);
- mminit_verify_page_links(page, zone, nid, pfn);
- init_page_count(page);
- page_mapcount_reset(page);
- page_cpupid_reset_last(page);
- SetPageReserved(page);
- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
- *
- * bitmap is created for zone's valid pfn range. but memmap
- * can be created for invalid pages (for alignment)
- * check here not to call set_pageblock_migratetype() against
- * pfn out of zone.
- */
- if ((z->zone_start_pfn <= pfn)
- && (pfn < zone_end_pfn(z))
- && !(pfn & (pageblock_nr_pages - 1)))
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-
- INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
- /* The shift won't overflow because ZONE_NORMAL is below 4G. */
- if (!is_highmem_idx(zone))
- set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
+ __init_single_pfn(pfn, zone, nid);
}
}

--
2.1.2

2015-04-13 10:21:10

[permalink] [raw]

Subject: [PATCH 03/14] mm: meminit: Only set page reserved in the memblock region

From: Nathan Zimmer <[email protected]>

Currently we when we initialze each page struct is set as reserved upon
initialization. This changes to starting with the reserved bit clear and
then only setting the bit in the reserved region.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
---
include/linux/mm.h | 2 ++
mm/nobootmem.c | 3 +++
mm/page_alloc.c | 11 ++++++++++-
3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a93928b90f..b6f82a31028a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1711,6 +1711,8 @@ extern void free_highmem_page(struct page *page);
extern void adjust_managed_page_count(struct page *page, long count);
extern void mem_init_print_info(const char *str);

+extern void reserve_bootmem_region(unsigned long start, unsigned long end);
+
/* Free the reserved page into the buddy system, so it gets managed. */
static inline void __free_reserved_page(struct page *page)
{
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 90b50468333e..396f9e450dc1 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -121,6 +121,9 @@ static unsigned long __init free_low_memory_core_early(void)

memblock_clear_hotplug(0, -1);

+ for_each_reserved_mem_region(i, &start, &end)
+ reserve_bootmem_region(start, end);
+
for_each_free_mem_range(i, NUMA_NO_NODE, &start, &end, NULL)
count += __free_memory_core(start, end);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd7a6d09062d..2abb3b861e70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -788,7 +788,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
- SetPageReserved(page);

/*
* Mark the block movable so that blocks are reserved for
@@ -823,6 +822,16 @@ static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
return __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
}

+void reserve_bootmem_region(unsigned long start, unsigned long end)
+{
+ unsigned long start_pfn = PFN_DOWN(start);
+ unsigned long end_pfn = PFN_UP(end);
+
+ for (; start_pfn < end_pfn; start_pfn++)
+ if (pfn_valid(start_pfn))
+ SetPageReserved(pfn_to_page(start_pfn));
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
bool compound = PageCompound(page);
--
2.1.2

2015-04-13 10:20:49

[permalink] [raw]

Subject: [PATCH 04/14] mm: page_alloc: Pass PFN to __free_pages_bootmem

__free_pages_bootmem prepares a page for release to the buddy allocator and
assumes that the struct page is already properly initialised. Parallel
initialisation of pages will mean that __free_pages_bootmem will be
called for pages with uninitalised structs. This patch passes PFN to
__free_pages_bootmem because until the struct page is initialised we cannot
use page_to_pfn() on all memory models. Functionally this patch does nothing
useful.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/bootmem.c | 6 +++---
mm/internal.h | 3 ++-
mm/memblock.c | 2 +-
mm/nobootmem.c | 4 ++--
mm/page_alloc.c | 3 ++-
5 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 477be696511d..1d017ab3b0c8 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -210,7 +210,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
if (IS_ALIGNED(start, BITS_PER_LONG) && vec == ~0UL) {
int order = ilog2(BITS_PER_LONG);

- __free_pages_bootmem(pfn_to_page(start), order);
+ __free_pages_bootmem(pfn_to_page(start), cursor, order);
count += BITS_PER_LONG;
start += BITS_PER_LONG;
} else {
@@ -220,7 +220,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
while (vec && cur != start) {
if (vec & 1) {
page = pfn_to_page(cur);
- __free_pages_bootmem(page, 0);
+ __free_pages_bootmem(page, cur, 0);
count++;
}
vec >>= 1;
@@ -234,7 +234,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
pages = bootmem_bootmap_pages(pages);
count += pages;
while (pages--)
- __free_pages_bootmem(page++, 0);
+ __free_pages_bootmem(page++, cur++, 0);

bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);

diff --git a/mm/internal.h b/mm/internal.h
index a96da5b0029d..76b605139c7a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -155,7 +155,8 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
}

extern int __isolate_free_page(struct page *page, unsigned int order);
-extern void __free_pages_bootmem(struct page *page, unsigned int order);
+extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
+ unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);
#ifdef CONFIG_MEMORY_FAILURE
extern bool is_free_buddy_page(struct page *page);
diff --git a/mm/memblock.c b/mm/memblock.c
index e0cc2d174f74..f3e97d8eeb5c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1334,7 +1334,7 @@ void __init __memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);

for (; cursor < end; cursor++) {
- __free_pages_bootmem(pfn_to_page(cursor), 0);
+ __free_pages_bootmem(pfn_to_page(cursor), cursor, 0);
totalram_pages++;
}
}
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 396f9e450dc1..bae652713ee5 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -77,7 +77,7 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size)
end = PFN_DOWN(addr + size);

for (; cursor < end; cursor++) {
- __free_pages_bootmem(pfn_to_page(cursor), 0);
+ __free_pages_bootmem(pfn_to_page(cursor), cursor, 0);
totalram_pages++;
}
}
@@ -92,7 +92,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end)
while (start + (1UL << order) > end)
order--;

- __free_pages_bootmem(pfn_to_page(start), order);
+ __free_pages_bootmem(pfn_to_page(start), start, order);

start += (1UL << order);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2abb3b861e70..0a0e0f280d87 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -886,7 +886,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
local_irq_restore(flags);
}

-void __init __free_pages_bootmem(struct page *page, unsigned int order)
+void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
+ unsigned int order)
{
unsigned int nr_pages = 1 << order;
struct page *p = page;
--
2.1.2

2015-04-13 10:20:07

[permalink] [raw]

Subject: [PATCH 05/14] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid

__early_pfn_to_nid() in the generic and arch-specific implementations use
static variables to cache recent lookups. Without the cache boot times
are much higher due to the excessive memblock lookups but it assumes that
memory initialisation is single-threaded. Parallel memory initialisation will
break that assumption so this patch makes __early_pfn_to_nid() SMP-safe by
requiring the caller to cache recent search information. early_pfn_to_nid()
keeps the same interface but is only safe to use early in boot due to
the use of a global static variable. meminit_pfn_in_nid() is an SMP-safe
version that callers must maintain their own state for.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/ia64/mm/numa.c | 19 +++++++------------
include/linux/mm.h | 8 ++++++--
include/linux/mmzone.h | 16 +++++++++++++++-
mm/page_alloc.c | 40 +++++++++++++++++++++++++---------------
4 files changed, 53 insertions(+), 30 deletions(-)

diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index ea21d4cad540..aa19b7ac8222 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -58,27 +58,22 @@ paddr_to_nid(unsigned long paddr)
* SPARSEMEM to allocate the SPARSEMEM sectionmap on the NUMA node where
* the section resides.
*/
-int __meminit __early_pfn_to_nid(unsigned long pfn)
+int __meminit __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state)
{
int i, section = pfn >> PFN_SECTION_SHIFT, ssec, esec;
- /*
- * NOTE: The following SMP-unsafe globals are only used early in boot
- * when the kernel is running single-threaded.
- */
- static int __meminitdata last_ssec, last_esec;
- static int __meminitdata last_nid;

- if (section >= last_ssec && section < last_esec)
- return last_nid;
+ if (section >= state->last_start && section < state->last_end)
+ return state->last_nid;

for (i = 0; i < num_node_memblks; i++) {
ssec = node_memblk[i].start_paddr >> PA_SECTION_SHIFT;
esec = (node_memblk[i].start_paddr + node_memblk[i].size +
((1L << PA_SECTION_SHIFT) - 1)) >> PA_SECTION_SHIFT;
if (section >= ssec && section < esec) {
- last_ssec = ssec;
- last_esec = esec;
- last_nid = node_memblk[i].nid;
+ state->last_start = ssec;
+ state->last_end = esec;
+ state->last_nid = node_memblk[i].nid;
return node_memblk[i].nid;
}
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b6f82a31028a..3a4c9f72c080 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1802,7 +1802,8 @@ extern void sparse_memory_present_with_active_regions(int nid);

#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
!defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
-static inline int __early_pfn_to_nid(unsigned long pfn)
+static inline int __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state)
{
return 0;
}
@@ -1810,7 +1811,10 @@ static inline int __early_pfn_to_nid(unsigned long pfn)
/* please see mm/page_alloc.c */
extern int __meminit early_pfn_to_nid(unsigned long pfn);
/* there is a per-arch backend function. */
-extern int __meminit __early_pfn_to_nid(unsigned long pfn);
+extern int __meminit __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state);
+bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state);
#endif

extern void set_dma_reserve(unsigned long new_dma_reserve);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f279d9c158cd..4ac0037de2f1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1217,10 +1217,24 @@ void sparse_init(void);
#define sparse_index_init(_sec, _nid) do {} while (0)
#endif /* CONFIG_SPARSEMEM */

+/*
+ * During memory init memblocks map pfns to nids. The search is expensive and
+ * this caches recent lookups. The implementation of __early_pfn_to_nid
+ * may treat start/end as pfns or sections.
+ */
+struct mminit_pfnnid_cache {
+ unsigned long last_start;
+ unsigned long last_end;
+ int last_nid;
+};
+
#ifdef CONFIG_NODES_SPAN_OTHER_NODES
bool early_pfn_in_nid(unsigned long pfn, int nid);
+bool meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state);
#else
-#define early_pfn_in_nid(pfn, nid) (1)
+#define early_pfn_in_nid(pfn, nid) (1)
+#define meminit_pfn_in_nid(pfn, nid, state) (1)
#endif

#ifndef early_pfn_valid
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0a0e0f280d87..f556ed63b964 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4457,39 +4457,41 @@ int __meminit init_currently_empty_zone(struct zone *zone,

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+
/*
* Required by SPARSEMEM. Given a PFN, return what node the PFN is on.
*/
-int __meminit __early_pfn_to_nid(unsigned long pfn)
+int __meminit __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state)
{
unsigned long start_pfn, end_pfn;
int nid;
- /*
- * NOTE: The following SMP-unsafe globals are only used early in boot
- * when the kernel is running single-threaded.
- */
- static unsigned long __meminitdata last_start_pfn, last_end_pfn;
- static int __meminitdata last_nid;

- if (last_start_pfn <= pfn && pfn < last_end_pfn)
- return last_nid;
+ if (state->last_start <= pfn && pfn < state->last_end)
+ return state->last_nid;

nid = memblock_search_pfn_nid(pfn, &start_pfn, &end_pfn);
if (nid != -1) {
- last_start_pfn = start_pfn;
- last_end_pfn = end_pfn;
- last_nid = nid;
+ state->last_start = start_pfn;
+ state->last_end = end_pfn;
+ state->last_nid = nid;
}

return nid;
}
#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */

+struct __meminitdata mminit_pfnnid_cache global_init_state;
+
+/* Only safe to use early in boot when initialisation is single-threaded */
int __meminit early_pfn_to_nid(unsigned long pfn)
{
int nid;

- nid = __early_pfn_to_nid(pfn);
+ /* The system will behave unpredictably otherwise */
+ BUG_ON(system_state != SYSTEM_BOOTING);
+
+ nid = __early_pfn_to_nid(pfn, &global_init_state);
if (nid >= 0)
return nid;
/* just returns 0 */
@@ -4497,15 +4499,23 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
}

#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state)
{
int nid;

- nid = __early_pfn_to_nid(pfn);
+ nid = __early_pfn_to_nid(pfn, state);
if (nid >= 0 && nid != node)
return false;
return true;
}
+
+/* Only safe to use early in boot when initialisation is single-threaded */
+bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+{
+ return meminit_pfn_in_nid(pfn, node, &global_init_state);
+}
+
#endif

/**
--
2.1.2

2015-04-13 10:17:26

[permalink] [raw]

Subject: [PATCH 06/14] mm: meminit: Inline some helper functions

early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
unnecessarily visible outside memory initialisation. As well as unnecessary
visibility, it's unnecessary function call overhead when initialising pages.
This patch moves the helpers inline.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm.h | 2 --
include/linux/mmzone.h | 9 -------
mm/page_alloc.c | 72 ++++++++++++++++++++++++--------------------------
3 files changed, 35 insertions(+), 48 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3a4c9f72c080..a8a8b161fd65 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1813,8 +1813,6 @@ extern int __meminit early_pfn_to_nid(unsigned long pfn);
/* there is a per-arch backend function. */
extern int __meminit __early_pfn_to_nid(unsigned long pfn,
struct mminit_pfnnid_cache *state);
-bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state);
#endif

extern void set_dma_reserve(unsigned long new_dma_reserve);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4ac0037de2f1..c75df05dc8e8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1228,15 +1228,6 @@ struct mminit_pfnnid_cache {
int last_nid;
};

-#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-bool early_pfn_in_nid(unsigned long pfn, int nid);
-bool meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state);
-#else
-#define early_pfn_in_nid(pfn, nid) (1)
-#define meminit_pfn_in_nid(pfn, nid, state) (1)
-#endif
-
#ifndef early_pfn_valid
#define early_pfn_valid(pfn) (1)
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f556ed63b964..cadf37c09e6b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -907,6 +907,41 @@ void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
__free_pages(page, order);
}

+/* Only safe to use early in boot when initialisation is single-threaded */
+struct __meminitdata mminit_pfnnid_cache global_init_state;
+int __meminit early_pfn_to_nid(unsigned long pfn)
+{
+ int nid;
+
+ /* The system will behave unpredictably otherwise */
+ BUG_ON(system_state != SYSTEM_BOOTING);
+
+ nid = __early_pfn_to_nid(pfn, &global_init_state);
+ if (nid >= 0)
+ return nid;
+ /* just returns 0 */
+ return 0;
+}
+
+#ifdef CONFIG_NODES_SPAN_OTHER_NODES
+static inline bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state)
+{
+ int nid;
+
+ nid = __early_pfn_to_nid(pfn, state);
+ if (nid >= 0 && nid != node)
+ return false;
+ return true;
+}
+
+/* Only safe to use early in boot when initialisation is single-threaded */
+static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+{
+ return meminit_pfn_in_nid(pfn, node, &global_init_state);
+}
+#endif
+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
@@ -4481,43 +4516,6 @@ int __meminit __early_pfn_to_nid(unsigned long pfn,
}
#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */

-struct __meminitdata mminit_pfnnid_cache global_init_state;
-
-/* Only safe to use early in boot when initialisation is single-threaded */
-int __meminit early_pfn_to_nid(unsigned long pfn)
-{
- int nid;
-
- /* The system will behave unpredictably otherwise */
- BUG_ON(system_state != SYSTEM_BOOTING);
-
- nid = __early_pfn_to_nid(pfn, &global_init_state);
- if (nid >= 0)
- return nid;
- /* just returns 0 */
- return 0;
-}
-
-#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state)
-{
- int nid;
-
- nid = __early_pfn_to_nid(pfn, state);
- if (nid >= 0 && nid != node)
- return false;
- return true;
-}
-
-/* Only safe to use early in boot when initialisation is single-threaded */
-bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
-{
- return meminit_pfn_in_nid(pfn, node, &global_init_state);
-}
-
-#endif
-
/**
* free_bootmem_with_active_regions - Call memblock_free_early_nid for each active range
* @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed.
--
2.1.2

2015-04-13 10:17:29

[permalink] [raw]

Subject: [PATCH 07/14] mm: meminit: Partially initialise memory if CONFIG_DEFERRED_MEM_INIT is set

This patch initialises 1G per zone during early boot if
CONFIG_DEFERRED_MEM_INIT is set. That config option cannot be set but
will be available in a later patch when memory initialisation can be
parallelised. As the parallel initialisation depends on some features
that only are available if memory hotplug is available it is necessary to
alter section annotations.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 8 ++++++
mm/internal.h | 8 ++++++
mm/page_alloc.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c75df05dc8e8..20c2da89a14d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -763,6 +763,14 @@ typedef struct pglist_data {
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
#endif
+
+#ifdef CONFIG_DEFERRED_MEM_INIT
+ /*
+ * If memory initialisation on large machines is deferred then this
+ * is the first PFN that needs to be initialised.
+ */
+ unsigned long first_deferred_pfn;
+#endif /* CONFIG_DEFERRED_MEM_INIT */
} pg_data_t;

#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
diff --git a/mm/internal.h b/mm/internal.h
index 76b605139c7a..c6718e2c051e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -385,6 +385,14 @@ static inline void mminit_verify_zonelist(void)
}
#endif /* CONFIG_DEBUG_MEMORY_INIT */

+#ifdef CONFIG_DEFERRED_MEM_INIT
+#define __defermem_init __meminit
+#define __defer_init __meminit
+#else
+#define __defermem_init
+#define __defer_init __init
+#endif
+
/* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */
#if defined(CONFIG_SPARSEMEM)
extern void mminit_validate_memmodel_limits(unsigned long *start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cadf37c09e6b..e15e74da31a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -235,6 +235,63 @@ EXPORT_SYMBOL(nr_online_nodes);

int page_group_by_mobility_disabled __read_mostly;

+#ifdef CONFIG_DEFERRED_MEM_INIT
+static inline void reset_deferred_meminit(pg_data_t *pgdat)
+{
+ pgdat->first_deferred_pfn = ULONG_MAX;
+}
+
+/* Returns true if the struct page for the pfn is uninitialised */
+static inline bool __defermem_init early_page_uninitialised(unsigned long pfn)
+{
+ int nid = early_pfn_to_nid(pfn);
+
+ if (pfn >= NODE_DATA(nid)->first_deferred_pfn)
+ return true;
+
+ return false;
+}
+
+/*
+ * Returns false when the remaining initialisation should be deferred until
+ * later in the boot cycle when it can be parallelised.
+ */
+static inline bool update_defer_init(pg_data_t *pgdat,
+ unsigned long pfn,
+ unsigned long *nr_initialised)
+{
+ if (pgdat->first_deferred_pfn != ULONG_MAX)
+ return false;
+
+ /* Initialise at least 1G per zone */
+ (*nr_initialised)++;
+ if (*nr_initialised > (1UL << (30 - PAGE_SHIFT)) &&
+ (pfn & (PAGES_PER_SECTION - 1)) == 0) {
+ pgdat->first_deferred_pfn = pfn;
+ return false;
+ }
+
+ return true;
+}
+#else
+static inline void reset_deferred_meminit(pg_data_t *pgdat)
+{
+}
+
+static inline bool early_page_uninitialised(unsigned long pfn)
+{
+ return false;
+}
+
+static inline bool update_defer_init(pg_data_t *pgdat,
+ unsigned long pfn,
+ unsigned long *nr_initialised)
+{
+ return true;
+}
+#endif
+
+
void set_pageblock_migratetype(struct page *page, int migratetype)
{
if (unlikely(page_group_by_mobility_disabled &&
@@ -886,8 +943,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
local_irq_restore(flags);
}

-void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
- unsigned int order)
+static void __defer_init __free_pages_boot_core(struct page *page,
+ unsigned long pfn, unsigned int order)
{
unsigned int nr_pages = 1 << order;
struct page *p = page;
@@ -942,6 +999,14 @@ static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
}
#endif

+void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
+ unsigned int order)
+{
+ if (early_page_uninitialised(pfn))
+ return;
+ return __free_pages_boot_core(page, pfn, order);
+}
+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
@@ -4214,14 +4279,16 @@ static void setup_zone_migrate_reserve(struct zone *zone)
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
+ pg_data_t *pgdat = NODE_DATA(nid);
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
struct zone *z;
+ unsigned long nr_initialised = 0;

if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;

- z = &NODE_DATA(nid)->node_zones[zone];
+ z = &pgdat->node_zones[zone];
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*
* There can be holes in boot-time mem_map[]s
@@ -4233,6 +4300,8 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
continue;
if (!early_pfn_in_nid(pfn, nid))
continue;
+ if (!update_defer_init(pgdat, pfn, &nr_initialised))
+ break;
}
__init_single_pfn(pfn, zone, nid);
}
@@ -5034,6 +5103,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
/* pg_data_t should be reset to zero when it's allocated */
WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);

+ reset_deferred_meminit(pgdat);
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
--
2.1.2

2015-04-13 10:19:41

[permalink] [raw]

Subject: [PATCH 08/14] mm: meminit: Initialise remaining memory in parallel with kswapd

Memory is only partially initialised at the moment. When this patch is
applied the first task kswapd does is initialise all remaining memmap in
parallel. This has two advantages. The first is that it's parallelised,
should take less time and boot faster on large machines. The second is
that the initialisation is taking place on a CPU local to the memory being
initialised. The user-visible effect on large machines is that free memory
will appear to rapidly increase early in the lifetime of the system until
kswapd reports that all memory is initialised in the kernel log. It'll
report the time to initialise so that the speedup can be estimated. Once
initialised there should be no other user-visibile effects. Potentially
the initialisation could be further parallelised using all CPUs local to
a node but it's expected that a single CPU will be able to saturate the
memory bus during initialisation and the additional complexity is not
justified.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 6 +++
mm/mm_init.c | 1 +
mm/page_alloc.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
mm/vmscan.c | 6 ++-
4 files changed, 123 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c6718e2c051e..2fad7c066e5c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -388,9 +388,15 @@ static inline void mminit_verify_zonelist(void)
#ifdef CONFIG_DEFERRED_MEM_INIT
#define __defermem_init __meminit
#define __defer_init __meminit
+
+void deferred_init_memmap(int nid);
#else
#define __defermem_init
#define __defer_init __init
+
+static inline void deferred_init_memmap(int nid)
+{
+}
#endif

/* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5f420f7fafa1..28fbf87b20aa 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -11,6 +11,7 @@
#include <linux/export.h>
#include <linux/memory.h>
#include <linux/notifier.h>
+#include <linux/sched.h>
#include "internal.h"

#ifdef CONFIG_DEBUG_MEMORY_INIT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e15e74da31a5..10ba841c7609 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -252,6 +252,14 @@ static inline bool __defermem_init early_page_uninitialised(unsigned long pfn)
return false;
}

+static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
+{
+ if (pfn >= NODE_DATA(nid)->first_deferred_pfn)
+ return true;
+
+ return false;
+}
+
/*
* Returns false when the remaining initialisation should be deferred until
* later in the boot cycle when it can be parallelised.
@@ -283,6 +291,11 @@ static inline bool early_page_uninitialised(unsigned long pfn)
return false;
}

+static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
+{
+ return false;
+}
+
static inline bool update_defer_init(pg_data_t *pgdat,
unsigned long pfn,
unsigned long *nr_initialised)
@@ -879,14 +892,45 @@ static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
return __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
}

-void reserve_bootmem_region(unsigned long start, unsigned long end)
+#ifdef CONFIG_DEFERRED_MEM_INIT
+static void init_reserved_page(unsigned long pfn)
+{
+ pg_data_t *pgdat;
+ int nid, zid;
+
+ if (!early_page_uninitialised(pfn))
+ return;
+
+ nid = early_pfn_to_nid(pfn);
+ pgdat = NODE_DATA(nid);
+
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ struct zone *zone = &pgdat->node_zones[zid];
+
+ if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
+ break;
+ }
+ __init_single_pfn(pfn, zid, nid);
+}
+#else
+static inline void init_reserved_page(unsigned long pfn)
+{
+}
+#endif /* CONFIG_DEFERRED_MEM_INIT */
+
+void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);

- for (; start_pfn < end_pfn; start_pfn++)
- if (pfn_valid(start_pfn))
- SetPageReserved(pfn_to_page(start_pfn));
+ for (; start_pfn < end_pfn; start_pfn++) {
+ if (pfn_valid(start_pfn)) {
+ struct page *page = pfn_to_page(start_pfn);
+
+ init_reserved_page(start_pfn);
+ SetPageReserved(page);
+ }
+ }
}

static bool free_pages_prepare(struct page *page, unsigned int order)
@@ -1007,6 +1051,67 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
return __free_pages_boot_core(page, pfn, order);
}

+#ifdef CONFIG_DEFERRED_MEM_INIT
+/* Initialise remaining memory on a node */
+void __defermem_init deferred_init_memmap(int nid)
+{
+ unsigned long start = jiffies;
+ struct mminit_pfnnid_cache nid_init_state = { };
+
+ pg_data_t *pgdat = NODE_DATA(nid);
+ int zid;
+ unsigned long first_init_pfn = pgdat->first_deferred_pfn;
+
+ if (first_init_pfn == ULONG_MAX)
+ return;
+
+ /* Sanity check boundaries */
+ BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
+ BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
+ pgdat->first_deferred_pfn = ULONG_MAX;
+
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ struct zone *zone = pgdat->node_zones + zid;
+ unsigned long walk_start, walk_end;
+ int i;
+
+ for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
+ unsigned long pfn, end_pfn;
+
+ end_pfn = min(walk_end, zone_end_pfn(zone));
+ pfn = first_init_pfn;
+ if (pfn < walk_start)
+ pfn = walk_start;
+ if (pfn < zone->zone_start_pfn)
+ pfn = zone->zone_start_pfn;
+
+ for (; pfn < end_pfn; pfn++) {
+ struct page *page;
+
+ if (!pfn_valid(pfn))
+ continue;
+
+ if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state))
+ continue;
+
+ if (page->flags) {
+ VM_BUG_ON(page_zone(page) != zone);
+ continue;
+ }
+
+ __init_single_page(page, pfn, zid, nid);
+ __free_pages_boot_core(page, pfn, 0);
+ cond_resched();
+ }
+ first_init_pfn = max(end_pfn, first_init_pfn);
+ }
+ }
+
+ pr_info("kswapd %d initialised deferred memory in %ums\n", nid,
+ jiffies_to_msecs(jiffies - start));
+}
+#endif /* CONFIG_DEFERRED_MEM_INIT */
+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
@@ -4217,6 +4322,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)
zone->nr_migrate_reserve_block = reserve;

for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+ if (!early_page_nid_uninitialised(pfn, zone_to_nid(zone)))
+ return;
+
if (!pfn_valid(pfn))
continue;
page = pfn_to_page(pfn);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..c4895d26d036 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3348,7 +3348,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
* If there are applications that are active memory-allocators
* (most normal use), this basically shouldn't matter.
*/
-static int kswapd(void *p)
+static int __defermem_init kswapd(void *p)
{
unsigned long order, new_order;
unsigned balanced_order;
@@ -3383,6 +3383,8 @@ static int kswapd(void *p)
tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
set_freezable();

+ deferred_init_memmap(pgdat->node_id);
+
order = new_order = 0;
balanced_order = 0;
classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
@@ -3538,7 +3540,7 @@ static int cpu_callback(struct notifier_block *nfb, unsigned long action,
* This kswapd start function will be called by init and node-hot-add.
* On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
*/
-int kswapd_run(int nid)
+int __defermem_init kswapd_run(int nid)
{
pg_data_t *pgdat = NODE_DATA(nid);
int ret = 0;
--
2.1.2

2015-04-13 10:17:33

[permalink] [raw]

Subject: [PATCH 09/14] mm: meminit: Minimise number of pfn->page lookups during initialisation

Deferred memory initialisation is using pfn_to_page() on every PFN
unnecessarily. This patch minimises the number of lookups and scheduler
checks.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 10ba841c7609..21bb818aa3c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1077,6 +1077,7 @@ void __defermem_init deferred_init_memmap(int nid)

for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
unsigned long pfn, end_pfn;
+ struct page *page = NULL;

end_pfn = min(walk_end, zone_end_pfn(zone));
pfn = first_init_pfn;
@@ -1086,13 +1087,32 @@ void __defermem_init deferred_init_memmap(int nid)
pfn = zone->zone_start_pfn;

for (; pfn < end_pfn; pfn++) {
- struct page *page;
-
- if (!pfn_valid(pfn))
+ if (!pfn_valid_within(pfn))
continue;

- if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state))
+ /*
+ * Ensure pfn_valid is checked every
+ * MAX_ORDER_NR_PAGES for memory holes
+ */
+ if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
+ if (!pfn_valid(pfn)) {
+ page = NULL;
+ continue;
+ }
+ }
+
+ if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
+ page = NULL;
continue;
+ }
+
+ /* Minimise pfn page lookups and scheduler checks */
+ if (page && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) {
+ page++;
+ } else {
+ page = pfn_to_page(pfn);
+ cond_resched();
+ }

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
@@ -1101,7 +1121,6 @@ void __defermem_init deferred_init_memmap(int nid)

__init_single_page(page, pfn, zid, nid);
__free_pages_boot_core(page, pfn, 0);
- cond_resched();
}
first_init_pfn = max(end_pfn, first_init_pfn);
}
--
2.1.2

2015-04-13 10:19:17

[permalink] [raw]

Subject: [PATCH 10/14] x86: mm: Enable deferred memory initialisation on x86-64

This patch adds the Kconfig logic to add deferred memory initialisation
to x86-64 if NUMA is enabled. Other architectures should enable on a
case-by-case basis once the users of early_pfn_to_nid are audited and it
is tested.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/Kconfig | 2 ++
mm/Kconfig | 19 +++++++++++++++++++
2 files changed, 21 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca55187..830ad8450bbd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -32,6 +32,8 @@ config X86
select HAVE_UNSTABLE_SCHED_CLOCK
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_SUPPORTS_INT128 if X86_64
+ select ARCH_SUPPORTS_DEFERRED_MEM_INIT if X86_64 && NUMA
+ select ARCH_WANTS_PROT_NUMA_PROT_NONE
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_PCSPKR_PLATFORM
diff --git a/mm/Kconfig b/mm/Kconfig
index a03131b6ba8e..463c7005c3d9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -629,3 +629,22 @@ config MAX_STACK_SIZE_MB
changed to a smaller value in which case that is used.

A sane initial value is 80 MB.
+
+# For architectures that was to support deferred memory initialisation
+config ARCH_SUPPORTS_DEFERRED_MEM_INIT
+ bool
+
+config DEFERRED_MEM_INIT
+ bool "Defer initialisation of memory to kswapd"
+ default n
+ depends on ARCH_SUPPORTS_DEFERRED_MEM_INIT
+ depends on MEMORY_HOTPLUG
+ help
+ Ordinarily all struct pages are initialised during early boot in a
+ single thread. On very large machines this can take a considerable
+ amount of time. If this option is set, large machines will bring up
+ a small amount of memory at boot and then initialise the rest when
+ kswapd starts. Boot times are reduced but very early in the lifetime
+ of the system it will still be busy initialising struct pages. This
+ has a potential performance impact on processes until kswapd finishes
+ the initialisation.
--
2.1.2

2015-04-13 10:18:46

[permalink] [raw]

Subject: [PATCH 11/14] mm: meminit: Control parallel memory initialisation from command line and config

Patch adds a defer_meminit=[enable|disable] kernel command line
option. Default is controlled by Kconfig.

Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/kernel-parameters.txt | 8 ++++++++
include/linux/mmzone.h | 14 ++++++++++++++
init/main.c | 1 +
mm/Kconfig | 10 ++++++++++
mm/page_alloc.c | 24 ++++++++++++++++++++++++
5 files changed, 57 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index bfcb1a62a7b4..867338fc5941 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -807,6 +807,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.

debug_objects [KNL] Enable object debugging

+ defer_meminit= [KNL,X86] Enable or disable deferred memory init.
+ Large machine may take a long time to initialise
+ memory management structures. If this is enabled
+ then memory initialisation is deferred to kswapd
+ and each memory node is initialised in parallel.
+ In very early boot, there will be less memory that
+ will rapidly increase while it is initialised.
+
no_debug_objects
[KNL] Disable object debugging

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 20c2da89a14d..1275f9a8cb42 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -823,6 +823,20 @@ static inline struct zone *lruvec_zone(struct lruvec *lruvec)
#endif
}

+
+#ifdef CONFIG_DEFERRED_MEM_INIT
+extern bool deferred_mem_init_enabled;
+static inline void setup_deferred_meminit(void)
+{
+ if (IS_ENABLED(CONFIG_DEFERRED_MEM_INIT_DEFAULT_ENABLED))
+ deferred_mem_init_enabled = true;
+}
+#else
+static inline void setup_deferred_meminit(void)
+{
+}
+#endif /* CONFIG_DEFERRED_MEM_INIT */
+
#ifdef CONFIG_HAVE_MEMORY_PRESENT
void memory_present(int nid, unsigned long start, unsigned long end);
#else
diff --git a/init/main.c b/init/main.c
index 6f0f1c5ff8cc..f339d37a43e8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -506,6 +506,7 @@ asmlinkage __visible void __init start_kernel(void)
boot_init_stack_canary();

cgroup_init_early();
+ setup_deferred_meminit();

local_irq_disable();
early_boot_irqs_disabled = true;
diff --git a/mm/Kconfig b/mm/Kconfig
index 463c7005c3d9..0eb9b1349cc2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -648,3 +648,13 @@ config DEFERRED_MEM_INIT
of the system it will still be busy initialising struct pages. This
has a potential performance impact on processes until kswapd finishes
the initialisation.
+
+config DEFERRED_MEM_INIT_DEFAULT_ENABLED
+ bool "Automatically enable deferred memory initialisation"
+ default y
+ depends on DEFERRED_MEM_INIT
+ help
+ If set, memory initialisation will be deferred by default on large
+ memory configurations. If DEFERRED_MEM_INIT is set then it is a
+ reasonable default to enable this too. User will need to disable
+ this if allocate huge pages from the command line.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 21bb818aa3c4..cb38583063cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -236,6 +236,8 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;

#ifdef CONFIG_DEFERRED_MEM_INIT
+bool __meminitdata deferred_mem_init_enabled;
+
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
pgdat->first_deferred_pfn = ULONG_MAX;
@@ -268,6 +270,9 @@ static inline bool update_defer_init(pg_data_t *pgdat,
unsigned long pfn,
unsigned long *nr_initialised)
{
+ if (!deferred_mem_init_enabled)
+ return true;
+
if (pgdat->first_deferred_pfn != ULONG_MAX)
return false;

@@ -281,6 +286,25 @@ static inline bool update_defer_init(pg_data_t *pgdat,

return true;
}
+
+static int __init setup_deferred_mem_init(char *str)
+{
+ if (!str)
+ return -1;
+
+ if (!strcmp(str, "enable")) {
+ deferred_mem_init_enabled = true;
+ } else if (!strcmp(str, "disable")) {
+ deferred_mem_init_enabled = false;
+ } else {
+ pr_warn("Unable to parse deferred_mem_init=\n");
+ return -1;
+ }
+
+ return 0;
+}
+
+early_param("defer_meminit", setup_deferred_mem_init);
#else
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
--
2.1.2

2015-04-13 10:18:45

[permalink] [raw]

Subject: [PATCH 12/14] mm: meminit: Free pages in large chunks where possible

Parallel initialisation frees pages one at a time. Try free pages as single
large pages where possible.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb38583063cb..bacd97b0030e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1076,6 +1076,20 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
}

#ifdef CONFIG_DEFERRED_MEM_INIT
+
+void __defermem_init deferred_free_range(struct page *page, unsigned long pfn,
+ int nr_pages)
+{
+ int i;
+
+ if (nr_pages == MAX_ORDER_NR_PAGES && (pfn & (MAX_ORDER_NR_PAGES-1)) == 0) {
+ __free_pages_boot_core(page, pfn, MAX_ORDER-1);
+ return;
+ }
+
+ for (i = 0; i < nr_pages; i++, page++, pfn++)
+ __free_pages_boot_core(page, pfn, 0);
+}
/* Initialise remaining memory on a node */
void __defermem_init deferred_init_memmap(int nid)
{
@@ -1102,6 +1116,9 @@ void __defermem_init deferred_init_memmap(int nid)
for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
unsigned long pfn, end_pfn;
struct page *page = NULL;
+ struct page *free_base_page = NULL;
+ unsigned long free_base_pfn = 0;
+ int nr_to_free = 0;

end_pfn = min(walk_end, zone_end_pfn(zone));
pfn = first_init_pfn;
@@ -1112,7 +1129,7 @@ void __defermem_init deferred_init_memmap(int nid)

for (; pfn < end_pfn; pfn++) {
if (!pfn_valid_within(pfn))
- continue;
+ goto free_range;

/*
* Ensure pfn_valid is checked every
@@ -1121,30 +1138,49 @@ void __defermem_init deferred_init_memmap(int nid)
if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
if (!pfn_valid(pfn)) {
page = NULL;
- continue;
+ goto free_range;
}
}

if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
page = NULL;
- continue;
+ goto free_range;
}

/* Minimise pfn page lookups and scheduler checks */
if (page && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) {
page++;
} else {
+ deferred_free_range(free_base_page,
+ free_base_pfn, nr_to_free);
+ free_base_page = NULL;
+ free_base_pfn = nr_to_free = 0;
+
page = pfn_to_page(pfn);
cond_resched();
}

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
- continue;
+ goto free_range;
}

__init_single_page(page, pfn, zid, nid);
- __free_pages_boot_core(page, pfn, 0);
+ if (!free_base_page) {
+ free_base_page = page;
+ free_base_pfn = pfn;
+ nr_to_free = 0;
+ }
+ nr_to_free++;
+
+ /* Where possible, batch up pages for a single free */
+ continue;
+free_range:
+ /* Free the current block of pages to allocator */
+ if (free_base_page)
+ deferred_free_range(free_base_page, free_base_pfn, nr_to_free);
+ free_base_page = NULL;
+ free_base_pfn = nr_to_free = 0;
}
first_init_pfn = max(end_pfn, first_init_pfn);
}
--
2.1.2

2015-04-13 10:18:26

[permalink] [raw]

Subject: [PATCH 13/14] mm: meminit: Reduce number of times pageblocks are set during initialisation

During parallel memory initialisation, ranges are checked for every PFN
unnecessarily which increases boot times. This patch alters when the
ranges are checked to reduce the work done for each page frame.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 45 +++++++++++++++++++++++----------------------
1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bacd97b0030e..b4c320beebc7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -875,33 +875,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
- struct zone *z = &NODE_DATA(nid)->node_zones[zone];
-
set_page_links(page, zone, nid, pfn);
mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);

- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
- *
- * bitmap is created for zone's valid pfn range. but memmap
- * can be created for invalid pages (for alignment)
- * check here not to call set_pageblock_migratetype() against
- * pfn out of zone.
- */
- if ((z->zone_start_pfn <= pfn)
- && (pfn < zone_end_pfn(z))
- && !(pfn & (pageblock_nr_pages - 1)))
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-
INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
@@ -1083,6 +1062,7 @@ void __defermem_init deferred_free_range(struct page *page, unsigned long pfn,
int i;

if (nr_pages == MAX_ORDER_NR_PAGES && (pfn & (MAX_ORDER_NR_PAGES-1)) == 0) {
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
__free_pages_boot_core(page, pfn, MAX_ORDER-1);
return;
}
@@ -4490,7 +4470,28 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (!update_defer_init(pgdat, pfn, &nr_initialised))
break;
}
- __init_single_pfn(pfn, zone, nid);
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made. Later some blocks near
+ * the start are marked MIGRATE_RESERVE by
+ * setup_zone_migrate_reserve()
+ *
+ * bitmap is created for zone's valid pfn range. but memmap
+ * can be created for invalid pages (for alignment)
+ * check here not to call set_pageblock_migratetype() against
+ * pfn out of zone.
+ */
+ if (!(pfn & (pageblock_nr_pages - 1))) {
+ struct page *page = pfn_to_page(pfn);
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ __init_single_page(page, pfn, zone, nid);
+ } else {
+ __init_single_pfn(pfn, zone, nid);
+ }
}
}

--
2.1.2

2015-04-13 10:18:03

[permalink] [raw]

Subject: [PATCH 14/14] mm: meminit: Remove mminit_verify_page_links

mminit_verify_page_links() is an extremely paranoid check that was introduced
when memory initialisation was being heavily reworked. Profiles indicated
that up to 10% of parallel memory initialisation was spent on checking
this for every page. The cost could be reduced but in practice this check
only found problems very early during the initialisation rewrite and has
found nothing since. This patch removes an expensive unnecessary check.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 8 --------
mm/mm_init.c | 8 --------
mm/page_alloc.c | 1 -
3 files changed, 17 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 2fad7c066e5c..3c8ac7d21dbd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -360,10 +360,7 @@ do { \
} while (0)

extern void mminit_verify_pageflags_layout(void);
-extern void mminit_verify_page_links(struct page *page,
- enum zone_type zone, unsigned long nid, unsigned long pfn);
extern void mminit_verify_zonelist(void);
-
#else

static inline void mminit_dprintk(enum mminit_level level,
@@ -375,11 +372,6 @@ static inline void mminit_verify_pageflags_layout(void)
{
}

-static inline void mminit_verify_page_links(struct page *page,
- enum zone_type zone, unsigned long nid, unsigned long pfn)
-{
-}
-
static inline void mminit_verify_zonelist(void)
{
}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 28fbf87b20aa..fdadf918de76 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -131,14 +131,6 @@ void __init mminit_verify_pageflags_layout(void)
BUG_ON(or_mask != add_mask);
}

-void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone,
- unsigned long nid, unsigned long pfn)
-{
- BUG_ON(page_to_nid(page) != nid);
- BUG_ON(page_zonenum(page) != zone);
- BUG_ON(page_to_pfn(page) != pfn);
-}
-
static __init int set_mminit_loglevel(char *str)
{
get_option(&str, &mminit_loglevel);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b4c320beebc7..f2c96d02662f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -876,7 +876,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
set_page_links(page, zone, nid, pfn);
- mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
--
2.1.2

2015-04-13 10:29:31

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On Mon, Apr 13, 2015 at 11:16:52AM +0100, Mel Gorman wrote:
> Memory initialisation had been identified as one of the reasons why large
> machines take a long time to boot. Patches were posted a long time ago
> that attempted to move deferred initialisation into the page allocator
> paths. This was rejected on the grounds it should not be necessary to hurt
> the fast paths to parallelise initialisation. This series reuses much of
> the work from that time but defers the initialisation of memory to kswapd
> so that one thread per node initialises memory local to that node. The
> issue is that on the machines I tested with, memory initialisation was not
> a major contributor to boot times. I'm posting the RFC to both review the
> series and see if it actually helps users of very large machines.
>

Robin Holt's address now bounces so remove the address from any replies.
If anyone has an updated address for him that he wants to use then let
me know. Otherwise, I'll leave the From's and Signed-offs from him as
the old address as it was accurate at the time.

--
Mel Gorman
SUSE Labs

2015-04-13 18:21:55

[permalink] [raw]

Subject: Re: [PATCH 10/14] x86: mm: Enable deferred memory initialisation on x86-64

On Mon, 2015-04-13 at 11:17 +0100, Mel Gorman wrote:
> --- a/mm/Kconfig
> +++ b/mm/Kconfig

> +# For architectures that was to support deferred memory initialisation

s/was/want/?

> +config ARCH_SUPPORTS_DEFERRED_MEM_INIT
> + bool

Thanks,

Paul Bolle

2015-04-15 13:16:18

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On 04/13/2015 06:16 AM, Mel Gorman wrote:
> Memory initialisation had been identified as one of the reasons why large
> machines take a long time to boot. Patches were posted a long time ago
> that attempted to move deferred initialisation into the page allocator
> paths. This was rejected on the grounds it should not be necessary to hurt
> the fast paths to parallelise initialisation. This series reuses much of
> the work from that time but defers the initialisation of memory to kswapd
> so that one thread per node initialises memory local to that node. The
> issue is that on the machines I tested with, memory initialisation was not
> a major contributor to boot times. I'm posting the RFC to both review the
> series and see if it actually helps users of very large machines.
>
> After applying the series and setting the appropriate Kconfig variable I
> see this in the boot log on a 64G machine
>
> [ 7.383764] kswapd 0 initialised deferred memory in 188ms
> [ 7.404253] kswapd 1 initialised deferred memory in 208ms
> [ 7.411044] kswapd 3 initialised deferred memory in 216ms
> [ 7.411551] kswapd 2 initialised deferred memory in 216ms
>
> On a 1TB machine, I see
>
> [ 11.913324] kswapd 0 initialised deferred memory in 1168ms
> [ 12.220011] kswapd 2 initialised deferred memory in 1476ms
> [ 12.245369] kswapd 3 initialised deferred memory in 1500ms
> [ 12.271680] kswapd 1 initialised deferred memory in 1528ms
>
> Once booted the machine appears to work as normal. Boot times were measured
> from the time shutdown was called until ssh was available again. In the
> 64G case, the boot time savings are negligible. On the 1TB machine, the
> savings were 10 seconds (about 8% improvement on kernel times but 1-2%
> overall as POST takes so long).
>
> It would be nice if the people that have access to really large machines
> would test this series and report back if the complexity is justified.
>
> Patches are against 4.0-rc7.
>
> Documentation/kernel-parameters.txt | 8 +
> arch/ia64/mm/numa.c | 19 +-
> arch/x86/Kconfig | 2 +
> include/linux/memblock.h | 18 ++
> include/linux/mm.h | 8 +-
> include/linux/mmzone.h | 37 +++-
> init/main.c | 1 +
> mm/Kconfig | 29 +++
> mm/bootmem.c | 6 +-
> mm/internal.h | 23 ++-
> mm/memblock.c | 34 ++-
> mm/mm_init.c | 9 +-
> mm/nobootmem.c | 7 +-
> mm/page_alloc.c | 398 +++++++++++++++++++++++++++++++-----
> mm/vmscan.c | 6 +-
> 15 files changed, 507 insertions(+), 98 deletions(-)
>

I had included your patch with the 4.0 kernel and booted up a 16-socket
12-TB machine. I measured the elapsed time from the elilo prompt to the
availability of ssh login. Without the patch, the bootup time was 404s.
It was reduced to 298s with the patch. So there was about 100s reduction
in bootup time (1/4 of the total).

However, there were 2 bootup problems in the dmesg log that needed to be
addressed.
1. There were 2 vmalloc allocation failures:
[ 2.284686] vmalloc: allocation failure, allocated 16578404352 of
17179873280 bytes
[ 10.399938] vmalloc: allocation failure, allocated 7970922496 of
8589938688 bytes

2. There were 2 soft lockup warnings:
[ 57.319453] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
[swapper/0:1]
[ 85.409263] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
[swapper/0:1]

Once those problems are fixed, the patch should be in a pretty good
shape. I have attached the dmesg log for your reference.

Cheers,
Longman

Attachments:

dmesg-4.0-Mel-mm-patch.txt (508.52 kB)

2015-04-15 13:38:40

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
> > <SNIP>
> >Patches are against 4.0-rc7.
> >
> > Documentation/kernel-parameters.txt | 8 +
> > arch/ia64/mm/numa.c | 19 +-
> > arch/x86/Kconfig | 2 +
> > include/linux/memblock.h | 18 ++
> > include/linux/mm.h | 8 +-
> > include/linux/mmzone.h | 37 +++-
> > init/main.c | 1 +
> > mm/Kconfig | 29 +++
> > mm/bootmem.c | 6 +-
> > mm/internal.h | 23 ++-
> > mm/memblock.c | 34 ++-
> > mm/mm_init.c | 9 +-
> > mm/nobootmem.c | 7 +-
> > mm/page_alloc.c | 398 +++++++++++++++++++++++++++++++-----
> > mm/vmscan.c | 6 +-
> > 15 files changed, 507 insertions(+), 98 deletions(-)
> >
>
> I had included your patch with the 4.0 kernel and booted up a
> 16-socket 12-TB machine. I measured the elapsed time from the elilo
> prompt to the availability of ssh login. Without the patch, the
> bootup time was 404s. It was reduced to 298s with the patch. So
> there was about 100s reduction in bootup time (1/4 of the total).
>

Cool, thanks for testing. Would you be able to state if this is really
important or not? Does booting 100s second faster on a 12TB machine really
matter? I can then add that justification to the changelog to avoid a
conversation with Andrew that goes something like

Andrew: Why are we doing this?
Mel: Because we can and apparently people might want it.
Andrew: What's the maintenance cost of this?
Mel: Magic beans

I prefer talking to Andrew when it's harder to predict what he'll say.

> However, there were 2 bootup problems in the dmesg log that needed
> to be addressed.
> 1. There were 2 vmalloc allocation failures:
> [ 2.284686] vmalloc: allocation failure, allocated 16578404352 of
> 17179873280 bytes
> [ 10.399938] vmalloc: allocation failure, allocated 7970922496 of
> 8589938688 bytes
>
> 2. There were 2 soft lockup warnings:
> [ 57.319453] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
> [swapper/0:1]
> [ 85.409263] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
> [swapper/0:1]
>
> Once those problems are fixed, the patch should be in a pretty good
> shape. I have attached the dmesg log for your reference.
>

The obvious conclusion is that initialising 1G per node is not enough for
really large machines. Can you try this on top? It's untested but should
work. The low value was chosen because it happened to work and I wanted
to get test coverage on common hardware but broke is broke.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f2c96d02662f..6b3bec304e35 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -276,9 +276,9 @@ static inline bool update_defer_init(pg_data_t *pgdat,
if (pgdat->first_deferred_pfn != ULONG_MAX)
return false;

- /* Initialise at least 1G per zone */
+ /* Initialise at least 32G per node */
(*nr_initialised)++;
- if (*nr_initialised > (1UL << (30 - PAGE_SHIFT)) &&
+ if (*nr_initialised > (32UL << (30 - PAGE_SHIFT)) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;

2015-04-15 14:27:57

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
> I had included your patch with the 4.0 kernel and booted up a 16-socket
> 12-TB machine. I measured the elapsed time from the elilo prompt to the
> availability of ssh login. Without the patch, the bootup time was 404s. It
> was reduced to 298s with the patch. So there was about 100s reduction in
> bootup time (1/4 of the total).

But you cheat! :-)

How long between power on and the elilo prompt? Do the 100 seconds
matter on that time scale?

2015-04-15 14:34:33

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On Wed, Apr 15, 2015 at 04:27:31PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
> > I had included your patch with the 4.0 kernel and booted up a 16-socket
> > 12-TB machine. I measured the elapsed time from the elilo prompt to the
> > availability of ssh login. Without the patch, the bootup time was 404s. It
> > was reduced to 298s with the patch. So there was about 100s reduction in
> > bootup time (1/4 of the total).
>
> But you cheat! :-)
>
> How long between power on and the elilo prompt? Do the 100 seconds
> matter on that time scale?

Calling it cheating is a *bit* harsh as the POST times vary considerably
between manufacturers. While I'm interested in Waiman's answer, I'm told
that those that really care about minimising reboot times will use kexec
to avoid POST. The 100 seconds is 100 seconds, whether that is 25% in
all cases is a different matter.

--
Mel Gorman
SUSE Labs

2015-04-15 14:48:37

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On Wed, Apr 15, 2015 at 03:34:20PM +0100, Mel Gorman wrote:
> On Wed, Apr 15, 2015 at 04:27:31PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
> > > I had included your patch with the 4.0 kernel and booted up a 16-socket
> > > 12-TB machine. I measured the elapsed time from the elilo prompt to the
> > > availability of ssh login. Without the patch, the bootup time was 404s. It
> > > was reduced to 298s with the patch. So there was about 100s reduction in
> > > bootup time (1/4 of the total).
> >
> > But you cheat! :-)
> >
> > How long between power on and the elilo prompt? Do the 100 seconds
> > matter on that time scale?
>
> Calling it cheating is a *bit* harsh as the POST times vary considerably
> between manufacturers. While I'm interested in Waiman's answer, I'm told
> that those that really care about minimising reboot times will use kexec
> to avoid POST. The 100 seconds is 100 seconds, whether that is 25% in
> all cases is a different matter.

Sure POST times vary, but its consistently stupid long :-) I'm forever
thinking my EX machine died because its not coming back from a power
cycle, and mine isn't really _that_ large.

2015-04-15 14:50:57

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On 04/15/2015 09:38 AM, Mel Gorman wrote:
> On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
>>> <SNIP>
>>> Patches are against 4.0-rc7.
>>>
>>> Documentation/kernel-parameters.txt | 8 +
>>> arch/ia64/mm/numa.c | 19 +-
>>> arch/x86/Kconfig | 2 +
>>> include/linux/memblock.h | 18 ++
>>> include/linux/mm.h | 8 +-
>>> include/linux/mmzone.h | 37 +++-
>>> init/main.c | 1 +
>>> mm/Kconfig | 29 +++
>>> mm/bootmem.c | 6 +-
>>> mm/internal.h | 23 ++-
>>> mm/memblock.c | 34 ++-
>>> mm/mm_init.c | 9 +-
>>> mm/nobootmem.c | 7 +-
>>> mm/page_alloc.c | 398 +++++++++++++++++++++++++++++++-----
>>> mm/vmscan.c | 6 +-
>>> 15 files changed, 507 insertions(+), 98 deletions(-)
>>>
>> I had included your patch with the 4.0 kernel and booted up a
>> 16-socket 12-TB machine. I measured the elapsed time from the elilo
>> prompt to the availability of ssh login. Without the patch, the
>> bootup time was 404s. It was reduced to 298s with the patch. So
>> there was about 100s reduction in bootup time (1/4 of the total).
>>
> Cool, thanks for testing. Would you be able to state if this is really
> important or not? Does booting 100s second faster on a 12TB machine really
> matter? I can then add that justification to the changelog to avoid a
> conversation with Andrew that goes something like
>
> Andrew: Why are we doing this?
> Mel: Because we can and apparently people might want it.
> Andrew: What's the maintenance cost of this?
> Mel: Magic beans
>
> I prefer talking to Andrew when it's harder to predict what he'll say.

Booting 100s faster is certainly something that is nice to have. Right
now, more time is spent in the firmware POST portion of the bootup
process than in the OS boot. So I would say this patch isn't really
critical right now as machines with that much memory are relatively
rare. However, if we look forward to the near future, some new memory
technology like persistent memory is coming and machines with large
amount of memory (whether persistent or not) will become more common.
This patch will certainly be useful if we look forward into the future.

>> However, there were 2 bootup problems in the dmesg log that needed
>> to be addressed.
>> 1. There were 2 vmalloc allocation failures:
>> [ 2.284686] vmalloc: allocation failure, allocated 16578404352 of
>> 17179873280 bytes
>> [ 10.399938] vmalloc: allocation failure, allocated 7970922496 of
>> 8589938688 bytes
>>
>> 2. There were 2 soft lockup warnings:
>> [ 57.319453] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
>> [swapper/0:1]
>> [ 85.409263] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
>> [swapper/0:1]
>>
>> Once those problems are fixed, the patch should be in a pretty good
>> shape. I have attached the dmesg log for your reference.
>>
> The obvious conclusion is that initialising 1G per node is not enough for
> really large machines. Can you try this on top? It's untested but should
> work. The low value was chosen because it happened to work and I wanted
> to get test coverage on common hardware but broke is broke.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f2c96d02662f..6b3bec304e35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -276,9 +276,9 @@ static inline bool update_defer_init(pg_data_t *pgdat,
> if (pgdat->first_deferred_pfn != ULONG_MAX)
> return false;
>
> - /* Initialise at least 1G per zone */
> + /* Initialise at least 32G per node */
> (*nr_initialised)++;
> - if (*nr_initialised> (1UL<< (30 - PAGE_SHIFT))&&
> + if (*nr_initialised> (32UL<< (30 - PAGE_SHIFT))&&
> (pfn& (PAGES_PER_SECTION - 1)) == 0) {
> pgdat->first_deferred_pfn = pfn;
> return false;

I will try this out when I can get hold of the 12-TB machine again.

The vmalloc allocation failures were for the following hash tables:
- Dentry cache hash table entries
- Inode-cache hash table entries

Those hash tables scale linearly with the amount of memory available in
the system. So instead of hardcoding a certain value, why don't we make
it a certain % of the total memory but bottomed out to 1G at the low end?

Cheers,
Longman

2015-04-15 15:44:34

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On Wed, Apr 15, 2015 at 10:50:45AM -0400, Waiman Long wrote:
> On 04/15/2015 09:38 AM, Mel Gorman wrote:
> >On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
> >>><SNIP>
> >>>Patches are against 4.0-rc7.
> >>>
> >>> Documentation/kernel-parameters.txt | 8 +
> >>> arch/ia64/mm/numa.c | 19 +-
> >>> arch/x86/Kconfig | 2 +
> >>> include/linux/memblock.h | 18 ++
> >>> include/linux/mm.h | 8 +-
> >>> include/linux/mmzone.h | 37 +++-
> >>> init/main.c | 1 +
> >>> mm/Kconfig | 29 +++
> >>> mm/bootmem.c | 6 +-
> >>> mm/internal.h | 23 ++-
> >>> mm/memblock.c | 34 ++-
> >>> mm/mm_init.c | 9 +-
> >>> mm/nobootmem.c | 7 +-
> >>> mm/page_alloc.c | 398 +++++++++++++++++++++++++++++++-----
> >>> mm/vmscan.c | 6 +-
> >>> 15 files changed, 507 insertions(+), 98 deletions(-)
> >>>
> >>I had included your patch with the 4.0 kernel and booted up a
> >>16-socket 12-TB machine. I measured the elapsed time from the elilo
> >>prompt to the availability of ssh login. Without the patch, the
> >>bootup time was 404s. It was reduced to 298s with the patch. So
> >>there was about 100s reduction in bootup time (1/4 of the total).
> >>
> >Cool, thanks for testing. Would you be able to state if this is really
> >important or not? Does booting 100s second faster on a 12TB machine really
> >matter? I can then add that justification to the changelog to avoid a
> >conversation with Andrew that goes something like
> >
> >Andrew: Why are we doing this?
> >Mel: Because we can and apparently people might want it.
> >Andrew: What's the maintenance cost of this?
> >Mel: Magic beans
> >
> >I prefer talking to Andrew when it's harder to predict what he'll say.
>
> Booting 100s faster is certainly something that is nice to have.
> Right now, more time is spent in the firmware POST portion of the
> bootup process than in the OS boot.

I'm not surprised. On two different 1TB machines, I've seen a post time
of 2 minutes and one of 35. No idea what it's doing for 35 minutes....
plotting world domination probably.

> So I would say this patch isn't
> really critical right now as machines with that much memory are
> relatively rare. However, if we look forward to the near future,
> some new memory technology like persistent memory is coming and
> machines with large amount of memory (whether persistent or not)
> will become more common. This patch will certainly be useful if we
> look forward into the future.
>

Whether persistent memory needs struct pages or not is up in the air and
I'm not getting stuck in that can of worms. 100 seconds off kernel init
time is a starting point. I can try pushing it on on that basis but I
really would like to see SGI and Intel people also chime in on how it
affects their really large machines.

> >>However, there were 2 bootup problems in the dmesg log that needed
> >>to be addressed.
> >>1. There were 2 vmalloc allocation failures:
> >>[ 2.284686] vmalloc: allocation failure, allocated 16578404352 of
> >>17179873280 bytes
> >>[ 10.399938] vmalloc: allocation failure, allocated 7970922496 of
> >>8589938688 bytes
> >>
> >>2. There were 2 soft lockup warnings:
> >>[ 57.319453] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
> >>[swapper/0:1]
> >>[ 85.409263] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
> >>[swapper/0:1]
> >>
> >>Once those problems are fixed, the patch should be in a pretty good
> >>shape. I have attached the dmesg log for your reference.
> >>
> >The obvious conclusion is that initialising 1G per node is not enough for
> >really large machines. Can you try this on top? It's untested but should
> >work. The low value was chosen because it happened to work and I wanted
> >to get test coverage on common hardware but broke is broke.
> >
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index f2c96d02662f..6b3bec304e35 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -276,9 +276,9 @@ static inline bool update_defer_init(pg_data_t *pgdat,
> > if (pgdat->first_deferred_pfn != ULONG_MAX)
> > return false;
> >
> >- /* Initialise at least 1G per zone */
> >+ /* Initialise at least 32G per node */
> > (*nr_initialised)++;
> >- if (*nr_initialised> (1UL<< (30 - PAGE_SHIFT))&&
> >+ if (*nr_initialised> (32UL<< (30 - PAGE_SHIFT))&&
> > (pfn& (PAGES_PER_SECTION - 1)) == 0) {
> > pgdat->first_deferred_pfn = pfn;
> > return false;
>
> I will try this out when I can get hold of the 12-TB machine again.
>

Thanks.

> The vmalloc allocation failures were for the following hash tables:
> - Dentry cache hash table entries
> - Inode-cache hash table entries
>
> Those hash tables scale linearly with the amount of memory available
> in the system. So instead of hardcoding a certain value, why don't
> we make it a certain % of the total memory but bottomed out to 1G at
> the low end?
>

Because then it becomes what percentage is the right percentage and what
happens if it's a percentage of total memory but the NUMA nodes are not
all the same size?. I want to start simple until there is more data on
what these really large machines look like and if it ever fails in the
field, there is the command-line switch until a patch is available.

--
Mel Gorman
SUSE Labs

2015-04-15 16:19:08

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On 04/15/2015 10:48 AM, Peter Zijlstra wrote:
> On Wed, Apr 15, 2015 at 03:34:20PM +0100, Mel Gorman wrote:
>> On Wed, Apr 15, 2015 at 04:27:31PM +0200, Peter Zijlstra wrote:
>>> On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
>>>> I had included your patch with the 4.0 kernel and booted up a 16-socket
>>>> 12-TB machine. I measured the elapsed time from the elilo prompt to the
>>>> availability of ssh login. Without the patch, the bootup time was 404s. It
>>>> was reduced to 298s with the patch. So there was about 100s reduction in
>>>> bootup time (1/4 of the total).
>>> But you cheat! :-)
>>>
>>> How long between power on and the elilo prompt? Do the 100 seconds
>>> matter on that time scale?
>> Calling it cheating is a *bit* harsh as the POST times vary considerably
>> between manufacturers. While I'm interested in Waiman's answer, I'm told
>> that those that really care about minimising reboot times will use kexec
>> to avoid POST. The 100 seconds is 100 seconds, whether that is 25% in
>> all cases is a different matter.
> Sure POST times vary, but its consistently stupid long :-) I'm forever
> thinking my EX machine died because its not coming back from a power
> cycle, and mine isn't really _that_ large.

I agree with that. I always complain about the long POST time of those
server machines.

As for Mel's patch, what I wanted to show is its impact on the OS bootup
part of the boot process. We have no control on how long the firmware
POST is, so there is no point in lumping them into the discussion.

Cheers,
Longman

2015-04-15 16:44:18

by Norton, Scott J

[permalink] [raw]

Subject: RE: [RFC PATCH 0/14] Parallel memory initialisation

On 04/15/2015 10:48 AM, Peter Zijlstra wrote:
> On Wed, Apr 15, 2015 at 03:34:20PM +0100, Mel Gorman wrote:
>> On Wed, Apr 15, 2015 at 04:27:31PM +0200, Peter Zijlstra wrote:
>>> On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
>>>> I had included your patch with the 4.0 kernel and booted up a
>>>> 16-socket 12-TB machine. I measured the elapsed time from the elilo
>>>> prompt to the availability of ssh login. Without the patch, the
>>>> bootup time was 404s. It was reduced to 298s with the patch. So
>>>> there was about 100s reduction in bootup time (1/4 of the total).
>>> But you cheat! :-)
>>>
>>> How long between power on and the elilo prompt? Do the 100 seconds
>>> matter on that time scale?
>>>
>> Calling it cheating is a *bit* harsh as the POST times vary
>> considerably between manufacturers. While I'm interested in Waiman's
>> answer, I'm told that those that really care about minimising reboot
>> times will use kexec to avoid POST. The 100 seconds is 100 seconds,
>> whether that is 25% in all cases is a different matter.
>>
> Sure POST times vary, but its consistently stupid long :-) I'm forever
> thinking my EX machine died because its not coming back from a power
> cycle, and mine isn't really _that_ large.

Yes, 100 seconds really does matter and is a big deal. When a business has one of
these large machines go down their business is stopped (unless they have a
fast failover solution in place). Every minute and second the machine is down
is crucial to these businesses. The fact that POST times can be so long make it
even more important that we make the kernel boot as fast as possible.

Scott

2015-04-15 21:46:13

by Nathan Zimmer

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On 04/15/2015 10:44 AM, Mel Gorman wrote:
> On Wed, Apr 15, 2015 at 10:50:45AM -0400, Waiman Long wrote:
>> On 04/15/2015 09:38 AM, Mel Gorman wrote:
>>> On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
>>>>> <SNIP>
>>>>> Patches are against 4.0-rc7.
>>>>>
>>>>> Documentation/kernel-parameters.txt | 8 +
>>>>> arch/ia64/mm/numa.c | 19 +-
>>>>> arch/x86/Kconfig | 2 +
>>>>> include/linux/memblock.h | 18 ++
>>>>> include/linux/mm.h | 8 +-
>>>>> include/linux/mmzone.h | 37 +++-
>>>>> init/main.c | 1 +
>>>>> mm/Kconfig | 29 +++
>>>>> mm/bootmem.c | 6 +-
>>>>> mm/internal.h | 23 ++-
>>>>> mm/memblock.c | 34 ++-
>>>>> mm/mm_init.c | 9 +-
>>>>> mm/nobootmem.c | 7 +-
>>>>> mm/page_alloc.c | 398 +++++++++++++++++++++++++++++++-----
>>>>> mm/vmscan.c | 6 +-
>>>>> 15 files changed, 507 insertions(+), 98 deletions(-)
>>>>>
>>>> I had included your patch with the 4.0 kernel and booted up a
>>>> 16-socket 12-TB machine. I measured the elapsed time from the elilo
>>>> prompt to the availability of ssh login. Without the patch, the
>>>> bootup time was 404s. It was reduced to 298s with the patch. So
>>>> there was about 100s reduction in bootup time (1/4 of the total).
>>>>
>>> Cool, thanks for testing. Would you be able to state if this is really
>>> important or not? Does booting 100s second faster on a 12TB machine really
>>> matter? I can then add that justification to the changelog to avoid a
>>> conversation with Andrew that goes something like
>>>
>>> Andrew: Why are we doing this?
>>> Mel: Because we can and apparently people might want it.
>>> Andrew: What's the maintenance cost of this?
>>> Mel: Magic beans
>>>
>>> I prefer talking to Andrew when it's harder to predict what he'll say.
>> Booting 100s faster is certainly something that is nice to have.
>> Right now, more time is spent in the firmware POST portion of the
>> bootup process than in the OS boot.
> I'm not surprised. On two different 1TB machines, I've seen a post time
> of 2 minutes and one of 35. No idea what it's doing for 35 minutes....
> plotting world domination probably.
>
>> So I would say this patch isn't
>> really critical right now as machines with that much memory are
>> relatively rare. However, if we look forward to the near future,
>> some new memory technology like persistent memory is coming and
>> machines with large amount of memory (whether persistent or not)
>> will become more common. This patch will certainly be useful if we
>> look forward into the future.
>>
> Whether persistent memory needs struct pages or not is up in the air and
> I'm not getting stuck in that can of worms. 100 seconds off kernel init
> time is a starting point. I can try pushing it on on that basis but I
> really would like to see SGI and Intel people also chime in on how it
> affects their really large machines.
>
I will get some numbers from this patch set but I haven't had the
opportunity yet. I will grab them this weekend for sure if I can't get
machine time sooner.

>>>> However, there were 2 bootup problems in the dmesg log that needed
>>>> to be addressed.
>>>> 1. There were 2 vmalloc allocation failures:
>>>> [ 2.284686] vmalloc: allocation failure, allocated 16578404352 of
>>>> 17179873280 bytes
>>>> [ 10.399938] vmalloc: allocation failure, allocated 7970922496 of
>>>> 8589938688 bytes
>>>>
>>>> 2. There were 2 soft lockup warnings:
>>>> [ 57.319453] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
>>>> [swapper/0:1]
>>>> [ 85.409263] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
>>>> [swapper/0:1]
>>>>
>>>> Once those problems are fixed, the patch should be in a pretty good
>>>> shape. I have attached the dmesg log for your reference.
>>>>
>>> The obvious conclusion is that initialising 1G per node is not enough for
>>> really large machines. Can you try this on top? It's untested but should
>>> work. The low value was chosen because it happened to work and I wanted
>>> to get test coverage on common hardware but broke is broke.
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index f2c96d02662f..6b3bec304e35 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -276,9 +276,9 @@ static inline bool update_defer_init(pg_data_t *pgdat,
>>> if (pgdat->first_deferred_pfn != ULONG_MAX)
>>> return false;
>>>
>>> - /* Initialise at least 1G per zone */
>>> + /* Initialise at least 32G per node */
>>> (*nr_initialised)++;
>>> - if (*nr_initialised> (1UL<< (30 - PAGE_SHIFT))&&
>>> + if (*nr_initialised> (32UL<< (30 - PAGE_SHIFT))&&
>>> (pfn& (PAGES_PER_SECTION - 1)) == 0) {
>>> pgdat->first_deferred_pfn = pfn;
>>> return false;
>> I will try this out when I can get hold of the 12-TB machine again.
>>
> Thanks.
>
>> The vmalloc allocation failures were for the following hash tables:
>> - Dentry cache hash table entries
>> - Inode-cache hash table entries
>>
>> Those hash tables scale linearly with the amount of memory available
>> in the system. So instead of hardcoding a certain value, why don't
>> we make it a certain % of the total memory but bottomed out to 1G at
>> the low end?
>>
> Because then it becomes what percentage is the right percentage and what
> happens if it's a percentage of total memory but the NUMA nodes are not
> all the same size?. I want to start simple until there is more data on
> what these really large machines look like and if it ever fails in the
> field, there is the command-line switch until a patch is available.
>

2015-04-16 07:19:45

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On Mon, 13 Apr 2015 11:16:52 +0100 Mel Gorman <[email protected]> wrote:

> Memory initialisation

I wish we didn't call this "memory initialization". Because memory
initialization is memset(), and that isn't what we're doing here.

Installation? Bringup?

> had been identified as one of the reasons why large
> machines take a long time to boot. Patches were posted a long time ago
> that attempted to move deferred initialisation into the page allocator
> paths. This was rejected on the grounds it should not be necessary to hurt
> the fast paths to parallelise initialisation. This series reuses much of
> the work from that time but defers the initialisation of memory to kswapd
> so that one thread per node initialises memory local to that node. The
> issue is that on the machines I tested with, memory initialisation was not
> a major contributor to boot times. I'm posting the RFC to both review the
> series and see if it actually helps users of very large machines.
>
> ...
>
> 15 files changed, 507 insertions(+), 98 deletions(-)

Sadface at how large and complex this is. I'd hoped the way we were
going to do this was by bringing up a bit of memory to get booted up,
then later on we just fake a bunch of memory hot-add operations. So
the new code would be pretty small and quite high-level.

2015-04-16 08:46:43

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On Thu, Apr 16, 2015 at 12:25:01AM -0700, Andrew Morton wrote:
> On Mon, 13 Apr 2015 11:16:52 +0100 Mel Gorman <[email protected]> wrote:
>
> > Memory initialisation
>
> I wish we didn't call this "memory initialization". Because memory
> initialization is memset(), and that isn't what we're doing here.
>
> Installation? Bringup?
>

It's about linking the struct pages to their physical page frame so
"Parallel struct page initialisation"?

> > had been identified as one of the reasons why large
> > machines take a long time to boot. Patches were posted a long time ago
> > that attempted to move deferred initialisation into the page allocator
> > paths. This was rejected on the grounds it should not be necessary to hurt
> > the fast paths to parallelise initialisation. This series reuses much of
> > the work from that time but defers the initialisation of memory to kswapd
> > so that one thread per node initialises memory local to that node. The
> > issue is that on the machines I tested with, memory initialisation was not
> > a major contributor to boot times. I'm posting the RFC to both review the
> > series and see if it actually helps users of very large machines.
> >
> > ...
> >
> > 15 files changed, 507 insertions(+), 98 deletions(-)
>
> Sadface at how large and complex this is.

The vast bulk of the complexity is in one patch "mm: meminit: Initialise
remaining memory in parallel with kswapd" which is

mm/internal.h | 6 +++++
mm/mm_init.c | 1 +
mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
mm/vmscan.c | 6 +++--
4 files changed, 123 insertions(+), 6 deletions(-)

Most of that is a fairly straight-forward walk through zones and pfns with
bounds checking. A lot of the rest of the complexity is helpers which are
very similar to existing helpers (but not suitable for sharing code) and
optimisations. The optimisations in later patches cut the parallel struct
page initialisation time by 80%.

> I'd hoped the way we were
> going to do this was by bringing up a bit of memory to get booted up,
> then later on we just fake a bunch of memory hot-add operations. So
> the new code would be pretty small and quite high-level.

That ends up being very complex but of a very different shape. We would
still have to prevent the sections being initialised similar to what this
series does already except the zone boundaries are lower. It's not as
simple as faking mem= because we want local memory on each node during
initialisation.

Later after device_init when sysfs is setup we would then have to walk all
possible sections to discover pluggable memory and hot-add them. However,
when doing it, we would want to first discover what node that section is
local to and ideally skip over the ones that are not local to the thread
doing the work. This means all threads have to scan all sections instead
of this approach which can walk within its own PFN. It then adds pages
one at a time which is slow although obviously that part could be addressed.

This would be harder to co-ordinate as kswapd is up and running before
the memory hot-add structures are finalised so it would need either a
semaphore or different threads to do the initialisation. The user-visible
impact is then that early in boot, the total amount of memory appears to
be rapidly increasing instead of this approach where the amount of free
memory is increasing.

Conceptually it's straight forward but the details end up being a lot
more complex than this approach.

--
Mel Gorman
SUSE Labs

2015-04-16 17:26:46

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On Thu, 16 Apr 2015 09:46:09 +0100 Mel Gorman <[email protected]> wrote:

> On Thu, Apr 16, 2015 at 12:25:01AM -0700, Andrew Morton wrote:
> > On Mon, 13 Apr 2015 11:16:52 +0100 Mel Gorman <[email protected]> wrote:
> >
> > > Memory initialisation
> >
> > I wish we didn't call this "memory initialization". Because memory
> > initialization is memset(), and that isn't what we're doing here.
> >
> > Installation? Bringup?
> >
>
> It's about linking the struct pages to their physical page frame so
> "Parallel struct page initialisation"?

Works for me.

> > I'd hoped the way we were
> > going to do this was by bringing up a bit of memory to get booted up,
> > then later on we just fake a bunch of memory hot-add operations. So
> > the new code would be pretty small and quite high-level.
>
> That ends up being very complex but of a very different shape. We would
> still have to prevent the sections being initialised similar to what this
> series does already except the zone boundaries are lower. It's not as
> simple as faking mem= because we want local memory on each node during
> initialisation.

Why do "we want..."?

> Later after device_init when sysfs is setup we would then have to walk all
> possible sections to discover pluggable memory and hot-add them. However,
> when doing it, we would want to first discover what node that section is
> local to and ideally skip over the ones that are not local to the thread
> doing the work. This means all threads have to scan all sections instead
> of this approach which can walk within its own PFN. It then adds pages
> one at a time which is slow although obviously that part could be addressed.
>
> This would be harder to co-ordinate as kswapd is up and running before
> the memory hot-add structures are finalised so it would need either a
> semaphore or different threads to do the initialisation. The user-visible
> impact is then that early in boot, the total amount of memory appears to
> be rapidly increasing instead of this approach where the amount of free
> memory is increasing.
>
> Conceptually it's straight forward but the details end up being a lot
> more complex than this approach.

Could we do most of the think work in userspace, emit a bunch of
low-level hotplug operations to the kernel?

2015-04-16 17:38:12

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On Thu, Apr 16, 2015 at 10:26:35AM -0700, Andrew Morton wrote:
> On Thu, 16 Apr 2015 09:46:09 +0100 Mel Gorman <[email protected]> wrote:
>
> > On Thu, Apr 16, 2015 at 12:25:01AM -0700, Andrew Morton wrote:
> > > On Mon, 13 Apr 2015 11:16:52 +0100 Mel Gorman <[email protected]> wrote:
> > >
> > > > Memory initialisation
> > >
> > > I wish we didn't call this "memory initialization". Because memory
> > > initialization is memset(), and that isn't what we're doing here.
> > >
> > > Installation? Bringup?
> > >
> >
> > It's about linking the struct pages to their physical page frame so
> > "Parallel struct page initialisation"?
>
> Works for me.
>
> > > I'd hoped the way we were
> > > going to do this was by bringing up a bit of memory to get booted up,
> > > then later on we just fake a bunch of memory hot-add operations. So
> > > the new code would be pretty small and quite high-level.
> >
> > That ends up being very complex but of a very different shape. We would
> > still have to prevent the sections being initialised similar to what this
> > series does already except the zone boundaries are lower. It's not as
> > simple as faking mem= because we want local memory on each node during
> > initialisation.
>
> Why do "we want..."?
>

Speed mostly. The memaps are local to a node so if this is going to be
parallelised then it makes sense to use local CPUs. It's why I used
kswapd to do the initialisation -- it's close to the struct pages being
initialised.

> > Later after device_init when sysfs is setup we would then have to walk all
> > possible sections to discover pluggable memory and hot-add them. However,
> > when doing it, we would want to first discover what node that section is
> > local to and ideally skip over the ones that are not local to the thread
> > doing the work. This means all threads have to scan all sections instead
> > of this approach which can walk within its own PFN. It then adds pages
> > one at a time which is slow although obviously that part could be addressed.
> >
> > This would be harder to co-ordinate as kswapd is up and running before
> > the memory hot-add structures are finalised so it would need either a
> > semaphore or different threads to do the initialisation. The user-visible
> > impact is then that early in boot, the total amount of memory appears to
> > be rapidly increasing instead of this approach where the amount of free
> > memory is increasing.
> >
> > Conceptually it's straight forward but the details end up being a lot
> > more complex than this approach.
>
> Could we do most of the think work in userspace, emit a bunch of
> low-level hotplug operations to the kernel?
>

That makes me wince at lot. The kernel would be depending on userspace
to correctly capture the event and write to the correct sysfs files. We'd
either have to declare a new event or fake an ACPI hotplug event and cross
our fingers that ACPI hotplug is setup correctly and that userspace does
the right thing. There is no guarantee userspace has any clue and it
certainly does not end up being simplier than this series.

--
Mel Gorman
SUSE Labs

2015-04-16 18:21:12

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation

On 04/15/2015 09:38 AM, Mel Gorman wrote:
>> However, there were 2 bootup problems in the dmesg log that needed
>> to be addressed.
>> 1. There were 2 vmalloc allocation failures:
>> [ 2.284686] vmalloc: allocation failure, allocated 16578404352 of
>> 17179873280 bytes
>> [ 10.399938] vmalloc: allocation failure, allocated 7970922496 of
>> 8589938688 bytes
>>
>> 2. There were 2 soft lockup warnings:
>> [ 57.319453] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
>> [swapper/0:1]
>> [ 85.409263] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
>> [swapper/0:1]
>>
>> Once those problems are fixed, the patch should be in a pretty good
>> shape. I have attached the dmesg log for your reference.
>>
> The obvious conclusion is that initialising 1G per node is not enough for
> really large machines. Can you try this on top? It's untested but should
> work. The low value was chosen because it happened to work and I wanted
> to get test coverage on common hardware but broke is broke.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f2c96d02662f..6b3bec304e35 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -276,9 +276,9 @@ static inline bool update_defer_init(pg_data_t *pgdat,
> if (pgdat->first_deferred_pfn != ULONG_MAX)
> return false;
>
> - /* Initialise at least 1G per zone */
> + /* Initialise at least 32G per node */
> (*nr_initialised)++;
> - if (*nr_initialised> (1UL<< (30 - PAGE_SHIFT))&&
> + if (*nr_initialised> (32UL<< (30 - PAGE_SHIFT))&&
> (pfn& (PAGES_PER_SECTION - 1)) == 0) {
> pgdat->first_deferred_pfn = pfn;
> return false;
>
>
I applied the patch and the boot time was 299s instead of 298s, so
practically the same. The two issues that I discussed about previously
were both gone. Attached is the new dmesg log for your reference.

Cheers,
Longman

Attachments:

dmesg-4.0-Mel-mm-patch-2.txt (478.84 kB)