2015-04-28 14:37:17

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/13] Parallel struct page initialisation v4

The bulk of the changes here are related to Andrew's feedback. Functionally
there is almost no difference.

Changelog since v3
o Fix section-related warning
o Comments, clarifications, checkpatch
o Report the number of pages initialised

Changelog since v2
o Reduce overhead of topology_init
o Remove boot-time kernel parameter to enable/disable
o Enable on UMA

Changelog since v1
o Always initialise low zones
o Typo corrections
o Rename parallel mem init to parallel struct page init
o Rebase to 4.0

Struct page initialisation had been identified as one of the reasons why
large machines take a long time to boot. Patches were posted a long time ago
to defer initialisation until they were first used. This was rejected on
the grounds it should not be necessary to hurt the fast paths. This series
reuses much of the work from that time but defers the initialisation of
memory to kswapd so that one thread per node initialises memory local to
that node.

After applying the series and setting the appropriate Kconfig variable I
see this in the boot log on a 64G machine

[ 7.383764] kswapd 0 initialised deferred memory in 188ms
[ 7.404253] kswapd 1 initialised deferred memory in 208ms
[ 7.411044] kswapd 3 initialised deferred memory in 216ms
[ 7.411551] kswapd 2 initialised deferred memory in 216ms

On a 1TB machine, I see

[ 8.406511] kswapd 3 initialised deferred memory in 1116ms
[ 8.428518] kswapd 1 initialised deferred memory in 1140ms
[ 8.435977] kswapd 0 initialised deferred memory in 1148ms
[ 8.437416] kswapd 2 initialised deferred memory in 1148ms

Once booted the machine appears to work as normal. Boot times were measured
from the time shutdown was called until ssh was available again. In the
64G case, the boot time savings are negligible. On the 1TB machine, the
savings were 16 seconds.

It would be nice if the people that have access to really large machines
would test this series and report how much boot time is reduced.

arch/ia64/mm/numa.c | 19 +--
arch/x86/Kconfig | 1 +
drivers/base/node.c | 6 +-
include/linux/memblock.h | 18 +++
include/linux/mm.h | 8 +-
include/linux/mmzone.h | 23 ++-
mm/Kconfig | 18 +++
mm/bootmem.c | 8 +-
mm/internal.h | 29 +++-
mm/memblock.c | 34 +++-
mm/mm_init.c | 9 +-
mm/nobootmem.c | 7 +-
mm/page_alloc.c | 401 ++++++++++++++++++++++++++++++++++++++++-------
mm/vmscan.c | 6 +-
14 files changed, 487 insertions(+), 100 deletions(-)

--
2.3.5


2015-04-28 14:37:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 01/13] memblock: Introduce a for_each_reserved_mem_region iterator.

From: Robin Holt <[email protected]>

As part of initializing struct page's in 2MiB chunks, we noticed that
at the end of free_all_bootmem(), there was nothing which had forced
the reserved/allocated 4KiB pages to be initialized.

This helper function will be used for that expansion.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nate Zimmer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/memblock.h | 18 ++++++++++++++++++
mm/memblock.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index e8cc45307f8f..3075e7673c54 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -93,6 +93,9 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a,
struct memblock_type *type_b, phys_addr_t *out_start,
phys_addr_t *out_end, int *out_nid);

+void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
+ phys_addr_t *out_end);
+
/**
* for_each_mem_range - iterate through memblock areas from type_a and not
* included in type_b. Or just type_a if type_b is NULL.
@@ -132,6 +135,21 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a,
__next_mem_range_rev(&i, nid, type_a, type_b, \
p_start, p_end, p_nid))

+/**
+ * for_each_reserved_mem_region - iterate over all reserved memblock areas
+ * @i: u64 used as loop variable
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over reserved areas of memblock. Available as soon as memblock
+ * is initialized.
+ */
+#define for_each_reserved_mem_region(i, p_start, p_end) \
+ for (i = 0UL, \
+ __next_reserved_mem_region(&i, p_start, p_end); \
+ i != (u64)ULLONG_MAX; \
+ __next_reserved_mem_region(&i, p_start, p_end))
+
#ifdef CONFIG_MOVABLE_NODE
static inline bool memblock_is_hotpluggable(struct memblock_region *m)
{
diff --git a/mm/memblock.c b/mm/memblock.c
index 252b77bdf65e..e0cc2d174f74 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -765,6 +765,38 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
}

/**
+ * __next_reserved_mem_region - next function for for_each_reserved_region()
+ * @idx: pointer to u64 loop variable
+ * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL
+ * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL
+ *
+ * Iterate over all reserved memory regions.
+ */
+void __init_memblock __next_reserved_mem_region(u64 *idx,
+ phys_addr_t *out_start,
+ phys_addr_t *out_end)
+{
+ struct memblock_type *rsv = &memblock.reserved;
+
+ if (*idx >= 0 && *idx < rsv->cnt) {
+ struct memblock_region *r = &rsv->regions[*idx];
+ phys_addr_t base = r->base;
+ phys_addr_t size = r->size;
+
+ if (out_start)
+ *out_start = base;
+ if (out_end)
+ *out_end = base + size - 1;
+
+ *idx += 1;
+ return;
+ }
+
+ /* signal end of iteration */
+ *idx = ULLONG_MAX;
+}
+
+/**
* __next__mem_range - next function for for_each_free_mem_range() etc.
* @idx: pointer to u64 loop variable
* @nid: node selector, %NUMA_NO_NODE for all nodes
--
2.3.5

2015-04-28 14:37:23

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/13] mm: meminit: Move page initialization into a separate function.

From: Robin Holt <[email protected]>

Currently, memmap_init_zone() has all the smarts for initializing a single
page. A subset of this is required for parallel page initialisation and so
this patch breaks up the monolithic function in preparation.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 79 +++++++++++++++++++++++++++++++++------------------------
1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40e29429e7b0..fd7a6d09062d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -778,6 +778,51 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
return 0;
}

+static void __meminit __init_single_page(struct page *page, unsigned long pfn,
+ unsigned long zone, int nid)
+{
+ struct zone *z = &NODE_DATA(nid)->node_zones[zone];
+
+ set_page_links(page, zone, nid, pfn);
+ mminit_verify_page_links(page, zone, nid, pfn);
+ init_page_count(page);
+ page_mapcount_reset(page);
+ page_cpupid_reset_last(page);
+ SetPageReserved(page);
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made. Later some blocks near
+ * the start are marked MIGRATE_RESERVE by
+ * setup_zone_migrate_reserve()
+ *
+ * bitmap is created for zone's valid pfn range. but memmap
+ * can be created for invalid pages (for alignment)
+ * check here not to call set_pageblock_migratetype() against
+ * pfn out of zone.
+ */
+ if ((z->zone_start_pfn <= pfn)
+ && (pfn < zone_end_pfn(z))
+ && !(pfn & (pageblock_nr_pages - 1)))
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+
+ INIT_LIST_HEAD(&page->lru);
+#ifdef WANT_PAGE_VIRTUAL
+ /* The shift won't overflow because ZONE_NORMAL is below 4G. */
+ if (!is_highmem_idx(zone))
+ set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+}
+
+static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
+ int nid)
+{
+ return __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
bool compound = PageCompound(page);
@@ -4124,7 +4169,6 @@ static void setup_zone_migrate_reserve(struct zone *zone)
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
- struct page *page;
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
struct zone *z;
@@ -4145,38 +4189,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (!early_pfn_in_nid(pfn, nid))
continue;
}
- page = pfn_to_page(pfn);
- set_page_links(page, zone, nid, pfn);
- mminit_verify_page_links(page, zone, nid, pfn);
- init_page_count(page);
- page_mapcount_reset(page);
- page_cpupid_reset_last(page);
- SetPageReserved(page);
- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
- *
- * bitmap is created for zone's valid pfn range. but memmap
- * can be created for invalid pages (for alignment)
- * check here not to call set_pageblock_migratetype() against
- * pfn out of zone.
- */
- if ((z->zone_start_pfn <= pfn)
- && (pfn < zone_end_pfn(z))
- && !(pfn & (pageblock_nr_pages - 1)))
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-
- INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
- /* The shift won't overflow because ZONE_NORMAL is below 4G. */
- if (!is_highmem_idx(zone))
- set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
+ __init_single_pfn(pfn, zone, nid);
}
}

--
2.3.5

2015-04-28 14:41:11

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 03/13] mm: meminit: Only set page reserved in the memblock region

From: Nathan Zimmer <[email protected]>

Currently each page struct is set as reserved upon initialization.
This patch leaves the reserved bit clear and only sets the reserved bit
when it is known the memory was allocated by the bootmem allocator. This
makes it easier to distinguish between uninitialised struct pages and
reserved struct pages in later patches.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm.h | 2 ++
mm/nobootmem.c | 3 +++
mm/page_alloc.c | 17 ++++++++++++++++-
3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a93928b90f..b6f82a31028a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1711,6 +1711,8 @@ extern void free_highmem_page(struct page *page);
extern void adjust_managed_page_count(struct page *page, long count);
extern void mem_init_print_info(const char *str);

+extern void reserve_bootmem_region(unsigned long start, unsigned long end);
+
/* Free the reserved page into the buddy system, so it gets managed. */
static inline void __free_reserved_page(struct page *page)
{
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 90b50468333e..396f9e450dc1 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -121,6 +121,9 @@ static unsigned long __init free_low_memory_core_early(void)

memblock_clear_hotplug(0, -1);

+ for_each_reserved_mem_region(i, &start, &end)
+ reserve_bootmem_region(start, end);
+
for_each_free_mem_range(i, NUMA_NO_NODE, &start, &end, NULL)
count += __free_memory_core(start, end);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd7a6d09062d..13c88177d3c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -788,7 +788,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
- SetPageReserved(page);

/*
* Mark the block movable so that blocks are reserved for
@@ -823,6 +822,22 @@ static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
return __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
}

+/*
+ * Initialised pages do not have PageReserved set. This function is
+ * called for each range allocated by the bootmem allocator and
+ * marks the pages PageReserved. The remaining valid pages are later
+ * sent to the buddy page allocator.
+ */
+void reserve_bootmem_region(unsigned long start, unsigned long end)
+{
+ unsigned long start_pfn = PFN_DOWN(start);
+ unsigned long end_pfn = PFN_UP(end);
+
+ for (; start_pfn < end_pfn; start_pfn++)
+ if (pfn_valid(start_pfn))
+ SetPageReserved(pfn_to_page(start_pfn));
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
bool compound = PageCompound(page);
--
2.3.5

2015-04-28 14:40:26

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/13] mm: page_alloc: Pass PFN to __free_pages_bootmem

__free_pages_bootmem prepares a page for release to the buddy allocator
and assumes that the struct page is initialised. Parallel initialisation of
struct pages defers initialisation and __free_pages_bootmem can be called
for struct pages that cannot yet map struct page to PFN. This patch passes
PFN to __free_pages_bootmem with no other functional change.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/bootmem.c | 8 ++++----
mm/internal.h | 3 ++-
mm/memblock.c | 2 +-
mm/nobootmem.c | 4 ++--
mm/page_alloc.c | 3 ++-
5 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 477be696511d..daf956bb4782 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -164,7 +164,7 @@ void __init free_bootmem_late(unsigned long physaddr, unsigned long size)
end = PFN_DOWN(physaddr + size);

for (; cursor < end; cursor++) {
- __free_pages_bootmem(pfn_to_page(cursor), 0);
+ __free_pages_bootmem(pfn_to_page(cursor), cursor, 0);
totalram_pages++;
}
}
@@ -210,7 +210,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
if (IS_ALIGNED(start, BITS_PER_LONG) && vec == ~0UL) {
int order = ilog2(BITS_PER_LONG);

- __free_pages_bootmem(pfn_to_page(start), order);
+ __free_pages_bootmem(pfn_to_page(start), start, order);
count += BITS_PER_LONG;
start += BITS_PER_LONG;
} else {
@@ -220,7 +220,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
while (vec && cur != start) {
if (vec & 1) {
page = pfn_to_page(cur);
- __free_pages_bootmem(page, 0);
+ __free_pages_bootmem(page, cur, 0);
count++;
}
vec >>= 1;
@@ -234,7 +234,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
pages = bootmem_bootmap_pages(pages);
count += pages;
while (pages--)
- __free_pages_bootmem(page++, 0);
+ __free_pages_bootmem(page++, cur++, 0);

bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);

diff --git a/mm/internal.h b/mm/internal.h
index a96da5b0029d..76b605139c7a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -155,7 +155,8 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
}

extern int __isolate_free_page(struct page *page, unsigned int order);
-extern void __free_pages_bootmem(struct page *page, unsigned int order);
+extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
+ unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);
#ifdef CONFIG_MEMORY_FAILURE
extern bool is_free_buddy_page(struct page *page);
diff --git a/mm/memblock.c b/mm/memblock.c
index e0cc2d174f74..f3e97d8eeb5c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1334,7 +1334,7 @@ void __init __memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);

for (; cursor < end; cursor++) {
- __free_pages_bootmem(pfn_to_page(cursor), 0);
+ __free_pages_bootmem(pfn_to_page(cursor), cursor, 0);
totalram_pages++;
}
}
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 396f9e450dc1..bae652713ee5 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -77,7 +77,7 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size)
end = PFN_DOWN(addr + size);

for (; cursor < end; cursor++) {
- __free_pages_bootmem(pfn_to_page(cursor), 0);
+ __free_pages_bootmem(pfn_to_page(cursor), cursor, 0);
totalram_pages++;
}
}
@@ -92,7 +92,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end)
while (start + (1UL << order) > end)
order--;

- __free_pages_bootmem(pfn_to_page(start), order);
+ __free_pages_bootmem(pfn_to_page(start), start, order);

start += (1UL << order);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 13c88177d3c6..a59f75d02d11 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -892,7 +892,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
local_irq_restore(flags);
}

-void __init __free_pages_bootmem(struct page *page, unsigned int order)
+void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
+ unsigned int order)
{
unsigned int nr_pages = 1 << order;
struct page *p = page;
--
2.3.5

2015-04-28 14:37:27

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/13] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid

__early_pfn_to_nid() use static variables to cache recent lookups as memblock
lookups are very expensive but it assumes that memory initialisation is
single-threaded. Parallel initialisation of struct pages will break that
assumption so this patch makes __early_pfn_to_nid() SMP-safe by requiring
the caller to cache recent search information. early_pfn_to_nid() keeps
the same interface but is only safe to use early in boot due to the use
of a global static variable. meminit_pfn_in_nid() is an SMP-safe version
that callers must maintain their own state for.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/ia64/mm/numa.c | 19 +++++++------------
include/linux/mm.h | 6 ++++--
include/linux/mmzone.h | 16 +++++++++++++++-
mm/page_alloc.c | 40 +++++++++++++++++++++++++---------------
4 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index ea21d4cad540..aa19b7ac8222 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -58,27 +58,22 @@ paddr_to_nid(unsigned long paddr)
* SPARSEMEM to allocate the SPARSEMEM sectionmap on the NUMA node where
* the section resides.
*/
-int __meminit __early_pfn_to_nid(unsigned long pfn)
+int __meminit __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state)
{
int i, section = pfn >> PFN_SECTION_SHIFT, ssec, esec;
- /*
- * NOTE: The following SMP-unsafe globals are only used early in boot
- * when the kernel is running single-threaded.
- */
- static int __meminitdata last_ssec, last_esec;
- static int __meminitdata last_nid;

- if (section >= last_ssec && section < last_esec)
- return last_nid;
+ if (section >= state->last_start && section < state->last_end)
+ return state->last_nid;

for (i = 0; i < num_node_memblks; i++) {
ssec = node_memblk[i].start_paddr >> PA_SECTION_SHIFT;
esec = (node_memblk[i].start_paddr + node_memblk[i].size +
((1L << PA_SECTION_SHIFT) - 1)) >> PA_SECTION_SHIFT;
if (section >= ssec && section < esec) {
- last_ssec = ssec;
- last_esec = esec;
- last_nid = node_memblk[i].nid;
+ state->last_start = ssec;
+ state->last_end = esec;
+ state->last_nid = node_memblk[i].nid;
return node_memblk[i].nid;
}
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b6f82a31028a..a8a8b161fd65 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1802,7 +1802,8 @@ extern void sparse_memory_present_with_active_regions(int nid);

#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
!defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
-static inline int __early_pfn_to_nid(unsigned long pfn)
+static inline int __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state)
{
return 0;
}
@@ -1810,7 +1811,8 @@ static inline int __early_pfn_to_nid(unsigned long pfn)
/* please see mm/page_alloc.c */
extern int __meminit early_pfn_to_nid(unsigned long pfn);
/* there is a per-arch backend function. */
-extern int __meminit __early_pfn_to_nid(unsigned long pfn);
+extern int __meminit __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state);
#endif

extern void set_dma_reserve(unsigned long new_dma_reserve);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2782df47101e..a67b33e52dfe 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1216,10 +1216,24 @@ void sparse_init(void);
#define sparse_index_init(_sec, _nid) do {} while (0)
#endif /* CONFIG_SPARSEMEM */

+/*
+ * During memory init memblocks map pfns to nids. The search is expensive and
+ * this caches recent lookups. The implementation of __early_pfn_to_nid
+ * may treat start/end as pfns or sections.
+ */
+struct mminit_pfnnid_cache {
+ unsigned long last_start;
+ unsigned long last_end;
+ int last_nid;
+};
+
#ifdef CONFIG_NODES_SPAN_OTHER_NODES
bool early_pfn_in_nid(unsigned long pfn, int nid);
+bool meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state);
#else
-#define early_pfn_in_nid(pfn, nid) (1)
+#define early_pfn_in_nid(pfn, nid) (1)
+#define meminit_pfn_in_nid(pfn, nid, state) (1)
#endif

#ifndef early_pfn_valid
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a59f75d02d11..6c5ed5804e82 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4463,39 +4463,41 @@ int __meminit init_currently_empty_zone(struct zone *zone,

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+
/*
* Required by SPARSEMEM. Given a PFN, return what node the PFN is on.
*/
-int __meminit __early_pfn_to_nid(unsigned long pfn)
+int __meminit __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state)
{
unsigned long start_pfn, end_pfn;
int nid;
- /*
- * NOTE: The following SMP-unsafe globals are only used early in boot
- * when the kernel is running single-threaded.
- */
- static unsigned long __meminitdata last_start_pfn, last_end_pfn;
- static int __meminitdata last_nid;

- if (last_start_pfn <= pfn && pfn < last_end_pfn)
- return last_nid;
+ if (state->last_start <= pfn && pfn < state->last_end)
+ return state->last_nid;

nid = memblock_search_pfn_nid(pfn, &start_pfn, &end_pfn);
if (nid != -1) {
- last_start_pfn = start_pfn;
- last_end_pfn = end_pfn;
- last_nid = nid;
+ state->last_start = start_pfn;
+ state->last_end = end_pfn;
+ state->last_nid = nid;
}

return nid;
}
#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */

+static struct mminit_pfnnid_cache early_pfnnid_cache __meminitdata;
+
+/* Only safe to use early in boot when initialisation is single-threaded */
int __meminit early_pfn_to_nid(unsigned long pfn)
{
int nid;

- nid = __early_pfn_to_nid(pfn);
+ /* The system will behave unpredictably otherwise */
+ BUG_ON(system_state != SYSTEM_BOOTING);
+
+ nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
if (nid >= 0)
return nid;
/* just returns 0 */
@@ -4503,15 +4505,23 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
}

#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state)
{
int nid;

- nid = __early_pfn_to_nid(pfn);
+ nid = __early_pfn_to_nid(pfn, state);
if (nid >= 0 && nid != node)
return false;
return true;
}
+
+/* Only safe to use early in boot when initialisation is single-threaded */
+bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+{
+ return meminit_pfn_in_nid(pfn, node, &early_pfnnid_cache);
+}
+
#endif

/**
--
2.3.5

2015-04-28 14:40:06

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/13] mm: meminit: Inline some helper functions

early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
unnecessarily visible outside memory initialisation. As well as unnecessary
visibility, it's unnecessary function call overhead when initialising pages.
This patch moves the helpers inline.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 9 ------
mm/page_alloc.c | 76 ++++++++++++++++++++++++++------------------------
2 files changed, 39 insertions(+), 46 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a67b33e52dfe..e3d8a2bd8d78 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1227,15 +1227,6 @@ struct mminit_pfnnid_cache {
int last_nid;
};

-#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-bool early_pfn_in_nid(unsigned long pfn, int nid);
-bool meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state);
-#else
-#define early_pfn_in_nid(pfn, nid) (1)
-#define meminit_pfn_in_nid(pfn, nid, state) (1)
-#endif
-
#ifndef early_pfn_valid
#define early_pfn_valid(pfn) (1)
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6c5ed5804e82..bb99c7e66da5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -913,6 +913,45 @@ void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
__free_pages(page, order);
}

+#if defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) || \
+ defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
+/* Only safe to use early in boot when initialisation is single-threaded */
+static struct mminit_pfnnid_cache early_pfnnid_cache __meminitdata;
+
+int __meminit early_pfn_to_nid(unsigned long pfn)
+{
+ int nid;
+
+ /* The system will behave unpredictably otherwise */
+ BUG_ON(system_state != SYSTEM_BOOTING);
+
+ nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
+ if (nid >= 0)
+ return nid;
+ /* just returns 0 */
+ return 0;
+}
+#endif
+
+#ifdef CONFIG_NODES_SPAN_OTHER_NODES
+static inline bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state)
+{
+ int nid;
+
+ nid = __early_pfn_to_nid(pfn, state);
+ if (nid >= 0 && nid != node)
+ return false;
+ return true;
+}
+
+/* Only safe to use early in boot when initialisation is single-threaded */
+static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+{
+ return meminit_pfn_in_nid(pfn, node, &early_pfnnid_cache);
+}
+#endif
+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
@@ -4487,43 +4526,6 @@ int __meminit __early_pfn_to_nid(unsigned long pfn,
}
#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */

-static struct mminit_pfnnid_cache early_pfnnid_cache __meminitdata;
-
-/* Only safe to use early in boot when initialisation is single-threaded */
-int __meminit early_pfn_to_nid(unsigned long pfn)
-{
- int nid;
-
- /* The system will behave unpredictably otherwise */
- BUG_ON(system_state != SYSTEM_BOOTING);
-
- nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
- if (nid >= 0)
- return nid;
- /* just returns 0 */
- return 0;
-}
-
-#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state)
-{
- int nid;
-
- nid = __early_pfn_to_nid(pfn, state);
- if (nid >= 0 && nid != node)
- return false;
- return true;
-}
-
-/* Only safe to use early in boot when initialisation is single-threaded */
-bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
-{
- return meminit_pfn_in_nid(pfn, node, &early_pfnnid_cache);
-}
-
-#endif
-
/**
* free_bootmem_with_active_regions - Call memblock_free_early_nid for each active range
* @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed.
--
2.3.5

2015-04-28 14:39:31

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

This patch initalises all low memory struct pages and 2G of the highest zone
on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT
is set. That config option cannot be set but will be available in a later
patch. Parallel initialisation of struct page depends on some features
from memory hotplug and it is necessary to alter alter section annotations.

Signed-off-by: Mel Gorman <[email protected]>
---
drivers/base/node.c | 6 +++-
include/linux/mmzone.h | 8 ++++++
mm/Kconfig | 18 ++++++++++++
mm/internal.h | 14 +++++++++
mm/page_alloc.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++--
5 files changed, 120 insertions(+), 4 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 36fabe43cd44..97ab2c4dd39e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -361,12 +361,16 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
#define page_initialized(page) (page->lru.next)

-static int get_nid_for_pfn(unsigned long pfn)
+static int __init_refok get_nid_for_pfn(unsigned long pfn)
{
struct page *page;

if (!pfn_valid_within(pfn))
return -1;
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ if (system_state == SYSTEM_BOOTING)
+ return early_pfn_to_nid(pfn);
+#endif
page = pfn_to_page(pfn);
if (!page_initialized(page))
return -1;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e3d8a2bd8d78..4882c53b70b5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -762,6 +762,14 @@ typedef struct pglist_data {
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
#endif
+
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ /*
+ * If memory initialisation on large machines is deferred then this
+ * is the first PFN that needs to be initialised.
+ */
+ unsigned long first_deferred_pfn;
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
} pg_data_t;

#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
diff --git a/mm/Kconfig b/mm/Kconfig
index a03131b6ba8e..3e40cb64e226 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -629,3 +629,21 @@ config MAX_STACK_SIZE_MB
changed to a smaller value in which case that is used.

A sane initial value is 80 MB.
+
+# For architectures that support deferred memory initialisation
+config ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
+ bool
+
+config DEFERRED_STRUCT_PAGE_INIT
+ bool "Defer initialisation of struct pages to kswapd"
+ default n
+ depends on ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
+ depends on MEMORY_HOTPLUG
+ help
+ Ordinarily all struct pages are initialised during early boot in a
+ single thread. On very large machines this can take a considerable
+ amount of time. If this option is set, large machines will bring up
+ a subset of memmap at boot and then initialise the rest in parallel
+ when kswapd starts. This has a potential performance impact on
+ processes running early in the lifetime of the systemm until kswapd
+ finishes the initialisation.
diff --git a/mm/internal.h b/mm/internal.h
index 76b605139c7a..24314b671db1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -385,6 +385,20 @@ static inline void mminit_verify_zonelist(void)
}
#endif /* CONFIG_DEBUG_MEMORY_INIT */

+/*
+ * Deferred struct page initialisation requires some early init functions that
+ * are removed before kswapd is up and running. The feature depends on memory
+ * hotplug so put the data and code required by deferred initialisation into
+ * the __meminit section where they are preserved.
+ */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+#define __defermem_init __meminit
+#define __defer_init __meminit
+#else
+#define __defermem_init
+#define __defer_init __init
+#endif
+
/* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */
#if defined(CONFIG_SPARSEMEM)
extern void mminit_validate_memmodel_limits(unsigned long *start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb99c7e66da5..8ec493a24b9c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -235,6 +235,64 @@ EXPORT_SYMBOL(nr_online_nodes);

int page_group_by_mobility_disabled __read_mostly;

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+static inline void reset_deferred_meminit(pg_data_t *pgdat)
+{
+ pgdat->first_deferred_pfn = ULONG_MAX;
+}
+
+/* Returns true if the struct page for the pfn is uninitialised */
+static inline bool __defermem_init early_page_uninitialised(unsigned long pfn)
+{
+ int nid = early_pfn_to_nid(pfn);
+
+ if (pfn >= NODE_DATA(nid)->first_deferred_pfn)
+ return true;
+
+ return false;
+}
+
+/*
+ * Returns false when the remaining initialisation should be deferred until
+ * later in the boot cycle when it can be parallelised.
+ */
+static inline bool update_defer_init(pg_data_t *pgdat,
+ unsigned long pfn, unsigned long zone_end,
+ unsigned long *nr_initialised)
+{
+ /* Always populate low zones for address-contrained allocations */
+ if (zone_end < pgdat_end_pfn(pgdat))
+ return true;
+
+ /* Initialise at least 2G of the highest zone */
+ (*nr_initialised)++;
+ if (*nr_initialised > (2UL << (30 - PAGE_SHIFT)) &&
+ (pfn & (PAGES_PER_SECTION - 1)) == 0) {
+ pgdat->first_deferred_pfn = pfn;
+ return false;
+ }
+
+ return true;
+}
+#else
+static inline void reset_deferred_meminit(pg_data_t *pgdat)
+{
+}
+
+static inline bool early_page_uninitialised(unsigned long pfn)
+{
+ return false;
+}
+
+static inline bool update_defer_init(pg_data_t *pgdat,
+ unsigned long pfn, unsigned long zone_end,
+ unsigned long *nr_initialised)
+{
+ return true;
+}
+#endif
+
+
void set_pageblock_migratetype(struct page *page, int migratetype)
{
if (unlikely(page_group_by_mobility_disabled &&
@@ -892,8 +950,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
local_irq_restore(flags);
}

-void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
- unsigned int order)
+static void __defer_init __free_pages_boot_core(struct page *page,
+ unsigned long pfn, unsigned int order)
{
unsigned int nr_pages = 1 << order;
struct page *p = page;
@@ -952,6 +1010,14 @@ static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
}
#endif

+void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
+ unsigned int order)
+{
+ if (early_page_uninitialised(pfn))
+ return;
+ return __free_pages_boot_core(page, pfn, order);
+}
+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
@@ -4224,14 +4290,16 @@ static void setup_zone_migrate_reserve(struct zone *zone)
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
+ pg_data_t *pgdat = NODE_DATA(nid);
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
struct zone *z;
+ unsigned long nr_initialised = 0;

if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;

- z = &NODE_DATA(nid)->node_zones[zone];
+ z = &pgdat->node_zones[zone];
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*
* There can be holes in boot-time mem_map[]s
@@ -4243,6 +4311,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
continue;
if (!early_pfn_in_nid(pfn, nid))
continue;
+ if (!update_defer_init(pgdat, pfn, end_pfn,
+ &nr_initialised))
+ break;
}
__init_single_pfn(pfn, zone, nid);
}
@@ -5044,6 +5115,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
/* pg_data_t should be reset to zero when it's allocated */
WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);

+ reset_deferred_meminit(pgdat);
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
--
2.3.5

2015-04-28 14:39:28

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 08/13] mm: meminit: Initialise remaining struct pages in parallel with kswapd

Only a subset of struct pages are initialised at the moment. When this patch
is applied kswapd initialise the remaining struct pages in parallel. This
should boot faster by spreading the work to multiple CPUs and initialising
data that is local to the CPU. The user-visible effect on large machines
is that free memory will appear to rapidly increase early in the lifetime
of the system until kswapd reports that all memory is initialised in the
kernel log. Once initialised there should be no other user-visibile effects.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 6 +++
mm/mm_init.c | 1 +
mm/page_alloc.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
mm/vmscan.c | 6 ++-
4 files changed, 130 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 24314b671db1..bed751a7ac42 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -394,9 +394,15 @@ static inline void mminit_verify_zonelist(void)
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
#define __defermem_init __meminit
#define __defer_init __meminit
+
+void deferred_init_memmap(int nid);
#else
#define __defermem_init
#define __defer_init __init
+
+static inline void deferred_init_memmap(int nid)
+{
+}
#endif

/* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5f420f7fafa1..28fbf87b20aa 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -11,6 +11,7 @@
#include <linux/export.h>
#include <linux/memory.h>
#include <linux/notifier.h>
+#include <linux/sched.h>
#include "internal.h"

#ifdef CONFIG_DEBUG_MEMORY_INIT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ec493a24b9c..96f2c2dc8ca6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -252,6 +252,14 @@ static inline bool __defermem_init early_page_uninitialised(unsigned long pfn)
return false;
}

+static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
+{
+ if (pfn >= NODE_DATA(nid)->first_deferred_pfn)
+ return true;
+
+ return false;
+}
+
/*
* Returns false when the remaining initialisation should be deferred until
* later in the boot cycle when it can be parallelised.
@@ -284,6 +292,11 @@ static inline bool early_page_uninitialised(unsigned long pfn)
return false;
}

+static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
+{
+ return false;
+}
+
static inline bool update_defer_init(pg_data_t *pgdat,
unsigned long pfn, unsigned long zone_end,
unsigned long *nr_initialised)
@@ -880,20 +893,51 @@ static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
return __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
}

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+static void init_reserved_page(unsigned long pfn)
+{
+ pg_data_t *pgdat;
+ int nid, zid;
+
+ if (!early_page_uninitialised(pfn))
+ return;
+
+ nid = early_pfn_to_nid(pfn);
+ pgdat = NODE_DATA(nid);
+
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ struct zone *zone = &pgdat->node_zones[zid];
+
+ if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
+ break;
+ }
+ __init_single_pfn(pfn, zid, nid);
+}
+#else
+static inline void init_reserved_page(unsigned long pfn)
+{
+}
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
/*
* Initialised pages do not have PageReserved set. This function is
* called for each range allocated by the bootmem allocator and
* marks the pages PageReserved. The remaining valid pages are later
* sent to the buddy page allocator.
*/
-void reserve_bootmem_region(unsigned long start, unsigned long end)
+void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);

- for (; start_pfn < end_pfn; start_pfn++)
- if (pfn_valid(start_pfn))
- SetPageReserved(pfn_to_page(start_pfn));
+ for (; start_pfn < end_pfn; start_pfn++) {
+ if (pfn_valid(start_pfn)) {
+ struct page *page = pfn_to_page(start_pfn);
+
+ init_reserved_page(start_pfn);
+ SetPageReserved(page);
+ }
+ }
}

static bool free_pages_prepare(struct page *page, unsigned int order)
@@ -1018,6 +1062,74 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
return __free_pages_boot_core(page, pfn, order);
}

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+/* Initialise remaining memory on a node */
+void __defermem_init deferred_init_memmap(int nid)
+{
+ struct mminit_pfnnid_cache nid_init_state = { };
+ unsigned long start = jiffies;
+ unsigned long nr_pages = 0;
+ unsigned long walk_start, walk_end;
+ int i, zid;
+ struct zone *zone;
+ pg_data_t *pgdat = NODE_DATA(nid);
+ unsigned long first_init_pfn = pgdat->first_deferred_pfn;
+
+ if (first_init_pfn == ULONG_MAX)
+ return;
+
+ /* Sanity check boundaries */
+ BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
+ BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
+ pgdat->first_deferred_pfn = ULONG_MAX;
+
+ /* Only the highest zone is deferred so find it */
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ zone = pgdat->node_zones + zid;
+ if (first_init_pfn < zone_end_pfn(zone))
+ break;
+ }
+
+ for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
+ unsigned long pfn, end_pfn;
+
+ end_pfn = min(walk_end, zone_end_pfn(zone));
+ pfn = first_init_pfn;
+ if (pfn < walk_start)
+ pfn = walk_start;
+ if (pfn < zone->zone_start_pfn)
+ pfn = zone->zone_start_pfn;
+
+ for (; pfn < end_pfn; pfn++) {
+ struct page *page;
+
+ if (!pfn_valid(pfn))
+ continue;
+
+ if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state))
+ continue;
+
+ if (page->flags) {
+ VM_BUG_ON(page_zone(page) != zone);
+ continue;
+ }
+
+ __init_single_page(page, pfn, zid, nid);
+ __free_pages_boot_core(page, pfn, 0);
+ nr_pages++;
+ cond_resched();
+ }
+ first_init_pfn = max(end_pfn, first_init_pfn);
+ }
+
+ /* Sanity check that the next zone really is unpopulated */
+ WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
+
+ pr_info("kswapd %d initialised %lu pages in %ums\n", nid, nr_pages,
+ jiffies_to_msecs(jiffies - start));
+}
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
@@ -4228,6 +4340,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)
zone->nr_migrate_reserve_block = reserve;

for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+ if (!early_page_nid_uninitialised(pfn, zone_to_nid(zone)))
+ return;
+
if (!pfn_valid(pfn))
continue;
page = pfn_to_page(pfn);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..c4895d26d036 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3348,7 +3348,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
* If there are applications that are active memory-allocators
* (most normal use), this basically shouldn't matter.
*/
-static int kswapd(void *p)
+static int __defermem_init kswapd(void *p)
{
unsigned long order, new_order;
unsigned balanced_order;
@@ -3383,6 +3383,8 @@ static int kswapd(void *p)
tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
set_freezable();

+ deferred_init_memmap(pgdat->node_id);
+
order = new_order = 0;
balanced_order = 0;
classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
@@ -3538,7 +3540,7 @@ static int cpu_callback(struct notifier_block *nfb, unsigned long action,
* This kswapd start function will be called by init and node-hot-add.
* On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
*/
-int kswapd_run(int nid)
+int __defermem_init kswapd_run(int nid)
{
pg_data_t *pgdat = NODE_DATA(nid);
int ret = 0;
--
2.3.5

2015-04-28 14:39:10

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 09/13] mm: meminit: Minimise number of pfn->page lookups during initialisation

Deferred struct page initialisation is using pfn_to_page() on every PFN
unnecessarily. This patch minimises the number of lookups and scheduler
checks.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 96f2c2dc8ca6..6e366fd654e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1092,6 +1092,7 @@ void __defermem_init deferred_init_memmap(int nid)

for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
unsigned long pfn, end_pfn;
+ struct page *page = NULL;

end_pfn = min(walk_end, zone_end_pfn(zone));
pfn = first_init_pfn;
@@ -1101,13 +1102,32 @@ void __defermem_init deferred_init_memmap(int nid)
pfn = zone->zone_start_pfn;

for (; pfn < end_pfn; pfn++) {
- struct page *page;
-
- if (!pfn_valid(pfn))
+ if (!pfn_valid_within(pfn))
continue;

- if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state))
+ /*
+ * Ensure pfn_valid is checked every
+ * MAX_ORDER_NR_PAGES for memory holes
+ */
+ if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
+ if (!pfn_valid(pfn)) {
+ page = NULL;
+ continue;
+ }
+ }
+
+ if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
+ page = NULL;
continue;
+ }
+
+ /* Minimise pfn page lookups and scheduler checks */
+ if (page && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) {
+ page++;
+ } else {
+ page = pfn_to_page(pfn);
+ cond_resched();
+ }

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
@@ -1117,7 +1137,6 @@ void __defermem_init deferred_init_memmap(int nid)
__init_single_page(page, pfn, zid, nid);
__free_pages_boot_core(page, pfn, 0);
nr_pages++;
- cond_resched();
}
first_init_pfn = max(end_pfn, first_init_pfn);
}
--
2.3.5

2015-04-28 14:38:51

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64

Subject says it all. Other architectures may enable on a case-by-case
basis after auditing early_pfn_to_nid and testing.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca55187..1beff8a8fbc9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -18,6 +18,7 @@ config X86_64
select X86_DEV_DMA_OPS
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_LIVEPATCH
+ select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT

### Arch settings
config X86
--
2.3.5

2015-04-28 14:37:31

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 11/13] mm: meminit: Free pages in large chunks where possible

Parallel struct page frees pages one at a time. Try free pages as single
large pages where possible.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 49 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e366fd654e1..2200b7473b5a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1063,6 +1063,25 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
}

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+static void __defermem_init deferred_free_range(struct page *page,
+ unsigned long pfn, int nr_pages)
+{
+ int i;
+
+ if (!page)
+ return;
+
+ /* Free a large naturally-aligned chunk if possible */
+ if (nr_pages == MAX_ORDER_NR_PAGES &&
+ (pfn & (MAX_ORDER_NR_PAGES-1)) == 0) {
+ __free_pages_boot_core(page, pfn, MAX_ORDER-1);
+ return;
+ }
+
+ for (i = 0; i < nr_pages; i++, page++, pfn++)
+ __free_pages_boot_core(page, pfn, 0);
+}
+
/* Initialise remaining memory on a node */
void __defermem_init deferred_init_memmap(int nid)
{
@@ -1093,6 +1112,9 @@ void __defermem_init deferred_init_memmap(int nid)
for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
unsigned long pfn, end_pfn;
struct page *page = NULL;
+ struct page *free_base_page = NULL;
+ unsigned long free_base_pfn = 0;
+ int nr_to_free = 0;

end_pfn = min(walk_end, zone_end_pfn(zone));
pfn = first_init_pfn;
@@ -1103,7 +1125,7 @@ void __defermem_init deferred_init_memmap(int nid)

for (; pfn < end_pfn; pfn++) {
if (!pfn_valid_within(pfn))
- continue;
+ goto free_range;

/*
* Ensure pfn_valid is checked every
@@ -1112,32 +1134,53 @@ void __defermem_init deferred_init_memmap(int nid)
if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
if (!pfn_valid(pfn)) {
page = NULL;
- continue;
+ goto free_range;
}
}

if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
page = NULL;
- continue;
+ goto free_range;
}

/* Minimise pfn page lookups and scheduler checks */
if (page && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) {
page++;
} else {
+ nr_pages += nr_to_free;
+ deferred_free_range(free_base_page,
+ free_base_pfn, nr_to_free);
+ free_base_page = NULL;
+ free_base_pfn = nr_to_free = 0;
+
page = pfn_to_page(pfn);
cond_resched();
}

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
- continue;
+ goto free_range;
}

__init_single_page(page, pfn, zid, nid);
- __free_pages_boot_core(page, pfn, 0);
- nr_pages++;
+ if (!free_base_page) {
+ free_base_page = page;
+ free_base_pfn = pfn;
+ nr_to_free = 0;
+ }
+ nr_to_free++;
+
+ /* Where possible, batch up pages for a single free */
+ continue;
+free_range:
+ /* Free the current block of pages to allocator */
+ nr_pages += nr_to_free;
+ deferred_free_range(free_base_page, free_base_pfn,
+ nr_to_free);
+ free_base_page = NULL;
+ free_base_pfn = nr_to_free = 0;
}
+
first_init_pfn = max(end_pfn, first_init_pfn);
}

--
2.3.5

2015-04-28 14:37:34

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 12/13] mm: meminit: Reduce number of times pageblocks are set during struct page init

During parallel sturct page initialisation, ranges are checked for every
PFN unnecessarily which increases boot times. This patch alters when the
ranges are checked.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 46 ++++++++++++++++++++++++----------------------
1 file changed, 24 insertions(+), 22 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2200b7473b5a..313f4a5a3907 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -852,33 +852,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
- struct zone *z = &NODE_DATA(nid)->node_zones[zone];
-
set_page_links(page, zone, nid, pfn);
mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);

- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
- *
- * bitmap is created for zone's valid pfn range. but memmap
- * can be created for invalid pages (for alignment)
- * check here not to call set_pageblock_migratetype() against
- * pfn out of zone.
- */
- if ((z->zone_start_pfn <= pfn)
- && (pfn < zone_end_pfn(z))
- && !(pfn & (pageblock_nr_pages - 1)))
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-
INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
@@ -1074,6 +1053,7 @@ static void __defermem_init deferred_free_range(struct page *page,
/* Free a large naturally-aligned chunk if possible */
if (nr_pages == MAX_ORDER_NR_PAGES &&
(pfn & (MAX_ORDER_NR_PAGES-1)) == 0) {
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
__free_pages_boot_core(page, pfn, MAX_ORDER-1);
return;
}
@@ -4492,7 +4472,29 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
&nr_initialised))
break;
}
- __init_single_pfn(pfn, zone, nid);
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made. Later some blocks near
+ * the start are marked MIGRATE_RESERVE by
+ * setup_zone_migrate_reserve()
+ *
+ * bitmap is created for zone's valid pfn range. but memmap
+ * can be created for invalid pages (for alignment)
+ * check here not to call set_pageblock_migratetype() against
+ * pfn out of zone.
+ */
+ if (!(pfn & (pageblock_nr_pages - 1))) {
+ struct page *page = pfn_to_page(pfn);
+
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ __init_single_page(page, pfn, zone, nid);
+ } else {
+ __init_single_pfn(pfn, zone, nid);
+ }
}
}

--
2.3.5

2015-04-28 14:38:32

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 13/13] mm: meminit: Remove mminit_verify_page_links

mminit_verify_page_links() is an extremely paranoid check that was introduced
when memory initialisation was being heavily reworked. Profiles indicated
that up to 10% of parallel memory initialisation was spent on checking
this for every page. The cost could be reduced but in practice this check
only found problems very early during the initialisation rewrite and has
found nothing since. This patch removes an expensive unnecessary check.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 8 --------
mm/mm_init.c | 8 --------
mm/page_alloc.c | 1 -
3 files changed, 17 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index bed751a7ac42..467a93e6a7b1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -360,10 +360,7 @@ do { \
} while (0)

extern void mminit_verify_pageflags_layout(void);
-extern void mminit_verify_page_links(struct page *page,
- enum zone_type zone, unsigned long nid, unsigned long pfn);
extern void mminit_verify_zonelist(void);
-
#else

static inline void mminit_dprintk(enum mminit_level level,
@@ -375,11 +372,6 @@ static inline void mminit_verify_pageflags_layout(void)
{
}

-static inline void mminit_verify_page_links(struct page *page,
- enum zone_type zone, unsigned long nid, unsigned long pfn)
-{
-}
-
static inline void mminit_verify_zonelist(void)
{
}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 28fbf87b20aa..fdadf918de76 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -131,14 +131,6 @@ void __init mminit_verify_pageflags_layout(void)
BUG_ON(or_mask != add_mask);
}

-void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone,
- unsigned long nid, unsigned long pfn)
-{
- BUG_ON(page_to_nid(page) != nid);
- BUG_ON(page_zonenum(page) != zone);
- BUG_ON(page_to_pfn(page) != pfn);
-}
-
static __init int set_mminit_loglevel(char *str)
{
get_option(&str, &mminit_loglevel);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 313f4a5a3907..9c8f2a72263d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -853,7 +853,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
set_page_links(page, zone, nid, pfn);
- mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
--
2.3.5

2015-04-28 16:06:08

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Tue, Apr 28, 2015 at 5:36 PM, Mel Gorman <[email protected]> wrote:
> Struct page initialisation had been identified as one of the reasons why
> large machines take a long time to boot. Patches were posted a long time ago
> to defer initialisation until they were first used. This was rejected on
> the grounds it should not be necessary to hurt the fast paths. This series
> reuses much of the work from that time but defers the initialisation of
> memory to kswapd so that one thread per node initialises memory local to
> that node.
>
> After applying the series and setting the appropriate Kconfig variable I
> see this in the boot log on a 64G machine
>
> [ 7.383764] kswapd 0 initialised deferred memory in 188ms
> [ 7.404253] kswapd 1 initialised deferred memory in 208ms
> [ 7.411044] kswapd 3 initialised deferred memory in 216ms
> [ 7.411551] kswapd 2 initialised deferred memory in 216ms
>
> On a 1TB machine, I see
>
> [ 8.406511] kswapd 3 initialised deferred memory in 1116ms
> [ 8.428518] kswapd 1 initialised deferred memory in 1140ms
> [ 8.435977] kswapd 0 initialised deferred memory in 1148ms
> [ 8.437416] kswapd 2 initialised deferred memory in 1148ms
>
> Once booted the machine appears to work as normal. Boot times were measured
> from the time shutdown was called until ssh was available again. In the
> 64G case, the boot time savings are negligible. On the 1TB machine, the
> savings were 16 seconds.

FWIW,

Acked-by: Pekka Enberg <[email protected]>

for the whole series.

- Pekka

2015-04-28 18:38:25

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On an older 8 TB box with lots and lots of cpus the boot time, as
measure from grub to login prompt, the boot time improved from 1484
seconds to exactly 1000 seconds.

I have time on 16 TB box tonight and a 12 TB box thursday and will
hopefully have more numbers then.



On 04/28/2015 11:06 AM, Pekka Enberg wrote:
> On Tue, Apr 28, 2015 at 5:36 PM, Mel Gorman <[email protected]> wrote:
>> Struct page initialisation had been identified as one of the reasons why
>> large machines take a long time to boot. Patches were posted a long time ago
>> to defer initialisation until they were first used. This was rejected on
>> the grounds it should not be necessary to hurt the fast paths. This series
>> reuses much of the work from that time but defers the initialisation of
>> memory to kswapd so that one thread per node initialises memory local to
>> that node.
>>
>> After applying the series and setting the appropriate Kconfig variable I
>> see this in the boot log on a 64G machine
>>
>> [ 7.383764] kswapd 0 initialised deferred memory in 188ms
>> [ 7.404253] kswapd 1 initialised deferred memory in 208ms
>> [ 7.411044] kswapd 3 initialised deferred memory in 216ms
>> [ 7.411551] kswapd 2 initialised deferred memory in 216ms
>>
>> On a 1TB machine, I see
>>
>> [ 8.406511] kswapd 3 initialised deferred memory in 1116ms
>> [ 8.428518] kswapd 1 initialised deferred memory in 1140ms
>> [ 8.435977] kswapd 0 initialised deferred memory in 1148ms
>> [ 8.437416] kswapd 2 initialised deferred memory in 1148ms
>>
>> Once booted the machine appears to work as normal. Boot times were measured
>> from the time shutdown was called until ssh was available again. In the
>> 64G case, the boot time savings are negligible. On the 1TB machine, the
>> savings were 16 seconds.
> FWIW,
>
> Acked-by: Pekka Enberg <[email protected]>
>
> for the whole series.
>
> - Pekka

2015-04-29 01:16:09

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On 04/28/2015 10:36 AM, Mel Gorman wrote:
> The bulk of the changes here are related to Andrew's feedback. Functionally
> there is almost no difference.
>
> Changelog since v3
> o Fix section-related warning
> o Comments, clarifications, checkpatch
> o Report the number of pages initialised
>
> Changelog since v2
> o Reduce overhead of topology_init
> o Remove boot-time kernel parameter to enable/disable
> o Enable on UMA
>
> Changelog since v1
> o Always initialise low zones
> o Typo corrections
> o Rename parallel mem init to parallel struct page init
> o Rebase to 4.0
>
> Struct page initialisation had been identified as one of the reasons why
> large machines take a long time to boot. Patches were posted a long time ago
> to defer initialisation until they were first used. This was rejected on
> the grounds it should not be necessary to hurt the fast paths. This series
> reuses much of the work from that time but defers the initialisation of
> memory to kswapd so that one thread per node initialises memory local to
> that node.
>
> After applying the series and setting the appropriate Kconfig variable I
> see this in the boot log on a 64G machine
>
> [ 7.383764] kswapd 0 initialised deferred memory in 188ms
> [ 7.404253] kswapd 1 initialised deferred memory in 208ms
> [ 7.411044] kswapd 3 initialised deferred memory in 216ms
> [ 7.411551] kswapd 2 initialised deferred memory in 216ms
>
> On a 1TB machine, I see
>
> [ 8.406511] kswapd 3 initialised deferred memory in 1116ms
> [ 8.428518] kswapd 1 initialised deferred memory in 1140ms
> [ 8.435977] kswapd 0 initialised deferred memory in 1148ms
> [ 8.437416] kswapd 2 initialised deferred memory in 1148ms
>
> Once booted the machine appears to work as normal. Boot times were measured
> from the time shutdown was called until ssh was available again. In the
> 64G case, the boot time savings are negligible. On the 1TB machine, the
> savings were 16 seconds.
>
> It would be nice if the people that have access to really large machines
> would test this series and report how much boot time is reduced.
>
>

I ran a bootup timing test on a 12-TB 16-socket IvyBridge-EX system.
From grub menu to ssh login, the bootup time was 453s before the patch
and 265s after the patch - a saving of 188s (42%). I used a different OS
environment and config file with this test and so the timing data
weren't comparable with my previous testing data. The kswapd log entries
were

[ 45.973967] kswapd 4 initialised 197655470 pages in 4390ms
[ 45.974214] kswapd 7 initialised 197655470 pages in 4390ms
[ 45.976692] kswapd 15 initialised 197654299 pages in 4390ms
[ 45.993284] kswapd 0 initialised 197131131 pages in 4410ms
[ 46.032735] kswapd 9 initialised 197655470 pages in 4447ms
[ 46.065856] kswapd 8 initialised 197655470 pages in 4481ms
[ 46.066615] kswapd 1 initialised 197622702 pages in 4483ms
[ 46.077995] kswapd 2 initialised 197655470 pages in 4495ms
[ 46.219508] kswapd 13 initialised 197655470 pages in 4633ms
[ 46.224358] kswapd 3 initialised 197655470 pages in 4641ms
[ 46.228441] kswapd 11 initialised 197655470 pages in 4643ms
[ 46.232258] kswapd 12 initialised 197655470 pages in 4647ms
[ 46.239659] kswapd 10 initialised 197655470 pages in 4654ms
[ 46.243402] kswapd 14 initialised 197655470 pages in 4657ms
[ 46.250368] kswapd 5 initialised 197655470 pages in 4666ms
[ 46.254659] kswapd 6 initialised 197655470 pages in 4670ms

Cheers,
Longman

2015-04-29 21:19:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

On Tue, 28 Apr 2015 15:37:04 +0100 Mel Gorman <[email protected]> wrote:

> +/*
> + * Deferred struct page initialisation requires some early init functions that
> + * are removed before kswapd is up and running. The feature depends on memory
> + * hotplug so put the data and code required by deferred initialisation into
> + * the __meminit section where they are preserved.
> + */
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +#define __defermem_init __meminit
> +#define __defer_init __meminit
> +#else
> +#define __defermem_init
> +#define __defer_init __init
> +#endif

I still don't get it :(

__defermem_init:

if (CONFIG_DEFERRED_STRUCT_PAGE_INIT) {
if (CONFIG_MEMORY_HOTPLUG)
retain
} else {
retain
}

but CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on
CONFIG_MEMORY_HOTPLUG, so this becomes

if (CONFIG_DEFERRED_STRUCT_PAGE_INIT) {
retain
} else {
retain
}

which becomes

retain

so why does __defermem_init exist?



__defer_init:

if (CONFIG_DEFERRED_STRUCT_PAGE_INIT) {
if (CONFIG_MEMORY_HOTPLUG)
retain
} else {
discard
}

becomes

if (CONFIG_DEFERRED_STRUCT_PAGE_INIT) {
retain
} else {
discard
}

this one makes sense, but could be documented much more clearly!


And why does the comment refer to "and data". There is no
__defer_initdata, etc. Just not needed yet?

2015-04-30 08:46:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

On Wed, Apr 29, 2015 at 02:19:01PM -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2015 15:37:04 +0100 Mel Gorman <[email protected]> wrote:
>
> > +/*
> > + * Deferred struct page initialisation requires some early init functions that
> > + * are removed before kswapd is up and running. The feature depends on memory
> > + * hotplug so put the data and code required by deferred initialisation into
> > + * the __meminit section where they are preserved.
> > + */
> > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> > +#define __defermem_init __meminit
> > +#define __defer_init __meminit
> > +#else
> > +#define __defermem_init
> > +#define __defer_init __init
> > +#endif
>
> I still don't get it :(
>

This version was sent out at roughly the same minute you asked the time
before so the comment was not updated. I suggested this as a possible
alternative.

/*
* Deferred struct page initialisation requires init functions that are freed
* before kswapd is available. Reuse the memory hotplug section annotation
* to mark the required code.
*
* __defermem_init is code that always exists but is annotated __meminit * to
* avoid section warnings.
* __defer_init code gets marked __meminit when deferring struct page
* initialistion but is otherwise in the init section.
*/

Suggestions on better names are welcome.

> __defermem_init:
>
> if (CONFIG_DEFERRED_STRUCT_PAGE_INIT) {
> if (CONFIG_MEMORY_HOTPLUG)
> retain
> } else {
> retain
> }
>
> but CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on
> CONFIG_MEMORY_HOTPLUG, so this becomes
>
> if (CONFIG_DEFERRED_STRUCT_PAGE_INIT) {
> retain
> } else {
> retain
> }
>
> which becomes
>
> retain
>
> so why does __defermem_init exist?
>

It suppresses section warnings. Another possibility is that I get rid of
it entirely and use __refok but I feared that it might hide a real problem
in the future.

--
Mel Gorman
SUSE Labs

2015-04-30 16:10:53

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Wed, Apr 29, 2015 at 2:38 AM, nzimmer <[email protected]> wrote:
> On 04/28/2015 11:06 AM, Pekka Enberg wrote:
>> On Tue, Apr 28, 2015 at 5:36 PM, Mel Gorman <[email protected]> wrote:
>>> Struct page initialisation had been identified as one of the
>>> reasons why
>>> large machines take a long time to boot. Patches were posted a long
>>> time ago
>>> to defer initialisation until they were first used. This was
>>> rejected on
>>> the grounds it should not be necessary to hurt the fast paths. This
>>> series
>>> reuses much of the work from that time but defers the
>>> initialisation of
>>> memory to kswapd so that one thread per node initialises memory
>>> local to
>>> that node.
>>>
>>> After applying the series and setting the appropriate Kconfig
>>> variable I
>>> see this in the boot log on a 64G machine
>>>
>>> [ 7.383764] kswapd 0 initialised deferred memory in 188ms
>>> [ 7.404253] kswapd 1 initialised deferred memory in 208ms
>>> [ 7.411044] kswapd 3 initialised deferred memory in 216ms
>>> [ 7.411551] kswapd 2 initialised deferred memory in 216ms
>>>
>>> On a 1TB machine, I see
>>>
>>> [ 8.406511] kswapd 3 initialised deferred memory in 1116ms
>>> [ 8.428518] kswapd 1 initialised deferred memory in 1140ms
>>> [ 8.435977] kswapd 0 initialised deferred memory in 1148ms
>>> [ 8.437416] kswapd 2 initialised deferred memory in 1148ms
>>>
>>> Once booted the machine appears to work as normal. Boot times were
>>> measured
>>> from the time shutdown was called until ssh was available again.
>>> In the
>>> 64G case, the boot time savings are negligible. On the 1TB machine,
>>> the
>>> savings were 16 seconds.

> On an older 8 TB box with lots and lots of cpus the boot time, as
> measure from grub to login prompt, the boot time improved from 1484
> seconds to exactly 1000 seconds.
>
> I have time on 16 TB box tonight and a 12 TB box thursday and will
> hopefully have more numbers then.

Neat, and a roughly similar picture here.

On a 7TB, 1728-core NumaConnect system with 108 NUMA nodes, we're
seeing stock 4.0 boot in 7136s. This drops to 2159s, or a 70% reduction
with this patchset. Non-temporal PMD init [1] drops this to 1045s.

Nathan, what do you guys see with the non-temporal PMD patch [1]? Do
add a sfence at the ende label if you manually patch.

Thanks!
Daniel

[1] https://lkml.org/lkml/2015/4/23/350

2015-04-30 17:12:56

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On 04/30/2015 11:10 AM, Daniel J Blueman wrote:
> On Wed, Apr 29, 2015 at 2:38 AM, nzimmer <[email protected]> wrote:
>> On 04/28/2015 11:06 AM, Pekka Enberg wrote:
>>> On Tue, Apr 28, 2015 at 5:36 PM, Mel Gorman <[email protected]> wrote:
>>>> Struct page initialisation had been identified as one of the
>>>> reasons why
>>>> large machines take a long time to boot. Patches were posted a long
>>>> time ago
>>>> to defer initialisation until they were first used. This was
>>>> rejected on
>>>> the grounds it should not be necessary to hurt the fast paths. This
>>>> series
>>>> reuses much of the work from that time but defers the
>>>> initialisation of
>>>> memory to kswapd so that one thread per node initialises memory
>>>> local to
>>>> that node.
>>>>
>>>> After applying the series and setting the appropriate Kconfig
>>>> variable I
>>>> see this in the boot log on a 64G machine
>>>>
>>>> [ 7.383764] kswapd 0 initialised deferred memory in 188ms
>>>> [ 7.404253] kswapd 1 initialised deferred memory in 208ms
>>>> [ 7.411044] kswapd 3 initialised deferred memory in 216ms
>>>> [ 7.411551] kswapd 2 initialised deferred memory in 216ms
>>>>
>>>> On a 1TB machine, I see
>>>>
>>>> [ 8.406511] kswapd 3 initialised deferred memory in 1116ms
>>>> [ 8.428518] kswapd 1 initialised deferred memory in 1140ms
>>>> [ 8.435977] kswapd 0 initialised deferred memory in 1148ms
>>>> [ 8.437416] kswapd 2 initialised deferred memory in 1148ms
>>>>
>>>> Once booted the machine appears to work as normal. Boot times were
>>>> measured
>>>> from the time shutdown was called until ssh was available again.
>>>> In the
>>>> 64G case, the boot time savings are negligible. On the 1TB machine,
>>>> the
>>>> savings were 16 seconds.
>
>> On an older 8 TB box with lots and lots of cpus the boot time, as
>> measure from grub to login prompt, the boot time improved from 1484
>> seconds to exactly 1000 seconds.
>>
>> I have time on 16 TB box tonight and a 12 TB box thursday and will
>> hopefully have more numbers then.
>
> Neat, and a roughly similar picture here.
>
> On a 7TB, 1728-core NumaConnect system with 108 NUMA nodes, we're
> seeing stock 4.0 boot in 7136s. This drops to 2159s, or a 70%
> reduction with this patchset. Non-temporal PMD init [1] drops this to
> 1045s.
>
> Nathan, what do you guys see with the non-temporal PMD patch [1]? Do
> add a sfence at the ende label if you manually patch.
>

I have not tried the non-temporal patch yet, Daniel.
I will give that a go when I can grab more machine time but that
probably won't be today.

> Thanks!
> Daniel
>
> [1] https://lkml.org/lkml/2015/4/23/350
>

More numbers, including my first set.

My numbers are from grub prompt to login prompt.
All times are in seconds.
The configs are very much like the ones found in sles but with
RCU_FANOUT_LEAF=64 instead of 16
Large core count boxed benefit from this quite a bit.

Older 8 TB box (128 nodes)
1484s -> 1000s (yes exactly)

32TB box (128 nodes)
4890s -> 1240s

Recent 12 TB box (32 nodes)
598s -> 450s

I am inferring from these numbers and others that memory locality is a
big part of the win.

Out of curiosity has anyone ran any tests post boot time?

2015-04-30 17:28:33

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Thu, Apr 30, 2015 at 12:12:50PM -0500, nzimmer wrote:
>
> Out of curiosity has anyone ran any tests post boot time?
>

Some functional tests only to exercise the machine and see if anything
blew up. It looked fine to me at least.

--
Mel Gorman
SUSE Labs

2015-04-30 21:53:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 06/13] mm: meminit: Inline some helper functions

On Tue, 28 Apr 2015 15:37:03 +0100 Mel Gorman <[email protected]> wrote:

> early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
> unnecessarily visible outside memory initialisation. As well as unnecessary
> visibility, it's unnecessary function call overhead when initialising pages.
> This patch moves the helpers inline.

mm/page_alloc.c: In function 'memmap_init_zone':
mm/page_alloc.c:4287: error: implicit declaration of function 'early_pfn_in_nid'

--- a/mm/page_alloc.c~mm-meminit-inline-some-helper-functions-fix
+++ a/mm/page_alloc.c
@@ -950,8 +950,16 @@ static inline bool __meminit early_pfn_i
{
return meminit_pfn_in_nid(pfn, node, &early_pfnnid_cache);
}
+
+#else
+
+static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+{
+ return true;
+}
#endif

+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)


allmodconfig. It's odd that nobody else hit this...

2015-04-30 21:55:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 06/13] mm: meminit: Inline some helper functions

On Thu, 30 Apr 2015 14:53:46 -0700 Andrew Morton <[email protected]> wrote:

> allmodconfig. It's odd that nobody else hit this...

err, it's allnoconfig. Not odd.

It would be tiresome to mention Documentation/SubmitChecklist.

2015-05-01 09:20:39

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] mm: page_alloc: pass PFN to __free_pages_bootmem -fix

Stephen Rothwell reported the following

Today's linux-next build (sparc defconfig) failed like this:

mm/bootmem.c: In function 'free_all_bootmem_core':
mm/bootmem.c:237:32: error: 'cur' undeclared (first use in this function)
__free_pages_bootmem(page++, cur++, 0);
^
Caused by commit "mm: page_alloc: pass PFN to __free_pages_bootmem".

He also merged a fix. The only difference in this version is one line is
moved so the final diff context is clearer. This is a fix to the mmotm
patch mm-page_alloc-pass-pfn-to-__free_pages_bootmem.patch

Signed-off-by: Mel Gorman <[email protected]>
---
mm/bootmem.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/bootmem.c b/mm/bootmem.c
index daf956bb4782..a23dd1934654 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -172,7 +172,7 @@ void __init free_bootmem_late(unsigned long physaddr, unsigned long size)
static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
{
struct page *page;
- unsigned long *map, start, end, pages, count = 0;
+ unsigned long *map, start, end, pages, cur, count = 0;

if (!bdata->node_bootmem_map)
return 0;
@@ -214,7 +214,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
count += BITS_PER_LONG;
start += BITS_PER_LONG;
} else {
- unsigned long cur = start;
+ cur = start;

start = ALIGN(start + 1, BITS_PER_LONG);
while (vec && cur != start) {
@@ -229,6 +229,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
}
}

+ cur = bdata->node_min_pfn;
page = virt_to_page(bdata->node_bootmem_map);
pages = bdata->node_low_pfn - bdata->node_min_pfn;
pages = bootmem_bootmap_pages(pages);

2015-05-01 09:21:17

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set -fix

This is take 2 on describing why these section names exist. If accepted
then it should be considered a fix for the mmotm patch
mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set.patch

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 24314b671db1..85189fce7f61 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -386,10 +386,14 @@ static inline void mminit_verify_zonelist(void)
#endif /* CONFIG_DEBUG_MEMORY_INIT */

/*
- * Deferred struct page initialisation requires some early init functions that
- * are removed before kswapd is up and running. The feature depends on memory
- * hotplug so put the data and code required by deferred initialisation into
- * the __meminit section where they are preserved.
+ * Deferred struct page initialisation requires init functions that are freed
+ * before kswapd is available. Reuse the memory hotplug section annotation
+ * to mark the required code.
+ *
+ * __defermem_init is code that always exists but is annotated __meminit to
+ * avoid section warnings.
+ * __defer_init code gets marked __meminit when deferring struct page
+ * initialistion but is otherwise in the init section.
*/
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
#define __defermem_init __meminit

2015-05-01 09:23:36

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] mm: meminit: Reduce number of times pageblocks are set during struct page init -fix


The patch "mm: meminit: Reduce number of times pageblocks are
set during struct page init" is setting a pageblock before
the page is initialised. This is a fix for the mmotm patch
mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init.patch

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 19aac687963c..544edb3b8da2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4497,8 +4497,8 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (!(pfn & (pageblock_nr_pages - 1))) {
struct page *page = pfn_to_page(pfn);

- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
__init_single_page(page, pfn, zone, nid);
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
} else {
__init_single_pfn(pfn, zone, nid);
}

2015-05-01 22:02:49

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On 04/28/2015 09:16 PM, Waiman Long wrote:
> On 04/28/2015 10:36 AM, Mel Gorman wrote:
>> The bulk of the changes here are related to Andrew's feedback.
>> Functionally
>> there is almost no difference.
>>
>> Changelog since v3
>> o Fix section-related warning
>> o Comments, clarifications, checkpatch
>> o Report the number of pages initialised
>>
>> Changelog since v2
>> o Reduce overhead of topology_init
>> o Remove boot-time kernel parameter to enable/disable
>> o Enable on UMA
>>
>> Changelog since v1
>> o Always initialise low zones
>> o Typo corrections
>> o Rename parallel mem init to parallel struct page init
>> o Rebase to 4.0
>>
>> Struct page initialisation had been identified as one of the reasons why
>> large machines take a long time to boot. Patches were posted a long
>> time ago
>> to defer initialisation until they were first used. This was
>> rejected on
>> the grounds it should not be necessary to hurt the fast paths. This
>> series
>> reuses much of the work from that time but defers the initialisation of
>> memory to kswapd so that one thread per node initialises memory local to
>> that node.
>>
>> After applying the series and setting the appropriate Kconfig variable I
>> see this in the boot log on a 64G machine
>>
>> [ 7.383764] kswapd 0 initialised deferred memory in 188ms
>> [ 7.404253] kswapd 1 initialised deferred memory in 208ms
>> [ 7.411044] kswapd 3 initialised deferred memory in 216ms
>> [ 7.411551] kswapd 2 initialised deferred memory in 216ms
>>
>> On a 1TB machine, I see
>>
>> [ 8.406511] kswapd 3 initialised deferred memory in 1116ms
>> [ 8.428518] kswapd 1 initialised deferred memory in 1140ms
>> [ 8.435977] kswapd 0 initialised deferred memory in 1148ms
>> [ 8.437416] kswapd 2 initialised deferred memory in 1148ms
>>
>> Once booted the machine appears to work as normal. Boot times were
>> measured
>> from the time shutdown was called until ssh was available again. In the
>> 64G case, the boot time savings are negligible. On the 1TB machine, the
>> savings were 16 seconds.
>>
>> It would be nice if the people that have access to really large machines
>> would test this series and report how much boot time is reduced.
>>
>>
>
> I ran a bootup timing test on a 12-TB 16-socket IvyBridge-EX system.
> From grub menu to ssh login, the bootup time was 453s before the patch
> and 265s after the patch - a saving of 188s (42%). I used a different
> OS environment and config file with this test and so the timing data
> weren't comparable with my previous testing data. The kswapd log
> entries were
>
> [ 45.973967] kswapd 4 initialised 197655470 pages in 4390ms
> [ 45.974214] kswapd 7 initialised 197655470 pages in 4390ms
> [ 45.976692] kswapd 15 initialised 197654299 pages in 4390ms
> [ 45.993284] kswapd 0 initialised 197131131 pages in 4410ms
> [ 46.032735] kswapd 9 initialised 197655470 pages in 4447ms
> [ 46.065856] kswapd 8 initialised 197655470 pages in 4481ms
> [ 46.066615] kswapd 1 initialised 197622702 pages in 4483ms
> [ 46.077995] kswapd 2 initialised 197655470 pages in 4495ms
> [ 46.219508] kswapd 13 initialised 197655470 pages in 4633ms
> [ 46.224358] kswapd 3 initialised 197655470 pages in 4641ms
> [ 46.228441] kswapd 11 initialised 197655470 pages in 4643ms
> [ 46.232258] kswapd 12 initialised 197655470 pages in 4647ms
> [ 46.239659] kswapd 10 initialised 197655470 pages in 4654ms
> [ 46.243402] kswapd 14 initialised 197655470 pages in 4657ms
> [ 46.250368] kswapd 5 initialised 197655470 pages in 4666ms
> [ 46.254659] kswapd 6 initialised 197655470 pages in 4670ms
>
> Cheers,
> Longman

Bad news!

I tried your patch on a 24-TB DragonHawk and got an out of memory panic.
The kernel log messages were:
:
[ 80.126186] CPU 474: hi: 186, btch: 31 usd: 0
[ 80.131457] CPU 475: hi: 186, btch: 31 usd: 0
[ 80.136726] CPU 476: hi: 186, btch: 31 usd: 0
[ 80.141997] CPU 477: hi: 186, btch: 31 usd: 0
[ 80.147267] CPU 478: hi: 186, btch: 31 usd: 0
[ 80.152538] CPU 479: hi: 186, btch: 31 usd: 0
[ 80.157813] active_anon:0 inactive_anon:0 isolated_anon:0
[ 80.157813] active_file:0 inactive_file:0 isolated_file:0
[ 80.157813] unevictable:0 dirty:0 writeback:0 unstable:0
[ 80.157813] free:209 slab_reclaimable:7 slab_unreclaimable:42986
[ 80.157813] mapped:0 shmem:0 pagetables:0 bounce:0
[ 80.157813] free_cma:0
[ 80.190428] Node 0 DMA free:568kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB
managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
slab_reclaimable:0kB slab_unreclaimable:14928kB kernel_stack:400kB
pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? yes
[ 80.233475] lowmem_reserve[]: 0 0 0 0
[ 80.237542] Node 0 DMA32 free:20kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1961924kB
managed:1333604kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB
shmem:0kB slab_reclaimable:12kB slab_unreclaimable:101664kB
kernel_stack:50176kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[ 80.281456] lowmem_reserve[]: 0 0 0 0
[ 80.285527] Node 0 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1608515580kB managed:2097148kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB
slab_unreclaimable:948kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.328958] lowmem_reserve[]: 0 0 0 0
[ 80.333031] Node 1 Normal free:248kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612732kB managed:2228220kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB
slab_unreclaimable:46240kB kernel_stack:3232kB pagetables:0kB
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.377256] lowmem_reserve[]: 0 0 0 0
[ 80.381325] Node 2 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:612kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.424764] lowmem_reserve[]: 0 0 0 0
[ 80.428842] Node 3 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.472293] lowmem_reserve[]: 0 0 0 0
[ 80.476360] Node 4 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:620kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.519803] lowmem_reserve[]: 0 0 0 0
[ 80.523875] Node 5 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:584kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.567312] lowmem_reserve[]: 0 0 0 0
[ 80.571379] Node 6 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:556kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.614814] lowmem_reserve[]: 0 0 0 0
[ 80.618881] Node 7 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:556kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.662316] lowmem_reserve[]: 0 0 0 0
[ 80.666390] Node 8 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:572kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.709827] lowmem_reserve[]: 0 0 0 0
[ 80.713898] Node 9 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:572kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.757336] lowmem_reserve[]: 0 0 0 0
[ 80.761407] Node 10 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:564kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.804941] lowmem_reserve[]: 0 0 0 0
[ 80.809015] Node 11 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:572kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.852548] lowmem_reserve[]: 0 0 0 0
[ 80.856620] Node 12 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:616kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.900158] lowmem_reserve[]: 0 0 0 0
[ 80.904236] Node 13 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:592kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.947765] lowmem_reserve[]: 0 0 0 0
[ 80.951847] Node 14 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 80.995380] lowmem_reserve[]: 0 0 0 0
[ 80.999448] Node 15 Normal free:0kB min:0kB low:0kB high:0kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:548kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 81.042974] lowmem_reserve[]: 0 0 0 0
[ 81.047044] Node 0 DMA: 132*4kB (U) 5*8kB (U) 0*16kB 0*32kB 0*64kB
0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 568kB
[ 81.059632] Node 0 DMA32: 5*4kB (U) 0*8kB 0*16kB 0*32kB 0*64kB
0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20kB
[ 81.071733] Node 0 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.083443] Node 1 Normal: 52*4kB (U) 5*8kB (U) 0*16kB 0*32kB 0*64kB
0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 248kB
[ 81.096227] Node 2 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.107935] Node 3 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.119643] Node 4 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.131347] Node 5 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.143056] Node 6 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.154767] Node 7 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.166473] Node 8 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.178179] Node 9 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.189893] Node 10 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.201695] Node 11 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.213496] Node 12 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.225324] Node 13 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.237130] Node 14 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.248926] Node 15 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 81.260726] 0 total pagecache pages
[ 81.264565] 0 pages in swap cache
[ 81.268212] Swap cache stats: add 0, delete 0, find 0/0
[ 81.273962] Free swap = 0kB
[ 81.277125] Total swap = 0kB
[ 81.280341] 6442421132 pages RAM
[ 81.283888] 0 pages HighMem/MovableOnly
[ 81.288109] 6433662383 pages reserved
[ 81.292135] 0 pages hwpoisoned
[ 81.295491] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds
swapents oom_score_adj name
[ 81.305245] Kernel panic - not syncing: Out of memory and no killable
processes...
[ 81.305245]
[ 81.315200] CPU: 240 PID: 1 Comm: swapper/0 Not tainted
4.0.1-pmm-bigsmp #1
[ 81.322856] Hardware name: HP Superdome2 16s x86, BIOS Bundle:
006.000.042 SFW: 015.099.000 04/01/2015
[ 81.333096] 0000000000000000 ffff8800044c79c8 ffffffff8151b0c9
ffff8800044c7a48
[ 81.341262] ffffffff8151ae1e 0000000000000008 ffff8800044c7a58
ffff8800044c79f8
[ 81.349428] ffffffff810785c3 ffffffff81a13480 0000000000000000
ffff8800001001d0
[ 81.357595] Call Trace:
[ 81.360287] [<ffffffff8151b0c9>] dump_stack+0x68/0x77
[ 81.365942] [<ffffffff8151ae1e>] panic+0xb9/0x219
[ 81.371213] [<ffffffff810785c3>] ?
__blocking_notifier_call_chain+0x63/0x80
[ 81.378971] [<ffffffff811384ce>] __out_of_memory+0x34e/0x350
[ 81.385292] [<ffffffff811385ee>] out_of_memory+0x5e/0x90
[ 81.391230] [<ffffffff8113ce9e>] __alloc_pages_slowpath+0x6be/0x740
[ 81.398219] [<ffffffff8113d15c>] __alloc_pages_nodemask+0x23c/0x250
[ 81.405212] [<ffffffff81186346>] kmem_getpages+0x56/0x110
[ 81.411246] [<ffffffff81187f44>] fallback_alloc+0x164/0x200
[ 81.417474] [<ffffffff81187cfd>] ____cache_alloc_node+0x8d/0x170
[ 81.424179] [<ffffffff811887bb>] kmem_cache_alloc_trace+0x17b/0x240
[ 81.431169] [<ffffffff813d5f3a>] init_memory_block+0x3a/0x110
[ 81.437586] [<ffffffff81b5f687>] memory_dev_init+0xd7/0x13d
[ 81.443810] [<ffffffff81b5f2af>] driver_init+0x2f/0x37
[ 81.449556] [<ffffffff81b1599b>] do_basic_setup+0x29/0xd5
[ 81.455597] [<ffffffff81b372c4>] ? sched_init_smp+0x140/0x147
[ 81.462015] [<ffffffff81b15c55>] kernel_init_freeable+0x20e/0x297
[ 81.468815] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
[ 81.474565] [<ffffffff81512ea9>] kernel_init+0x9/0xf0
[ 81.480216] [<ffffffff8151f788>] ret_from_fork+0x58/0x90
[ 81.486156] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
[ 81.492350] ---[ end Kernel panic - not syncing: Out of memory and no
killable processes...
[ 81.492350]

-Longman



2015-05-02 00:09:27

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On 05/01/2015 06:02 PM, Waiman Long wrote:
>
> Bad news!
>
> I tried your patch on a 24-TB DragonHawk and got an out of memory
> panic. The kernel log messages were:
> :
> [ 80.126186] CPU 474: hi: 186, btch: 31 usd: 0
> [ 80.131457] CPU 475: hi: 186, btch: 31 usd: 0
> [ 80.136726] CPU 476: hi: 186, btch: 31 usd: 0
> [ 80.141997] CPU 477: hi: 186, btch: 31 usd: 0
> [ 80.147267] CPU 478: hi: 186, btch: 31 usd: 0
> [ 80.152538] CPU 479: hi: 186, btch: 31 usd: 0
> [ 80.157813] active_anon:0 inactive_anon:0 isolated_anon:0
> [ 80.157813] active_file:0 inactive_file:0 isolated_file:0
> [ 80.157813] unevictable:0 dirty:0 writeback:0 unstable:0
> [ 80.157813] free:209 slab_reclaimable:7 slab_unreclaimable:42986
> [ 80.157813] mapped:0 shmem:0 pagetables:0 bounce:0
> [ 80.157813] free_cma:0
> [ 80.190428] Node 0 DMA free:568kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB
> managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB
> shmem:0kB slab_reclaimable:0kB slab_unreclaimable:14928kB
> kernel_stack:400kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
> [ 80.233475] lowmem_reserve[]: 0 0 0 0
> [ 80.237542] Node 0 DMA32 free:20kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1961924kB managed:1333604kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB
> slab_unreclaimable:101664kB kernel_stack:50176kB pagetables:0kB
> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.281456] lowmem_reserve[]: 0 0 0 0
> [ 80.285527] Node 0 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1608515580kB managed:2097148kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB
> slab_unreclaimable:948kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.328958] lowmem_reserve[]: 0 0 0 0
> [ 80.333031] Node 1 Normal free:248kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612732kB managed:2228220kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB
> slab_unreclaimable:46240kB kernel_stack:3232kB pagetables:0kB
> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.377256] lowmem_reserve[]: 0 0 0 0
> [ 80.381325] Node 2 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:612kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.424764] lowmem_reserve[]: 0 0 0 0
> [ 80.428842] Node 3 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.472293] lowmem_reserve[]: 0 0 0 0
> [ 80.476360] Node 4 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:620kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.519803] lowmem_reserve[]: 0 0 0 0
> [ 80.523875] Node 5 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:584kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.567312] lowmem_reserve[]: 0 0 0 0
> [ 80.571379] Node 6 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:556kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.614814] lowmem_reserve[]: 0 0 0 0
> [ 80.618881] Node 7 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:556kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.662316] lowmem_reserve[]: 0 0 0 0
> [ 80.666390] Node 8 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:572kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.709827] lowmem_reserve[]: 0 0 0 0
> [ 80.713898] Node 9 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:572kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.757336] lowmem_reserve[]: 0 0 0 0
> [ 80.761407] Node 10 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:564kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.804941] lowmem_reserve[]: 0 0 0 0
> [ 80.809015] Node 11 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:572kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.852548] lowmem_reserve[]: 0 0 0 0
> [ 80.856620] Node 12 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:616kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.900158] lowmem_reserve[]: 0 0 0 0
> [ 80.904236] Node 13 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:592kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.947765] lowmem_reserve[]: 0 0 0 0
> [ 80.951847] Node 14 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 80.995380] lowmem_reserve[]: 0 0 0 0
> [ 80.999448] Node 15 Normal free:0kB min:0kB low:0kB high:0kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:548kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> [ 81.042974] lowmem_reserve[]: 0 0 0 0
> [ 81.047044] Node 0 DMA: 132*4kB (U) 5*8kB (U) 0*16kB 0*32kB 0*64kB
> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 568kB
> [ 81.059632] Node 0 DMA32: 5*4kB (U) 0*8kB 0*16kB 0*32kB 0*64kB
> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20kB
> [ 81.071733] Node 0 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.083443] Node 1 Normal: 52*4kB (U) 5*8kB (U) 0*16kB 0*32kB
> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 248kB
> [ 81.096227] Node 2 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.107935] Node 3 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.119643] Node 4 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.131347] Node 5 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.143056] Node 6 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.154767] Node 7 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.166473] Node 8 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.178179] Node 9 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.189893] Node 10 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.201695] Node 11 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.213496] Node 12 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.225324] Node 13 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.237130] Node 14 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.248926] Node 15 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [ 81.260726] 0 total pagecache pages
> [ 81.264565] 0 pages in swap cache
> [ 81.268212] Swap cache stats: add 0, delete 0, find 0/0
> [ 81.273962] Free swap = 0kB
> [ 81.277125] Total swap = 0kB
> [ 81.280341] 6442421132 pages RAM
> [ 81.283888] 0 pages HighMem/MovableOnly
> [ 81.288109] 6433662383 pages reserved
> [ 81.292135] 0 pages hwpoisoned
> [ 81.295491] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds
> swapents oom_score_adj name
> [ 81.305245] Kernel panic - not syncing: Out of memory and no
> killable processes...
> [ 81.305245]
> [ 81.315200] CPU: 240 PID: 1 Comm: swapper/0 Not tainted
> 4.0.1-pmm-bigsmp #1
> [ 81.322856] Hardware name: HP Superdome2 16s x86, BIOS Bundle:
> 006.000.042 SFW: 015.099.000 04/01/2015
> [ 81.333096] 0000000000000000 ffff8800044c79c8 ffffffff8151b0c9
> ffff8800044c7a48
> [ 81.341262] ffffffff8151ae1e 0000000000000008 ffff8800044c7a58
> ffff8800044c79f8
> [ 81.349428] ffffffff810785c3 ffffffff81a13480 0000000000000000
> ffff8800001001d0
> [ 81.357595] Call Trace:
> [ 81.360287] [<ffffffff8151b0c9>] dump_stack+0x68/0x77
> [ 81.365942] [<ffffffff8151ae1e>] panic+0xb9/0x219
> [ 81.371213] [<ffffffff810785c3>] ?
> __blocking_notifier_call_chain+0x63/0x80
> [ 81.378971] [<ffffffff811384ce>] __out_of_memory+0x34e/0x350
> [ 81.385292] [<ffffffff811385ee>] out_of_memory+0x5e/0x90
> [ 81.391230] [<ffffffff8113ce9e>] __alloc_pages_slowpath+0x6be/0x740
> [ 81.398219] [<ffffffff8113d15c>] __alloc_pages_nodemask+0x23c/0x250
> [ 81.405212] [<ffffffff81186346>] kmem_getpages+0x56/0x110
> [ 81.411246] [<ffffffff81187f44>] fallback_alloc+0x164/0x200
> [ 81.417474] [<ffffffff81187cfd>] ____cache_alloc_node+0x8d/0x170
> [ 81.424179] [<ffffffff811887bb>] kmem_cache_alloc_trace+0x17b/0x240
> [ 81.431169] [<ffffffff813d5f3a>] init_memory_block+0x3a/0x110
> [ 81.437586] [<ffffffff81b5f687>] memory_dev_init+0xd7/0x13d
> [ 81.443810] [<ffffffff81b5f2af>] driver_init+0x2f/0x37
> [ 81.449556] [<ffffffff81b1599b>] do_basic_setup+0x29/0xd5
> [ 81.455597] [<ffffffff81b372c4>] ? sched_init_smp+0x140/0x147
> [ 81.462015] [<ffffffff81b15c55>] kernel_init_freeable+0x20e/0x297
> [ 81.468815] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
> [ 81.474565] [<ffffffff81512ea9>] kernel_init+0x9/0xf0
> [ 81.480216] [<ffffffff8151f788>] ret_from_fork+0x58/0x90
> [ 81.486156] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
> [ 81.492350] ---[ end Kernel panic - not syncing: Out of memory and
> no killable processes...
> [ 81.492350]
>
> -Longman

I increased the pre-initialized memory per node in update_defer_init()
of mm/page_alloc.c from 2G to 4G. Now I am able to boot the 24-TB
machine without error. The 12-TB has 0.75TB/node, while the 24-TB
machine has 1.5TB/node. I would suggest something like pre-initializing
1G per 0.25TB/node. In this way, it will scale properly with the memory
size.

Before the patch, the boot time from elilo prompt to ssh login was 694s.
After the patch, the boot up time was 346s, a saving of 348s (about 50%).

Cheers,
Longman

2015-05-02 08:52:38

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Sat, May 2, 2015 at 8:09 AM, Waiman Long <[email protected]> wrote:
> On 05/01/2015 06:02 PM, Waiman Long wrote:
>>
>> Bad news!
>>
>> I tried your patch on a 24-TB DragonHawk and got an out of memory
>> panic. The kernel log messages were:
>> :
>> [ 80.126186] CPU 474: hi: 186, btch: 31 usd: 0
>> [ 80.131457] CPU 475: hi: 186, btch: 31 usd: 0
>> [ 80.136726] CPU 476: hi: 186, btch: 31 usd: 0
>> [ 80.141997] CPU 477: hi: 186, btch: 31 usd: 0
>> [ 80.147267] CPU 478: hi: 186, btch: 31 usd: 0
>> [ 80.152538] CPU 479: hi: 186, btch: 31 usd: 0
>> [ 80.157813] active_anon:0 inactive_anon:0 isolated_anon:0
>> [ 80.157813] active_file:0 inactive_file:0 isolated_file:0
>> [ 80.157813] unevictable:0 dirty:0 writeback:0 unstable:0
>> [ 80.157813] free:209 slab_reclaimable:7 slab_unreclaimable:42986
>> [ 80.157813] mapped:0 shmem:0 pagetables:0 bounce:0
>> [ 80.157813] free_cma:0
>> [ 80.190428] Node 0 DMA free:568kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:15988kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB
>> mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:14928kB
>> kernel_stack:400kB pagetables:0kB unstable:0kB bounce:0kB
>> free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
>> [ 80.233475] lowmem_reserve[]: 0 0 0 0
>> [ 80.237542] Node 0 DMA32 free:20kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1961924kB managed:1333604kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB
>> slab_unreclaimable:101664kB kernel_stack:50176kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.281456] lowmem_reserve[]: 0 0 0 0
>> [ 80.285527] Node 0 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1608515580kB managed:2097148kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB
>> slab_unreclaimable:948kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.328958] lowmem_reserve[]: 0 0 0 0
>> [ 80.333031] Node 1 Normal free:248kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612732kB managed:2228220kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB
>> slab_unreclaimable:46240kB kernel_stack:3232kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.377256] lowmem_reserve[]: 0 0 0 0
>> [ 80.381325] Node 2 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:612kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.424764] lowmem_reserve[]: 0 0 0 0
>> [ 80.428842] Node 3 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.472293] lowmem_reserve[]: 0 0 0 0
>> [ 80.476360] Node 4 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:620kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.519803] lowmem_reserve[]: 0 0 0 0
>> [ 80.523875] Node 5 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:584kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.567312] lowmem_reserve[]: 0 0 0 0
>> [ 80.571379] Node 6 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:556kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.614814] lowmem_reserve[]: 0 0 0 0
>> [ 80.618881] Node 7 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:556kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.662316] lowmem_reserve[]: 0 0 0 0
>> [ 80.666390] Node 8 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:572kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.709827] lowmem_reserve[]: 0 0 0 0
>> [ 80.713898] Node 9 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:572kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.757336] lowmem_reserve[]: 0 0 0 0
>> [ 80.761407] Node 10 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:564kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.804941] lowmem_reserve[]: 0 0 0 0
>> [ 80.809015] Node 11 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:572kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.852548] lowmem_reserve[]: 0 0 0 0
>> [ 80.856620] Node 12 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:616kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.900158] lowmem_reserve[]: 0 0 0 0
>> [ 80.904236] Node 13 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:592kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.947765] lowmem_reserve[]: 0 0 0 0
>> [ 80.951847] Node 14 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 80.995380] lowmem_reserve[]: 0 0 0 0
>> [ 80.999448] Node 15 Normal free:0kB min:0kB low:0kB high:0kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:548kB kernel_stack:0kB pagetables:0kB
>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> [ 81.042974] lowmem_reserve[]: 0 0 0 0
>> [ 81.047044] Node 0 DMA: 132*4kB (U) 5*8kB (U) 0*16kB 0*32kB
>> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 568kB
>> [ 81.059632] Node 0 DMA32: 5*4kB (U) 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20kB
>> [ 81.071733] Node 0 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.083443] Node 1 Normal: 52*4kB (U) 5*8kB (U) 0*16kB 0*32kB
>> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 248kB
>> [ 81.096227] Node 2 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.107935] Node 3 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.119643] Node 4 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.131347] Node 5 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.143056] Node 6 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.154767] Node 7 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.166473] Node 8 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.178179] Node 9 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.189893] Node 10 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.201695] Node 11 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.213496] Node 12 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.225324] Node 13 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.237130] Node 14 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.248926] Node 15 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>> [ 81.260726] 0 total pagecache pages
>> [ 81.264565] 0 pages in swap cache
>> [ 81.268212] Swap cache stats: add 0, delete 0, find 0/0
>> [ 81.273962] Free swap = 0kB
>> [ 81.277125] Total swap = 0kB
>> [ 81.280341] 6442421132 pages RAM
>> [ 81.283888] 0 pages HighMem/MovableOnly
>> [ 81.288109] 6433662383 pages reserved
>> [ 81.292135] 0 pages hwpoisoned
>> [ 81.295491] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds
>> swapents oom_score_adj name
>> [ 81.305245] Kernel panic - not syncing: Out of memory and no
>> killable processes...
>> [ 81.305245]
>> [ 81.315200] CPU: 240 PID: 1 Comm: swapper/0 Not tainted
>> 4.0.1-pmm-bigsmp #1
>> [ 81.322856] Hardware name: HP Superdome2 16s x86, BIOS Bundle:
>> 006.000.042 SFW: 015.099.000 04/01/2015
>> [ 81.333096] 0000000000000000 ffff8800044c79c8 ffffffff8151b0c9
>> ffff8800044c7a48
>> [ 81.341262] ffffffff8151ae1e 0000000000000008 ffff8800044c7a58
>> ffff8800044c79f8
>> [ 81.349428] ffffffff810785c3 ffffffff81a13480 0000000000000000
>> ffff8800001001d0
>> [ 81.357595] Call Trace:
>> [ 81.360287] [<ffffffff8151b0c9>] dump_stack+0x68/0x77
>> [ 81.365942] [<ffffffff8151ae1e>] panic+0xb9/0x219
>> [ 81.371213] [<ffffffff810785c3>] ?
>> __blocking_notifier_call_chain+0x63/0x80
>> [ 81.378971] [<ffffffff811384ce>] __out_of_memory+0x34e/0x350
>> [ 81.385292] [<ffffffff811385ee>] out_of_memory+0x5e/0x90
>> [ 81.391230] [<ffffffff8113ce9e>]
>> __alloc_pages_slowpath+0x6be/0x740
>> [ 81.398219] [<ffffffff8113d15c>]
>> __alloc_pages_nodemask+0x23c/0x250
>> [ 81.405212] [<ffffffff81186346>] kmem_getpages+0x56/0x110
>> [ 81.411246] [<ffffffff81187f44>] fallback_alloc+0x164/0x200
>> [ 81.417474] [<ffffffff81187cfd>] ____cache_alloc_node+0x8d/0x170
>> [ 81.424179] [<ffffffff811887bb>]
>> kmem_cache_alloc_trace+0x17b/0x240
>> [ 81.431169] [<ffffffff813d5f3a>] init_memory_block+0x3a/0x110
>> [ 81.437586] [<ffffffff81b5f687>] memory_dev_init+0xd7/0x13d
>> [ 81.443810] [<ffffffff81b5f2af>] driver_init+0x2f/0x37
>> [ 81.449556] [<ffffffff81b1599b>] do_basic_setup+0x29/0xd5
>> [ 81.455597] [<ffffffff81b372c4>] ? sched_init_smp+0x140/0x147
>> [ 81.462015] [<ffffffff81b15c55>] kernel_init_freeable+0x20e/0x297
>> [ 81.468815] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
>> [ 81.474565] [<ffffffff81512ea9>] kernel_init+0x9/0xf0
>> [ 81.480216] [<ffffffff8151f788>] ret_from_fork+0x58/0x90
>> [ 81.486156] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
>> [ 81.492350] ---[ end Kernel panic - not syncing: Out of memory
>> and no killable processes...
>> [ 81.492350]
>>
>> -Longman
>
> I increased the pre-initialized memory per node in
> update_defer_init() of mm/page_alloc.c from 2G to 4G. Now I am able
> to boot the 24-TB machine without error. The 12-TB has 0.75TB/node,
> while the 24-TB machine has 1.5TB/node. I would suggest something
> like pre-initializing 1G per 0.25TB/node. In this way, it will scale
> properly with the memory size.
>
> Before the patch, the boot time from elilo prompt to ssh login was
> 694s. After the patch, the boot up time was 346s, a saving of 348s
> (about 50%).

I second scaling the up-front init with the zone size. The 7TB system I
was booting has only 32GB per NUMA node, which at 1GB per 0.25TB would
work out at 128MB up-front init per-NUMA-node, which worked nice and
booted faster yet.

Even booting with 64MB per NUMA node worked great, so there is adequate
margin for the 8 cores, just I guess we'd need to enforce a minimum of
eg 64MB or so.

Thanks,
Daniel

Subject: RE: [PATCH 0/13] Parallel struct page initialisation v4


> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Daniel J Blueman
> Sent: Thursday, April 30, 2015 11:10 AM
> Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4
...
> On a 7TB, 1728-core NumaConnect system with 108 NUMA nodes, we're
> seeing stock 4.0 boot in 7136s. This drops to 2159s, or a 70% reduction
> with this patchset. Non-temporal PMD init [1] drops this to 1045s.
>
> Nathan, what do you guys see with the non-temporal PMD patch [1]? Do
> add a sfence at the ende label if you manually patch.
>
...
> [1] https://lkml.org/lkml/2015/4/23/350

>From that post:
> +loop_64:
> + decq %rcx
> + movnti %rax,(%rdi)
> + movnti %rax,8(%rdi)
> + movnti %rax,16(%rdi)
> + movnti %rax,24(%rdi)
> + movnti %rax,32(%rdi)
> + movnti %rax,40(%rdi)
> + movnti %rax,48(%rdi)
> + movnti %rax,56(%rdi)
> + leaq 64(%rdi),%rdi
> + jnz loop_64

There are some even more efficient instructions available in x86,
depending on the CPU features:
* movnti 8 byte
* movntdq %xmm 16 byte, SSE
* vmovntdq %ymm 32 byte, AVX
* vmovntdq %zmm 64 byte, AVX-512 (forthcoming)

The last will transfer a full cache line at a time.

For NVDIMMs, the nd pmem driver is also in need of memcpy functions that
use these non-temporal instructions, both for performance and reliability.
We also need to speed up __clear_page and copy_user_enhanced_string so
userspace accesses through the page cache can keep up.
https://lkml.org/lkml/2015/4/2/453 is one of the threads on that topic.

Some results I've gotten there under different cache attributes
(in terms of 4 KiB IOPS):

16-byte movntdq:
UC write iops=697872 (697.872 K)(0.697872 M)
WB write iops=9745800 (9745.8 K)(9.7458 M)
WC write iops=9801800 (9801.8 K)(9.8018 M)
WT write iops=9812400 (9812.4 K)(9.8124 M)

32-byte vmovntdq:
UC write iops=1274400 (1274.4 K)(1.2744 M)
WB write iops=10259000 (10259 K)(10.259 M)
WC write iops=10286000 (10286 K)(10.286 M)
WT write iops=10294000 (10294 K)(10.294 M)

---
Robert Elliott, HP Server Storage

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-05-02 16:06:08

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Sat, May 2, 2015 at 4:52 PM, Daniel J Blueman <[email protected]>
wrote:
> On Sat, May 2, 2015 at 8:09 AM, Waiman Long <[email protected]>
> wrote:
>> On 05/01/2015 06:02 PM, Waiman Long wrote:
>>>
>>> Bad news!
>>>
>>> I tried your patch on a 24-TB DragonHawk and got an out of memory
>>> panic. The kernel log messages were:
>>> :
>>> [ 80.126186] CPU 474: hi: 186, btch: 31 usd: 0
>>> [ 80.131457] CPU 475: hi: 186, btch: 31 usd: 0
>>> [ 80.136726] CPU 476: hi: 186, btch: 31 usd: 0
>>> [ 80.141997] CPU 477: hi: 186, btch: 31 usd: 0
>>> [ 80.147267] CPU 478: hi: 186, btch: 31 usd: 0
>>> [ 80.152538] CPU 479: hi: 186, btch: 31 usd: 0
>>> [ 80.157813] active_anon:0 inactive_anon:0 isolated_anon:0
>>> [ 80.157813] active_file:0 inactive_file:0 isolated_file:0
>>> [ 80.157813] unevictable:0 dirty:0 writeback:0 unstable:0
>>> [ 80.157813] free:209 slab_reclaimable:7 slab_unreclaimable:42986
>>> [ 80.157813] mapped:0 shmem:0 pagetables:0 bounce:0
>>> [ 80.157813] free_cma:0
>>> [ 80.190428] Node 0 DMA free:568kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:15988kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB
>>> mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:14928kB kernel_stack:400kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.233475] lowmem_reserve[]: 0 0 0 0
>>> [ 80.237542] Node 0 DMA32 free:20kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1961924kB managed:1333604kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB
>>> slab_unreclaimable:101664kB kernel_stack:50176kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.281456] lowmem_reserve[]: 0 0 0 0
>>> [ 80.285527] Node 0 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1608515580kB managed:2097148kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB
>>> slab_unreclaimable:948kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.328958] lowmem_reserve[]: 0 0 0 0
>>> [ 80.333031] Node 1 Normal free:248kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612732kB managed:2228220kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB
>>> slab_unreclaimable:46240kB kernel_stack:3232kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.377256] lowmem_reserve[]: 0 0 0 0
>>> [ 80.381325] Node 2 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:612kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.424764] lowmem_reserve[]: 0 0 0 0
>>> [ 80.428842] Node 3 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.472293] lowmem_reserve[]: 0 0 0 0
>>> [ 80.476360] Node 4 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:620kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.519803] lowmem_reserve[]: 0 0 0 0
>>> [ 80.523875] Node 5 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:584kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.567312] lowmem_reserve[]: 0 0 0 0
>>> [ 80.571379] Node 6 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:556kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.614814] lowmem_reserve[]: 0 0 0 0
>>> [ 80.618881] Node 7 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:556kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.662316] lowmem_reserve[]: 0 0 0 0
>>> [ 80.666390] Node 8 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:572kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.709827] lowmem_reserve[]: 0 0 0 0
>>> [ 80.713898] Node 9 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:572kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.757336] lowmem_reserve[]: 0 0 0 0
>>> [ 80.761407] Node 10 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:564kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.804941] lowmem_reserve[]: 0 0 0 0
>>> [ 80.809015] Node 11 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:572kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.852548] lowmem_reserve[]: 0 0 0 0
>>> [ 80.856620] Node 12 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:616kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.900158] lowmem_reserve[]: 0 0 0 0
>>> [ 80.904236] Node 13 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:592kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.947765] lowmem_reserve[]: 0 0 0 0
>>> [ 80.951847] Node 14 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 80.995380] lowmem_reserve[]: 0 0 0 0
>>> [ 80.999448] Node 15 Normal free:0kB min:0kB low:0kB high:0kB
>>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB
>>> present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB
>>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>>> slab_unreclaimable:548kB kernel_stack:0kB pagetables:0kB
>>> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
>>> pages_scanned:0 all_unreclaimable? yes
>>> [ 81.042974] lowmem_reserve[]: 0 0 0 0
>>> [ 81.047044] Node 0 DMA: 132*4kB (U) 5*8kB (U) 0*16kB 0*32kB
>>> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 568kB
>>> [ 81.059632] Node 0 DMA32: 5*4kB (U) 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20kB
>>> [ 81.071733] Node 0 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.083443] Node 1 Normal: 52*4kB (U) 5*8kB (U) 0*16kB 0*32kB
>>> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 248kB
>>> [ 81.096227] Node 2 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.107935] Node 3 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.119643] Node 4 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.131347] Node 5 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.143056] Node 6 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.154767] Node 7 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.166473] Node 8 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.178179] Node 9 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.189893] Node 10 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.201695] Node 11 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.213496] Node 12 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.225324] Node 13 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.237130] Node 14 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.248926] Node 15 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB
>>> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
>>> [ 81.260726] 0 total pagecache pages
>>> [ 81.264565] 0 pages in swap cache
>>> [ 81.268212] Swap cache stats: add 0, delete 0, find 0/0
>>> [ 81.273962] Free swap = 0kB
>>> [ 81.277125] Total swap = 0kB
>>> [ 81.280341] 6442421132 pages RAM
>>> [ 81.283888] 0 pages HighMem/MovableOnly
>>> [ 81.288109] 6433662383 pages reserved
>>> [ 81.292135] 0 pages hwpoisoned
>>> [ 81.295491] [ pid ] uid tgid total_vm rss nr_ptes
>>> nr_pmds swapents oom_score_adj name
>>> [ 81.305245] Kernel panic - not syncing: Out of memory and no
>>> killable processes...
>>> [ 81.305245]
>>> [ 81.315200] CPU: 240 PID: 1 Comm: swapper/0 Not tainted
>>> 4.0.1-pmm-bigsmp #1
>>> [ 81.322856] Hardware name: HP Superdome2 16s x86, BIOS Bundle:
>>> 006.000.042 SFW: 015.099.000 04/01/2015
>>> [ 81.333096] 0000000000000000 ffff8800044c79c8 ffffffff8151b0c9
>>> ffff8800044c7a48
>>> [ 81.341262] ffffffff8151ae1e 0000000000000008 ffff8800044c7a58
>>> ffff8800044c79f8
>>> [ 81.349428] ffffffff810785c3 ffffffff81a13480 0000000000000000
>>> ffff8800001001d0
>>> [ 81.357595] Call Trace:
>>> [ 81.360287] [<ffffffff8151b0c9>] dump_stack+0x68/0x77
>>> [ 81.365942] [<ffffffff8151ae1e>] panic+0xb9/0x219
>>> [ 81.371213] [<ffffffff810785c3>] ?
>>> __blocking_notifier_call_chain+0x63/0x80
>>> [ 81.378971] [<ffffffff811384ce>] __out_of_memory+0x34e/0x350
>>> [ 81.385292] [<ffffffff811385ee>] out_of_memory+0x5e/0x90
>>> [ 81.391230] [<ffffffff8113ce9e>]
>>> __alloc_pages_slowpath+0x6be/0x740
>>> [ 81.398219] [<ffffffff8113d15c>]
>>> __alloc_pages_nodemask+0x23c/0x250
>>> [ 81.405212] [<ffffffff81186346>] kmem_getpages+0x56/0x110
>>> [ 81.411246] [<ffffffff81187f44>] fallback_alloc+0x164/0x200
>>> [ 81.417474] [<ffffffff81187cfd>] ____cache_alloc_node+0x8d/0x170
>>> [ 81.424179] [<ffffffff811887bb>]
>>> kmem_cache_alloc_trace+0x17b/0x240
>>> [ 81.431169] [<ffffffff813d5f3a>] init_memory_block+0x3a/0x110
>>> [ 81.437586] [<ffffffff81b5f687>] memory_dev_init+0xd7/0x13d
>>> [ 81.443810] [<ffffffff81b5f2af>] driver_init+0x2f/0x37
>>> [ 81.449556] [<ffffffff81b1599b>] do_basic_setup+0x29/0xd5
>>> [ 81.455597] [<ffffffff81b372c4>] ? sched_init_smp+0x140/0x147
>>> [ 81.462015] [<ffffffff81b15c55>]
>>> kernel_init_freeable+0x20e/0x297
>>> [ 81.468815] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
>>> [ 81.474565] [<ffffffff81512ea9>] kernel_init+0x9/0xf0
>>> [ 81.480216] [<ffffffff8151f788>] ret_from_fork+0x58/0x90
>>> [ 81.486156] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
>>> [ 81.492350] ---[ end Kernel panic - not syncing: Out of memory
>>> and no killable processes...
>>> [ 81.492350]
>>>
>>> -Longman
>>
>> I increased the pre-initialized memory per node in
>> update_defer_init() of mm/page_alloc.c from 2G to 4G. Now I am able
>> to boot the 24-TB machine without error. The 12-TB has 0.75TB/node,
>> while the 24-TB machine has 1.5TB/node. I would suggest something
>> like pre-initializing 1G per 0.25TB/node. In this way, it will scale
>> properly with the memory size.
>>
>> Before the patch, the boot time from elilo prompt to ssh login was
>> 694s. After the patch, the boot up time was 346s, a saving of 348s
>> (about 50%).
>
> I second scaling the up-front init with the zone size. The 7TB system
> I was booting has only 32GB per NUMA node, which at 1GB per 0.25TB
> would work out at 128MB up-front init per-NUMA-node, which worked
> nice and booted faster yet.
>
> Even booting with 64MB per NUMA node worked great, so there is
> adequate margin for the 8 cores, just I guess we'd need to enforce a
> minimum of eg 64MB or so.

Varying the synchronous per-NUMA-node initialisation (with non-temporal
patch, but that just removes a constant from PMD init), from kernel
load to login prompt on this 7TB, 1728-core system takes:
512MB 699.2s
256MB 680.3s
128MB 661.7s
64MB 663.6s
32MB 667.8s

So, in this case 128MB per NUMA node gives more locality than 64MB, so
should be a good minimum, and matches Waiman's scaling suggestion.

Thanks,
Daniel

2015-05-04 08:34:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 06/13] mm: meminit: Inline some helper functions

On Tue 28-04-15 15:37:03, Mel Gorman wrote:
> early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
> unnecessarily visible outside memory initialisation. As well as unnecessary
> visibility, it's unnecessary function call overhead when initialising pages.
> This patch moves the helpers inline.

This is causing:
CC mm/page_alloc.o
mm/page_alloc.c: In function ‘deferred_init_memmap’:
mm/page_alloc.c:1135:4: error: implicit declaration of function ‘meminit_pfn_in_nid’ [-Werror=implicit-function-declaration]
if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
^

with randconfig test. CONFIG_NODES_SPAN_OTHER_NODES is not defined.
The full config is attached.

I guess we need something like this:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e0257debce0..a48128d882d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1044,6 +1044,11 @@ static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
{
return true;
}
+static inline bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state)
+{
+ return true;
+}
#endif

--
Michal Hocko
SUSE Labs


Attachments:
(No filename) (1.15 kB)
config-failed (98.52 kB)
Download all attachments

2015-05-04 08:38:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 06/13] mm: meminit: Inline some helper functions

I have taken this into my mm git tree for now. I guess Andrew will fold
it into the original patch later.

---
>From 986279c465b2f513bcbb91ba7010cb2184d1bb7c Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 4 May 2015 10:35:36 +0200
Subject: [PATCH] mm-meminit-inline-some-helper-functions-fix2.patch
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

mm/page_alloc.c: In function ‘deferred_init_memmap’:
mm/page_alloc.c:1135:4: error: implicit declaration of function ‘meminit_pfn_in_nid’ [-Werror=implicit-function-declaration]
if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
^

because randconfig decided to disable CONFIG_NODES_SPAN_OTHER_NODES.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e0257debce0..a48128d882d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1044,6 +1044,11 @@ static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
{
return true;
}
+static inline bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state)
+{
+ return true;
+}
#endif


--
2.1.4

--
Michal Hocko
SUSE Labs

2015-05-04 21:30:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Fri, 01 May 2015 20:09:21 -0400 Waiman Long <[email protected]> wrote:

> On 05/01/2015 06:02 PM, Waiman Long wrote:
> >
> > Bad news!
> >
> > I tried your patch on a 24-TB DragonHawk and got an out of memory
> > panic. The kernel log messages were:
>
> ...
>
> > [ 81.360287] [<ffffffff8151b0c9>] dump_stack+0x68/0x77
> > [ 81.365942] [<ffffffff8151ae1e>] panic+0xb9/0x219
> > [ 81.371213] [<ffffffff810785c3>] ?
> > __blocking_notifier_call_chain+0x63/0x80
> > [ 81.378971] [<ffffffff811384ce>] __out_of_memory+0x34e/0x350
> > [ 81.385292] [<ffffffff811385ee>] out_of_memory+0x5e/0x90
> > [ 81.391230] [<ffffffff8113ce9e>] __alloc_pages_slowpath+0x6be/0x740
> > [ 81.398219] [<ffffffff8113d15c>] __alloc_pages_nodemask+0x23c/0x250
> > [ 81.405212] [<ffffffff81186346>] kmem_getpages+0x56/0x110
> > [ 81.411246] [<ffffffff81187f44>] fallback_alloc+0x164/0x200
> > [ 81.417474] [<ffffffff81187cfd>] ____cache_alloc_node+0x8d/0x170
> > [ 81.424179] [<ffffffff811887bb>] kmem_cache_alloc_trace+0x17b/0x240
> > [ 81.431169] [<ffffffff813d5f3a>] init_memory_block+0x3a/0x110
> > [ 81.437586] [<ffffffff81b5f687>] memory_dev_init+0xd7/0x13d
> > [ 81.443810] [<ffffffff81b5f2af>] driver_init+0x2f/0x37
> > [ 81.449556] [<ffffffff81b1599b>] do_basic_setup+0x29/0xd5
> > [ 81.455597] [<ffffffff81b372c4>] ? sched_init_smp+0x140/0x147
> > [ 81.462015] [<ffffffff81b15c55>] kernel_init_freeable+0x20e/0x297
> > [ 81.468815] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
> > [ 81.474565] [<ffffffff81512ea9>] kernel_init+0x9/0xf0
> > [ 81.480216] [<ffffffff8151f788>] ret_from_fork+0x58/0x90
> > [ 81.486156] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
> > [ 81.492350] ---[ end Kernel panic - not syncing: Out of memory and
> > no killable processes...
> > [ 81.492350]
> >
> > -Longman
>
> I increased the pre-initialized memory per node in update_defer_init()
> of mm/page_alloc.c from 2G to 4G. Now I am able to boot the 24-TB
> machine without error. The 12-TB has 0.75TB/node, while the 24-TB
> machine has 1.5TB/node. I would suggest something like pre-initializing
> 1G per 0.25TB/node. In this way, it will scale properly with the memory
> size.

We're using more than 2G before we've even completed do_basic_setup()?
Where did it all go?

> Before the patch, the boot time from elilo prompt to ssh login was 694s.
> After the patch, the boot up time was 346s, a saving of 348s (about 50%).

Having to guesstimate the amount of memory which is needed for a
successful boot will be painful. Any number we choose will be wrong
99% of the time.

If the kswapd threads have started, all we need to do is to wait: take
a little nap in the allocator's page==NULL slowpath.

I'm not seeing any reason why we can't start kswapd much earlier -
right at the start of do_basic_setup()?

2015-05-05 03:32:49

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On 05/04/2015 05:30 PM, Andrew Morton wrote:
> On Fri, 01 May 2015 20:09:21 -0400 Waiman Long<[email protected]> wrote:
>
>> On 05/01/2015 06:02 PM, Waiman Long wrote:
>>> Bad news!
>>>
>>> I tried your patch on a 24-TB DragonHawk and got an out of memory
>>> panic. The kernel log messages were:
>> ...
>>
>>> [ 81.360287] [<ffffffff8151b0c9>] dump_stack+0x68/0x77
>>> [ 81.365942] [<ffffffff8151ae1e>] panic+0xb9/0x219
>>> [ 81.371213] [<ffffffff810785c3>] ?
>>> __blocking_notifier_call_chain+0x63/0x80
>>> [ 81.378971] [<ffffffff811384ce>] __out_of_memory+0x34e/0x350
>>> [ 81.385292] [<ffffffff811385ee>] out_of_memory+0x5e/0x90
>>> [ 81.391230] [<ffffffff8113ce9e>] __alloc_pages_slowpath+0x6be/0x740
>>> [ 81.398219] [<ffffffff8113d15c>] __alloc_pages_nodemask+0x23c/0x250
>>> [ 81.405212] [<ffffffff81186346>] kmem_getpages+0x56/0x110
>>> [ 81.411246] [<ffffffff81187f44>] fallback_alloc+0x164/0x200
>>> [ 81.417474] [<ffffffff81187cfd>] ____cache_alloc_node+0x8d/0x170
>>> [ 81.424179] [<ffffffff811887bb>] kmem_cache_alloc_trace+0x17b/0x240
>>> [ 81.431169] [<ffffffff813d5f3a>] init_memory_block+0x3a/0x110
>>> [ 81.437586] [<ffffffff81b5f687>] memory_dev_init+0xd7/0x13d
>>> [ 81.443810] [<ffffffff81b5f2af>] driver_init+0x2f/0x37
>>> [ 81.449556] [<ffffffff81b1599b>] do_basic_setup+0x29/0xd5
>>> [ 81.455597] [<ffffffff81b372c4>] ? sched_init_smp+0x140/0x147
>>> [ 81.462015] [<ffffffff81b15c55>] kernel_init_freeable+0x20e/0x297
>>> [ 81.468815] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
>>> [ 81.474565] [<ffffffff81512ea9>] kernel_init+0x9/0xf0
>>> [ 81.480216] [<ffffffff8151f788>] ret_from_fork+0x58/0x90
>>> [ 81.486156] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
>>> [ 81.492350] ---[ end Kernel panic - not syncing: Out of memory and
>>> no killable processes...
>>> [ 81.492350]
>>>
>>> -Longman
>> I increased the pre-initialized memory per node in update_defer_init()
>> of mm/page_alloc.c from 2G to 4G. Now I am able to boot the 24-TB
>> machine without error. The 12-TB has 0.75TB/node, while the 24-TB
>> machine has 1.5TB/node. I would suggest something like pre-initializing
>> 1G per 0.25TB/node. In this way, it will scale properly with the memory
>> size.
> We're using more than 2G before we've even completed do_basic_setup()?
> Where did it all go?

I think they may be used in the allocation of the hash tables like:

[ 2.367440] Dentry cache hash table entries: 2147483648 (order: 22,
17179869184 bytes)
[ 11.522768] Inode-cache hash table entries: 2147483648 (order: 22,
17179869184 bytes)
[ 18.598513] Mount-cache hash table entries: 67108864 (order: 17,
536870912 bytes)
[ 18.667485] Mountpoint-cache hash table entries: 67108864 (order: 17,
536870912 bytes)

The size of those hash tables do scale somewhat linearly with the amount
of total memory available.

>> Before the patch, the boot time from elilo prompt to ssh login was 694s.
>> After the patch, the boot up time was 346s, a saving of 348s (about 50%).
> Having to guesstimate the amount of memory which is needed for a
> successful boot will be painful. Any number we choose will be wrong
> 99% of the time.
>
> If the kswapd threads have started, all we need to do is to wait: take
> a little nap in the allocator's page==NULL slowpath.
>
> I'm not seeing any reason why we can't start kswapd much earlier -
> right at the start of do_basic_setup()?

I think we can, we just have to change the hash table allocator to do that.

Cheers,
Longman

2015-05-05 10:45:31

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Mon, May 04, 2015 at 02:30:46PM -0700, Andrew Morton wrote:
> > Before the patch, the boot time from elilo prompt to ssh login was 694s.
> > After the patch, the boot up time was 346s, a saving of 348s (about 50%).
>
> Having to guesstimate the amount of memory which is needed for a
> successful boot will be painful. Any number we choose will be wrong
> 99% of the time.
>
> If the kswapd threads have started, all we need to do is to wait: take
> a little nap in the allocator's page==NULL slowpath.
>
> I'm not seeing any reason why we can't start kswapd much earlier -
> right at the start of do_basic_setup()?

It doesn't even have to be kswapd, it just should be a thread pinned to
a done. The difficulty is that dealing with the system hashes means the
initialisation has to happen before vfs_caches_init_early() when there is
no scheduler. Those allocations could be delayed further but then there is
the possibility that the allocations would not be contiguous and they'd
have to rely on CMA to make the attempt. That potentially alters the
performance of the large system hashes at run time.

We can scale the amount initialised with memory sizes relatively easy.
This boots on the same 1TB machine I was testing before but that is
hardly a surprise.

---8<---
mm: meminit: Take into account that large system caches scale linearly with memory

Waiman Long reported a 24TB machine triggered an OOM as parallel memory
initialisation deferred too much memory for initialisation. The likely
consumer of this memory was large system hashes that scale with memory
size. This patch initialises at least 2G per node but scales the amount
initialised for larger systems.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 598f78d6544c..f7cc6c9fb909 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -266,15 +266,16 @@ static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
*/
static inline bool update_defer_init(pg_data_t *pgdat,
unsigned long pfn, unsigned long zone_end,
+ unsigned long max_initialise,
unsigned long *nr_initialised)
{
/* Always populate low zones for address-contrained allocations */
if (zone_end < pgdat_end_pfn(pgdat))
return true;

- /* Initialise at least 2G of the highest zone */
+ /* Initialise at least the requested amount in the highest zone */
(*nr_initialised)++;
- if (*nr_initialised > (2UL << (30 - PAGE_SHIFT)) &&
+ if ((*nr_initialised > max_initialise) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;
@@ -299,6 +300,7 @@ static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)

static inline bool update_defer_init(pg_data_t *pgdat,
unsigned long pfn, unsigned long zone_end,
+ unsigned long max_initialise,
unsigned long *nr_initialised)
{
return true;
@@ -4457,11 +4459,19 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
struct zone *z;
+ unsigned long max_initialise;
unsigned long nr_initialised = 0;

if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;

+ /*
+ * Initialise at least 2G of a node but also take into account that
+ * two large system hashes that can take up an 8th of memory.
+ */
+ max_initialise = min(2UL << (30 - PAGE_SHIFT),
+ (pgdat->node_spanned_pages >> 3));
+
z = &pgdat->node_zones[zone];
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*
@@ -4475,6 +4485,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (!early_pfn_in_nid(pfn, nid))
continue;
if (!update_defer_init(pgdat, pfn, end_pfn,
+ max_initialise,
&nr_initialised))
break;
}

2015-05-05 13:56:02

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On 05/05/2015 06:45 AM, Mel Gorman wrote:
> On Mon, May 04, 2015 at 02:30:46PM -0700, Andrew Morton wrote:
>>> Before the patch, the boot time from elilo prompt to ssh login was 694s.
>>> After the patch, the boot up time was 346s, a saving of 348s (about 50%).
>> Having to guesstimate the amount of memory which is needed for a
>> successful boot will be painful. Any number we choose will be wrong
>> 99% of the time.
>>
>> If the kswapd threads have started, all we need to do is to wait: take
>> a little nap in the allocator's page==NULL slowpath.
>>
>> I'm not seeing any reason why we can't start kswapd much earlier -
>> right at the start of do_basic_setup()?
> It doesn't even have to be kswapd, it just should be a thread pinned to
> a done. The difficulty is that dealing with the system hashes means the
> initialisation has to happen before vfs_caches_init_early() when there is
> no scheduler. Those allocations could be delayed further but then there is
> the possibility that the allocations would not be contiguous and they'd
> have to rely on CMA to make the attempt. That potentially alters the
> performance of the large system hashes at run time.
>
> We can scale the amount initialised with memory sizes relatively easy.
> This boots on the same 1TB machine I was testing before but that is
> hardly a surprise.
>
> ---8<---
> mm: meminit: Take into account that large system caches scale linearly with memory
>
> Waiman Long reported a 24TB machine triggered an OOM as parallel memory
> initialisation deferred too much memory for initialisation. The likely
> consumer of this memory was large system hashes that scale with memory
> size. This patch initialises at least 2G per node but scales the amount
> initialised for larger systems.
>
> Signed-off-by: Mel Gorman<[email protected]>
> ---
> mm/page_alloc.c | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 598f78d6544c..f7cc6c9fb909 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -266,15 +266,16 @@ static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
> */
> static inline bool update_defer_init(pg_data_t *pgdat,
> unsigned long pfn, unsigned long zone_end,
> + unsigned long max_initialise,
> unsigned long *nr_initialised)
> {
> /* Always populate low zones for address-contrained allocations */
> if (zone_end< pgdat_end_pfn(pgdat))
> return true;
>
> - /* Initialise at least 2G of the highest zone */
> + /* Initialise at least the requested amount in the highest zone */
> (*nr_initialised)++;
> - if (*nr_initialised> (2UL<< (30 - PAGE_SHIFT))&&
> + if ((*nr_initialised> max_initialise)&&
> (pfn& (PAGES_PER_SECTION - 1)) == 0) {
> pgdat->first_deferred_pfn = pfn;
> return false;
> @@ -299,6 +300,7 @@ static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
>
> static inline bool update_defer_init(pg_data_t *pgdat,
> unsigned long pfn, unsigned long zone_end,
> + unsigned long max_initialise,
> unsigned long *nr_initialised)
> {
> return true;
> @@ -4457,11 +4459,19 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> unsigned long end_pfn = start_pfn + size;
> unsigned long pfn;
> struct zone *z;
> + unsigned long max_initialise;
> unsigned long nr_initialised = 0;
>
> if (highest_memmap_pfn< end_pfn - 1)
> highest_memmap_pfn = end_pfn - 1;
>
> + /*
> + * Initialise at least 2G of a node but also take into account that
> + * two large system hashes that can take up an 8th of memory.
> + */
> + max_initialise = min(2UL<< (30 - PAGE_SHIFT),
> + (pgdat->node_spanned_pages>> 3));
> +

I think you may be pre-allocating too much memory here. On the 24-TB
machine, the size of the dentry and inode hash tables were 16G each. So
the ratio is about is about 32G/24T = 0.13%. I think a shift factor of
(>> 8) which is about 0.39% should be more than enough. For the 24TB
machine, that means a preallocated memory of 96+4G which should be even
more than the 64+4G in the modified kernel that I used. At the same
time, I think we can also set the minimum to 1G or even 0.5G for better
performance for systems that have many CPUs, but not as much memory per
node.

Cheers,
Longman

2015-05-05 16:06:59

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Tue, May 05, 2015 at 09:55:52AM -0400, Waiman Long wrote:
> On 05/05/2015 06:45 AM, Mel Gorman wrote:
> >On Mon, May 04, 2015 at 02:30:46PM -0700, Andrew Morton wrote:
> >>>Before the patch, the boot time from elilo prompt to ssh login was 694s.
> >>>After the patch, the boot up time was 346s, a saving of 348s (about 50%).
> >>Having to guesstimate the amount of memory which is needed for a
> >>successful boot will be painful. Any number we choose will be wrong
> >>99% of the time.
> >>
> >>If the kswapd threads have started, all we need to do is to wait: take
> >>a little nap in the allocator's page==NULL slowpath.
> >>
> >>I'm not seeing any reason why we can't start kswapd much earlier -
> >>right at the start of do_basic_setup()?
> >It doesn't even have to be kswapd, it just should be a thread pinned to
> >a done. The difficulty is that dealing with the system hashes means the
> >initialisation has to happen before vfs_caches_init_early() when there is
> >no scheduler. Those allocations could be delayed further but then there is
> >the possibility that the allocations would not be contiguous and they'd
> >have to rely on CMA to make the attempt. That potentially alters the
> >performance of the large system hashes at run time.
> >
> >We can scale the amount initialised with memory sizes relatively easy.
> >This boots on the same 1TB machine I was testing before but that is
> >hardly a surprise.
> >
> >---8<---
> >mm: meminit: Take into account that large system caches scale linearly with memory
> >
> >Waiman Long reported a 24TB machine triggered an OOM as parallel memory
> >initialisation deferred too much memory for initialisation. The likely
> >consumer of this memory was large system hashes that scale with memory
> >size. This patch initialises at least 2G per node but scales the amount
> >initialised for larger systems.
> >
> >Signed-off-by: Mel Gorman<[email protected]>
> >---
> > mm/page_alloc.c | 15 +++++++++++++--
> > 1 file changed, 13 insertions(+), 2 deletions(-)
> >
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 598f78d6544c..f7cc6c9fb909 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -266,15 +266,16 @@ static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
> > */
> > static inline bool update_defer_init(pg_data_t *pgdat,
> > unsigned long pfn, unsigned long zone_end,
> >+ unsigned long max_initialise,
> > unsigned long *nr_initialised)
> > {
> > /* Always populate low zones for address-contrained allocations */
> > if (zone_end< pgdat_end_pfn(pgdat))
> > return true;
> >
> >- /* Initialise at least 2G of the highest zone */
> >+ /* Initialise at least the requested amount in the highest zone */
> > (*nr_initialised)++;
> >- if (*nr_initialised> (2UL<< (30 - PAGE_SHIFT))&&
> >+ if ((*nr_initialised> max_initialise)&&
> > (pfn& (PAGES_PER_SECTION - 1)) == 0) {
> > pgdat->first_deferred_pfn = pfn;
> > return false;
> >@@ -299,6 +300,7 @@ static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
> >
> > static inline bool update_defer_init(pg_data_t *pgdat,
> > unsigned long pfn, unsigned long zone_end,
> >+ unsigned long max_initialise,
> > unsigned long *nr_initialised)
> > {
> > return true;
> >@@ -4457,11 +4459,19 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> > unsigned long end_pfn = start_pfn + size;
> > unsigned long pfn;
> > struct zone *z;
> >+ unsigned long max_initialise;
> > unsigned long nr_initialised = 0;
> >
> > if (highest_memmap_pfn< end_pfn - 1)
> > highest_memmap_pfn = end_pfn - 1;
> >
> >+ /*
> >+ * Initialise at least 2G of a node but also take into account that
> >+ * two large system hashes that can take up an 8th of memory.
> >+ */
> >+ max_initialise = min(2UL<< (30 - PAGE_SHIFT),
> >+ (pgdat->node_spanned_pages>> 3));
> >+
>
> I think you may be pre-allocating too much memory here. On the 24-TB
> machine, the size of the dentry and inode hash tables were 16G each.
> So the ratio is about is about 32G/24T = 0.13%. I think a shift
> factor of (>> 8) which is about 0.39% should be more than enough.

I was taking the most pessimistic value possible to match where those
hashes currently get allocated from so that the locality does not change
after the series is applied. Can you try both (>> 3) and (>> 8) and see
do both work and if so, what the timing is?

> For the 24TB machine, that means a preallocated memory of 96+4G
> which should be even more than the 64+4G in the modified kernel that
> I used. At the same time, I think we can also set the minimum to 1G
> or even 0.5G for better performance for systems that have many CPUs,
> but not as much memory per node.
>

I think the benefit there is going to be marginal except maybe on machines
where remote accesses are extremely costly.

--
Mel Gorman
SUSE Labs

2015-05-05 16:08:33

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On 05/05/2015 10:31 AM, Mel Gorman wrote:
> On Tue, May 05, 2015 at 09:55:52AM -0400, Waiman Long wrote:
>> On 05/05/2015 06:45 AM, Mel Gorman wrote:
>>> On Mon, May 04, 2015 at 02:30:46PM -0700, Andrew Morton wrote:
>>>>> Before the patch, the boot time from elilo prompt to ssh login was 694s.
>>>>> After the patch, the boot up time was 346s, a saving of 348s (about 50%).
>>>> Having to guesstimate the amount of memory which is needed for a
>>>> successful boot will be painful. Any number we choose will be wrong
>>>> 99% of the time.
>>>>
>>>> If the kswapd threads have started, all we need to do is to wait: take
>>>> a little nap in the allocator's page==NULL slowpath.
>>>>
>>>> I'm not seeing any reason why we can't start kswapd much earlier -
>>>> right at the start of do_basic_setup()?
>>> It doesn't even have to be kswapd, it just should be a thread pinned to
>>> a done. The difficulty is that dealing with the system hashes means the
>>> initialisation has to happen before vfs_caches_init_early() when there is
>>> no scheduler. Those allocations could be delayed further but then there is
>>> the possibility that the allocations would not be contiguous and they'd
>>> have to rely on CMA to make the attempt. That potentially alters the
>>> performance of the large system hashes at run time.
>>>
>>> We can scale the amount initialised with memory sizes relatively easy.
>>> This boots on the same 1TB machine I was testing before but that is
>>> hardly a surprise.
>>>
>>> ---8<---
>>> mm: meminit: Take into account that large system caches scale linearly with memory
>>>
>>> Waiman Long reported a 24TB machine triggered an OOM as parallel memory
>>> initialisation deferred too much memory for initialisation. The likely
>>> consumer of this memory was large system hashes that scale with memory
>>> size. This patch initialises at least 2G per node but scales the amount
>>> initialised for larger systems.
>>>
>>> Signed-off-by: Mel Gorman<[email protected]>
>>> ---
>>> mm/page_alloc.c | 15 +++++++++++++--
>>> 1 file changed, 13 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 598f78d6544c..f7cc6c9fb909 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -266,15 +266,16 @@ static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
>>> */
>>> static inline bool update_defer_init(pg_data_t *pgdat,
>>> unsigned long pfn, unsigned long zone_end,
>>> + unsigned long max_initialise,
>>> unsigned long *nr_initialised)
>>> {
>>> /* Always populate low zones for address-contrained allocations */
>>> if (zone_end< pgdat_end_pfn(pgdat))
>>> return true;
>>>
>>> - /* Initialise at least 2G of the highest zone */
>>> + /* Initialise at least the requested amount in the highest zone */
>>> (*nr_initialised)++;
>>> - if (*nr_initialised> (2UL<< (30 - PAGE_SHIFT))&&
>>> + if ((*nr_initialised> max_initialise)&&
>>> (pfn& (PAGES_PER_SECTION - 1)) == 0) {
>>> pgdat->first_deferred_pfn = pfn;
>>> return false;
>>> @@ -299,6 +300,7 @@ static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
>>>
>>> static inline bool update_defer_init(pg_data_t *pgdat,
>>> unsigned long pfn, unsigned long zone_end,
>>> + unsigned long max_initialise,
>>> unsigned long *nr_initialised)
>>> {
>>> return true;
>>> @@ -4457,11 +4459,19 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>> unsigned long end_pfn = start_pfn + size;
>>> unsigned long pfn;
>>> struct zone *z;
>>> + unsigned long max_initialise;
>>> unsigned long nr_initialised = 0;
>>>
>>> if (highest_memmap_pfn< end_pfn - 1)
>>> highest_memmap_pfn = end_pfn - 1;
>>>
>>> + /*
>>> + * Initialise at least 2G of a node but also take into account that
>>> + * two large system hashes that can take up an 8th of memory.
>>> + */
>>> + max_initialise = min(2UL<< (30 - PAGE_SHIFT),
>>> + (pgdat->node_spanned_pages>> 3));
>>> +
>> I think you may be pre-allocating too much memory here. On the 24-TB
>> machine, the size of the dentry and inode hash tables were 16G each.
>> So the ratio is about is about 32G/24T = 0.13%. I think a shift
>> factor of (>> 8) which is about 0.39% should be more than enough.
> I was taking the most pessimistic value possible to match where those
> hashes currently get allocated from so that the locality does not change
> after the series is applied. Can you try both (>> 3) and (>> 8) and see
> do both work and if so, what the timing is?

Sure. I will try both and get you the results, hopefully by tomorrow at
the latest.

Cheers,
Longman

2015-05-05 20:03:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Tue, 5 May 2015 11:45:14 +0100 Mel Gorman <[email protected]> wrote:

> On Mon, May 04, 2015 at 02:30:46PM -0700, Andrew Morton wrote:
> > > Before the patch, the boot time from elilo prompt to ssh login was 694s.
> > > After the patch, the boot up time was 346s, a saving of 348s (about 50%).
> >
> > Having to guesstimate the amount of memory which is needed for a
> > successful boot will be painful. Any number we choose will be wrong
> > 99% of the time.
> >
> > If the kswapd threads have started, all we need to do is to wait: take
> > a little nap in the allocator's page==NULL slowpath.
> >
> > I'm not seeing any reason why we can't start kswapd much earlier -
> > right at the start of do_basic_setup()?
>
> It doesn't even have to be kswapd, it just should be a thread pinned to
> a done. The difficulty is that dealing with the system hashes means the
> initialisation has to happen before vfs_caches_init_early() when there is
> no scheduler.

I bet we can run vfs_caches_init_early() after sched_init(). Might
need a few little fixups.

> Those allocations could be delayed further but then there is
> the possibility that the allocations would not be contiguous and they'd
> have to rely on CMA to make the attempt. That potentially alters the
> performance of the large system hashes at run time.

hm, why. If the kswapd threads are running and busily creating free
pages then alloc_pages(order=10) can detect this situation and stall
for a while, waiting for kswapd to create an order-10 page.

Alternatively, the page allocator can go off and synchronously
initialize some pageframes itself. Keep doing that until the
allocation attempt succeeds.

Such an approach is much more robust than trying to predict how much
memory will be needed.

2015-05-05 22:13:38

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Tue, May 05, 2015 at 01:02:55PM -0700, Andrew Morton wrote:
> On Tue, 5 May 2015 11:45:14 +0100 Mel Gorman <[email protected]> wrote:
>
> > On Mon, May 04, 2015 at 02:30:46PM -0700, Andrew Morton wrote:
> > > > Before the patch, the boot time from elilo prompt to ssh login was 694s.
> > > > After the patch, the boot up time was 346s, a saving of 348s (about 50%).
> > >
> > > Having to guesstimate the amount of memory which is needed for a
> > > successful boot will be painful. Any number we choose will be wrong
> > > 99% of the time.
> > >
> > > If the kswapd threads have started, all we need to do is to wait: take
> > > a little nap in the allocator's page==NULL slowpath.
> > >
> > > I'm not seeing any reason why we can't start kswapd much earlier -
> > > right at the start of do_basic_setup()?
> >
> > It doesn't even have to be kswapd, it just should be a thread pinned to
> > a done. The difficulty is that dealing with the system hashes means the
> > initialisation has to happen before vfs_caches_init_early() when there is
> > no scheduler.
>
> I bet we can run vfs_caches_init_early() after sched_init(). Might
> need a few little fixups.
>

For the large hashes, that would leave the CMA requirement because
allocation sizes can be larger than order-10. Arguably on NUMA, that's
a bad idea anyway because it should have been interleaved but it's not
something this patchset should change.

> > Those allocations could be delayed further but then there is
> > the possibility that the allocations would not be contiguous and they'd
> > have to rely on CMA to make the attempt. That potentially alters the
> > performance of the large system hashes at run time.
>
> hm, why. If the kswapd threads are running and busily creating free
> pages then alloc_pages(order=10) can detect this situation and stall
> for a while, waiting for kswapd to create an order-10 page.
>

In Waiman's case, the OOM happened when kswapd was not necessarily available
but that's an implementation detail. I'll look tomorrow at what is required
to use dedicated threads to parallelisation the allocation and synchronously
wait for those threads to complete. It should be possible to create those
earlier than kswapd currently is. It'll take longer to boot but hopefully
not so long that it makes the series pointless.

> Alternatively, the page allocator can go off and synchronously
> initialize some pageframes itself. Keep doing that until the
> allocation attempt succeeds.
>

That was rejected during review of earlier attempts at this feature on
the grounds that it impacted allocator fast paths.

--
Mel Gorman
SUSE Labs

2015-05-05 22:25:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Tue, 5 May 2015 23:13:29 +0100 Mel Gorman <[email protected]> wrote:

> > Alternatively, the page allocator can go off and synchronously
> > initialize some pageframes itself. Keep doing that until the
> > allocation attempt succeeds.
> >
>
> That was rejected during review of earlier attempts at this feature on
> the grounds that it impacted allocator fast paths.

eh? Changes are only needed on the allocation-attempt-failed path,
which is slow-path.

2015-05-06 00:55:51

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On 05/05/2015 10:31 AM, Mel Gorman wrote:
> On Tue, May 05, 2015 at 09:55:52AM -0400, Waiman Long wrote:
>> On 05/05/2015 06:45 AM, Mel Gorman wrote:
>>> On Mon, May 04, 2015 at 02:30:46PM -0700, Andrew Morton wrote:
>>>>> Before the patch, the boot time from elilo prompt to ssh login was 694s.
>>>>> After the patch, the boot up time was 346s, a saving of 348s (about 50%).
>>>> Having to guesstimate the amount of memory which is needed for a
>>>> successful boot will be painful. Any number we choose will be wrong
>>>> 99% of the time.
>>>>
>>>> If the kswapd threads have started, all we need to do is to wait: take
>>>> a little nap in the allocator's page==NULL slowpath.
>>>>
>>>> I'm not seeing any reason why we can't start kswapd much earlier -
>>>> right at the start of do_basic_setup()?
>>> It doesn't even have to be kswapd, it just should be a thread pinned to
>>> a done. The difficulty is that dealing with the system hashes means the
>>> initialisation has to happen before vfs_caches_init_early() when there is
>>> no scheduler. Those allocations could be delayed further but then there is
>>> the possibility that the allocations would not be contiguous and they'd
>>> have to rely on CMA to make the attempt. That potentially alters the
>>> performance of the large system hashes at run time.
>>>
>>> We can scale the amount initialised with memory sizes relatively easy.
>>> This boots on the same 1TB machine I was testing before but that is
>>> hardly a surprise.
>>>
>>> ---8<---
>>> mm: meminit: Take into account that large system caches scale linearly with memory
>>>
>>> Waiman Long reported a 24TB machine triggered an OOM as parallel memory
>>> initialisation deferred too much memory for initialisation. The likely
>>> consumer of this memory was large system hashes that scale with memory
>>> size. This patch initialises at least 2G per node but scales the amount
>>> initialised for larger systems.
>>>
>>> Signed-off-by: Mel Gorman<[email protected]>
>>> ---
>>> mm/page_alloc.c | 15 +++++++++++++--
>>> 1 file changed, 13 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 598f78d6544c..f7cc6c9fb909 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -266,15 +266,16 @@ static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
>>> */
>>> static inline bool update_defer_init(pg_data_t *pgdat,
>>> unsigned long pfn, unsigned long zone_end,
>>> + unsigned long max_initialise,
>>> unsigned long *nr_initialised)
>>> {
>>> /* Always populate low zones for address-contrained allocations */
>>> if (zone_end< pgdat_end_pfn(pgdat))
>>> return true;
>>>
>>> - /* Initialise at least 2G of the highest zone */
>>> + /* Initialise at least the requested amount in the highest zone */
>>> (*nr_initialised)++;
>>> - if (*nr_initialised> (2UL<< (30 - PAGE_SHIFT))&&
>>> + if ((*nr_initialised> max_initialise)&&
>>> (pfn& (PAGES_PER_SECTION - 1)) == 0) {
>>> pgdat->first_deferred_pfn = pfn;
>>> return false;
>>> @@ -299,6 +300,7 @@ static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
>>>
>>> static inline bool update_defer_init(pg_data_t *pgdat,
>>> unsigned long pfn, unsigned long zone_end,
>>> + unsigned long max_initialise,
>>> unsigned long *nr_initialised)
>>> {
>>> return true;
>>> @@ -4457,11 +4459,19 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>> unsigned long end_pfn = start_pfn + size;
>>> unsigned long pfn;
>>> struct zone *z;
>>> + unsigned long max_initialise;
>>> unsigned long nr_initialised = 0;
>>>
>>> if (highest_memmap_pfn< end_pfn - 1)
>>> highest_memmap_pfn = end_pfn - 1;
>>>
>>> + /*
>>> + * Initialise at least 2G of a node but also take into account that
>>> + * two large system hashes that can take up an 8th of memory.
>>> + */
>>> + max_initialise = min(2UL<< (30 - PAGE_SHIFT),
>>> + (pgdat->node_spanned_pages>> 3));
>>> +

I found an error here. The correct code should be:

max_initialise = max(2UL<< (30 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 3));


The error made the 24-TB machine crash again.

Cheers,
Longman

2015-05-06 01:21:28

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On 05/05/2015 04:02 PM, Andrew Morton wrote:
> On Tue, 5 May 2015 11:45:14 +0100 Mel Gorman<[email protected]> wrote:
>
>> On Mon, May 04, 2015 at 02:30:46PM -0700, Andrew Morton wrote:
>>>> Before the patch, the boot time from elilo prompt to ssh login was 694s.
>>>> After the patch, the boot up time was 346s, a saving of 348s (about 50%).
>>> Having to guesstimate the amount of memory which is needed for a
>>> successful boot will be painful. Any number we choose will be wrong
>>> 99% of the time.
>>>
>>> If the kswapd threads have started, all we need to do is to wait: take
>>> a little nap in the allocator's page==NULL slowpath.
>>>
>>> I'm not seeing any reason why we can't start kswapd much earlier -
>>> right at the start of do_basic_setup()?
>> It doesn't even have to be kswapd, it just should be a thread pinned to
>> a done. The difficulty is that dealing with the system hashes means the
>> initialisation has to happen before vfs_caches_init_early() when there is
>> no scheduler.
> I bet we can run vfs_caches_init_early() after sched_init(). Might
> need a few little fixups.
>
>> Those allocations could be delayed further but then there is
>> the possibility that the allocations would not be contiguous and they'd
>> have to rely on CMA to make the attempt. That potentially alters the
>> performance of the large system hashes at run time.
> hm, why. If the kswapd threads are running and busily creating free
> pages then alloc_pages(order=10) can detect this situation and stall
> for a while, waiting for kswapd to create an order-10 page.
>
> Alternatively, the page allocator can go off and synchronously
> initialize some pageframes itself. Keep doing that until the
> allocation attempt succeeds.
>
> Such an approach is much more robust than trying to predict how much
> memory will be needed.
>

Most of those hash tables are allocated before smp_boot. In UP mode, you
can't have another thread initializing memory. So we really need to
preallocate enough for those tables.

Cheers,
Longman

2015-05-06 01:55:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Tue, 05 May 2015 21:21:19 -0400 Waiman Long <[email protected]> wrote:

> On 05/05/2015 04:02 PM, Andrew Morton wrote:
> > On Tue, 5 May 2015 11:45:14 +0100 Mel Gorman<[email protected]> wrote:
> >
> >> On Mon, May 04, 2015 at 02:30:46PM -0700, Andrew Morton wrote:
> >>>> Before the patch, the boot time from elilo prompt to ssh login was 694s.
> >>>> After the patch, the boot up time was 346s, a saving of 348s (about 50%).
> >>> Having to guesstimate the amount of memory which is needed for a
> >>> successful boot will be painful. Any number we choose will be wrong
> >>> 99% of the time.
> >>>
> >>> If the kswapd threads have started, all we need to do is to wait: take
> >>> a little nap in the allocator's page==NULL slowpath.
> >>>
> >>> I'm not seeing any reason why we can't start kswapd much earlier -
> >>> right at the start of do_basic_setup()?
> >> It doesn't even have to be kswapd, it just should be a thread pinned to
> >> a done. The difficulty is that dealing with the system hashes means the
> >> initialisation has to happen before vfs_caches_init_early() when there is
> >> no scheduler.
> > I bet we can run vfs_caches_init_early() after sched_init(). Might
> > need a few little fixups.
> >
> >> Those allocations could be delayed further but then there is
> >> the possibility that the allocations would not be contiguous and they'd
> >> have to rely on CMA to make the attempt. That potentially alters the
> >> performance of the large system hashes at run time.
> > hm, why. If the kswapd threads are running and busily creating free
> > pages then alloc_pages(order=10) can detect this situation and stall
> > for a while, waiting for kswapd to create an order-10 page.
> >
> > Alternatively, the page allocator can go off and synchronously
> > initialize some pageframes itself. Keep doing that until the
> > allocation attempt succeeds.
> >
> > Such an approach is much more robust than trying to predict how much
> > memory will be needed.
> >
>
> Most of those hash tables are allocated before smp_boot. In UP mode, you
> can't have another thread initializing memory. So we really need to
> preallocate enough for those tables.

(copy-paste)

: Alternatively, the page allocator can go off and synchronously
: initialize some pageframes itself. Keep doing that until the
: allocation attempt succeeds.

IOW, the caller of alloc_pages() goes off and does the work which
kswapd would have done later on.

2015-05-06 03:39:15

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On 05/05/2015 11:01 AM, Waiman Long wrote:
> On 05/05/2015 10:31 AM, Mel Gorman wrote:
>> On Tue, May 05, 2015 at 09:55:52AM -0400, Waiman Long wrote:
>>> On 05/05/2015 06:45 AM, Mel Gorman wrote:
>>>> On Mon, May 04, 2015 at 02:30:46PM -0700, Andrew Morton wrote:
>>>>>> Before the patch, the boot time from elilo prompt to ssh login
>>>>>> was 694s.
>>>>>> After the patch, the boot up time was 346s, a saving of 348s
>>>>>> (about 50%).
>>>>> Having to guesstimate the amount of memory which is needed for a
>>>>> successful boot will be painful. Any number we choose will be wrong
>>>>> 99% of the time.
>>>>>
>>>>> If the kswapd threads have started, all we need to do is to wait:
>>>>> take
>>>>> a little nap in the allocator's page==NULL slowpath.
>>>>>
>>>>> I'm not seeing any reason why we can't start kswapd much earlier -
>>>>> right at the start of do_basic_setup()?
>>>> It doesn't even have to be kswapd, it just should be a thread
>>>> pinned to
>>>> a done. The difficulty is that dealing with the system hashes means
>>>> the
>>>> initialisation has to happen before vfs_caches_init_early() when
>>>> there is
>>>> no scheduler. Those allocations could be delayed further but then
>>>> there is
>>>> the possibility that the allocations would not be contiguous and
>>>> they'd
>>>> have to rely on CMA to make the attempt. That potentially alters the
>>>> performance of the large system hashes at run time.
>>>>
>>>> We can scale the amount initialised with memory sizes relatively easy.
>>>> This boots on the same 1TB machine I was testing before but that is
>>>> hardly a surprise.
>>>>
>>>> ---8<---
>>>> mm: meminit: Take into account that large system caches scale
>>>> linearly with memory
>>>>
>>>> Waiman Long reported a 24TB machine triggered an OOM as parallel
>>>> memory
>>>> initialisation deferred too much memory for initialisation. The likely
>>>> consumer of this memory was large system hashes that scale with memory
>>>> size. This patch initialises at least 2G per node but scales the
>>>> amount
>>>> initialised for larger systems.
>>>>
>>>> Signed-off-by: Mel Gorman<[email protected]>
>>>> ---
>>>> mm/page_alloc.c | 15 +++++++++++++--
>>>> 1 file changed, 13 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 598f78d6544c..f7cc6c9fb909 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -266,15 +266,16 @@ static inline bool
>>>> early_page_nid_uninitialised(unsigned long pfn, int nid)
>>>> */
>>>> static inline bool update_defer_init(pg_data_t *pgdat,
>>>> unsigned long pfn, unsigned long zone_end,
>>>> + unsigned long max_initialise,
>>>> unsigned long *nr_initialised)
>>>> {
>>>> /* Always populate low zones for address-contrained
>>>> allocations */
>>>> if (zone_end< pgdat_end_pfn(pgdat))
>>>> return true;
>>>>
>>>> - /* Initialise at least 2G of the highest zone */
>>>> + /* Initialise at least the requested amount in the highest
>>>> zone */
>>>> (*nr_initialised)++;
>>>> - if (*nr_initialised> (2UL<< (30 - PAGE_SHIFT))&&
>>>> + if ((*nr_initialised> max_initialise)&&
>>>> (pfn& (PAGES_PER_SECTION - 1)) == 0) {
>>>> pgdat->first_deferred_pfn = pfn;
>>>> return false;
>>>> @@ -299,6 +300,7 @@ static inline bool
>>>> early_page_nid_uninitialised(unsigned long pfn, int nid)
>>>>
>>>> static inline bool update_defer_init(pg_data_t *pgdat,
>>>> unsigned long pfn, unsigned long zone_end,
>>>> + unsigned long max_initialise,
>>>> unsigned long *nr_initialised)
>>>> {
>>>> return true;
>>>> @@ -4457,11 +4459,19 @@ void __meminit memmap_init_zone(unsigned
>>>> long size, int nid, unsigned long zone,
>>>> unsigned long end_pfn = start_pfn + size;
>>>> unsigned long pfn;
>>>> struct zone *z;
>>>> + unsigned long max_initialise;
>>>> unsigned long nr_initialised = 0;
>>>>
>>>> if (highest_memmap_pfn< end_pfn - 1)
>>>> highest_memmap_pfn = end_pfn - 1;
>>>>
>>>> + /*
>>>> + * Initialise at least 2G of a node but also take into account
>>>> that
>>>> + * two large system hashes that can take up an 8th of memory.
>>>> + */
>>>> + max_initialise = min(2UL<< (30 - PAGE_SHIFT),
>>>> + (pgdat->node_spanned_pages>> 3));
>>>> +
>>> I think you may be pre-allocating too much memory here. On the 24-TB
>>> machine, the size of the dentry and inode hash tables were 16G each.
>>> So the ratio is about is about 32G/24T = 0.13%. I think a shift
>>> factor of (>> 8) which is about 0.39% should be more than enough.
>> I was taking the most pessimistic value possible to match where those
>> hashes currently get allocated from so that the locality does not change
>> after the series is applied. Can you try both (>> 3) and (>> 8) and
>> see
>> do both work and if so, what the timing is?
>
> Sure. I will try both and get you the results, hopefully by tomorrow
> at the latest.
>

With the modified patch, both (>>3) and (>>8) worked without any
problem. The bootup times are:

1. Unpatch 4.0 kernel - 694s
2. Patch kernel with 4G/node - 346s
3. Patch kernel with (>>3) - 389s
4. Patch kernel with (>>8) - 353s

Cheers,
Longman

2015-05-06 07:12:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Tue, May 05, 2015 at 03:25:49PM -0700, Andrew Morton wrote:
> On Tue, 5 May 2015 23:13:29 +0100 Mel Gorman <[email protected]> wrote:
>
> > > Alternatively, the page allocator can go off and synchronously
> > > initialize some pageframes itself. Keep doing that until the
> > > allocation attempt succeeds.
> > >
> >
> > That was rejected during review of earlier attempts at this feature on
> > the grounds that it impacted allocator fast paths.
>
> eh? Changes are only needed on the allocation-attempt-failed path,
> which is slow-path.

We'd have to distinguish between falling back to other zones because the
high zone is artifically exhausted and normal ALLOC_BATCH exhaustion. We'd
also have to avoid falling back to remote nodes prematurely. While I have
not tried an implementation, I expected they would need to be in the fast
paths unless I used jump labels to get around it. I'm going to try altering
when we initialise instead so that it happens earlier.

--
Mel Gorman
SUSE Labs

2015-05-06 10:22:36

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Wed, May 06, 2015 at 08:12:46AM +0100, Mel Gorman wrote:
> On Tue, May 05, 2015 at 03:25:49PM -0700, Andrew Morton wrote:
> > On Tue, 5 May 2015 23:13:29 +0100 Mel Gorman <[email protected]> wrote:
> >
> > > > Alternatively, the page allocator can go off and synchronously
> > > > initialize some pageframes itself. Keep doing that until the
> > > > allocation attempt succeeds.
> > > >
> > >
> > > That was rejected during review of earlier attempts at this feature on
> > > the grounds that it impacted allocator fast paths.
> >
> > eh? Changes are only needed on the allocation-attempt-failed path,
> > which is slow-path.
>
> We'd have to distinguish between falling back to other zones because the
> high zone is artifically exhausted and normal ALLOC_BATCH exhaustion. We'd
> also have to avoid falling back to remote nodes prematurely. While I have
> not tried an implementation, I expected they would need to be in the fast
> paths unless I used jump labels to get around it. I'm going to try altering
> when we initialise instead so that it happens earlier.
>

Which looks as follows. Waiman, a test on the 24TB machine would be
appreciated again. This patch should be applied instead of "mm: meminit:
Take into account that large system caches scale linearly with memory"

---8<---
mm: meminit: Finish initialisation of memory before basic setup

Waiman Long reported that 24TB machines hit OOM during basic setup when
struct page initialisation was deferred. One approach is to initialise memory
on demand but it interferes with page allocator paths. This patch creates
dedicated threads to initialise memory before basic setup. It then blocks
on a rw_semaphore until completion as a wait_queue and counter is overkill.
This may be slower to boot but it's simplier overall and also gets rid of a
lot of section mangling which existed so kswapd could do the initialisation.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 8 ++++++++
init/main.c | 2 ++
mm/internal.h | 24 ------------------------
mm/page_alloc.c | 44 ++++++++++++++++++++++++++++++++++++--------
mm/vmscan.c | 6 ++----
5 files changed, 48 insertions(+), 36 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51bd1e72a917..28a3128d9e59 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -385,6 +385,14 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(struct zone *zone);
void drain_local_pages(struct zone *zone);

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+void page_alloc_init_late(void);
+#else
+static inline void page_alloc_init_late(void)
+{
+}
+#endif
+
/*
* gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
* GFP flags are used before interrupts are enabled. Once interrupts are
diff --git a/init/main.c b/init/main.c
index 6f0f1c5ff8cc..9bef5f0c9864 100644
--- a/init/main.c
+++ b/init/main.c
@@ -995,6 +995,8 @@ static noinline void __init kernel_init_freeable(void)
smp_init();
sched_init_smp();

+ page_alloc_init_late();
+
do_basic_setup();

/* Open the /dev/console on the rootfs, this should never fail */
diff --git a/mm/internal.h b/mm/internal.h
index 5c221ad41a29..5a7c7a531720 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -377,30 +377,6 @@ static inline void mminit_verify_zonelist(void)
}
#endif /* CONFIG_DEBUG_MEMORY_INIT */

-/*
- * Deferred struct page initialisation requires init functions that are freed
- * before kswapd is available. Reuse the memory hotplug section annotation
- * to mark the required code.
- *
- * __defermem_init is code that always exists but is annotated __meminit to
- * avoid section warnings.
- * __defer_init code gets marked __meminit when deferring struct page
- * initialistion but is otherwise in the init section.
- */
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-#define __defermem_init __meminit
-#define __defer_init __meminit
-
-void deferred_init_memmap(int nid);
-#else
-#define __defermem_init
-#define __defer_init __init
-
-static inline void deferred_init_memmap(int nid)
-{
-}
-#endif
-
/* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */
#if defined(CONFIG_SPARSEMEM)
extern void mminit_validate_memmodel_limits(unsigned long *start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 598f78d6544c..1cef116727b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -61,6 +61,7 @@
#include <linux/hugetlb.h>
#include <linux/sched/rt.h>
#include <linux/page_owner.h>
+#include <linux/kthread.h>

#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -242,7 +243,7 @@ static inline void reset_deferred_meminit(pg_data_t *pgdat)
}

/* Returns true if the struct page for the pfn is uninitialised */
-static inline bool __defermem_init early_page_uninitialised(unsigned long pfn)
+static inline bool __init early_page_uninitialised(unsigned long pfn)
{
int nid = early_pfn_to_nid(pfn);

@@ -972,7 +973,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
local_irq_restore(flags);
}

-static void __defer_init __free_pages_boot_core(struct page *page,
+static void __init __free_pages_boot_core(struct page *page,
unsigned long pfn, unsigned int order)
{
unsigned int nr_pages = 1 << order;
@@ -1039,7 +1040,7 @@ static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
}
#endif

-void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
+void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
unsigned int order)
{
if (early_page_uninitialised(pfn))
@@ -1048,7 +1049,7 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
}

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __defermem_init deferred_free_range(struct page *page,
+static void __init deferred_free_range(struct page *page,
unsigned long pfn, int nr_pages)
{
int i;
@@ -1068,20 +1069,30 @@ static void __defermem_init deferred_free_range(struct page *page,
__free_pages_boot_core(page, pfn, 0);
}

+static struct rw_semaphore __initdata pgdat_init_rwsem;
+
/* Initialise remaining memory on a node */
-void __defermem_init deferred_init_memmap(int nid)
+static int __init deferred_init_memmap(void *data)
{
+ pg_data_t *pgdat = (pg_data_t *)data;
+ int nid = pgdat->node_id;
struct mminit_pfnnid_cache nid_init_state = { };
unsigned long start = jiffies;
unsigned long nr_pages = 0;
unsigned long walk_start, walk_end;
int i, zid;
struct zone *zone;
- pg_data_t *pgdat = NODE_DATA(nid);
unsigned long first_init_pfn = pgdat->first_deferred_pfn;
+ const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);

- if (first_init_pfn == ULONG_MAX)
- return;
+ if (first_init_pfn == ULONG_MAX) {
+ up_read(&pgdat_init_rwsem);
+ return 0;
+ }
+
+ /* Bound memory initialisation to a local node if possible */
+ if (!cpumask_empty(cpumask))
+ set_cpus_allowed_ptr(current, cpumask);

/* Sanity check boundaries */
BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
@@ -1175,6 +1186,23 @@ free_range:

pr_info("kswapd %d initialised %lu pages in %ums\n", nid, nr_pages,
jiffies_to_msecs(jiffies - start));
+ up_read(&pgdat_init_rwsem);
+ return 0;
+}
+
+void __init page_alloc_init_late(void)
+{
+ int nid;
+
+ init_rwsem(&pgdat_init_rwsem);
+ for_each_node_state(nid, N_MEMORY) {
+ down_read(&pgdat_init_rwsem);
+ kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid);
+ }
+
+ /* Block until all are initialised */
+ down_write(&pgdat_init_rwsem);
+ up_write(&pgdat_init_rwsem);
}
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4895d26d036..5e8eadd71bac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3348,7 +3348,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
* If there are applications that are active memory-allocators
* (most normal use), this basically shouldn't matter.
*/
-static int __defermem_init kswapd(void *p)
+static int kswapd(void *p)
{
unsigned long order, new_order;
unsigned balanced_order;
@@ -3383,8 +3383,6 @@ static int __defermem_init kswapd(void *p)
tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
set_freezable();

- deferred_init_memmap(pgdat->node_id);
-
order = new_order = 0;
balanced_order = 0;
classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
@@ -3540,7 +3538,7 @@ static int cpu_callback(struct notifier_block *nfb, unsigned long action,
* This kswapd start function will be called by init and node-hot-add.
* On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
*/
-int __defermem_init kswapd_run(int nid)
+int kswapd_run(int nid)
{
pg_data_t *pgdat = NODE_DATA(nid);
int ret = 0;

2015-05-06 12:05:27

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Wed, May 06, 2015 at 11:22:20AM +0100, Mel Gorman wrote:
> On Wed, May 06, 2015 at 08:12:46AM +0100, Mel Gorman wrote:
> > On Tue, May 05, 2015 at 03:25:49PM -0700, Andrew Morton wrote:
> > > On Tue, 5 May 2015 23:13:29 +0100 Mel Gorman <[email protected]> wrote:
> > >
> > > > > Alternatively, the page allocator can go off and synchronously
> > > > > initialize some pageframes itself. Keep doing that until the
> > > > > allocation attempt succeeds.
> > > > >
> > > >
> > > > That was rejected during review of earlier attempts at this feature on
> > > > the grounds that it impacted allocator fast paths.
> > >
> > > eh? Changes are only needed on the allocation-attempt-failed path,
> > > which is slow-path.
> >
> > We'd have to distinguish between falling back to other zones because the
> > high zone is artifically exhausted and normal ALLOC_BATCH exhaustion. We'd
> > also have to avoid falling back to remote nodes prematurely. While I have
> > not tried an implementation, I expected they would need to be in the fast
> > paths unless I used jump labels to get around it. I'm going to try altering
> > when we initialise instead so that it happens earlier.
> >
>
> Which looks as follows. Waiman, a test on the 24TB machine would be
> appreciated again. This patch should be applied instead of "mm: meminit:
> Take into account that large system caches scale linearly with memory"
>
> ---8<---
> mm: meminit: Finish initialisation of memory before basic setup
>

*sigh* Eventually build testing found the need for this

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1cef116727b6..052b9ba65b66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -243,7 +243,7 @@ static inline void reset_deferred_meminit(pg_data_t *pgdat)
}

/* Returns true if the struct page for the pfn is uninitialised */
-static inline bool __init early_page_uninitialised(unsigned long pfn)
+static inline bool __meminit early_page_uninitialised(unsigned long pfn)
{
int nid = early_pfn_to_nid(pfn);

2015-05-06 17:58:49

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On 05/06/2015 06:22 AM, Mel Gorman wrote:
> On Wed, May 06, 2015 at 08:12:46AM +0100, Mel Gorman wrote:
>> On Tue, May 05, 2015 at 03:25:49PM -0700, Andrew Morton wrote:
>>> On Tue, 5 May 2015 23:13:29 +0100 Mel Gorman<[email protected]> wrote:
>>>
>>>>> Alternatively, the page allocator can go off and synchronously
>>>>> initialize some pageframes itself. Keep doing that until the
>>>>> allocation attempt succeeds.
>>>>>
>>>> That was rejected during review of earlier attempts at this feature on
>>>> the grounds that it impacted allocator fast paths.
>>> eh? Changes are only needed on the allocation-attempt-failed path,
>>> which is slow-path.
>> We'd have to distinguish between falling back to other zones because the
>> high zone is artifically exhausted and normal ALLOC_BATCH exhaustion. We'd
>> also have to avoid falling back to remote nodes prematurely. While I have
>> not tried an implementation, I expected they would need to be in the fast
>> paths unless I used jump labels to get around it. I'm going to try altering
>> when we initialise instead so that it happens earlier.
>>
> Which looks as follows. Waiman, a test on the 24TB machine would be
> appreciated again. This patch should be applied instead of "mm: meminit:
> Take into account that large system caches scale linearly with memory"
>
> ---8<---
> mm: meminit: Finish initialisation of memory before basic setup
>
> Waiman Long reported that 24TB machines hit OOM during basic setup when
> struct page initialisation was deferred. One approach is to initialise memory
> on demand but it interferes with page allocator paths. This patch creates
> dedicated threads to initialise memory before basic setup. It then blocks
> on a rw_semaphore until completion as a wait_queue and counter is overkill.
> This may be slower to boot but it's simplier overall and also gets rid of a
> lot of section mangling which existed so kswapd could do the initialisation.
>
> Signed-off-by: Mel Gorman<[email protected]>
>

This patch moves the deferred meminit from kswapd to its own kernel
threads started after smp_init(). However, the hash table allocation was
done earlier than that. It seems like it will still run out of memory in
the 24TB machine that I tested on.

I will certainly try it out, but I doubt it will solve the problem on
its own.

Cheers,
Longman

2015-05-07 02:37:35

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On 05/06/2015 01:58 PM, Waiman Long wrote:
> On 05/06/2015 06:22 AM, Mel Gorman wrote:
>> On Wed, May 06, 2015 at 08:12:46AM +0100, Mel Gorman wrote:
>>> On Tue, May 05, 2015 at 03:25:49PM -0700, Andrew Morton wrote:
>>>> On Tue, 5 May 2015 23:13:29 +0100 Mel Gorman<[email protected]> wrote:
>>>>
>>>>>> Alternatively, the page allocator can go off and synchronously
>>>>>> initialize some pageframes itself. Keep doing that until the
>>>>>> allocation attempt succeeds.
>>>>>>
>>>>> That was rejected during review of earlier attempts at this
>>>>> feature on
>>>>> the grounds that it impacted allocator fast paths.
>>>> eh? Changes are only needed on the allocation-attempt-failed path,
>>>> which is slow-path.
>>> We'd have to distinguish between falling back to other zones because
>>> the
>>> high zone is artifically exhausted and normal ALLOC_BATCH
>>> exhaustion. We'd
>>> also have to avoid falling back to remote nodes prematurely. While I
>>> have
>>> not tried an implementation, I expected they would need to be in the
>>> fast
>>> paths unless I used jump labels to get around it. I'm going to try
>>> altering
>>> when we initialise instead so that it happens earlier.
>>>
>> Which looks as follows. Waiman, a test on the 24TB machine would be
>> appreciated again. This patch should be applied instead of "mm: meminit:
>> Take into account that large system caches scale linearly with memory"
>>
>> ---8<---
>> mm: meminit: Finish initialisation of memory before basic setup
>>
>> Waiman Long reported that 24TB machines hit OOM during basic setup when
>> struct page initialisation was deferred. One approach is to
>> initialise memory
>> on demand but it interferes with page allocator paths. This patch
>> creates
>> dedicated threads to initialise memory before basic setup. It then
>> blocks
>> on a rw_semaphore until completion as a wait_queue and counter is
>> overkill.
>> This may be slower to boot but it's simplier overall and also gets
>> rid of a
>> lot of section mangling which existed so kswapd could do the
>> initialisation.
>>
>> Signed-off-by: Mel Gorman<[email protected]>
>>
>
> This patch moves the deferred meminit from kswapd to its own kernel
> threads started after smp_init(). However, the hash table allocation
> was done earlier than that. It seems like it will still run out of
> memory in the 24TB machine that I tested on.
>
> I will certainly try it out, but I doubt it will solve the problem on
> its own.

It turns out that the two new patches did work on the 24-TB DragonHawk
without the "mm: meminit: Take into account that large system caches
scale linearly with memory" patch. The bootup time was 357s which was
just a few seconds slower than the other bootup times that I sent you
yesterday.

BTW, do you want to change the following log message as kswapd will no
longer be the one doing deferred meminit?

kswapd 0 initialised 396098436 pages in 6024ms

Cheers,
Longman

2015-05-07 07:22:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4

On Wed, May 06, 2015 at 10:37:28PM -0400, Waiman Long wrote:
> On 05/06/2015 01:58 PM, Waiman Long wrote:
> >On 05/06/2015 06:22 AM, Mel Gorman wrote:
> >>On Wed, May 06, 2015 at 08:12:46AM +0100, Mel Gorman wrote:
> >>>On Tue, May 05, 2015 at 03:25:49PM -0700, Andrew Morton wrote:
> >>>>On Tue, 5 May 2015 23:13:29 +0100 Mel Gorman<[email protected]> wrote:
> >>>>
> >>>>>>Alternatively, the page allocator can go off and synchronously
> >>>>>>initialize some pageframes itself. Keep doing that until the
> >>>>>>allocation attempt succeeds.
> >>>>>>
> >>>>>That was rejected during review of earlier attempts at
> >>>>>this feature on
> >>>>>the grounds that it impacted allocator fast paths.
> >>>>eh? Changes are only needed on the allocation-attempt-failed path,
> >>>>which is slow-path.
> >>>We'd have to distinguish between falling back to other zones
> >>>because the
> >>>high zone is artifically exhausted and normal ALLOC_BATCH
> >>>exhaustion. We'd
> >>>also have to avoid falling back to remote nodes prematurely.
> >>>While I have
> >>>not tried an implementation, I expected they would need to be
> >>>in the fast
> >>>paths unless I used jump labels to get around it. I'm going to
> >>>try altering
> >>>when we initialise instead so that it happens earlier.
> >>>
> >>Which looks as follows. Waiman, a test on the 24TB machine would be
> >>appreciated again. This patch should be applied instead of "mm: meminit:
> >>Take into account that large system caches scale linearly with memory"
> >>
> >>---8<---
> >>mm: meminit: Finish initialisation of memory before basic setup
> >>
> >>Waiman Long reported that 24TB machines hit OOM during basic setup when
> >>struct page initialisation was deferred. One approach is to
> >>initialise memory
> >>on demand but it interferes with page allocator paths. This
> >>patch creates
> >>dedicated threads to initialise memory before basic setup. It
> >>then blocks
> >>on a rw_semaphore until completion as a wait_queue and counter
> >>is overkill.
> >>This may be slower to boot but it's simplier overall and also
> >>gets rid of a
> >>lot of section mangling which existed so kswapd could do the
> >>initialisation.
> >>
> >>Signed-off-by: Mel Gorman<[email protected]>
> >>
> >
> >This patch moves the deferred meminit from kswapd to its own
> >kernel threads started after smp_init(). However, the hash table
> >allocation was done earlier than that. It seems like it will still
> >run out of memory in the 24TB machine that I tested on.
> >
> >I will certainly try it out, but I doubt it will solve the problem
> >on its own.
>
> It turns out that the two new patches did work on the 24-TB
> DragonHawk without the "mm: meminit: Take into account that large
> system caches scale linearly with memory" patch. The bootup time was
> 357s which was just a few seconds slower than the other bootup times
> that I sent you yesterday.
>

Grand. This is what I expected because the previous failure was not the
hash tables, it was later allocations and the parallel initialisation
was early enough.

> BTW, do you want to change the following log message as kswapd will
> no longer be the one doing deferred meminit?
>
> kswapd 0 initialised 396098436 pages in 6024ms
>

I will.

--
Mel Gorman
SUSE Labs

2015-05-07 07:25:27

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

Waiman Long reported that 24TB machines hit OOM during basic setup when
struct page initialisation was deferred. One approach is to initialise memory
on demand but it interferes with page allocator paths. This patch creates
dedicated threads to initialise memory before basic setup. It then blocks
on a rw_semaphore until completion as a wait_queue and counter is overkill.
This may be slower to boot but it's simplier overall and also gets rid of a
section mangling which existed so kswapd could do the initialisation.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 8 ++++++++
init/main.c | 2 ++
mm/internal.h | 24 ------------------------
mm/page_alloc.c | 46 +++++++++++++++++++++++++++++++++++++---------
mm/vmscan.c | 6 ++----
5 files changed, 49 insertions(+), 37 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 51bd1e72a917..28a3128d9e59 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -385,6 +385,14 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(struct zone *zone);
void drain_local_pages(struct zone *zone);

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+void page_alloc_init_late(void);
+#else
+static inline void page_alloc_init_late(void)
+{
+}
+#endif
+
/*
* gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
* GFP flags are used before interrupts are enabled. Once interrupts are
diff --git a/init/main.c b/init/main.c
index 6f0f1c5ff8cc..9bef5f0c9864 100644
--- a/init/main.c
+++ b/init/main.c
@@ -995,6 +995,8 @@ static noinline void __init kernel_init_freeable(void)
smp_init();
sched_init_smp();

+ page_alloc_init_late();
+
do_basic_setup();

/* Open the /dev/console on the rootfs, this should never fail */
diff --git a/mm/internal.h b/mm/internal.h
index 5c221ad41a29..5a7c7a531720 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -377,30 +377,6 @@ static inline void mminit_verify_zonelist(void)
}
#endif /* CONFIG_DEBUG_MEMORY_INIT */

-/*
- * Deferred struct page initialisation requires init functions that are freed
- * before kswapd is available. Reuse the memory hotplug section annotation
- * to mark the required code.
- *
- * __defermem_init is code that always exists but is annotated __meminit to
- * avoid section warnings.
- * __defer_init code gets marked __meminit when deferring struct page
- * initialistion but is otherwise in the init section.
- */
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-#define __defermem_init __meminit
-#define __defer_init __meminit
-
-void deferred_init_memmap(int nid);
-#else
-#define __defermem_init
-#define __defer_init __init
-
-static inline void deferred_init_memmap(int nid)
-{
-}
-#endif
-
/* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */
#if defined(CONFIG_SPARSEMEM)
extern void mminit_validate_memmodel_limits(unsigned long *start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 598f78d6544c..7c257e37f2ce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -61,6 +61,7 @@
#include <linux/hugetlb.h>
#include <linux/sched/rt.h>
#include <linux/page_owner.h>
+#include <linux/kthread.h>

#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -242,7 +243,7 @@ static inline void reset_deferred_meminit(pg_data_t *pgdat)
}

/* Returns true if the struct page for the pfn is uninitialised */
-static inline bool __defermem_init early_page_uninitialised(unsigned long pfn)
+static inline bool __meminit early_page_uninitialised(unsigned long pfn)
{
int nid = early_pfn_to_nid(pfn);

@@ -972,7 +973,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
local_irq_restore(flags);
}

-static void __defer_init __free_pages_boot_core(struct page *page,
+static void __init __free_pages_boot_core(struct page *page,
unsigned long pfn, unsigned int order)
{
unsigned int nr_pages = 1 << order;
@@ -1039,7 +1040,7 @@ static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
}
#endif

-void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
+void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
unsigned int order)
{
if (early_page_uninitialised(pfn))
@@ -1048,7 +1049,7 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
}

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __defermem_init deferred_free_range(struct page *page,
+static void __init deferred_free_range(struct page *page,
unsigned long pfn, int nr_pages)
{
int i;
@@ -1068,20 +1069,30 @@ static void __defermem_init deferred_free_range(struct page *page,
__free_pages_boot_core(page, pfn, 0);
}

+static struct rw_semaphore __initdata pgdat_init_rwsem;
+
/* Initialise remaining memory on a node */
-void __defermem_init deferred_init_memmap(int nid)
+static int __init deferred_init_memmap(void *data)
{
+ pg_data_t *pgdat = (pg_data_t *)data;
+ int nid = pgdat->node_id;
struct mminit_pfnnid_cache nid_init_state = { };
unsigned long start = jiffies;
unsigned long nr_pages = 0;
unsigned long walk_start, walk_end;
int i, zid;
struct zone *zone;
- pg_data_t *pgdat = NODE_DATA(nid);
unsigned long first_init_pfn = pgdat->first_deferred_pfn;
+ const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);

- if (first_init_pfn == ULONG_MAX)
- return;
+ if (first_init_pfn == ULONG_MAX) {
+ up_read(&pgdat_init_rwsem);
+ return 0;
+ }
+
+ /* Bound memory initialisation to a local node if possible */
+ if (!cpumask_empty(cpumask))
+ set_cpus_allowed_ptr(current, cpumask);

/* Sanity check boundaries */
BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
@@ -1173,8 +1184,25 @@ free_range:
/* Sanity check that the next zone really is unpopulated */
WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));

- pr_info("kswapd %d initialised %lu pages in %ums\n", nid, nr_pages,
+ pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
jiffies_to_msecs(jiffies - start));
+ up_read(&pgdat_init_rwsem);
+ return 0;
+}
+
+void __init page_alloc_init_late(void)
+{
+ int nid;
+
+ init_rwsem(&pgdat_init_rwsem);
+ for_each_node_state(nid, N_MEMORY) {
+ down_read(&pgdat_init_rwsem);
+ kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid);
+ }
+
+ /* Block until all are initialised */
+ down_write(&pgdat_init_rwsem);
+ up_write(&pgdat_init_rwsem);
}
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4895d26d036..5e8eadd71bac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3348,7 +3348,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
* If there are applications that are active memory-allocators
* (most normal use), this basically shouldn't matter.
*/
-static int __defermem_init kswapd(void *p)
+static int kswapd(void *p)
{
unsigned long order, new_order;
unsigned balanced_order;
@@ -3383,8 +3383,6 @@ static int __defermem_init kswapd(void *p)
tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
set_freezable();

- deferred_init_memmap(pgdat->node_id);
-
order = new_order = 0;
balanced_order = 0;
classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
@@ -3540,7 +3538,7 @@ static int cpu_callback(struct notifier_block *nfb, unsigned long action,
* This kswapd start function will be called by init and node-hot-add.
* On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
*/
-int __defermem_init kswapd_run(int nid)
+int kswapd_run(int nid)
{
pg_data_t *pgdat = NODE_DATA(nid);
int ret = 0;

2015-05-07 22:09:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Thu, 7 May 2015 08:25:18 +0100 Mel Gorman <[email protected]> wrote:

> Waiman Long reported that 24TB machines hit OOM during basic setup when
> struct page initialisation was deferred. One approach is to initialise memory
> on demand but it interferes with page allocator paths. This patch creates
> dedicated threads to initialise memory before basic setup. It then blocks
> on a rw_semaphore until completion as a wait_queue and counter is overkill.
> This may be slower to boot but it's simplier overall and also gets rid of a
> section mangling which existed so kswapd could do the initialisation.

Seems a reasonable compromise. It makes a bit of a mess of the patch
sequencing.

Have some tweaklets:



From: Andrew Morton <[email protected]>
Subject: mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix

include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast

Cc: Daniel J Blueman <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Nathan Zimmer <[email protected]>
Cc: Scott Norton <[email protected]>
Cc: Waiman Long <[email protected]
Signed-off-by: Andrew Morton <[email protected]>
---

mm/page_alloc.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff -puN mm/page_alloc.c~mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix mm/page_alloc.c
--- a/mm/page_alloc.c~mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix
+++ a/mm/page_alloc.c
@@ -18,6 +18,7 @@
#include <linux/mm.h>
#include <linux/swap.h>
#include <linux/interrupt.h>
+#include <linux/rwsem.h>
#include <linux/pagemap.h>
#include <linux/jiffies.h>
#include <linux/bootmem.h>
@@ -1075,12 +1076,12 @@ static void __init deferred_free_range(s
__free_pages_boot_core(page, pfn, 0);
}

-static struct rw_semaphore __initdata pgdat_init_rwsem;
+static __initdata DECLARE_RWSEM(pgdat_init_rwsem);

/* Initialise remaining memory on a node */
static int __init deferred_init_memmap(void *data)
{
- pg_data_t *pgdat = (pg_data_t *)data;
+ pg_data_t *pgdat = data;
int nid = pgdat->node_id;
struct mminit_pfnnid_cache nid_init_state = { };
unsigned long start = jiffies;
@@ -1096,7 +1097,7 @@ static int __init deferred_init_memmap(v
return 0;
}

- /* Bound memory initialisation to a local node if possible */
+ /* Bind memory initialisation thread to a local node if possible */
if (!cpumask_empty(cpumask))
set_cpus_allowed_ptr(current, cpumask);

@@ -1200,7 +1201,6 @@ void __init page_alloc_init_late(void)
{
int nid;

- init_rwsem(&pgdat_init_rwsem);
for_each_node_state(nid, N_MEMORY) {
down_read(&pgdat_init_rwsem);
kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid);
_

2015-05-07 22:52:33

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Thu, May 07, 2015 at 03:09:32PM -0700, Andrew Morton wrote:
> On Thu, 7 May 2015 08:25:18 +0100 Mel Gorman <[email protected]> wrote:
>
> > Waiman Long reported that 24TB machines hit OOM during basic setup when
> > struct page initialisation was deferred. One approach is to initialise memory
> > on demand but it interferes with page allocator paths. This patch creates
> > dedicated threads to initialise memory before basic setup. It then blocks
> > on a rw_semaphore until completion as a wait_queue and counter is overkill.
> > This may be slower to boot but it's simplier overall and also gets rid of a
> > section mangling which existed so kswapd could do the initialisation.
>
> Seems a reasonable compromise. It makes a bit of a mess of the patch
> sequencing.
>
> Have some tweaklets:
>

The tweaks are prefectly reasonable. As for the patch sequencing, I'm ok
with adding the patch on top if you are because that preserves the testing
history. If you're unhappy, I can shuffle it into a better place and resend
the full series that includes all the fixes so far.

Thanks.

--
Mel Gorman
SUSE Labs

2015-05-07 23:02:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Thu, 7 May 2015 23:52:26 +0100 Mel Gorman <[email protected]> wrote:

> As for the patch sequencing, I'm ok
> with adding the patch on top if you are because that preserves the testing
> history. If you're unhappy, I can shuffle it into a better place and resend
> the full series that includes all the fixes so far.

We'll survive. Let's only do the reorganization if the patches need rework
for other reasons.

2015-05-13 15:53:32

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

I am just noticed a hang on my largest box.
I can only reproduce with large core counts, if I turn down the number
of cpus it doesn't have an issue.

Also as time goes on the amount of time required to initialize pages
goes up.


log_uv48_05121052:[ 177.250385] node 0 initialised, 14950072 pages in 544ms
log_uv48_05121052:[ 177.269629] node 1 initialised, 15990505 pages in 564ms
log_uv48_05121052:[ 177.436047] node 215 initialised, 3600110 pages in
724ms
log_uv48_05121052:[ 177.464056] node 102 initialised, 3604205 pages in
756ms
log_uv48_05121052:[ 178.073822] node 30 initialised, 7732972 pages in
1368ms
log_uv48_05121052:[ 178.082888] node 31 initialised, 7728877 pages in
1372ms
log_uv48_05121052:[ 178.080060] node 29 initialised, 7728877 pages in
1376ms
....
log_uv48_05121052:[ 178.217980] node 197 initialised, 7728877 pages in
1504ms
log_uv48_05121052:[ 178.217851] node 196 initialised, 7732972 pages in
1504ms
log_uv48_05121052:[ 178.219992] node 247 initialised, 7726418 pages in
1504ms
log_uv48_05121052:[ 178.325299] node 3 initialised, 15986409 pages in
1624ms
log_uv48_05121052:[ 178.328455] node 2 initialised, 15990505 pages in
1624ms
log_uv48_05121052:[ 178.383371] node 4 initialised, 15990505 pages in
1680ms
...
log_uv48_05121052:[ 178.438401] node 19 initialised, 15986409 pages in
1728ms

I apologize for the tardiness of this report but I have not been able to
get to the largest boxes reliably.
Hopefully I will have more access this week.


On 05/07/2015 05:09 PM, Andrew Morton wrote:
> On Thu, 7 May 2015 08:25:18 +0100 Mel Gorman <[email protected]> wrote:
>
>> Waiman Long reported that 24TB machines hit OOM during basic setup when
>> struct page initialisation was deferred. One approach is to initialise memory
>> on demand but it interferes with page allocator paths. This patch creates
>> dedicated threads to initialise memory before basic setup. It then blocks
>> on a rw_semaphore until completion as a wait_queue and counter is overkill.
>> This may be slower to boot but it's simplier overall and also gets rid of a
>> section mangling which existed so kswapd could do the initialisation.
> Seems a reasonable compromise. It makes a bit of a mess of the patch
> sequencing.
>
> Have some tweaklets:
>
>
>
> From: Andrew Morton <[email protected]>
> Subject: mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix
>
> include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast
>
> Cc: Daniel J Blueman <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Nathan Zimmer <[email protected]>
> Cc: Scott Norton <[email protected]>
> Cc: Waiman Long <[email protected]
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> mm/page_alloc.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff -puN mm/page_alloc.c~mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix mm/page_alloc.c
> --- a/mm/page_alloc.c~mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix
> +++ a/mm/page_alloc.c
> @@ -18,6 +18,7 @@
> #include <linux/mm.h>
> #include <linux/swap.h>
> #include <linux/interrupt.h>
> +#include <linux/rwsem.h>
> #include <linux/pagemap.h>
> #include <linux/jiffies.h>
> #include <linux/bootmem.h>
> @@ -1075,12 +1076,12 @@ static void __init deferred_free_range(s
> __free_pages_boot_core(page, pfn, 0);
> }
>
> -static struct rw_semaphore __initdata pgdat_init_rwsem;
> +static __initdata DECLARE_RWSEM(pgdat_init_rwsem);
>
> /* Initialise remaining memory on a node */
> static int __init deferred_init_memmap(void *data)
> {
> - pg_data_t *pgdat = (pg_data_t *)data;
> + pg_data_t *pgdat = data;
> int nid = pgdat->node_id;
> struct mminit_pfnnid_cache nid_init_state = { };
> unsigned long start = jiffies;
> @@ -1096,7 +1097,7 @@ static int __init deferred_init_memmap(v
> return 0;
> }
>
> - /* Bound memory initialisation to a local node if possible */
> + /* Bind memory initialisation thread to a local node if possible */
> if (!cpumask_empty(cpumask))
> set_cpus_allowed_ptr(current, cpumask);
>
> @@ -1200,7 +1201,6 @@ void __init page_alloc_init_late(void)
> {
> int nid;
>
> - init_rwsem(&pgdat_init_rwsem);
> for_each_node_state(nid, N_MEMORY) {
> down_read(&pgdat_init_rwsem);
> kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid);
> _
>

2015-05-13 16:32:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:
> I am just noticed a hang on my largest box.
> I can only reproduce with large core counts, if I turn down the
> number of cpus it doesn't have an issue.
>

Odd. The number of core counts should make little a difference as only
one CPU per node should be in use. Does sysrq+t give any indication how
or where it is hanging?

--
Mel Gorman
SUSE Labs

2015-05-14 10:03:19

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Thu, May 14, 2015 at 12:31 AM, Mel Gorman <[email protected]> wrote:
> On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:
>> I am just noticed a hang on my largest box.
>> I can only reproduce with large core counts, if I turn down the
>> number of cpus it doesn't have an issue.
>>
>
> Odd. The number of core counts should make little a difference as only
> one CPU per node should be in use. Does sysrq+t give any indication
> how
> or where it is hanging?

I was seeing the same behaviour of 1000ms increasing to 5500ms [1];
this suggests either lock contention or O(n) behaviour.

Nathan, can you check with this ordering of patches from Andrew's cache
[2]? I was getting hanging until I a found them all.

I'll follow up with timing data.

Thanks,
Daniel

-- [1]

[ 73.076117] node 2 initialised, 7732961 pages in 1060ms
[ 73.077184] node 38 initialised, 7732961 pages in 1060ms
[ 73.079626] node 146 initialised, 7732961 pages in 1050ms
[ 73.093488] node 62 initialised, 7732961 pages in 1080ms
[ 73.091557] node 3 initialised, 7732962 pages in 1080ms
[ 73.100000] node 186 initialised, 7732961 pages in 1040ms
[ 73.095731] node 4 initialised, 7732961 pages in 1080ms
[ 73.090289] node 50 initialised, 7732961 pages in 1080ms
[ 73.094005] node 158 initialised, 7732961 pages in 1050ms
[ 73.095421] node 159 initialised, 7732962 pages in 1050ms
[ 73.090324] node 52 initialised, 7732961 pages in 1080ms
[ 73.099056] node 5 initialised, 7732962 pages in 1080ms
[ 73.090116] node 160 initialised, 7732961 pages in 1050ms
[ 73.161051] node 157 initialised, 7732962 pages in 1120ms
[ 73.193565] node 161 initialised, 7732962 pages in 1160ms
[ 73.212456] node 26 initialised, 7732961 pages in 1200ms
[ 73.222904] node 0 initialised, 6686488 pages in 1210ms
[ 73.242165] node 140 initialised, 7732961 pages in 1210ms
[ 73.254230] node 156 initialised, 7732961 pages in 1220ms
[ 73.284634] node 1 initialised, 7732962 pages in 1270ms
[ 73.305301] node 141 initialised, 7732962 pages in 1280ms
[ 73.322845] node 28 initialised, 7732961 pages in 1310ms
[ 73.321757] node 142 initialised, 7732961 pages in 1290ms
[ 73.327677] node 138 initialised, 7732961 pages in 1300ms
[ 73.413597] node 176 initialised, 7732961 pages in 1370ms
[ 73.455552] node 139 initialised, 7732962 pages in 1420ms
[ 73.475356] node 143 initialised, 7732962 pages in 1440ms
[ 73.547202] node 32 initialised, 7732961 pages in 1530ms
[ 73.579591] node 104 initialised, 7732961 pages in 1560ms
[ 73.618065] node 174 initialised, 7732961 pages in 1570ms
[ 73.624918] node 178 initialised, 7732961 pages in 1580ms
[ 73.649024] node 175 initialised, 7732962 pages in 1610ms
[ 73.654110] node 105 initialised, 7732962 pages in 1630ms
[ 73.670589] node 106 initialised, 7732961 pages in 1650ms
[ 73.739682] node 102 initialised, 7732961 pages in 1720ms
[ 73.769639] node 86 initialised, 7732961 pages in 1750ms
[ 73.775573] node 44 initialised, 7732961 pages in 1760ms
[ 73.772955] node 177 initialised, 7732962 pages in 1740ms
[ 73.804390] node 34 initialised, 7732961 pages in 1790ms
[ 73.819370] node 30 initialised, 7732961 pages in 1810ms
[ 73.847882] node 98 initialised, 7732961 pages in 1830ms
[ 73.867545] node 33 initialised, 7732962 pages in 1860ms
[ 73.877964] node 107 initialised, 7732962 pages in 1860ms
[ 73.906256] node 103 initialised, 7732962 pages in 1880ms
[ 73.945581] node 100 initialised, 7732961 pages in 1930ms
[ 73.947024] node 96 initialised, 7732961 pages in 1930ms
[ 74.186208] node 116 initialised, 7732961 pages in 2170ms
[ 74.220838] node 68 initialised, 7732961 pages in 2210ms
[ 74.252341] node 46 initialised, 7732961 pages in 2240ms
[ 74.274795] node 118 initialised, 7732961 pages in 2260ms
[ 74.337544] node 14 initialised, 7732961 pages in 2320ms
[ 74.350819] node 22 initialised, 7732961 pages in 2340ms
[ 74.350332] node 69 initialised, 7732962 pages in 2340ms
[ 74.362683] node 211 initialised, 7732962 pages in 2310ms
[ 74.360617] node 70 initialised, 7732961 pages in 2340ms
[ 74.369137] node 66 initialised, 7732961 pages in 2360ms
[ 74.378242] node 115 initialised, 7732962 pages in 2360ms
[ 74.404221] node 213 initialised, 7732962 pages in 2350ms
[ 74.420901] node 210 initialised, 7732961 pages in 2370ms
[ 74.430049] node 35 initialised, 7732962 pages in 2420ms
[ 74.436007] node 48 initialised, 7732961 pages in 2420ms
[ 74.480595] node 71 initialised, 7732962 pages in 2460ms
[ 74.485700] node 67 initialised, 7732962 pages in 2480ms
[ 74.502627] node 31 initialised, 7732962 pages in 2490ms
[ 74.542220] node 16 initialised, 7732961 pages in 2530ms
[ 74.547936] node 128 initialised, 7732961 pages in 2520ms
[ 74.634374] node 214 initialised, 7732961 pages in 2580ms
[ 74.654389] node 88 initialised, 7732961 pages in 2630ms
[ 74.722833] node 117 initialised, 7732962 pages in 2700ms
[ 74.735002] node 148 initialised, 7732961 pages in 2700ms
[ 74.742725] node 12 initialised, 7732961 pages in 2730ms
[ 74.749319] node 194 initialised, 7732961 pages in 2700ms
[ 74.767979] node 24 initialised, 7732961 pages in 2750ms
[ 74.769465] node 114 initialised, 7732961 pages in 2750ms
[ 74.796973] node 134 initialised, 7732961 pages in 2770ms
[ 74.818164] node 15 initialised, 7732962 pages in 2810ms
[ 74.844852] node 18 initialised, 7732961 pages in 2830ms
[ 74.866123] node 110 initialised, 7732961 pages in 2850ms
[ 74.898255] node 215 initialised, 7730688 pages in 2840ms
[ 74.903623] node 136 initialised, 7732961 pages in 2880ms
[ 74.911107] node 144 initialised, 7732961 pages in 2890ms
[ 74.918757] node 212 initialised, 7732961 pages in 2870ms
[ 74.935333] node 182 initialised, 7732961 pages in 2880ms
[ 74.958147] node 42 initialised, 7732961 pages in 2950ms
[ 74.964989] node 108 initialised, 7732961 pages in 2950ms
[ 74.965482] node 112 initialised, 7732961 pages in 2950ms
[ 75.034787] node 184 initialised, 7732961 pages in 2980ms
[ 75.051242] node 45 initialised, 7732962 pages in 3040ms
[ 75.047169] node 152 initialised, 7732961 pages in 3020ms
[ 75.062834] node 179 initialised, 7732962 pages in 3010ms
[ 75.076528] node 145 initialised, 7732962 pages in 3040ms
[ 75.076613] node 25 initialised, 7732962 pages in 3070ms
[ 75.073086] node 164 initialised, 7732961 pages in 3040ms
[ 75.079674] node 149 initialised, 7732962 pages in 3050ms
[ 75.092015] node 113 initialised, 7732962 pages in 3070ms
[ 75.096325] node 80 initialised, 7732961 pages in 3080ms
[ 75.131380] node 92 initialised, 7732961 pages in 3110ms
[ 75.142147] node 10 initialised, 7732961 pages in 3130ms
[ 75.151041] node 51 initialised, 7732962 pages in 3140ms
[ 75.159074] node 130 initialised, 7732961 pages in 3130ms
[ 75.162616] node 166 initialised, 7732961 pages in 3130ms
[ 75.193557] node 82 initialised, 7732961 pages in 3170ms
[ 75.254801] node 84 initialised, 7732961 pages in 3240ms
[ 75.303028] node 64 initialised, 7732961 pages in 3290ms
[ 75.299739] node 49 initialised, 7732962 pages in 3290ms
[ 75.314231] node 21 initialised, 7732962 pages in 3300ms
[ 75.371298] node 53 initialised, 7732962 pages in 3360ms
[ 75.394569] node 95 initialised, 7732962 pages in 3380ms
[ 75.441101] node 23 initialised, 7732962 pages in 3430ms
[ 75.433080] node 19 initialised, 7732962 pages in 3430ms
[ 75.446076] node 173 initialised, 7732962 pages in 3410ms
[ 75.445816] node 99 initialised, 7732962 pages in 3430ms
[ 75.470330] node 87 initialised, 7732962 pages in 3450ms
[ 75.502334] node 8 initialised, 7732961 pages in 3490ms
[ 75.508300] node 206 initialised, 7732961 pages in 3460ms
[ 75.540253] node 132 initialised, 7732961 pages in 3510ms
[ 75.615453] node 183 initialised, 7732962 pages in 3560ms
[ 75.632576] node 78 initialised, 7732961 pages in 3610ms
[ 75.647753] node 85 initialised, 7732962 pages in 3620ms
[ 75.688955] node 90 initialised, 7732961 pages in 3670ms
[ 75.694522] node 200 initialised, 7732961 pages in 3640ms
[ 75.688790] node 43 initialised, 7732962 pages in 3680ms
[ 75.694540] node 94 initialised, 7732961 pages in 3680ms
[ 75.697149] node 29 initialised, 7732962 pages in 3690ms
[ 75.693590] node 111 initialised, 7732962 pages in 3680ms
[ 75.715829] node 56 initialised, 7732961 pages in 3700ms
[ 75.718427] node 97 initialised, 7732962 pages in 3700ms
[ 75.741643] node 147 initialised, 7732962 pages in 3710ms
[ 75.773613] node 170 initialised, 7732961 pages in 3740ms
[ 75.802874] node 208 initialised, 7732961 pages in 3750ms
[ 75.804409] node 58 initialised, 7732961 pages in 3790ms
[ 75.853438] node 126 initialised, 7732961 pages in 3830ms
[ 75.888167] node 167 initialised, 7732962 pages in 3850ms
[ 75.912656] node 172 initialised, 7732961 pages in 3870ms
[ 75.956540] node 93 initialised, 7732962 pages in 3940ms
[ 75.988819] node 127 initialised, 7732962 pages in 3960ms
[ 76.062198] node 201 initialised, 7732962 pages in 4010ms
[ 76.091769] node 47 initialised, 7732962 pages in 4080ms
[ 76.119749] node 162 initialised, 7732961 pages in 4080ms
[ 76.122797] node 6 initialised, 7732961 pages in 4110ms
[ 76.225916] node 153 initialised, 7732962 pages in 4190ms
[ 76.219855] node 81 initialised, 7732962 pages in 4200ms
[ 76.236116] node 150 initialised, 7732961 pages in 4210ms
[ 76.245349] node 180 initialised, 7732961 pages in 4190ms
[ 76.248827] node 17 initialised, 7732962 pages in 4240ms
[ 76.258801] node 13 initialised, 7732962 pages in 4250ms
[ 76.259943] node 122 initialised, 7732961 pages in 4240ms
[ 76.277480] node 196 initialised, 7732961 pages in 4230ms
[ 76.320830] node 41 initialised, 7732962 pages in 4310ms
[ 76.351667] node 129 initialised, 7732962 pages in 4320ms
[ 76.353488] node 202 initialised, 7732961 pages in 4310ms
[ 76.376753] node 165 initialised, 7732962 pages in 4340ms
[ 76.381807] node 124 initialised, 7732961 pages in 4350ms
[ 76.419952] node 171 initialised, 7732962 pages in 4380ms
[ 76.431242] node 168 initialised, 7732961 pages in 4390ms
[ 76.441324] node 89 initialised, 7732962 pages in 4420ms
[ 76.440720] node 155 initialised, 7732962 pages in 4400ms
[ 76.459715] node 120 initialised, 7732961 pages in 4440ms
[ 76.483986] node 205 initialised, 7732962 pages in 4430ms
[ 76.493284] node 151 initialised, 7732962 pages in 4460ms
[ 76.491437] node 60 initialised, 7732961 pages in 4480ms
[ 76.526620] node 74 initialised, 7732961 pages in 4510ms
[ 76.543761] node 131 initialised, 7732962 pages in 4510ms
[ 76.549562] node 39 initialised, 7732962 pages in 4540ms
[ 76.563861] node 11 initialised, 7732962 pages in 4550ms
[ 76.598775] node 54 initialised, 7732961 pages in 4590ms
[ 76.602006] node 123 initialised, 7732962 pages in 4570ms
[ 76.619856] node 76 initialised, 7732961 pages in 4600ms
[ 76.631418] node 198 initialised, 7732961 pages in 4580ms
[ 76.665415] node 188 initialised, 7732961 pages in 4610ms
[ 76.669178] node 63 initialised, 7732962 pages in 4660ms
[ 76.683646] node 101 initialised, 7732962 pages in 4670ms
[ 76.710780] node 192 initialised, 7732961 pages in 4660ms
[ 76.736743] node 121 initialised, 7732962 pages in 4720ms
[ 76.743800] node 199 initialised, 7732962 pages in 4700ms
[ 76.750663] node 20 initialised, 7732961 pages in 4740ms
[ 76.763045] node 135 initialised, 7732962 pages in 4730ms
[ 76.768216] node 137 initialised, 7732962 pages in 4740ms
[ 76.800135] node 181 initialised, 7732962 pages in 4750ms
[ 76.811215] node 27 initialised, 7732962 pages in 4800ms
[ 76.857405] node 125 initialised, 7732962 pages in 4820ms
[ 76.853750] node 163 initialised, 7732962 pages in 4820ms
[ 76.882975] node 59 initialised, 7732962 pages in 4870ms
[ 76.920121] node 9 initialised, 7732962 pages in 4910ms
[ 76.934824] node 189 initialised, 7732962 pages in 4880ms
[ 76.951223] node 154 initialised, 7732961 pages in 4920ms
[ 76.953897] node 203 initialised, 7732962 pages in 4900ms
[ 76.952558] node 75 initialised, 7732962 pages in 4930ms
[ 76.985480] node 119 initialised, 7732962 pages in 4970ms
[ 77.036089] node 195 initialised, 7732962 pages in 4980ms
[ 77.039996] node 55 initialised, 7732962 pages in 5030ms
[ 77.067989] node 109 initialised, 7732962 pages in 5040ms
[ 77.066236] node 7 initialised, 7732962 pages in 5060ms
[ 77.068709] node 65 initialised, 7732962 pages in 5060ms
[ 77.097859] node 79 initialised, 7732962 pages in 5080ms
[ 77.096219] node 169 initialised, 7732962 pages in 5060ms
[ 77.125113] node 83 initialised, 7732962 pages in 5110ms
[ 77.139507] node 37 initialised, 7732962 pages in 5130ms
[ 77.143280] node 77 initialised, 7732962 pages in 5120ms
[ 77.226494] node 73 initialised, 7732962 pages in 5200ms
[ 77.281584] node 190 initialised, 7732961 pages in 5230ms
[ 77.314794] node 204 initialised, 7732961 pages in 5260ms
[ 77.328577] node 72 initialised, 7732961 pages in 5310ms
[ 77.335743] node 36 initialised, 7732961 pages in 5320ms
[ 77.360573] node 40 initialised, 7732961 pages in 5350ms
[ 77.368712] node 207 initialised, 7732962 pages in 5320ms
[ 77.387708] node 91 initialised, 7732962 pages in 5370ms
[ 77.385143] node 57 initialised, 7732962 pages in 5380ms
[ 77.391785] node 191 initialised, 7732962 pages in 5340ms
[ 77.479970] node 185 initialised, 7732962 pages in 5430ms
[ 77.491865] node 61 initialised, 7732962 pages in 5480ms
[ 77.489255] node 133 initialised, 7732962 pages in 5460ms
[ 77.502111] node 197 initialised, 7732962 pages in 5450ms
[ 77.507136] node 193 initialised, 7732962 pages in 5460ms
[ 77.523739] node 209 initialised, 7732962 pages in 5470ms
[ 77.537131] node 187 initialised, 7732962 pages in 5490ms

-- [2]

http://ozlabs.org/~akpm/mmots/broken-out/memblock-introduce-a-for_each_reserved_mem_region-iterator.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-move-page-initialization-into-a-separate-function.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-only-set-page-reserved-in-the-memblock-region.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-pass-pfn-to-__free_pages_bootmem.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-make-__early_pfn_to_nid-smp-safe-and-introduce-meminit_pfn_in_nid.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-remaining-struct-pages-in-parallel-with-kswapd.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-minimise-number-of-pfn-page-lookups-during-initialisation.patch
http://ozlabs.org/~akpm/mmots/broken-out/x86-mm-enable-deferred-struct-page-initialisation-on-x86-64.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-free-pages-in-large-chunks-where-possible.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-remove-mminit_verify_page_links.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix2.patch

2015-05-14 15:57:22

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

Well I did get in some tests yesterday afternoon. And with some simple
timers found that occasionally it took huge amount of time in this
snippit at the top of
static int __init deferred_init_memmap(void *data)

/* Bind memory initialisation thread to a local node if possible */
if (!cpumask_empty(cpumask))
set_cpus_allowed_ptr(current, cpumask);

I am assuming that the it is getting caught up in set_cpus_allowed_ptr
not the cpumask_empty.
I have more machine time today and will make sure I have all those patches.

On 05/14/2015 05:03 AM, Daniel J Blueman wrote:
> On Thu, May 14, 2015 at 12:31 AM, Mel Gorman <[email protected]> wrote:
>> On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:
>>> I am just noticed a hang on my largest box.
>>> I can only reproduce with large core counts, if I turn down the
>>> number of cpus it doesn't have an issue.
>>>
>>
>> Odd. The number of core counts should make little a difference as only
>> one CPU per node should be in use. Does sysrq+t give any indication how
>> or where it is hanging?
>
> I was seeing the same behaviour of 1000ms increasing to 5500ms [1];
> this suggests either lock contention or O(n) behaviour.
>
> Nathan, can you check with this ordering of patches from Andrew's
> cache [2]? I was getting hanging until I a found them all.
>
> I'll follow up with timing data.
>
> Thanks,
> Daniel
>
> -- [1]
>
> [ 73.076117] node 2 initialised, 7732961 pages in 1060ms
> [ 73.077184] node 38 initialised, 7732961 pages in 1060ms
> [ 73.079626] node 146 initialised, 7732961 pages in 1050ms
> [ 73.093488] node 62 initialised, 7732961 pages in 1080ms
> [ 73.091557] node 3 initialised, 7732962 pages in 1080ms
> [ 73.100000] node 186 initialised, 7732961 pages in 1040ms
> [ 73.095731] node 4 initialised, 7732961 pages in 1080ms
> [ 73.090289] node 50 initialised, 7732961 pages in 1080ms
> [ 73.094005] node 158 initialised, 7732961 pages in 1050ms
> [ 73.095421] node 159 initialised, 7732962 pages in 1050ms
> [ 73.090324] node 52 initialised, 7732961 pages in 1080ms
> [ 73.099056] node 5 initialised, 7732962 pages in 1080ms
> [ 73.090116] node 160 initialised, 7732961 pages in 1050ms
> [ 73.161051] node 157 initialised, 7732962 pages in 1120ms
> [ 73.193565] node 161 initialised, 7732962 pages in 1160ms
> [ 73.212456] node 26 initialised, 7732961 pages in 1200ms
> [ 73.222904] node 0 initialised, 6686488 pages in 1210ms
> [ 73.242165] node 140 initialised, 7732961 pages in 1210ms
> [ 73.254230] node 156 initialised, 7732961 pages in 1220ms
> [ 73.284634] node 1 initialised, 7732962 pages in 1270ms
> [ 73.305301] node 141 initialised, 7732962 pages in 1280ms
> [ 73.322845] node 28 initialised, 7732961 pages in 1310ms
> [ 73.321757] node 142 initialised, 7732961 pages in 1290ms
> [ 73.327677] node 138 initialised, 7732961 pages in 1300ms
> [ 73.413597] node 176 initialised, 7732961 pages in 1370ms
> [ 73.455552] node 139 initialised, 7732962 pages in 1420ms
> [ 73.475356] node 143 initialised, 7732962 pages in 1440ms
> [ 73.547202] node 32 initialised, 7732961 pages in 1530ms
> [ 73.579591] node 104 initialised, 7732961 pages in 1560ms
> [ 73.618065] node 174 initialised, 7732961 pages in 1570ms
> [ 73.624918] node 178 initialised, 7732961 pages in 1580ms
> [ 73.649024] node 175 initialised, 7732962 pages in 1610ms
> [ 73.654110] node 105 initialised, 7732962 pages in 1630ms
> [ 73.670589] node 106 initialised, 7732961 pages in 1650ms
> [ 73.739682] node 102 initialised, 7732961 pages in 1720ms
> [ 73.769639] node 86 initialised, 7732961 pages in 1750ms
> [ 73.775573] node 44 initialised, 7732961 pages in 1760ms
> [ 73.772955] node 177 initialised, 7732962 pages in 1740ms
> [ 73.804390] node 34 initialised, 7732961 pages in 1790ms
> [ 73.819370] node 30 initialised, 7732961 pages in 1810ms
> [ 73.847882] node 98 initialised, 7732961 pages in 1830ms
> [ 73.867545] node 33 initialised, 7732962 pages in 1860ms
> [ 73.877964] node 107 initialised, 7732962 pages in 1860ms
> [ 73.906256] node 103 initialised, 7732962 pages in 1880ms
> [ 73.945581] node 100 initialised, 7732961 pages in 1930ms
> [ 73.947024] node 96 initialised, 7732961 pages in 1930ms
> [ 74.186208] node 116 initialised, 7732961 pages in 2170ms
> [ 74.220838] node 68 initialised, 7732961 pages in 2210ms
> [ 74.252341] node 46 initialised, 7732961 pages in 2240ms
> [ 74.274795] node 118 initialised, 7732961 pages in 2260ms
> [ 74.337544] node 14 initialised, 7732961 pages in 2320ms
> [ 74.350819] node 22 initialised, 7732961 pages in 2340ms
> [ 74.350332] node 69 initialised, 7732962 pages in 2340ms
> [ 74.362683] node 211 initialised, 7732962 pages in 2310ms
> [ 74.360617] node 70 initialised, 7732961 pages in 2340ms
> [ 74.369137] node 66 initialised, 7732961 pages in 2360ms
> [ 74.378242] node 115 initialised, 7732962 pages in 2360ms
> [ 74.404221] node 213 initialised, 7732962 pages in 2350ms
> [ 74.420901] node 210 initialised, 7732961 pages in 2370ms
> [ 74.430049] node 35 initialised, 7732962 pages in 2420ms
> [ 74.436007] node 48 initialised, 7732961 pages in 2420ms
> [ 74.480595] node 71 initialised, 7732962 pages in 2460ms
> [ 74.485700] node 67 initialised, 7732962 pages in 2480ms
> [ 74.502627] node 31 initialised, 7732962 pages in 2490ms
> [ 74.542220] node 16 initialised, 7732961 pages in 2530ms
> [ 74.547936] node 128 initialised, 7732961 pages in 2520ms
> [ 74.634374] node 214 initialised, 7732961 pages in 2580ms
> [ 74.654389] node 88 initialised, 7732961 pages in 2630ms
> [ 74.722833] node 117 initialised, 7732962 pages in 2700ms
> [ 74.735002] node 148 initialised, 7732961 pages in 2700ms
> [ 74.742725] node 12 initialised, 7732961 pages in 2730ms
> [ 74.749319] node 194 initialised, 7732961 pages in 2700ms
> [ 74.767979] node 24 initialised, 7732961 pages in 2750ms
> [ 74.769465] node 114 initialised, 7732961 pages in 2750ms
> [ 74.796973] node 134 initialised, 7732961 pages in 2770ms
> [ 74.818164] node 15 initialised, 7732962 pages in 2810ms
> [ 74.844852] node 18 initialised, 7732961 pages in 2830ms
> [ 74.866123] node 110 initialised, 7732961 pages in 2850ms
> [ 74.898255] node 215 initialised, 7730688 pages in 2840ms
> [ 74.903623] node 136 initialised, 7732961 pages in 2880ms
> [ 74.911107] node 144 initialised, 7732961 pages in 2890ms
> [ 74.918757] node 212 initialised, 7732961 pages in 2870ms
> [ 74.935333] node 182 initialised, 7732961 pages in 2880ms
> [ 74.958147] node 42 initialised, 7732961 pages in 2950ms
> [ 74.964989] node 108 initialised, 7732961 pages in 2950ms
> [ 74.965482] node 112 initialised, 7732961 pages in 2950ms
> [ 75.034787] node 184 initialised, 7732961 pages in 2980ms
> [ 75.051242] node 45 initialised, 7732962 pages in 3040ms
> [ 75.047169] node 152 initialised, 7732961 pages in 3020ms
> [ 75.062834] node 179 initialised, 7732962 pages in 3010ms
> [ 75.076528] node 145 initialised, 7732962 pages in 3040ms
> [ 75.076613] node 25 initialised, 7732962 pages in 3070ms
> [ 75.073086] node 164 initialised, 7732961 pages in 3040ms
> [ 75.079674] node 149 initialised, 7732962 pages in 3050ms
> [ 75.092015] node 113 initialised, 7732962 pages in 3070ms
> [ 75.096325] node 80 initialised, 7732961 pages in 3080ms
> [ 75.131380] node 92 initialised, 7732961 pages in 3110ms
> [ 75.142147] node 10 initialised, 7732961 pages in 3130ms
> [ 75.151041] node 51 initialised, 7732962 pages in 3140ms
> [ 75.159074] node 130 initialised, 7732961 pages in 3130ms
> [ 75.162616] node 166 initialised, 7732961 pages in 3130ms
> [ 75.193557] node 82 initialised, 7732961 pages in 3170ms
> [ 75.254801] node 84 initialised, 7732961 pages in 3240ms
> [ 75.303028] node 64 initialised, 7732961 pages in 3290ms
> [ 75.299739] node 49 initialised, 7732962 pages in 3290ms
> [ 75.314231] node 21 initialised, 7732962 pages in 3300ms
> [ 75.371298] node 53 initialised, 7732962 pages in 3360ms
> [ 75.394569] node 95 initialised, 7732962 pages in 3380ms
> [ 75.441101] node 23 initialised, 7732962 pages in 3430ms
> [ 75.433080] node 19 initialised, 7732962 pages in 3430ms
> [ 75.446076] node 173 initialised, 7732962 pages in 3410ms
> [ 75.445816] node 99 initialised, 7732962 pages in 3430ms
> [ 75.470330] node 87 initialised, 7732962 pages in 3450ms
> [ 75.502334] node 8 initialised, 7732961 pages in 3490ms
> [ 75.508300] node 206 initialised, 7732961 pages in 3460ms
> [ 75.540253] node 132 initialised, 7732961 pages in 3510ms
> [ 75.615453] node 183 initialised, 7732962 pages in 3560ms
> [ 75.632576] node 78 initialised, 7732961 pages in 3610ms
> [ 75.647753] node 85 initialised, 7732962 pages in 3620ms
> [ 75.688955] node 90 initialised, 7732961 pages in 3670ms
> [ 75.694522] node 200 initialised, 7732961 pages in 3640ms
> [ 75.688790] node 43 initialised, 7732962 pages in 3680ms
> [ 75.694540] node 94 initialised, 7732961 pages in 3680ms
> [ 75.697149] node 29 initialised, 7732962 pages in 3690ms
> [ 75.693590] node 111 initialised, 7732962 pages in 3680ms
> [ 75.715829] node 56 initialised, 7732961 pages in 3700ms
> [ 75.718427] node 97 initialised, 7732962 pages in 3700ms
> [ 75.741643] node 147 initialised, 7732962 pages in 3710ms
> [ 75.773613] node 170 initialised, 7732961 pages in 3740ms
> [ 75.802874] node 208 initialised, 7732961 pages in 3750ms
> [ 75.804409] node 58 initialised, 7732961 pages in 3790ms
> [ 75.853438] node 126 initialised, 7732961 pages in 3830ms
> [ 75.888167] node 167 initialised, 7732962 pages in 3850ms
> [ 75.912656] node 172 initialised, 7732961 pages in 3870ms
> [ 75.956540] node 93 initialised, 7732962 pages in 3940ms
> [ 75.988819] node 127 initialised, 7732962 pages in 3960ms
> [ 76.062198] node 201 initialised, 7732962 pages in 4010ms
> [ 76.091769] node 47 initialised, 7732962 pages in 4080ms
> [ 76.119749] node 162 initialised, 7732961 pages in 4080ms
> [ 76.122797] node 6 initialised, 7732961 pages in 4110ms
> [ 76.225916] node 153 initialised, 7732962 pages in 4190ms
> [ 76.219855] node 81 initialised, 7732962 pages in 4200ms
> [ 76.236116] node 150 initialised, 7732961 pages in 4210ms
> [ 76.245349] node 180 initialised, 7732961 pages in 4190ms
> [ 76.248827] node 17 initialised, 7732962 pages in 4240ms
> [ 76.258801] node 13 initialised, 7732962 pages in 4250ms
> [ 76.259943] node 122 initialised, 7732961 pages in 4240ms
> [ 76.277480] node 196 initialised, 7732961 pages in 4230ms
> [ 76.320830] node 41 initialised, 7732962 pages in 4310ms
> [ 76.351667] node 129 initialised, 7732962 pages in 4320ms
> [ 76.353488] node 202 initialised, 7732961 pages in 4310ms
> [ 76.376753] node 165 initialised, 7732962 pages in 4340ms
> [ 76.381807] node 124 initialised, 7732961 pages in 4350ms
> [ 76.419952] node 171 initialised, 7732962 pages in 4380ms
> [ 76.431242] node 168 initialised, 7732961 pages in 4390ms
> [ 76.441324] node 89 initialised, 7732962 pages in 4420ms
> [ 76.440720] node 155 initialised, 7732962 pages in 4400ms
> [ 76.459715] node 120 initialised, 7732961 pages in 4440ms
> [ 76.483986] node 205 initialised, 7732962 pages in 4430ms
> [ 76.493284] node 151 initialised, 7732962 pages in 4460ms
> [ 76.491437] node 60 initialised, 7732961 pages in 4480ms
> [ 76.526620] node 74 initialised, 7732961 pages in 4510ms
> [ 76.543761] node 131 initialised, 7732962 pages in 4510ms
> [ 76.549562] node 39 initialised, 7732962 pages in 4540ms
> [ 76.563861] node 11 initialised, 7732962 pages in 4550ms
> [ 76.598775] node 54 initialised, 7732961 pages in 4590ms
> [ 76.602006] node 123 initialised, 7732962 pages in 4570ms
> [ 76.619856] node 76 initialised, 7732961 pages in 4600ms
> [ 76.631418] node 198 initialised, 7732961 pages in 4580ms
> [ 76.665415] node 188 initialised, 7732961 pages in 4610ms
> [ 76.669178] node 63 initialised, 7732962 pages in 4660ms
> [ 76.683646] node 101 initialised, 7732962 pages in 4670ms
> [ 76.710780] node 192 initialised, 7732961 pages in 4660ms
> [ 76.736743] node 121 initialised, 7732962 pages in 4720ms
> [ 76.743800] node 199 initialised, 7732962 pages in 4700ms
> [ 76.750663] node 20 initialised, 7732961 pages in 4740ms
> [ 76.763045] node 135 initialised, 7732962 pages in 4730ms
> [ 76.768216] node 137 initialised, 7732962 pages in 4740ms
> [ 76.800135] node 181 initialised, 7732962 pages in 4750ms
> [ 76.811215] node 27 initialised, 7732962 pages in 4800ms
> [ 76.857405] node 125 initialised, 7732962 pages in 4820ms
> [ 76.853750] node 163 initialised, 7732962 pages in 4820ms
> [ 76.882975] node 59 initialised, 7732962 pages in 4870ms
> [ 76.920121] node 9 initialised, 7732962 pages in 4910ms
> [ 76.934824] node 189 initialised, 7732962 pages in 4880ms
> [ 76.951223] node 154 initialised, 7732961 pages in 4920ms
> [ 76.953897] node 203 initialised, 7732962 pages in 4900ms
> [ 76.952558] node 75 initialised, 7732962 pages in 4930ms
> [ 76.985480] node 119 initialised, 7732962 pages in 4970ms
> [ 77.036089] node 195 initialised, 7732962 pages in 4980ms
> [ 77.039996] node 55 initialised, 7732962 pages in 5030ms
> [ 77.067989] node 109 initialised, 7732962 pages in 5040ms
> [ 77.066236] node 7 initialised, 7732962 pages in 5060ms
> [ 77.068709] node 65 initialised, 7732962 pages in 5060ms
> [ 77.097859] node 79 initialised, 7732962 pages in 5080ms
> [ 77.096219] node 169 initialised, 7732962 pages in 5060ms
> [ 77.125113] node 83 initialised, 7732962 pages in 5110ms
> [ 77.139507] node 37 initialised, 7732962 pages in 5130ms
> [ 77.143280] node 77 initialised, 7732962 pages in 5120ms
> [ 77.226494] node 73 initialised, 7732962 pages in 5200ms
> [ 77.281584] node 190 initialised, 7732961 pages in 5230ms
> [ 77.314794] node 204 initialised, 7732961 pages in 5260ms
> [ 77.328577] node 72 initialised, 7732961 pages in 5310ms
> [ 77.335743] node 36 initialised, 7732961 pages in 5320ms
> [ 77.360573] node 40 initialised, 7732961 pages in 5350ms
> [ 77.368712] node 207 initialised, 7732962 pages in 5320ms
> [ 77.387708] node 91 initialised, 7732962 pages in 5370ms
> [ 77.385143] node 57 initialised, 7732962 pages in 5380ms
> [ 77.391785] node 191 initialised, 7732962 pages in 5340ms
> [ 77.479970] node 185 initialised, 7732962 pages in 5430ms
> [ 77.491865] node 61 initialised, 7732962 pages in 5480ms
> [ 77.489255] node 133 initialised, 7732962 pages in 5460ms
> [ 77.502111] node 197 initialised, 7732962 pages in 5450ms
> [ 77.507136] node 193 initialised, 7732962 pages in 5460ms
> [ 77.523739] node 209 initialised, 7732962 pages in 5470ms
> [ 77.537131] node 187 initialised, 7732962 pages in 5490ms
>
> -- [2]
>
> http://ozlabs.org/~akpm/mmots/broken-out/memblock-introduce-a-for_each_reserved_mem_region-iterator.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-move-page-initialization-into-a-separate-function.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-only-set-page-reserved-in-the-memblock-region.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-pass-pfn-to-__free_pages_bootmem.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-make-__early_pfn_to_nid-smp-safe-and-introduce-meminit_pfn_in_nid.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-remaining-struct-pages-in-parallel-with-kswapd.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-minimise-number-of-pfn-page-lookups-during-initialisation.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/x86-mm-enable-deferred-struct-page-initialisation-on-x86-64.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-free-pages-in-large-chunks-where-possible.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-remove-mminit_verify_page_links.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set-fix.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init-fix.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix2.patch
>
>

2015-05-19 18:31:34

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

After double checking the patches it seems everything is ok.

I had to rerun quite a bit since the machine was reconfigured and I
wanted to be thorough.
My latest timings are quite close to my previous reported numbers.

The hang issue I encountered turned out to be unrelated to these patches
so that is a separate bundle of fun.



On 05/14/2015 05:03 AM, Daniel J Blueman wrote:
> On Thu, May 14, 2015 at 12:31 AM, Mel Gorman <[email protected]> wrote:
>> On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:
>>> I am just noticed a hang on my largest box.
>>> I can only reproduce with large core counts, if I turn down the
>>> number of cpus it doesn't have an issue.
>>>
>>
>> Odd. The number of core counts should make little a difference as only
>> one CPU per node should be in use. Does sysrq+t give any indication how
>> or where it is hanging?
>
> I was seeing the same behaviour of 1000ms increasing to 5500ms [1];
> this suggests either lock contention or O(n) behaviour.
>
> Nathan, can you check with this ordering of patches from Andrew's
> cache [2]? I was getting hanging until I a found them all.
>
> I'll follow up with timing data.
>
> Thanks,
> Daniel
>
> -- [1]
>
> [ 73.076117] node 2 initialised, 7732961 pages in 1060ms
> [ 73.077184] node 38 initialised, 7732961 pages in 1060ms
> [ 73.079626] node 146 initialised, 7732961 pages in 1050ms
> [ 73.093488] node 62 initialised, 7732961 pages in 1080ms
> [ 73.091557] node 3 initialised, 7732962 pages in 1080ms
> [ 73.100000] node 186 initialised, 7732961 pages in 1040ms
> [ 73.095731] node 4 initialised, 7732961 pages in 1080ms
> [ 73.090289] node 50 initialised, 7732961 pages in 1080ms
> [ 73.094005] node 158 initialised, 7732961 pages in 1050ms
> [ 73.095421] node 159 initialised, 7732962 pages in 1050ms
> [ 73.090324] node 52 initialised, 7732961 pages in 1080ms
> [ 73.099056] node 5 initialised, 7732962 pages in 1080ms
> [ 73.090116] node 160 initialised, 7732961 pages in 1050ms
> [ 73.161051] node 157 initialised, 7732962 pages in 1120ms
> [ 73.193565] node 161 initialised, 7732962 pages in 1160ms
> [ 73.212456] node 26 initialised, 7732961 pages in 1200ms
> [ 73.222904] node 0 initialised, 6686488 pages in 1210ms
> [ 73.242165] node 140 initialised, 7732961 pages in 1210ms
> [ 73.254230] node 156 initialised, 7732961 pages in 1220ms
> [ 73.284634] node 1 initialised, 7732962 pages in 1270ms
> [ 73.305301] node 141 initialised, 7732962 pages in 1280ms
> [ 73.322845] node 28 initialised, 7732961 pages in 1310ms
> [ 73.321757] node 142 initialised, 7732961 pages in 1290ms
> [ 73.327677] node 138 initialised, 7732961 pages in 1300ms
> [ 73.413597] node 176 initialised, 7732961 pages in 1370ms
> [ 73.455552] node 139 initialised, 7732962 pages in 1420ms
> [ 73.475356] node 143 initialised, 7732962 pages in 1440ms
> [ 73.547202] node 32 initialised, 7732961 pages in 1530ms
> [ 73.579591] node 104 initialised, 7732961 pages in 1560ms
> [ 73.618065] node 174 initialised, 7732961 pages in 1570ms
> [ 73.624918] node 178 initialised, 7732961 pages in 1580ms
> [ 73.649024] node 175 initialised, 7732962 pages in 1610ms
> [ 73.654110] node 105 initialised, 7732962 pages in 1630ms
> [ 73.670589] node 106 initialised, 7732961 pages in 1650ms
> [ 73.739682] node 102 initialised, 7732961 pages in 1720ms
> [ 73.769639] node 86 initialised, 7732961 pages in 1750ms
> [ 73.775573] node 44 initialised, 7732961 pages in 1760ms
> [ 73.772955] node 177 initialised, 7732962 pages in 1740ms
> [ 73.804390] node 34 initialised, 7732961 pages in 1790ms
> [ 73.819370] node 30 initialised, 7732961 pages in 1810ms
> [ 73.847882] node 98 initialised, 7732961 pages in 1830ms
> [ 73.867545] node 33 initialised, 7732962 pages in 1860ms
> [ 73.877964] node 107 initialised, 7732962 pages in 1860ms
> [ 73.906256] node 103 initialised, 7732962 pages in 1880ms
> [ 73.945581] node 100 initialised, 7732961 pages in 1930ms
> [ 73.947024] node 96 initialised, 7732961 pages in 1930ms
> [ 74.186208] node 116 initialised, 7732961 pages in 2170ms
> [ 74.220838] node 68 initialised, 7732961 pages in 2210ms
> [ 74.252341] node 46 initialised, 7732961 pages in 2240ms
> [ 74.274795] node 118 initialised, 7732961 pages in 2260ms
> [ 74.337544] node 14 initialised, 7732961 pages in 2320ms
> [ 74.350819] node 22 initialised, 7732961 pages in 2340ms
> [ 74.350332] node 69 initialised, 7732962 pages in 2340ms
> [ 74.362683] node 211 initialised, 7732962 pages in 2310ms
> [ 74.360617] node 70 initialised, 7732961 pages in 2340ms
> [ 74.369137] node 66 initialised, 7732961 pages in 2360ms
> [ 74.378242] node 115 initialised, 7732962 pages in 2360ms
> [ 74.404221] node 213 initialised, 7732962 pages in 2350ms
> [ 74.420901] node 210 initialised, 7732961 pages in 2370ms
> [ 74.430049] node 35 initialised, 7732962 pages in 2420ms
> [ 74.436007] node 48 initialised, 7732961 pages in 2420ms
> [ 74.480595] node 71 initialised, 7732962 pages in 2460ms
> [ 74.485700] node 67 initialised, 7732962 pages in 2480ms
> [ 74.502627] node 31 initialised, 7732962 pages in 2490ms
> [ 74.542220] node 16 initialised, 7732961 pages in 2530ms
> [ 74.547936] node 128 initialised, 7732961 pages in 2520ms
> [ 74.634374] node 214 initialised, 7732961 pages in 2580ms
> [ 74.654389] node 88 initialised, 7732961 pages in 2630ms
> [ 74.722833] node 117 initialised, 7732962 pages in 2700ms
> [ 74.735002] node 148 initialised, 7732961 pages in 2700ms
> [ 74.742725] node 12 initialised, 7732961 pages in 2730ms
> [ 74.749319] node 194 initialised, 7732961 pages in 2700ms
> [ 74.767979] node 24 initialised, 7732961 pages in 2750ms
> [ 74.769465] node 114 initialised, 7732961 pages in 2750ms
> [ 74.796973] node 134 initialised, 7732961 pages in 2770ms
> [ 74.818164] node 15 initialised, 7732962 pages in 2810ms
> [ 74.844852] node 18 initialised, 7732961 pages in 2830ms
> [ 74.866123] node 110 initialised, 7732961 pages in 2850ms
> [ 74.898255] node 215 initialised, 7730688 pages in 2840ms
> [ 74.903623] node 136 initialised, 7732961 pages in 2880ms
> [ 74.911107] node 144 initialised, 7732961 pages in 2890ms
> [ 74.918757] node 212 initialised, 7732961 pages in 2870ms
> [ 74.935333] node 182 initialised, 7732961 pages in 2880ms
> [ 74.958147] node 42 initialised, 7732961 pages in 2950ms
> [ 74.964989] node 108 initialised, 7732961 pages in 2950ms
> [ 74.965482] node 112 initialised, 7732961 pages in 2950ms
> [ 75.034787] node 184 initialised, 7732961 pages in 2980ms
> [ 75.051242] node 45 initialised, 7732962 pages in 3040ms
> [ 75.047169] node 152 initialised, 7732961 pages in 3020ms
> [ 75.062834] node 179 initialised, 7732962 pages in 3010ms
> [ 75.076528] node 145 initialised, 7732962 pages in 3040ms
> [ 75.076613] node 25 initialised, 7732962 pages in 3070ms
> [ 75.073086] node 164 initialised, 7732961 pages in 3040ms
> [ 75.079674] node 149 initialised, 7732962 pages in 3050ms
> [ 75.092015] node 113 initialised, 7732962 pages in 3070ms
> [ 75.096325] node 80 initialised, 7732961 pages in 3080ms
> [ 75.131380] node 92 initialised, 7732961 pages in 3110ms
> [ 75.142147] node 10 initialised, 7732961 pages in 3130ms
> [ 75.151041] node 51 initialised, 7732962 pages in 3140ms
> [ 75.159074] node 130 initialised, 7732961 pages in 3130ms
> [ 75.162616] node 166 initialised, 7732961 pages in 3130ms
> [ 75.193557] node 82 initialised, 7732961 pages in 3170ms
> [ 75.254801] node 84 initialised, 7732961 pages in 3240ms
> [ 75.303028] node 64 initialised, 7732961 pages in 3290ms
> [ 75.299739] node 49 initialised, 7732962 pages in 3290ms
> [ 75.314231] node 21 initialised, 7732962 pages in 3300ms
> [ 75.371298] node 53 initialised, 7732962 pages in 3360ms
> [ 75.394569] node 95 initialised, 7732962 pages in 3380ms
> [ 75.441101] node 23 initialised, 7732962 pages in 3430ms
> [ 75.433080] node 19 initialised, 7732962 pages in 3430ms
> [ 75.446076] node 173 initialised, 7732962 pages in 3410ms
> [ 75.445816] node 99 initialised, 7732962 pages in 3430ms
> [ 75.470330] node 87 initialised, 7732962 pages in 3450ms
> [ 75.502334] node 8 initialised, 7732961 pages in 3490ms
> [ 75.508300] node 206 initialised, 7732961 pages in 3460ms
> [ 75.540253] node 132 initialised, 7732961 pages in 3510ms
> [ 75.615453] node 183 initialised, 7732962 pages in 3560ms
> [ 75.632576] node 78 initialised, 7732961 pages in 3610ms
> [ 75.647753] node 85 initialised, 7732962 pages in 3620ms
> [ 75.688955] node 90 initialised, 7732961 pages in 3670ms
> [ 75.694522] node 200 initialised, 7732961 pages in 3640ms
> [ 75.688790] node 43 initialised, 7732962 pages in 3680ms
> [ 75.694540] node 94 initialised, 7732961 pages in 3680ms
> [ 75.697149] node 29 initialised, 7732962 pages in 3690ms
> [ 75.693590] node 111 initialised, 7732962 pages in 3680ms
> [ 75.715829] node 56 initialised, 7732961 pages in 3700ms
> [ 75.718427] node 97 initialised, 7732962 pages in 3700ms
> [ 75.741643] node 147 initialised, 7732962 pages in 3710ms
> [ 75.773613] node 170 initialised, 7732961 pages in 3740ms
> [ 75.802874] node 208 initialised, 7732961 pages in 3750ms
> [ 75.804409] node 58 initialised, 7732961 pages in 3790ms
> [ 75.853438] node 126 initialised, 7732961 pages in 3830ms
> [ 75.888167] node 167 initialised, 7732962 pages in 3850ms
> [ 75.912656] node 172 initialised, 7732961 pages in 3870ms
> [ 75.956540] node 93 initialised, 7732962 pages in 3940ms
> [ 75.988819] node 127 initialised, 7732962 pages in 3960ms
> [ 76.062198] node 201 initialised, 7732962 pages in 4010ms
> [ 76.091769] node 47 initialised, 7732962 pages in 4080ms
> [ 76.119749] node 162 initialised, 7732961 pages in 4080ms
> [ 76.122797] node 6 initialised, 7732961 pages in 4110ms
> [ 76.225916] node 153 initialised, 7732962 pages in 4190ms
> [ 76.219855] node 81 initialised, 7732962 pages in 4200ms
> [ 76.236116] node 150 initialised, 7732961 pages in 4210ms
> [ 76.245349] node 180 initialised, 7732961 pages in 4190ms
> [ 76.248827] node 17 initialised, 7732962 pages in 4240ms
> [ 76.258801] node 13 initialised, 7732962 pages in 4250ms
> [ 76.259943] node 122 initialised, 7732961 pages in 4240ms
> [ 76.277480] node 196 initialised, 7732961 pages in 4230ms
> [ 76.320830] node 41 initialised, 7732962 pages in 4310ms
> [ 76.351667] node 129 initialised, 7732962 pages in 4320ms
> [ 76.353488] node 202 initialised, 7732961 pages in 4310ms
> [ 76.376753] node 165 initialised, 7732962 pages in 4340ms
> [ 76.381807] node 124 initialised, 7732961 pages in 4350ms
> [ 76.419952] node 171 initialised, 7732962 pages in 4380ms
> [ 76.431242] node 168 initialised, 7732961 pages in 4390ms
> [ 76.441324] node 89 initialised, 7732962 pages in 4420ms
> [ 76.440720] node 155 initialised, 7732962 pages in 4400ms
> [ 76.459715] node 120 initialised, 7732961 pages in 4440ms
> [ 76.483986] node 205 initialised, 7732962 pages in 4430ms
> [ 76.493284] node 151 initialised, 7732962 pages in 4460ms
> [ 76.491437] node 60 initialised, 7732961 pages in 4480ms
> [ 76.526620] node 74 initialised, 7732961 pages in 4510ms
> [ 76.543761] node 131 initialised, 7732962 pages in 4510ms
> [ 76.549562] node 39 initialised, 7732962 pages in 4540ms
> [ 76.563861] node 11 initialised, 7732962 pages in 4550ms
> [ 76.598775] node 54 initialised, 7732961 pages in 4590ms
> [ 76.602006] node 123 initialised, 7732962 pages in 4570ms
> [ 76.619856] node 76 initialised, 7732961 pages in 4600ms
> [ 76.631418] node 198 initialised, 7732961 pages in 4580ms
> [ 76.665415] node 188 initialised, 7732961 pages in 4610ms
> [ 76.669178] node 63 initialised, 7732962 pages in 4660ms
> [ 76.683646] node 101 initialised, 7732962 pages in 4670ms
> [ 76.710780] node 192 initialised, 7732961 pages in 4660ms
> [ 76.736743] node 121 initialised, 7732962 pages in 4720ms
> [ 76.743800] node 199 initialised, 7732962 pages in 4700ms
> [ 76.750663] node 20 initialised, 7732961 pages in 4740ms
> [ 76.763045] node 135 initialised, 7732962 pages in 4730ms
> [ 76.768216] node 137 initialised, 7732962 pages in 4740ms
> [ 76.800135] node 181 initialised, 7732962 pages in 4750ms
> [ 76.811215] node 27 initialised, 7732962 pages in 4800ms
> [ 76.857405] node 125 initialised, 7732962 pages in 4820ms
> [ 76.853750] node 163 initialised, 7732962 pages in 4820ms
> [ 76.882975] node 59 initialised, 7732962 pages in 4870ms
> [ 76.920121] node 9 initialised, 7732962 pages in 4910ms
> [ 76.934824] node 189 initialised, 7732962 pages in 4880ms
> [ 76.951223] node 154 initialised, 7732961 pages in 4920ms
> [ 76.953897] node 203 initialised, 7732962 pages in 4900ms
> [ 76.952558] node 75 initialised, 7732962 pages in 4930ms
> [ 76.985480] node 119 initialised, 7732962 pages in 4970ms
> [ 77.036089] node 195 initialised, 7732962 pages in 4980ms
> [ 77.039996] node 55 initialised, 7732962 pages in 5030ms
> [ 77.067989] node 109 initialised, 7732962 pages in 5040ms
> [ 77.066236] node 7 initialised, 7732962 pages in 5060ms
> [ 77.068709] node 65 initialised, 7732962 pages in 5060ms
> [ 77.097859] node 79 initialised, 7732962 pages in 5080ms
> [ 77.096219] node 169 initialised, 7732962 pages in 5060ms
> [ 77.125113] node 83 initialised, 7732962 pages in 5110ms
> [ 77.139507] node 37 initialised, 7732962 pages in 5130ms
> [ 77.143280] node 77 initialised, 7732962 pages in 5120ms
> [ 77.226494] node 73 initialised, 7732962 pages in 5200ms
> [ 77.281584] node 190 initialised, 7732961 pages in 5230ms
> [ 77.314794] node 204 initialised, 7732961 pages in 5260ms
> [ 77.328577] node 72 initialised, 7732961 pages in 5310ms
> [ 77.335743] node 36 initialised, 7732961 pages in 5320ms
> [ 77.360573] node 40 initialised, 7732961 pages in 5350ms
> [ 77.368712] node 207 initialised, 7732962 pages in 5320ms
> [ 77.387708] node 91 initialised, 7732962 pages in 5370ms
> [ 77.385143] node 57 initialised, 7732962 pages in 5380ms
> [ 77.391785] node 191 initialised, 7732962 pages in 5340ms
> [ 77.479970] node 185 initialised, 7732962 pages in 5430ms
> [ 77.491865] node 61 initialised, 7732962 pages in 5480ms
> [ 77.489255] node 133 initialised, 7732962 pages in 5460ms
> [ 77.502111] node 197 initialised, 7732962 pages in 5450ms
> [ 77.507136] node 193 initialised, 7732962 pages in 5460ms
> [ 77.523739] node 209 initialised, 7732962 pages in 5470ms
> [ 77.537131] node 187 initialised, 7732962 pages in 5490ms
>
> -- [2]
>
> http://ozlabs.org/~akpm/mmots/broken-out/memblock-introduce-a-for_each_reserved_mem_region-iterator.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-move-page-initialization-into-a-separate-function.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-only-set-page-reserved-in-the-memblock-region.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-pass-pfn-to-__free_pages_bootmem.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-make-__early_pfn_to_nid-smp-safe-and-introduce-meminit_pfn_in_nid.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-remaining-struct-pages-in-parallel-with-kswapd.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-minimise-number-of-pfn-page-lookups-during-initialisation.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/x86-mm-enable-deferred-struct-page-initialisation-on-x86-64.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-free-pages-in-large-chunks-where-possible.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-remove-mminit_verify_page_links.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set-fix.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init-fix.patch
>
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix2.patch
>
>

2015-05-19 19:06:39

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Tue, May 19, 2015 at 01:31:28PM -0500, nzimmer wrote:
> After double checking the patches it seems everything is ok.
>
> I had to rerun quite a bit since the machine was reconfigured and I
> wanted to be thorough.
> My latest timings are quite close to my previous reported numbers.
>
> The hang issue I encountered turned out to be unrelated to these
> patches so that is a separate bundle of fun.
>

Ok, sorry to hear about the hanging but I'm glad to hear the patches are
not responsible. Thanks for testing and getting back.

--
Mel Gorman
SUSE Labs

2015-05-22 06:30:18

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman
<[email protected]> wrote:
> On Thu, May 14, 2015 at 12:31 AM, Mel Gorman <[email protected]> wrote:
>> On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:
>>> I am just noticed a hang on my largest box.
>>> I can only reproduce with large core counts, if I turn down the
>>> number of cpus it doesn't have an issue.
>>>
>>
>> Odd. The number of core counts should make little a difference as
>> only
>> one CPU per node should be in use. Does sysrq+t give any indication
>> how
>> or where it is hanging?
>
> I was seeing the same behaviour of 1000ms increasing to 5500ms [1];
> this suggests either lock contention or O(n) behaviour.
>
> Nathan, can you check with this ordering of patches from Andrew's
> cache [2]? I was getting hanging until I a found them all.
>
> I'll follow up with timing data.

7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to login:

1. 2086s with patches 01-19 [1]

2. 2026s adding "Take into account that large system caches scale
linearly with memory", which has:
min(2UL << (30 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 3));

3. 2442s fixing to:
max(2UL << (30 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 3));

4. 2064s adjusting minimum and shift to:
max(512UL << (20 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 8));

5. 1934s adjusting minimum and shift to:
max(128UL << (20 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 8));

6. 930s #5 with the non-temporal PMD init patch I had earlier proposed
(I'll pursue separately)

The scaling patch isn't in -mm. #5 tests out nice on a bunch of other
AMD systems, 64GB and up, so: Tested-by: Daniel J Blueman
<[email protected]>.

Fine work, Mel!

Daniel

-- [1]

> http://ozlabs.org/~akpm/mmots/broken-out/memblock-introduce-a-for_each_reserved_mem_region-iterator.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-move-page-initialization-into-a-separate-function.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-only-set-page-reserved-in-the-memblock-region.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-pass-pfn-to-__free_pages_bootmem.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-make-__early_pfn_to_nid-smp-safe-and-introduce-meminit_pfn_in_nid.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-remaining-struct-pages-in-parallel-with-kswapd.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-minimise-number-of-pfn-page-lookups-during-initialisation.patch
> http://ozlabs.org/~akpm/mmots/broken-out/x86-mm-enable-deferred-struct-page-initialisation-on-x86-64.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-free-pages-in-large-chunks-where-possible.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-remove-mminit_verify_page_links.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix2.patch

2015-05-22 09:33:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Fri, May 22, 2015 at 02:30:01PM +0800, Daniel J Blueman wrote:
> On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman
> <[email protected]> wrote:
> >On Thu, May 14, 2015 at 12:31 AM, Mel Gorman <[email protected]> wrote:
> >>On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:
> >>> I am just noticed a hang on my largest box.
> >>> I can only reproduce with large core counts, if I turn down the
> >>> number of cpus it doesn't have an issue.
> >>>
> >>
> >>Odd. The number of core counts should make little a difference
> >>as only
> >>one CPU per node should be in use. Does sysrq+t give any
> >>indication how
> >>or where it is hanging?
> >
> >I was seeing the same behaviour of 1000ms increasing to 5500ms
> >[1]; this suggests either lock contention or O(n) behaviour.
> >
> >Nathan, can you check with this ordering of patches from Andrew's
> >cache [2]? I was getting hanging until I a found them all.
> >
> >I'll follow up with timing data.
>
> 7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to login:
>
> 1. 2086s with patches 01-19 [1]
>
> 2. 2026s adding "Take into account that large system caches scale
> linearly with memory", which has:
> min(2UL << (30 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 3));
>
> 3. 2442s fixing to:
> max(2UL << (30 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 3));
>
> 4. 2064s adjusting minimum and shift to:
> max(512UL << (20 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 8));
>
> 5. 1934s adjusting minimum and shift to:
> max(128UL << (20 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 8));
>
> 6. 930s #5 with the non-temporal PMD init patch I had earlier
> proposed (I'll pursue separately)
>
> The scaling patch isn't in -mm.

That patch was superceded by "mm: meminit: finish
initialisation of struct pages before basic setup" and
"mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix"
so that's ok.

FWIW, I think you should still go ahead with the non-temporal patches because
there is potential benefit there other than the initialisation. If there
was an arch-optional implementation of a non-termporal clear then it would
also be worth considering if __GFP_ZERO should use non-temporal stores.
At a greater stretch it would be worth considering if kswapd freeing should
zero pages to avoid a zero on the allocation side in the general case as
it would be more generally useful and a stepping stone towards what the
series "Sanitizing freed pages" attempts.

> #5 tests out nice on a bunch of
> other AMD systems, 64GB and up, so: Tested-by: Daniel J Blueman
> <[email protected]>.
>

Thanks very much Daniel, much appreciated.

--
Mel Gorman
SUSE Labs

2015-05-22 17:14:49

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On 05/22/2015 05:33 AM, Mel Gorman wrote:
> On Fri, May 22, 2015 at 02:30:01PM +0800, Daniel J Blueman wrote:
>> On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman
>> <[email protected]> wrote:
>>> On Thu, May 14, 2015 at 12:31 AM, Mel Gorman<[email protected]> wrote:
>>>> On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:
>>>>> I am just noticed a hang on my largest box.
>>>>> I can only reproduce with large core counts, if I turn down the
>>>>> number of cpus it doesn't have an issue.
>>>>>
>>>> Odd. The number of core counts should make little a difference
>>>> as only
>>>> one CPU per node should be in use. Does sysrq+t give any
>>>> indication how
>>>> or where it is hanging?
>>> I was seeing the same behaviour of 1000ms increasing to 5500ms
>>> [1]; this suggests either lock contention or O(n) behaviour.
>>>
>>> Nathan, can you check with this ordering of patches from Andrew's
>>> cache [2]? I was getting hanging until I a found them all.
>>>
>>> I'll follow up with timing data.
>> 7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to login:
>>
>> 1. 2086s with patches 01-19 [1]
>>
>> 2. 2026s adding "Take into account that large system caches scale
>> linearly with memory", which has:
>> min(2UL<< (30 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 3));
>>
>> 3. 2442s fixing to:
>> max(2UL<< (30 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 3));
>>
>> 4. 2064s adjusting minimum and shift to:
>> max(512UL<< (20 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 8));
>>
>> 5. 1934s adjusting minimum and shift to:
>> max(128UL<< (20 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 8));
>>
>> 6. 930s #5 with the non-temporal PMD init patch I had earlier
>> proposed (I'll pursue separately)
>>
>> The scaling patch isn't in -mm.
> That patch was superceded by "mm: meminit: finish
> initialisation of struct pages before basic setup" and
> "mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix"
> so that's ok.
>
> FWIW, I think you should still go ahead with the non-temporal patches because
> there is potential benefit there other than the initialisation. If there
> was an arch-optional implementation of a non-termporal clear then it would
> also be worth considering if __GFP_ZERO should use non-temporal stores.
> At a greater stretch it would be worth considering if kswapd freeing should
> zero pages to avoid a zero on the allocation side in the general case as
> it would be more generally useful and a stepping stone towards what the
> series "Sanitizing freed pages" attempts.

I think the non-temporal patch benefits mainly AMD systems. I have tried
the patch on both DragonHawk and it actually made it boot up a little
bit slower. I think the Intel optimized "rep stosb" instruction (used in
memset) is performing well. I had done similar test on zero page code
and the performance gain was non-conclusive.

Cheers,
Longman

2015-05-22 20:31:58

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH 03/13] mm: meminit: Only set page reserved in the memblock region

On Tue, Apr 28, 2015 at 7:37 AM, Mel Gorman <[email protected]> wrote:
> Currently each page struct is set as reserved upon initialization.
> This patch leaves the reserved bit clear and only sets the reserved bit
> when it is known the memory was allocated by the bootmem allocator. This
> makes it easier to distinguish between uninitialised struct pages and
> reserved struct pages in later patches.

On ia64 my linux-next builds now report a bunch of messages like this:

put_kernel_page: page at 0xe000000005588000 not in reserved memory
put_kernel_page: page at 0xe000000005588000 not in reserved memory
put_kernel_page: page at 0xe000000005580000 not in reserved memory
put_kernel_page: page at 0xe000000005580000 not in reserved memory
put_kernel_page: page at 0xe000000005580000 not in reserved memory
put_kernel_page: page at 0xe000000005580000 not in reserved memory

the two different pages match up with two objects from the loaded kernel
that get mapped by arch/ia64/mm/init.c:setup_gate()

a000000101588000 D __start_gate_section
a000000101580000 D empty_zero_page

Should I look for a place to set the reserved bit on page structures for these
addresses? Or just remove the test and message in put_kernel_page()
[I added a debug "else" clause here - every caller passes in a page that is
not reserved]

if (!PageReserved(page))
printk(KERN_ERR "put_kernel_page: page at 0x%p not in
reserved memory\n",
page_address(page));

-Tony

2015-05-22 21:44:08

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Fri, 2015-05-22 at 13:14 -0400, Waiman Long wrote:
> I think the non-temporal patch benefits mainly AMD systems. I have tried
> the patch on both DragonHawk and it actually made it boot up a little
> bit slower. I think the Intel optimized "rep stosb" instruction (used in
> memset) is performing well. I had done similar test on zero page code
> and the performance gain was non-conclusive.

fwiw I did some experiments with similar conclusions a while ago
(inconclusive with intel hw, maybe it was even the same machine ;)
Now, this was for optimizing clear_hugepage by using movnti, but I never
got to run it on an AMD box.

Thanks,
Davidlohr

2015-05-23 03:49:54

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup



--
Daniel J Blueman
Principal Software Engineer, Numascale

On Sat, May 23, 2015 at 1:14 AM, Waiman Long <[email protected]> wrote:
> On 05/22/2015 05:33 AM, Mel Gorman wrote:
>> On Fri, May 22, 2015 at 02:30:01PM +0800, Daniel J Blueman wrote:
>>> On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman
>>> <[email protected]> wrote:
>>>> On Thu, May 14, 2015 at 12:31 AM, Mel Gorman<[email protected]>
>>>> wrote:
>>>>> On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:
>>>>>> I am just noticed a hang on my largest box.
>>>>>> I can only reproduce with large core counts, if I turn down the
>>>>>> number of cpus it doesn't have an issue.
>>>>>>
>>>>> Odd. The number of core counts should make little a difference
>>>>> as only
>>>>> one CPU per node should be in use. Does sysrq+t give any
>>>>> indication how
>>>>> or where it is hanging?
>>>> I was seeing the same behaviour of 1000ms increasing to 5500ms
>>>> [1]; this suggests either lock contention or O(n) behaviour.
>>>>
>>>> Nathan, can you check with this ordering of patches from Andrew's
>>>> cache [2]? I was getting hanging until I a found them all.
>>>>
>>>> I'll follow up with timing data.
>>> 7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to
>>> login:
>>>
>>> 1. 2086s with patches 01-19 [1]
>>>
>>> 2. 2026s adding "Take into account that large system caches scale
>>> linearly with memory", which has:
>>> min(2UL<< (30 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 3));
>>>
>>> 3. 2442s fixing to:
>>> max(2UL<< (30 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 3));
>>>
>>> 4. 2064s adjusting minimum and shift to:
>>> max(512UL<< (20 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 8));
>>>
>>> 5. 1934s adjusting minimum and shift to:
>>> max(128UL<< (20 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 8));
>>>
>>> 6. 930s #5 with the non-temporal PMD init patch I had earlier
>>> proposed (I'll pursue separately)
>>>
>>> The scaling patch isn't in -mm.
>> That patch was superceded by "mm: meminit: finish
>> initialisation of struct pages before basic setup" and
>> "mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix"
>> so that's ok.
>>
>> FWIW, I think you should still go ahead with the non-temporal
>> patches because
>> there is potential benefit there other than the initialisation. If
>> there
>> was an arch-optional implementation of a non-termporal clear then it
>> would
>> also be worth considering if __GFP_ZERO should use non-temporal
>> stores.
>> At a greater stretch it would be worth considering if kswapd freeing
>> should
>> zero pages to avoid a zero on the allocation side in the general
>> case as
>> it would be more generally useful and a stepping stone towards what
>> the
>> series "Sanitizing freed pages" attempts.

Good tip Mel; I'll take a look when time allows and get some data,
though I guess it'll only be a win where the clearing is on a different
node than the allocation.

> I think the non-temporal patch benefits mainly AMD systems. I have
> tried the patch on both DragonHawk and it actually made it boot up a
> little bit slower. I think the Intel optimized "rep stosb"
> instruction (used in memset) is performing well. I had done similar
> test on zero page code and the performance gain was non-conclusive.

I suspect 'rep stosb' on modern Intel hardware can write whole
cachelines atomically, avoiding the RMW, or that the read part of the
RMW is optimally prefetched. Open-coding it just can't reach the same
level of pipeline saturation that the microcode can.

Daniel

2015-05-26 10:22:27

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 03/13] mm: meminit: Only set page reserved in the memblock region

On Fri, May 22, 2015 at 01:31:55PM -0700, Tony Luck wrote:
> On Tue, Apr 28, 2015 at 7:37 AM, Mel Gorman <[email protected]> wrote:
> > Currently each page struct is set as reserved upon initialization.
> > This patch leaves the reserved bit clear and only sets the reserved bit
> > when it is known the memory was allocated by the bootmem allocator. This
> > makes it easier to distinguish between uninitialised struct pages and
> > reserved struct pages in later patches.
>
> On ia64 my linux-next builds now report a bunch of messages like this:
>
> put_kernel_page: page at 0xe000000005588000 not in reserved memory
> put_kernel_page: page at 0xe000000005588000 not in reserved memory
> put_kernel_page: page at 0xe000000005580000 not in reserved memory
> put_kernel_page: page at 0xe000000005580000 not in reserved memory
> put_kernel_page: page at 0xe000000005580000 not in reserved memory
> put_kernel_page: page at 0xe000000005580000 not in reserved memory
>
> the two different pages match up with two objects from the loaded kernel
> that get mapped by arch/ia64/mm/init.c:setup_gate()
>
> a000000101588000 D __start_gate_section
> a000000101580000 D empty_zero_page
>
> Should I look for a place to set the reserved bit on page structures for these
> addresses?

That would be preferred.

> Or just remove the test and message in put_kernel_page()
> [I added a debug "else" clause here - every caller passes in a page that is
> not reserved]
>
> if (!PageReserved(page))
> printk(KERN_ERR "put_kernel_page: page at 0x%p not in
> reserved memory\n",
> page_address(page));
>

But as it's a debugging check that is ia-64 specific I think either
should be fine.

--
Mel Gorman
SUSE Labs

2015-06-24 22:56:10

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

My apologies for taking so long to get back to this.

I think I did locate two potential sources of slowdown.
One is the set_cpus_allowed_ptr as I have noted previously.
However I only notice that on the very largest boxes.
I did cobble together a patch that seems to help.

The other spot I suspect is the zone lock in free_one_page.
I haven't been able to give that much thought as of yet though.

Daniel do you mind seeing if the attached patch helps out?

Thanks,
Nate

On Thu, May 14, 2015 at 06:03:03PM +0800, Daniel J Blueman wrote:
> On Thu, May 14, 2015 at 12:31 AM, Mel Gorman <[email protected]> wrote:
>> On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:
>>> I am just noticed a hang on my largest box.
>>> I can only reproduce with large core counts, if I turn down the
>>> number of cpus it doesn't have an issue.
>>>
>>
>> Odd. The number of core counts should make little a difference as only
>> one CPU per node should be in use. Does sysrq+t give any indication how
>> or where it is hanging?
>
> I was seeing the same behaviour of 1000ms increasing to 5500ms [1]; this
> suggests either lock contention or O(n) behaviour.
>
> Nathan, can you check with this ordering of patches from Andrew's cache
> [2]? I was getting hanging until I a found them all.
>
> I'll follow up with timing data.
>
> Thanks,
> Daniel
>
> -- [1]
>
> [ 73.076117] node 2 initialised, 7732961 pages in 1060ms
> [ 73.077184] node 38 initialised, 7732961 pages in 1060ms
> [ 73.079626] node 146 initialised, 7732961 pages in 1050ms
> [ 73.093488] node 62 initialised, 7732961 pages in 1080ms
> [ 73.091557] node 3 initialised, 7732962 pages in 1080ms
> [ 73.100000] node 186 initialised, 7732961 pages in 1040ms
> [ 73.095731] node 4 initialised, 7732961 pages in 1080ms
> [ 73.090289] node 50 initialised, 7732961 pages in 1080ms
> [ 73.094005] node 158 initialised, 7732961 pages in 1050ms
> [ 73.095421] node 159 initialised, 7732962 pages in 1050ms
> [ 73.090324] node 52 initialised, 7732961 pages in 1080ms
> [ 73.099056] node 5 initialised, 7732962 pages in 1080ms
> [ 73.090116] node 160 initialised, 7732961 pages in 1050ms
> [ 73.161051] node 157 initialised, 7732962 pages in 1120ms
> [ 73.193565] node 161 initialised, 7732962 pages in 1160ms
> [ 73.212456] node 26 initialised, 7732961 pages in 1200ms
> [ 73.222904] node 0 initialised, 6686488 pages in 1210ms
> [ 73.242165] node 140 initialised, 7732961 pages in 1210ms
> [ 73.254230] node 156 initialised, 7732961 pages in 1220ms
> [ 73.284634] node 1 initialised, 7732962 pages in 1270ms
> [ 73.305301] node 141 initialised, 7732962 pages in 1280ms
> [ 73.322845] node 28 initialised, 7732961 pages in 1310ms
> [ 73.321757] node 142 initialised, 7732961 pages in 1290ms
> [ 73.327677] node 138 initialised, 7732961 pages in 1300ms
> [ 73.413597] node 176 initialised, 7732961 pages in 1370ms
> [ 73.455552] node 139 initialised, 7732962 pages in 1420ms
> [ 73.475356] node 143 initialised, 7732962 pages in 1440ms
> [ 73.547202] node 32 initialised, 7732961 pages in 1530ms
> [ 73.579591] node 104 initialised, 7732961 pages in 1560ms
> [ 73.618065] node 174 initialised, 7732961 pages in 1570ms
> [ 73.624918] node 178 initialised, 7732961 pages in 1580ms
> [ 73.649024] node 175 initialised, 7732962 pages in 1610ms
> [ 73.654110] node 105 initialised, 7732962 pages in 1630ms
> [ 73.670589] node 106 initialised, 7732961 pages in 1650ms
> [ 73.739682] node 102 initialised, 7732961 pages in 1720ms
> [ 73.769639] node 86 initialised, 7732961 pages in 1750ms
> [ 73.775573] node 44 initialised, 7732961 pages in 1760ms
> [ 73.772955] node 177 initialised, 7732962 pages in 1740ms
> [ 73.804390] node 34 initialised, 7732961 pages in 1790ms
> [ 73.819370] node 30 initialised, 7732961 pages in 1810ms
> [ 73.847882] node 98 initialised, 7732961 pages in 1830ms
> [ 73.867545] node 33 initialised, 7732962 pages in 1860ms
> [ 73.877964] node 107 initialised, 7732962 pages in 1860ms
> [ 73.906256] node 103 initialised, 7732962 pages in 1880ms
> [ 73.945581] node 100 initialised, 7732961 pages in 1930ms
> [ 73.947024] node 96 initialised, 7732961 pages in 1930ms
> [ 74.186208] node 116 initialised, 7732961 pages in 2170ms
> [ 74.220838] node 68 initialised, 7732961 pages in 2210ms
> [ 74.252341] node 46 initialised, 7732961 pages in 2240ms
> [ 74.274795] node 118 initialised, 7732961 pages in 2260ms
> [ 74.337544] node 14 initialised, 7732961 pages in 2320ms
> [ 74.350819] node 22 initialised, 7732961 pages in 2340ms
> [ 74.350332] node 69 initialised, 7732962 pages in 2340ms
> [ 74.362683] node 211 initialised, 7732962 pages in 2310ms
> [ 74.360617] node 70 initialised, 7732961 pages in 2340ms
> [ 74.369137] node 66 initialised, 7732961 pages in 2360ms
> [ 74.378242] node 115 initialised, 7732962 pages in 2360ms
> [ 74.404221] node 213 initialised, 7732962 pages in 2350ms
> [ 74.420901] node 210 initialised, 7732961 pages in 2370ms
> [ 74.430049] node 35 initialised, 7732962 pages in 2420ms
> [ 74.436007] node 48 initialised, 7732961 pages in 2420ms
> [ 74.480595] node 71 initialised, 7732962 pages in 2460ms
> [ 74.485700] node 67 initialised, 7732962 pages in 2480ms
> [ 74.502627] node 31 initialised, 7732962 pages in 2490ms
> [ 74.542220] node 16 initialised, 7732961 pages in 2530ms
> [ 74.547936] node 128 initialised, 7732961 pages in 2520ms
> [ 74.634374] node 214 initialised, 7732961 pages in 2580ms
> [ 74.654389] node 88 initialised, 7732961 pages in 2630ms
> [ 74.722833] node 117 initialised, 7732962 pages in 2700ms
> [ 74.735002] node 148 initialised, 7732961 pages in 2700ms
> [ 74.742725] node 12 initialised, 7732961 pages in 2730ms
> [ 74.749319] node 194 initialised, 7732961 pages in 2700ms
> [ 74.767979] node 24 initialised, 7732961 pages in 2750ms
> [ 74.769465] node 114 initialised, 7732961 pages in 2750ms
> [ 74.796973] node 134 initialised, 7732961 pages in 2770ms
> [ 74.818164] node 15 initialised, 7732962 pages in 2810ms
> [ 74.844852] node 18 initialised, 7732961 pages in 2830ms
> [ 74.866123] node 110 initialised, 7732961 pages in 2850ms
> [ 74.898255] node 215 initialised, 7730688 pages in 2840ms
> [ 74.903623] node 136 initialised, 7732961 pages in 2880ms
> [ 74.911107] node 144 initialised, 7732961 pages in 2890ms
> [ 74.918757] node 212 initialised, 7732961 pages in 2870ms
> [ 74.935333] node 182 initialised, 7732961 pages in 2880ms
> [ 74.958147] node 42 initialised, 7732961 pages in 2950ms
> [ 74.964989] node 108 initialised, 7732961 pages in 2950ms
> [ 74.965482] node 112 initialised, 7732961 pages in 2950ms
> [ 75.034787] node 184 initialised, 7732961 pages in 2980ms
> [ 75.051242] node 45 initialised, 7732962 pages in 3040ms
> [ 75.047169] node 152 initialised, 7732961 pages in 3020ms
> [ 75.062834] node 179 initialised, 7732962 pages in 3010ms
> [ 75.076528] node 145 initialised, 7732962 pages in 3040ms
> [ 75.076613] node 25 initialised, 7732962 pages in 3070ms
> [ 75.073086] node 164 initialised, 7732961 pages in 3040ms
> [ 75.079674] node 149 initialised, 7732962 pages in 3050ms
> [ 75.092015] node 113 initialised, 7732962 pages in 3070ms
> [ 75.096325] node 80 initialised, 7732961 pages in 3080ms
> [ 75.131380] node 92 initialised, 7732961 pages in 3110ms
> [ 75.142147] node 10 initialised, 7732961 pages in 3130ms
> [ 75.151041] node 51 initialised, 7732962 pages in 3140ms
> [ 75.159074] node 130 initialised, 7732961 pages in 3130ms
> [ 75.162616] node 166 initialised, 7732961 pages in 3130ms
> [ 75.193557] node 82 initialised, 7732961 pages in 3170ms
> [ 75.254801] node 84 initialised, 7732961 pages in 3240ms
> [ 75.303028] node 64 initialised, 7732961 pages in 3290ms
> [ 75.299739] node 49 initialised, 7732962 pages in 3290ms
> [ 75.314231] node 21 initialised, 7732962 pages in 3300ms
> [ 75.371298] node 53 initialised, 7732962 pages in 3360ms
> [ 75.394569] node 95 initialised, 7732962 pages in 3380ms
> [ 75.441101] node 23 initialised, 7732962 pages in 3430ms
> [ 75.433080] node 19 initialised, 7732962 pages in 3430ms
> [ 75.446076] node 173 initialised, 7732962 pages in 3410ms
> [ 75.445816] node 99 initialised, 7732962 pages in 3430ms
> [ 75.470330] node 87 initialised, 7732962 pages in 3450ms
> [ 75.502334] node 8 initialised, 7732961 pages in 3490ms
> [ 75.508300] node 206 initialised, 7732961 pages in 3460ms
> [ 75.540253] node 132 initialised, 7732961 pages in 3510ms
> [ 75.615453] node 183 initialised, 7732962 pages in 3560ms
> [ 75.632576] node 78 initialised, 7732961 pages in 3610ms
> [ 75.647753] node 85 initialised, 7732962 pages in 3620ms
> [ 75.688955] node 90 initialised, 7732961 pages in 3670ms
> [ 75.694522] node 200 initialised, 7732961 pages in 3640ms
> [ 75.688790] node 43 initialised, 7732962 pages in 3680ms
> [ 75.694540] node 94 initialised, 7732961 pages in 3680ms
> [ 75.697149] node 29 initialised, 7732962 pages in 3690ms
> [ 75.693590] node 111 initialised, 7732962 pages in 3680ms
> [ 75.715829] node 56 initialised, 7732961 pages in 3700ms
> [ 75.718427] node 97 initialised, 7732962 pages in 3700ms
> [ 75.741643] node 147 initialised, 7732962 pages in 3710ms
> [ 75.773613] node 170 initialised, 7732961 pages in 3740ms
> [ 75.802874] node 208 initialised, 7732961 pages in 3750ms
> [ 75.804409] node 58 initialised, 7732961 pages in 3790ms
> [ 75.853438] node 126 initialised, 7732961 pages in 3830ms
> [ 75.888167] node 167 initialised, 7732962 pages in 3850ms
> [ 75.912656] node 172 initialised, 7732961 pages in 3870ms
> [ 75.956540] node 93 initialised, 7732962 pages in 3940ms
> [ 75.988819] node 127 initialised, 7732962 pages in 3960ms
> [ 76.062198] node 201 initialised, 7732962 pages in 4010ms
> [ 76.091769] node 47 initialised, 7732962 pages in 4080ms
> [ 76.119749] node 162 initialised, 7732961 pages in 4080ms
> [ 76.122797] node 6 initialised, 7732961 pages in 4110ms
> [ 76.225916] node 153 initialised, 7732962 pages in 4190ms
> [ 76.219855] node 81 initialised, 7732962 pages in 4200ms
> [ 76.236116] node 150 initialised, 7732961 pages in 4210ms
> [ 76.245349] node 180 initialised, 7732961 pages in 4190ms
> [ 76.248827] node 17 initialised, 7732962 pages in 4240ms
> [ 76.258801] node 13 initialised, 7732962 pages in 4250ms
> [ 76.259943] node 122 initialised, 7732961 pages in 4240ms
> [ 76.277480] node 196 initialised, 7732961 pages in 4230ms
> [ 76.320830] node 41 initialised, 7732962 pages in 4310ms
> [ 76.351667] node 129 initialised, 7732962 pages in 4320ms
> [ 76.353488] node 202 initialised, 7732961 pages in 4310ms
> [ 76.376753] node 165 initialised, 7732962 pages in 4340ms
> [ 76.381807] node 124 initialised, 7732961 pages in 4350ms
> [ 76.419952] node 171 initialised, 7732962 pages in 4380ms
> [ 76.431242] node 168 initialised, 7732961 pages in 4390ms
> [ 76.441324] node 89 initialised, 7732962 pages in 4420ms
> [ 76.440720] node 155 initialised, 7732962 pages in 4400ms
> [ 76.459715] node 120 initialised, 7732961 pages in 4440ms
> [ 76.483986] node 205 initialised, 7732962 pages in 4430ms
> [ 76.493284] node 151 initialised, 7732962 pages in 4460ms
> [ 76.491437] node 60 initialised, 7732961 pages in 4480ms
> [ 76.526620] node 74 initialised, 7732961 pages in 4510ms
> [ 76.543761] node 131 initialised, 7732962 pages in 4510ms
> [ 76.549562] node 39 initialised, 7732962 pages in 4540ms
> [ 76.563861] node 11 initialised, 7732962 pages in 4550ms
> [ 76.598775] node 54 initialised, 7732961 pages in 4590ms
> [ 76.602006] node 123 initialised, 7732962 pages in 4570ms
> [ 76.619856] node 76 initialised, 7732961 pages in 4600ms
> [ 76.631418] node 198 initialised, 7732961 pages in 4580ms
> [ 76.665415] node 188 initialised, 7732961 pages in 4610ms
> [ 76.669178] node 63 initialised, 7732962 pages in 4660ms
> [ 76.683646] node 101 initialised, 7732962 pages in 4670ms
> [ 76.710780] node 192 initialised, 7732961 pages in 4660ms
> [ 76.736743] node 121 initialised, 7732962 pages in 4720ms
> [ 76.743800] node 199 initialised, 7732962 pages in 4700ms
> [ 76.750663] node 20 initialised, 7732961 pages in 4740ms
> [ 76.763045] node 135 initialised, 7732962 pages in 4730ms
> [ 76.768216] node 137 initialised, 7732962 pages in 4740ms
> [ 76.800135] node 181 initialised, 7732962 pages in 4750ms
> [ 76.811215] node 27 initialised, 7732962 pages in 4800ms
> [ 76.857405] node 125 initialised, 7732962 pages in 4820ms
> [ 76.853750] node 163 initialised, 7732962 pages in 4820ms
> [ 76.882975] node 59 initialised, 7732962 pages in 4870ms
> [ 76.920121] node 9 initialised, 7732962 pages in 4910ms
> [ 76.934824] node 189 initialised, 7732962 pages in 4880ms
> [ 76.951223] node 154 initialised, 7732961 pages in 4920ms
> [ 76.953897] node 203 initialised, 7732962 pages in 4900ms
> [ 76.952558] node 75 initialised, 7732962 pages in 4930ms
> [ 76.985480] node 119 initialised, 7732962 pages in 4970ms
> [ 77.036089] node 195 initialised, 7732962 pages in 4980ms
> [ 77.039996] node 55 initialised, 7732962 pages in 5030ms
> [ 77.067989] node 109 initialised, 7732962 pages in 5040ms
> [ 77.066236] node 7 initialised, 7732962 pages in 5060ms
> [ 77.068709] node 65 initialised, 7732962 pages in 5060ms
> [ 77.097859] node 79 initialised, 7732962 pages in 5080ms
> [ 77.096219] node 169 initialised, 7732962 pages in 5060ms
> [ 77.125113] node 83 initialised, 7732962 pages in 5110ms
> [ 77.139507] node 37 initialised, 7732962 pages in 5130ms
> [ 77.143280] node 77 initialised, 7732962 pages in 5120ms
> [ 77.226494] node 73 initialised, 7732962 pages in 5200ms
> [ 77.281584] node 190 initialised, 7732961 pages in 5230ms
> [ 77.314794] node 204 initialised, 7732961 pages in 5260ms
> [ 77.328577] node 72 initialised, 7732961 pages in 5310ms
> [ 77.335743] node 36 initialised, 7732961 pages in 5320ms
> [ 77.360573] node 40 initialised, 7732961 pages in 5350ms
> [ 77.368712] node 207 initialised, 7732962 pages in 5320ms
> [ 77.387708] node 91 initialised, 7732962 pages in 5370ms
> [ 77.385143] node 57 initialised, 7732962 pages in 5380ms
> [ 77.391785] node 191 initialised, 7732962 pages in 5340ms
> [ 77.479970] node 185 initialised, 7732962 pages in 5430ms
> [ 77.491865] node 61 initialised, 7732962 pages in 5480ms
> [ 77.489255] node 133 initialised, 7732962 pages in 5460ms
> [ 77.502111] node 197 initialised, 7732962 pages in 5450ms
> [ 77.507136] node 193 initialised, 7732962 pages in 5460ms
> [ 77.523739] node 209 initialised, 7732962 pages in 5470ms
> [ 77.537131] node 187 initialised, 7732962 pages in 5490ms
>
> -- [2]
>
> http://ozlabs.org/~akpm/mmots/broken-out/memblock-introduce-a-for_each_reserved_mem_region-iterator.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-move-page-initialization-into-a-separate-function.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-only-set-page-reserved-in-the-memblock-region.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-pass-pfn-to-__free_pages_bootmem.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-make-__early_pfn_to_nid-smp-safe-and-introduce-meminit_pfn_in_nid.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-remaining-struct-pages-in-parallel-with-kswapd.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-minimise-number-of-pfn-page-lookups-during-initialisation.patch
> http://ozlabs.org/~akpm/mmots/broken-out/x86-mm-enable-deferred-struct-page-initialisation-on-x86-64.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-free-pages-in-large-chunks-where-possible.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-remove-mminit_verify_page_links.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix2.patch
>


Attachments:
(No filename) (16.61 kB)
0001-Avoid-the-contention-in-set_cpus_allowed.patch (2.23 kB)
Download all attachments

2015-06-25 20:49:07

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Wed, Jun 24, 2015 at 05:50:28PM -0500, Nathan Zimmer wrote:
> My apologies for taking so long to get back to this.
>
> I think I did locate two potential sources of slowdown.
> One is the set_cpus_allowed_ptr as I have noted previously.
> However I only notice that on the very largest boxes.
> I did cobble together a patch that seems to help.
>

If you are using kthread_create_on_node(), is it even necessary to call
set_cpus_allowed_ptr() at all?

--
Mel Gorman
SUSE Labs

2015-06-25 20:57:56

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Thu, Jun 25, 2015 at 09:48:55PM +0100, Mel Gorman wrote:
> On Wed, Jun 24, 2015 at 05:50:28PM -0500, Nathan Zimmer wrote:
> > My apologies for taking so long to get back to this.
> >
> > I think I did locate two potential sources of slowdown.
> > One is the set_cpus_allowed_ptr as I have noted previously.
> > However I only notice that on the very largest boxes.
> > I did cobble together a patch that seems to help.
> >
>
> If you are using kthread_create_on_node(), is it even necessary to call
> set_cpus_allowed_ptr() at all?
>

That aside, are you aware of any failure with this series as it currently
stands in Andrew's tree that this patch is meant to address? It seems
like a nice follow-on that would boot faster on very large machines but
if it's addressing a regression then it's very important as the series
cannot be merged with known critical failures.

--
Mel Gorman
SUSE Labs

2015-06-25 21:41:13

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Thu, Jun 25, 2015 at 09:48:55PM +0100, Mel Gorman wrote:
> On Wed, Jun 24, 2015 at 05:50:28PM -0500, Nathan Zimmer wrote:
> > My apologies for taking so long to get back to this.
> >
> > I think I did locate two potential sources of slowdown.
> > One is the set_cpus_allowed_ptr as I have noted previously.
> > However I only notice that on the very largest boxes.
> > I did cobble together a patch that seems to help.
> >
>
> If you are using kthread_create_on_node(), is it even necessary to call
> set_cpus_allowed_ptr() at all?
>

Yup kthread_create_on_node unconditionanly calls
set_cpus_allowed_ptr(task, cpu_all_mask);
It does it to avoid inherting kthreadd's properties.

Not being familiar with scheduling code I assumed I missed something.
However it sounds like it should respect the choice.

2015-06-25 21:37:52

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Thu, Jun 25, 2015 at 09:57:44PM +0100, Mel Gorman wrote:
> On Thu, Jun 25, 2015 at 09:48:55PM +0100, Mel Gorman wrote:
> > On Wed, Jun 24, 2015 at 05:50:28PM -0500, Nathan Zimmer wrote:
> > > My apologies for taking so long to get back to this.
> > >
> > > I think I did locate two potential sources of slowdown.
> > > One is the set_cpus_allowed_ptr as I have noted previously.
> > > However I only notice that on the very largest boxes.
> > > I did cobble together a patch that seems to help.
> > >
> >
> > If you are using kthread_create_on_node(), is it even necessary to call
> > set_cpus_allowed_ptr() at all?
> >
>
> That aside, are you aware of any failure with this series as it currently
> stands in Andrew's tree that this patch is meant to address? It seems
> like a nice follow-on that would boot faster on very large machines but
> if it's addressing a regression then it's very important as the series
> cannot be merged with known critical failures.
>

Nope I haven't recorded any failures without it.
I just get concerned when I see some scaling issues that something COULD go wrong.


Nate

2015-06-25 21:44:30

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC] kthread_create_on_node is failing to honor the node choice

In kthread_create_on_node we set_cpus_allowed to cpu_all_mask
regardless of what the node is requested.
This seems incorrect.

Signed-off-by: Nathan Zimmer <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Nishanth Aravamudan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: [email protected]

---
kernel/kthread.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 10e489c..d885d66 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -318,7 +318,10 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
* The kernel thread should not inherit these properties.
*/
sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
- set_cpus_allowed_ptr(task, cpu_all_mask);
+ if (node == -1)
+ set_cpus_allowed_ptr(task, cpu_all_mask);
+ else
+ set_cpus_allowed_ptr(task, cpumask_of_node(node));
}
kfree(create);
return task;
--
1.8.2.1

2015-06-26 01:03:55

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [RFC] kthread_create_on_node is failing to honor the node choice

On 06/26/2015 05:44 AM, Nathan Zimmer wrote:
> In kthread_create_on_node we set_cpus_allowed to cpu_all_mask
> regardless of what the node is requested.
> This seems incorrect.
>
> Signed-off-by: Nathan Zimmer <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Nishanth Aravamudan <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: [email protected]
>
> ---
> kernel/kthread.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 10e489c..d885d66 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -318,7 +318,10 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
> * The kernel thread should not inherit these properties.
> */
> sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
> - set_cpus_allowed_ptr(task, cpu_all_mask);
> + if (node == -1)
> + set_cpus_allowed_ptr(task, cpu_all_mask);
> + else
> + set_cpus_allowed_ptr(task, cpumask_of_node(node));


cpumask_of_node(node) is bad here. It containers only online cpus.

> }
> kfree(create);
> return task;
>

2015-06-26 10:16:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

On Wed, Jun 24, 2015 at 05:50:28PM -0500, Nathan Zimmer wrote:
> From e18aa6158a60c2134b4eef93c856f3b5b250b122 Mon Sep 17 00:00:00 2001
> From: Nathan Zimmer <[email protected]>
> Date: Thu, 11 Jun 2015 10:47:39 -0500
> Subject: [RFC] Avoid the contention in set_cpus_allowed
>
> Noticing some scaling issues at larger box sizes (64 nodes+) I found that in some
> cases we are spending significant amounts of time in set_cpus_allowed_ptr.
>
> My assumption is that it is getting stuck on migration.
> So if we create the thread on the target node and restrict cpus before we start
> the thread then we don't have to suffer migration.
>
> Cc: Mel Gorman <[email protected]>
> Cc: Waiman Long <[email protected]
> Cc: Dave Hansen <[email protected]>
> Cc: Scott Norton <[email protected]>
> Cc: Daniel J Blueman <[email protected]>
> Signed-off-by: Nathan Zimmer <[email protected]>
>

I asked yesterday if set_cpus_allowed_ptr() was required and I made a
mistake because it is. The node parameter for kthread_create_on_node()
controls where it gets created but not how it is scheduled after that.
Sorry for the noise. The patch makes sense to me now, lets see if it
helps Daniel.


--
Mel Gorman
SUSE Labs

2015-07-06 17:46:01

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

Hi Nate,

On Wed, Jun 24, 2015 at 11:50 PM, Nathan Zimmer <[email protected]> wrote:
> My apologies for taking so long to get back to this.
>
> I think I did locate two potential sources of slowdown.
> One is the set_cpus_allowed_ptr as I have noted previously.
> However I only notice that on the very largest boxes.
> I did cobble together a patch that seems to help.
>
> The other spot I suspect is the zone lock in free_one_page.
> I haven't been able to give that much thought as of yet though.
>
> Daniel do you mind seeing if the attached patch helps out?

Just got back from travel, so apologies for the delays.

The patch doesn't mitigate the increasing initialisation time; summing
the per-node times for an accurate measure, there was a total of
171.48s before the patch and 175.23s after. I double-checked and got
similar data.

Thanks,
Daniel

2015-07-09 17:49:14

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

Interesting, I found a small improvement in total clock time through the
area.
I tweaked page_alloc_init_late have a timer, like the
deferred_init_memmap, and this patch showed a small improvement.

Ok thanks for your help.


On 07/06/2015 12:45 PM, Daniel J Blueman wrote:
> Hi Nate,
>
> On Wed, Jun 24, 2015 at 11:50 PM, Nathan Zimmer <[email protected]> wrote:
>> My apologies for taking so long to get back to this.
>>
>> I think I did locate two potential sources of slowdown.
>> One is the set_cpus_allowed_ptr as I have noted previously.
>> However I only notice that on the very largest boxes.
>> I did cobble together a patch that seems to help.
>>
>> The other spot I suspect is the zone lock in free_one_page.
>> I haven't been able to give that much thought as of yet though.
>>
>> Daniel do you mind seeing if the attached patch helps out?
>
> Just got back from travel, so apologies for the delays.
>
> The patch doesn't mitigate the increasing initialisation time; summing
> the per-node times for an accurate measure, there was a total of
> 171.48s before the patch and 175.23s after. I double-checked and got
> similar data.
>
> Thanks,
> Daniel
>

2015-07-09 22:13:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] kthread_create_on_node is failing to honor the node choice

On Thu, 25 Jun 2015 16:44:13 -0500 Nathan Zimmer <[email protected]> wrote:

> In kthread_create_on_node we set_cpus_allowed to cpu_all_mask
> regardless of what the node is requested.
> This seems incorrect.

The `node' arg to kthread_create_on_node() refers to which node the
task_struct and thread_info are allocated from. It doesn't refer to
the CPUs upon which the thread is executed. See
kthread_create_info.node and that gruesome task_struct.pref_node_fork
thing.

The kthread_create_on_node() kerneldoc explains this, in a confused
way. It needs a s/-1/NUMA_NO_NODE/.

I'm a bit surprised that kthread_create_on_node() futzes with the new
thread's policy and cpumask after it has been created. Wouldn't it be
simpler/faster to have the thread itself set these things while it's
starting up?


As to whether kthread_create_on_node() should bind the thread to that
node's CPUs: unclear.
drivers/thermal/intel_powerclamp.c:start_power_clamp() understands how
kthread_create_on_node() works. I guess the code is OK as-is, but the
documentation could be improved. Perfunctory effort:

--- a/kernel/kthread.c~a
+++ a/kernel/kthread.c
@@ -246,7 +246,7 @@ static void create_kthread(struct kthrea
* kthread_create_on_node - create a kthread.
* @threadfn: the function to run until signal_pending(current).
* @data: data ptr for @threadfn.
- * @node: memory node number.
+ * @node: task and thread structures for the thread are allocated on this node
* @namefmt: printf-style name for the thread.
*
* Description: This helper function creates and names a kernel
@@ -254,7 +254,7 @@ static void create_kthread(struct kthrea
* it. See also kthread_run().
*
* If thread is going to be bound on a particular cpu, give its node
- * in @node, to get NUMA affinity for kthread stack, or else give -1.
+ * in @node, to get NUMA affinity for kthread stack, or else give NUMA_NO_NODE.
* When woken, the thread will run @threadfn() with @data as its
* argument. @threadfn() can either call do_exit() directly if it is a
* standalone thread for which no one will call kthread_stop(), or
_


2015-07-10 14:27:02

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] kthread_create_on_node is failing to honor the node choice

On Thu, Jul 09, 2015 at 03:12:59PM -0700, Andrew Morton wrote:
> On Thu, 25 Jun 2015 16:44:13 -0500 Nathan Zimmer <[email protected]> wrote:
>
> > In kthread_create_on_node we set_cpus_allowed to cpu_all_mask
> > regardless of what the node is requested.
> > This seems incorrect.
>
> The `node' arg to kthread_create_on_node() refers to which node the
> task_struct and thread_info are allocated from. It doesn't refer to
> the CPUs upon which the thread is executed. See
> kthread_create_info.node and that gruesome task_struct.pref_node_fork
> thing.
>

That's the initial mistake I made when reviewing Nathan's first path.

> The kthread_create_on_node() kerneldoc explains this, in a confused
> way. It needs a s/-1/NUMA_NO_NODE/.
>
> I'm a bit surprised that kthread_create_on_node() futzes with the new
> thread's policy and cpumask after it has been created. Wouldn't it be
> simpler/faster to have the thread itself set these things while it's
> starting up?
>

Yeah, which is what Nathan's first patch did that I made a mistake with
initially. It creates a thread, sets the mask then wakes it up which
looks correct.

Your documentation patch looks good to me, I would not have fallen into
the trap if it had been applied.

--
Mel Gorman
SUSE Labs

2015-07-10 17:34:55

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [RFC] kthread_create_on_node is failing to honor the node choice


On Thu, Jul 09, 2015 at 03:12:59PM -0700, Andrew Morton wrote:
> On Thu, 25 Jun 2015 16:44:13 -0500 Nathan Zimmer <[email protected]> wrote:
>
> > In kthread_create_on_node we set_cpus_allowed to cpu_all_mask
> > regardless of what the node is requested.
> > This seems incorrect.
>
> The `node' arg to kthread_create_on_node() refers to which node the
> task_struct and thread_info are allocated from. It doesn't refer to
> the CPUs upon which the thread is executed. See
> kthread_create_info.node and that gruesome task_struct.pref_node_fork
> thing.
>
> The kthread_create_on_node() kerneldoc explains this, in a confused
> way. It needs a s/-1/NUMA_NO_NODE/.


I suspect we should also update the kthread_create macro to use NUMA_NO_NODE also.


diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 13d5520..3e6773e 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -11,7 +11,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
const char namefmt[], ...);

#define kthread_create(threadfn, data, namefmt, arg...) \
- kthread_create_on_node(threadfn, data, -1, namefmt, ##arg)
+ kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)


struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),

2015-07-14 15:54:14

by Dave Hansen

[permalink] [raw]
Subject: 4.2-rc2: hitting "file-max limit 8192 reached"

My laptop has been behaving strangely with 4.2-rc2. Once I log in to my
X session, I start getting all kinds of strange errors from applications
and see this in my dmesg:

VFS: file-max limit 8192 reached

Could this be from CONFIG_DEFERRED_STRUCT_PAGE_INIT=y? files_init()
seems top be sizing files_stat.max_files from memory sizes.

vfs_caches_init() uses nr_free_pages() to figure out what the "current
kernel size" is in early boot. *But* since we have not freed most of
our memory, nr_free_pages() is low and makes us calculate the reserve as
if the kernel we huge.

Adding some printk's confirms this. Broken kernel:

vfs_caches_init() mempages: 4026972
vfs_caches_init() reserve: 4021629
vfs_caches_init() mempages (after reserve minus): 5343
files_init() n: 2137
files_init() files_stat.max_files: 8192

Working kernel:

vfs_caches_init() mempages: 4026972
vfs_caches_init() reserve: 375
vfs_caches_init() mempages2: 4026597
files_init() n: 1610638
files_init() files_stat.max_files: 1610638

Do we have an alternative to call instead of nr_free_pages() in
vfs_caches_init()?

I guess we could save off 'nr_initialized' in memmap_init_zone() and
then use "nr_initialized - nr_free_pages()", but that seems a bit hackish.

2015-07-14 16:16:00

by Andrew Morton

[permalink] [raw]
Subject: Re: 4.2-rc2: hitting "file-max limit 8192 reached"

On Tue, 14 Jul 2015 08:54:11 -0700 Dave Hansen <[email protected]> wrote:

> My laptop has been behaving strangely with 4.2-rc2. Once I log in to my
> X session, I start getting all kinds of strange errors from applications
> and see this in my dmesg:
>
> VFS: file-max limit 8192 reached
>
> Could this be from CONFIG_DEFERRED_STRUCT_PAGE_INIT=y? files_init()
> seems top be sizing files_stat.max_files from memory sizes.

argh.

> vfs_caches_init() uses nr_free_pages() to figure out what the "current
> kernel size" is in early boot. *But* since we have not freed most of
> our memory, nr_free_pages() is low and makes us calculate the reserve as
> if the kernel we huge.
>
> Adding some printk's confirms this. Broken kernel:
>
> vfs_caches_init() mempages: 4026972
> vfs_caches_init() reserve: 4021629
> vfs_caches_init() mempages (after reserve minus): 5343
> files_init() n: 2137
> files_init() files_stat.max_files: 8192
>
> Working kernel:
>
> vfs_caches_init() mempages: 4026972
> vfs_caches_init() reserve: 375
> vfs_caches_init() mempages2: 4026597
> files_init() n: 1610638
> files_init() files_stat.max_files: 1610638
>
> Do we have an alternative to call instead of nr_free_pages() in
> vfs_caches_init()?
>
> I guess we could save off 'nr_initialized' in memmap_init_zone() and
> then use "nr_initialized - nr_free_pages()", but that seems a bit hackish.

There are a lot of things that might be affected this way. Callers of
nr_free_buffer_pages(), nr_free_pagecache_pages(), etc.

If we'd fully used the memory hotplug infrastructure then everything
would work - all those knobs which are sized off free-memory would get
themselves resized as more memory comes on line. But quite a few
things have been missed.

2015-07-15 10:45:34

by Mel Gorman

[permalink] [raw]
Subject: Re: 4.2-rc2: hitting "file-max limit 8192 reached"

On Tue, Jul 14, 2015 at 08:54:11AM -0700, Dave Hansen wrote:
> My laptop has been behaving strangely with 4.2-rc2. Once I log in to my
> X session, I start getting all kinds of strange errors from applications
> and see this in my dmesg:
>
> VFS: file-max limit 8192 reached
>
> Could this be from CONFIG_DEFERRED_STRUCT_PAGE_INIT=y? files_init()
> seems top be sizing files_stat.max_files from memory sizes.
>

Yep.

I'm very sick at the moment and running a temperature so this needs double
checking. Medication is helping but I'm nowhere near 100%.

Andrew mentioned nr_free_buffer_pages and nr_free_pagecache_pages.
They are both live calculation that walks through zonelists with return
values based on zone->managed_pages. They are not affected by deferred
memory initialisation which leaves managed_pages alone.

AFAICS, the key problem is to watch for initialisations that are based on
free memory. It appears that only file_table.c cares and the calculation
of limits can be done after deferred memory initialisation like this;

---8<---
fs, file table: Reinit files_stat.max_files after deferred memory initialisation

Dave Hansen reported the following;

My laptop has been behaving strangely with 4.2-rc2. Once I log
in to my X session, I start getting all kinds of strange errors
from applications and see this in my dmesg:

VFS: file-max limit 8192 reached

The problem is that the file-max is calculated before memory is fully
initialised and miscalculates how much memory the kernel is using. This
patch recalculates file-max after deferred memory initialisation. Note
that using memory hotplug infrastructure would not have avoided this
problem as the value is not recalculated after memory hot-add.

4.1: files_stat.max_files = 6582781
4.2-rc2: files_stat.max_files = 8192
4.2-rc2 patched: files_stat.max_files = 6562467

Small differences with the patch applied and 4.1 but not enough to matter.

Signed-off-by: Mel Gorman <[email protected]>
---
fs/dcache.c | 13 +++----------
fs/file_table.c | 24 +++++++++++++++---------
include/linux/fs.h | 5 +++--
init/main.c | 2 +-
mm/page_alloc.c | 3 +++
5 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5c8ea15e73a5..9b5fe503f6cb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3442,22 +3442,15 @@ void __init vfs_caches_init_early(void)
inode_init_early();
}

-void __init vfs_caches_init(unsigned long mempages)
+void __init vfs_caches_init(void)
{
- unsigned long reserve;
-
- /* Base hash sizes on available memory, with a reserve equal to
- 150% of current kernel size */
-
- reserve = min((mempages - nr_free_pages()) * 3/2, mempages - 1);
- mempages -= reserve;
-
names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);

dcache_init();
inode_init();
- files_init(mempages);
+ files_init();
+ files_maxfiles_init();
mnt_init();
bdev_cache_init();
chrdev_init();
diff --git a/fs/file_table.c b/fs/file_table.c
index 7f9d407c7595..ad17e05ebf95 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -25,6 +25,7 @@
#include <linux/hardirq.h>
#include <linux/task_work.h>
#include <linux/ima.h>
+#include <linux/swap.h>

#include <linux/atomic.h>

@@ -308,19 +309,24 @@ void put_filp(struct file *file)
}
}

-void __init files_init(unsigned long mempages)
+void __init files_init(void)
{
- unsigned long n;
-
filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+ percpu_counter_init(&nr_files, 0, GFP_KERNEL);
+}

- /*
- * One file with associated inode and dcache is very roughly 1K.
- * Per default don't use more than 10% of our memory for files.
- */
+/*
+ * One file with associated inode and dcache is very roughly 1K. Per default
+ * do not use more than 10% of our memory for files.
+ */
+void __init files_maxfiles_init(void)
+{
+ unsigned long n;
+ unsigned long memreserve = (totalram_pages - nr_free_pages()) * 3/2;
+
+ memreserve = min(memreserve, totalram_pages - 1);
+ n = ((totalram_pages - memreserve) * (PAGE_SIZE / 1024)) / 10;

- n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = max_t(unsigned long, n, NR_FILE);
- percpu_counter_init(&nr_files, 0, GFP_KERNEL);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a0653e560c26..e6ceaae3a50e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -55,7 +55,8 @@ struct vm_fault;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
-extern void __init files_init(unsigned long);
+extern void __init files_init(void);
+extern void __init files_maxfiles_init(void);

extern struct files_stat_struct files_stat;
extern unsigned long get_max_files(void);
@@ -2235,7 +2236,7 @@ extern int ioctl_preallocate(struct file *filp, void __user *argp);

/* fs/dcache.c */
extern void __init vfs_caches_init_early(void);
-extern void __init vfs_caches_init(unsigned long);
+extern void __init vfs_caches_init(void);

extern struct kmem_cache *names_cachep;

diff --git a/init/main.c b/init/main.c
index c5d5626289ce..56506553d4d8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -656,7 +656,7 @@ asmlinkage __visible void __init start_kernel(void)
key_init();
security_init();
dbg_late_init();
- vfs_caches_init(totalram_pages);
+ vfs_caches_init();
signals_init();
/* rootfs populating might need page-writeback */
page_writeback_init();
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a69e78c396a0..94e2599830c2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1203,6 +1203,9 @@ void __init page_alloc_init_late(void)

/* Block until all are initialised */
wait_for_completion(&pgdat_init_all_done_comp);
+
+ /* Reinit limits that are based on free pages after the kernel is up */
+ files_maxfiles_init();
}
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */

2015-07-30 23:57:16

by Alex Ng (LIS)

[permalink] [raw]
Subject: Re: [PATCH 03/13] mm: meminit: Only set page reserved in the memblock region

Hi,

While testing builds containing this change (commit id: 92923ca3aacef63c92dc297a75ad0c6dfe4eab37), I've observed that memory fails to come online in the hotplug case. When attempting to bring the hot added pages online (via udev rules or manually writing to sysfs); it's failing with -EBUSY error because the hot added pages no longer have the Reserved flag set.

The reserved bit is being checked in memory_block_action() in drivers/base/memory.c:

switch (action) {
case MEM_ONLINE:
if (!pages_correctly_reserved(start_pfn))
return -EBUSY;

Should the reserved bit be set in some other place? Or should the above test be removed from sysfs memory device?

--
Alex Ng