2015-04-23 10:33:25

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/13] Parallel struct page initialisation v3

The big change here is an adjustment to the topology_init path that caused
soft lockups on Waiman and Daniel Blue had reported it was an expensive
function.

Changelog since v2
o Reduce overhead of topology_init
o Remove boot-time kernel parameter to enable/disable
o Enable on UMA

Changelog since v1
o Always initialise low zones
o Typo corrections
o Rename parallel mem init to parallel struct page init
o Rebase to 4.0

Struct page initialisation had been identified as one of the reasons why
large machines take a long time to boot. Patches were posted a long time ago
to defer initialisation until they were first used. This was rejected on
the grounds it should not be necessary to hurt the fast paths. This series
reuses much of the work from that time but defers the initialisation of
memory to kswapd so that one thread per node initialises memory local to
that node.

After applying the series and setting the appropriate Kconfig variable I
see this in the boot log on a 64G machine

[ 7.383764] kswapd 0 initialised deferred memory in 188ms
[ 7.404253] kswapd 1 initialised deferred memory in 208ms
[ 7.411044] kswapd 3 initialised deferred memory in 216ms
[ 7.411551] kswapd 2 initialised deferred memory in 216ms

On a 1TB machine, I see

[ 8.406511] kswapd 3 initialised deferred memory in 1116ms
[ 8.428518] kswapd 1 initialised deferred memory in 1140ms
[ 8.435977] kswapd 0 initialised deferred memory in 1148ms
[ 8.437416] kswapd 2 initialised deferred memory in 1148ms

Once booted the machine appears to work as normal. Boot times were measured
from the time shutdown was called until ssh was available again. In the
64G case, the boot time savings are negligible. On the 1TB machine, the
savings were 16 seconds.

It would be nice if the people that have access to really large machines
would test this series and report how much boot time is reduced.

arch/ia64/mm/numa.c | 19 +--
arch/x86/Kconfig | 1 +
drivers/base/node.c | 11 +-
include/linux/memblock.h | 18 +++
include/linux/mm.h | 8 +-
include/linux/mmzone.h | 23 ++-
mm/Kconfig | 18 +++
mm/bootmem.c | 8 +-
mm/internal.h | 23 ++-
mm/memblock.c | 34 ++++-
mm/mm_init.c | 9 +-
mm/nobootmem.c | 7 +-
mm/page_alloc.c | 379 ++++++++++++++++++++++++++++++++++++++++-------
mm/vmscan.c | 6 +-
14 files changed, 462 insertions(+), 102 deletions(-)

--
2.3.5


2015-04-23 10:33:26

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 01/13] memblock: Introduce a for_each_reserved_mem_region iterator.

From: Robin Holt <[email protected]>

As part of initializing struct page's in 2MiB chunks, we noticed that
at the end of free_all_bootmem(), there was nothing which had forced
the reserved/allocated 4KiB pages to be initialized.

This helper function will be used for that expansion.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nate Zimmer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/memblock.h | 18 ++++++++++++++++++
mm/memblock.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index e8cc45307f8f..3075e7673c54 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -93,6 +93,9 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a,
struct memblock_type *type_b, phys_addr_t *out_start,
phys_addr_t *out_end, int *out_nid);

+void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
+ phys_addr_t *out_end);
+
/**
* for_each_mem_range - iterate through memblock areas from type_a and not
* included in type_b. Or just type_a if type_b is NULL.
@@ -132,6 +135,21 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a,
__next_mem_range_rev(&i, nid, type_a, type_b, \
p_start, p_end, p_nid))

+/**
+ * for_each_reserved_mem_region - iterate over all reserved memblock areas
+ * @i: u64 used as loop variable
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over reserved areas of memblock. Available as soon as memblock
+ * is initialized.
+ */
+#define for_each_reserved_mem_region(i, p_start, p_end) \
+ for (i = 0UL, \
+ __next_reserved_mem_region(&i, p_start, p_end); \
+ i != (u64)ULLONG_MAX; \
+ __next_reserved_mem_region(&i, p_start, p_end))
+
#ifdef CONFIG_MOVABLE_NODE
static inline bool memblock_is_hotpluggable(struct memblock_region *m)
{
diff --git a/mm/memblock.c b/mm/memblock.c
index 252b77bdf65e..e0cc2d174f74 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -765,6 +765,38 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
}

/**
+ * __next_reserved_mem_region - next function for for_each_reserved_region()
+ * @idx: pointer to u64 loop variable
+ * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL
+ * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL
+ *
+ * Iterate over all reserved memory regions.
+ */
+void __init_memblock __next_reserved_mem_region(u64 *idx,
+ phys_addr_t *out_start,
+ phys_addr_t *out_end)
+{
+ struct memblock_type *rsv = &memblock.reserved;
+
+ if (*idx >= 0 && *idx < rsv->cnt) {
+ struct memblock_region *r = &rsv->regions[*idx];
+ phys_addr_t base = r->base;
+ phys_addr_t size = r->size;
+
+ if (out_start)
+ *out_start = base;
+ if (out_end)
+ *out_end = base + size - 1;
+
+ *idx += 1;
+ return;
+ }
+
+ /* signal end of iteration */
+ *idx = ULLONG_MAX;
+}
+
+/**
* __next__mem_range - next function for for_each_free_mem_range() etc.
* @idx: pointer to u64 loop variable
* @nid: node selector, %NUMA_NO_NODE for all nodes
--
2.3.5

2015-04-23 10:36:04

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/13] mm: meminit: Move page initialization into a separate function.

From: Robin Holt <[email protected]>

Currently, memmap_init_zone() has all the smarts for initializing a single
page. A subset of this is required for parallel page initialisation and so
this patch breaks up the monolithic function in preparation.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 79 +++++++++++++++++++++++++++++++++------------------------
1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40e29429e7b0..fd7a6d09062d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -778,6 +778,51 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
return 0;
}

+static void __meminit __init_single_page(struct page *page, unsigned long pfn,
+ unsigned long zone, int nid)
+{
+ struct zone *z = &NODE_DATA(nid)->node_zones[zone];
+
+ set_page_links(page, zone, nid, pfn);
+ mminit_verify_page_links(page, zone, nid, pfn);
+ init_page_count(page);
+ page_mapcount_reset(page);
+ page_cpupid_reset_last(page);
+ SetPageReserved(page);
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made. Later some blocks near
+ * the start are marked MIGRATE_RESERVE by
+ * setup_zone_migrate_reserve()
+ *
+ * bitmap is created for zone's valid pfn range. but memmap
+ * can be created for invalid pages (for alignment)
+ * check here not to call set_pageblock_migratetype() against
+ * pfn out of zone.
+ */
+ if ((z->zone_start_pfn <= pfn)
+ && (pfn < zone_end_pfn(z))
+ && !(pfn & (pageblock_nr_pages - 1)))
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+
+ INIT_LIST_HEAD(&page->lru);
+#ifdef WANT_PAGE_VIRTUAL
+ /* The shift won't overflow because ZONE_NORMAL is below 4G. */
+ if (!is_highmem_idx(zone))
+ set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+}
+
+static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
+ int nid)
+{
+ return __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
bool compound = PageCompound(page);
@@ -4124,7 +4169,6 @@ static void setup_zone_migrate_reserve(struct zone *zone)
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
- struct page *page;
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
struct zone *z;
@@ -4145,38 +4189,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (!early_pfn_in_nid(pfn, nid))
continue;
}
- page = pfn_to_page(pfn);
- set_page_links(page, zone, nid, pfn);
- mminit_verify_page_links(page, zone, nid, pfn);
- init_page_count(page);
- page_mapcount_reset(page);
- page_cpupid_reset_last(page);
- SetPageReserved(page);
- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
- *
- * bitmap is created for zone's valid pfn range. but memmap
- * can be created for invalid pages (for alignment)
- * check here not to call set_pageblock_migratetype() against
- * pfn out of zone.
- */
- if ((z->zone_start_pfn <= pfn)
- && (pfn < zone_end_pfn(z))
- && !(pfn & (pageblock_nr_pages - 1)))
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-
- INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
- /* The shift won't overflow because ZONE_NORMAL is below 4G. */
- if (!is_highmem_idx(zone))
- set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
+ __init_single_pfn(pfn, zone, nid);
}
}

--
2.3.5

2015-04-23 10:36:00

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 03/13] mm: meminit: Only set page reserved in the memblock region

From: Nathan Zimmer <[email protected]>

Currently we when we initialze each page struct is set as reserved upon
initialization. This changes to starting with the reserved bit clear and
then only setting the bit in the reserved region.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
---
include/linux/mm.h | 2 ++
mm/nobootmem.c | 3 +++
mm/page_alloc.c | 11 ++++++++++-
3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a93928b90f..b6f82a31028a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1711,6 +1711,8 @@ extern void free_highmem_page(struct page *page);
extern void adjust_managed_page_count(struct page *page, long count);
extern void mem_init_print_info(const char *str);

+extern void reserve_bootmem_region(unsigned long start, unsigned long end);
+
/* Free the reserved page into the buddy system, so it gets managed. */
static inline void __free_reserved_page(struct page *page)
{
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 90b50468333e..396f9e450dc1 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -121,6 +121,9 @@ static unsigned long __init free_low_memory_core_early(void)

memblock_clear_hotplug(0, -1);

+ for_each_reserved_mem_region(i, &start, &end)
+ reserve_bootmem_region(start, end);
+
for_each_free_mem_range(i, NUMA_NO_NODE, &start, &end, NULL)
count += __free_memory_core(start, end);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd7a6d09062d..2abb3b861e70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -788,7 +788,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
- SetPageReserved(page);

/*
* Mark the block movable so that blocks are reserved for
@@ -823,6 +822,16 @@ static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
return __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
}

+void reserve_bootmem_region(unsigned long start, unsigned long end)
+{
+ unsigned long start_pfn = PFN_DOWN(start);
+ unsigned long end_pfn = PFN_UP(end);
+
+ for (; start_pfn < end_pfn; start_pfn++)
+ if (pfn_valid(start_pfn))
+ SetPageReserved(pfn_to_page(start_pfn));
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
bool compound = PageCompound(page);
--
2.3.5

2015-04-23 10:35:57

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/13] mm: page_alloc: Pass PFN to __free_pages_bootmem

__free_pages_bootmem prepares a page for release to the buddy allocator
and assumes that the struct page is initialised. Parallel initialisation of
struct pages defers initialisation and __free_pages_bootmem can be called
for struct pages that cannot yet map struct page to PFN. This patch passes
PFN to __free_pages_bootmem with no other functional change.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/bootmem.c | 8 ++++----
mm/internal.h | 3 ++-
mm/memblock.c | 2 +-
mm/nobootmem.c | 4 ++--
mm/page_alloc.c | 3 ++-
5 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 477be696511d..daf956bb4782 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -164,7 +164,7 @@ void __init free_bootmem_late(unsigned long physaddr, unsigned long size)
end = PFN_DOWN(physaddr + size);

for (; cursor < end; cursor++) {
- __free_pages_bootmem(pfn_to_page(cursor), 0);
+ __free_pages_bootmem(pfn_to_page(cursor), cursor, 0);
totalram_pages++;
}
}
@@ -210,7 +210,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
if (IS_ALIGNED(start, BITS_PER_LONG) && vec == ~0UL) {
int order = ilog2(BITS_PER_LONG);

- __free_pages_bootmem(pfn_to_page(start), order);
+ __free_pages_bootmem(pfn_to_page(start), start, order);
count += BITS_PER_LONG;
start += BITS_PER_LONG;
} else {
@@ -220,7 +220,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
while (vec && cur != start) {
if (vec & 1) {
page = pfn_to_page(cur);
- __free_pages_bootmem(page, 0);
+ __free_pages_bootmem(page, cur, 0);
count++;
}
vec >>= 1;
@@ -234,7 +234,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
pages = bootmem_bootmap_pages(pages);
count += pages;
while (pages--)
- __free_pages_bootmem(page++, 0);
+ __free_pages_bootmem(page++, cur++, 0);

bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);

diff --git a/mm/internal.h b/mm/internal.h
index a96da5b0029d..76b605139c7a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -155,7 +155,8 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
}

extern int __isolate_free_page(struct page *page, unsigned int order);
-extern void __free_pages_bootmem(struct page *page, unsigned int order);
+extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
+ unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);
#ifdef CONFIG_MEMORY_FAILURE
extern bool is_free_buddy_page(struct page *page);
diff --git a/mm/memblock.c b/mm/memblock.c
index e0cc2d174f74..f3e97d8eeb5c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1334,7 +1334,7 @@ void __init __memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);

for (; cursor < end; cursor++) {
- __free_pages_bootmem(pfn_to_page(cursor), 0);
+ __free_pages_bootmem(pfn_to_page(cursor), cursor, 0);
totalram_pages++;
}
}
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 396f9e450dc1..bae652713ee5 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -77,7 +77,7 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size)
end = PFN_DOWN(addr + size);

for (; cursor < end; cursor++) {
- __free_pages_bootmem(pfn_to_page(cursor), 0);
+ __free_pages_bootmem(pfn_to_page(cursor), cursor, 0);
totalram_pages++;
}
}
@@ -92,7 +92,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end)
while (start + (1UL << order) > end)
order--;

- __free_pages_bootmem(pfn_to_page(start), order);
+ __free_pages_bootmem(pfn_to_page(start), start, order);

start += (1UL << order);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2abb3b861e70..0a0e0f280d87 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -886,7 +886,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
local_irq_restore(flags);
}

-void __init __free_pages_bootmem(struct page *page, unsigned int order)
+void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
+ unsigned int order)
{
unsigned int nr_pages = 1 << order;
struct page *p = page;
--
2.3.5

2015-04-23 10:35:53

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/13] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid

__early_pfn_to_nid() in the generic and arch-specific implementations
use static variables to cache recent lookups. Without the cache
boot times are much higher due to the excessive memblock lookups but
it assumes that memory initialisation is single-threaded. Parallel
initialisation of struct pages will break that assumption so this patch
makes __early_pfn_to_nid() SMP-safe by requiring the caller to cache
recent search information. early_pfn_to_nid() keeps the same interface
but is only safe to use early in boot due to the use of a global static
variable. meminit_pfn_in_nid() is an SMP-safe version that callers must
maintain their own state for.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/ia64/mm/numa.c | 19 +++++++------------
include/linux/mm.h | 6 ++++--
include/linux/mmzone.h | 16 +++++++++++++++-
mm/page_alloc.c | 40 +++++++++++++++++++++++++---------------
4 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index ea21d4cad540..aa19b7ac8222 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -58,27 +58,22 @@ paddr_to_nid(unsigned long paddr)
* SPARSEMEM to allocate the SPARSEMEM sectionmap on the NUMA node where
* the section resides.
*/
-int __meminit __early_pfn_to_nid(unsigned long pfn)
+int __meminit __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state)
{
int i, section = pfn >> PFN_SECTION_SHIFT, ssec, esec;
- /*
- * NOTE: The following SMP-unsafe globals are only used early in boot
- * when the kernel is running single-threaded.
- */
- static int __meminitdata last_ssec, last_esec;
- static int __meminitdata last_nid;

- if (section >= last_ssec && section < last_esec)
- return last_nid;
+ if (section >= state->last_start && section < state->last_end)
+ return state->last_nid;

for (i = 0; i < num_node_memblks; i++) {
ssec = node_memblk[i].start_paddr >> PA_SECTION_SHIFT;
esec = (node_memblk[i].start_paddr + node_memblk[i].size +
((1L << PA_SECTION_SHIFT) - 1)) >> PA_SECTION_SHIFT;
if (section >= ssec && section < esec) {
- last_ssec = ssec;
- last_esec = esec;
- last_nid = node_memblk[i].nid;
+ state->last_start = ssec;
+ state->last_end = esec;
+ state->last_nid = node_memblk[i].nid;
return node_memblk[i].nid;
}
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b6f82a31028a..a8a8b161fd65 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1802,7 +1802,8 @@ extern void sparse_memory_present_with_active_regions(int nid);

#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
!defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
-static inline int __early_pfn_to_nid(unsigned long pfn)
+static inline int __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state)
{
return 0;
}
@@ -1810,7 +1811,8 @@ static inline int __early_pfn_to_nid(unsigned long pfn)
/* please see mm/page_alloc.c */
extern int __meminit early_pfn_to_nid(unsigned long pfn);
/* there is a per-arch backend function. */
-extern int __meminit __early_pfn_to_nid(unsigned long pfn);
+extern int __meminit __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state);
#endif

extern void set_dma_reserve(unsigned long new_dma_reserve);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2782df47101e..a67b33e52dfe 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1216,10 +1216,24 @@ void sparse_init(void);
#define sparse_index_init(_sec, _nid) do {} while (0)
#endif /* CONFIG_SPARSEMEM */

+/*
+ * During memory init memblocks map pfns to nids. The search is expensive and
+ * this caches recent lookups. The implementation of __early_pfn_to_nid
+ * may treat start/end as pfns or sections.
+ */
+struct mminit_pfnnid_cache {
+ unsigned long last_start;
+ unsigned long last_end;
+ int last_nid;
+};
+
#ifdef CONFIG_NODES_SPAN_OTHER_NODES
bool early_pfn_in_nid(unsigned long pfn, int nid);
+bool meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state);
#else
-#define early_pfn_in_nid(pfn, nid) (1)
+#define early_pfn_in_nid(pfn, nid) (1)
+#define meminit_pfn_in_nid(pfn, nid, state) (1)
#endif

#ifndef early_pfn_valid
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0a0e0f280d87..f556ed63b964 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4457,39 +4457,41 @@ int __meminit init_currently_empty_zone(struct zone *zone,

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+
/*
* Required by SPARSEMEM. Given a PFN, return what node the PFN is on.
*/
-int __meminit __early_pfn_to_nid(unsigned long pfn)
+int __meminit __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state)
{
unsigned long start_pfn, end_pfn;
int nid;
- /*
- * NOTE: The following SMP-unsafe globals are only used early in boot
- * when the kernel is running single-threaded.
- */
- static unsigned long __meminitdata last_start_pfn, last_end_pfn;
- static int __meminitdata last_nid;

- if (last_start_pfn <= pfn && pfn < last_end_pfn)
- return last_nid;
+ if (state->last_start <= pfn && pfn < state->last_end)
+ return state->last_nid;

nid = memblock_search_pfn_nid(pfn, &start_pfn, &end_pfn);
if (nid != -1) {
- last_start_pfn = start_pfn;
- last_end_pfn = end_pfn;
- last_nid = nid;
+ state->last_start = start_pfn;
+ state->last_end = end_pfn;
+ state->last_nid = nid;
}

return nid;
}
#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */

+struct __meminitdata mminit_pfnnid_cache global_init_state;
+
+/* Only safe to use early in boot when initialisation is single-threaded */
int __meminit early_pfn_to_nid(unsigned long pfn)
{
int nid;

- nid = __early_pfn_to_nid(pfn);
+ /* The system will behave unpredictably otherwise */
+ BUG_ON(system_state != SYSTEM_BOOTING);
+
+ nid = __early_pfn_to_nid(pfn, &global_init_state);
if (nid >= 0)
return nid;
/* just returns 0 */
@@ -4497,15 +4499,23 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
}

#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state)
{
int nid;

- nid = __early_pfn_to_nid(pfn);
+ nid = __early_pfn_to_nid(pfn, state);
if (nid >= 0 && nid != node)
return false;
return true;
}
+
+/* Only safe to use early in boot when initialisation is single-threaded */
+bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+{
+ return meminit_pfn_in_nid(pfn, node, &global_init_state);
+}
+
#endif

/**
--
2.3.5

2015-04-23 10:35:50

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/13] mm: meminit: Inline some helper functions

early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
unnecessarily visible outside memory initialisation. As well as unnecessary
visibility, it's unnecessary function call overhead when initialising pages.
This patch moves the helpers inline.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 9 ------
mm/page_alloc.c | 75 +++++++++++++++++++++++++-------------------------
2 files changed, 38 insertions(+), 46 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a67b33e52dfe..e3d8a2bd8d78 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1227,15 +1227,6 @@ struct mminit_pfnnid_cache {
int last_nid;
};

-#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-bool early_pfn_in_nid(unsigned long pfn, int nid);
-bool meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state);
-#else
-#define early_pfn_in_nid(pfn, nid) (1)
-#define meminit_pfn_in_nid(pfn, nid, state) (1)
-#endif
-
#ifndef early_pfn_valid
#define early_pfn_valid(pfn) (1)
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f556ed63b964..8b4659aa0bc2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -907,6 +907,44 @@ void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
__free_pages(page, order);
}

+#if defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) || \
+ defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
+/* Only safe to use early in boot when initialisation is single-threaded */
+struct __meminitdata mminit_pfnnid_cache global_init_state;
+int __meminit early_pfn_to_nid(unsigned long pfn)
+{
+ int nid;
+
+ /* The system will behave unpredictably otherwise */
+ BUG_ON(system_state != SYSTEM_BOOTING);
+
+ nid = __early_pfn_to_nid(pfn, &global_init_state);
+ if (nid >= 0)
+ return nid;
+ /* just returns 0 */
+ return 0;
+}
+#endif
+
+#ifdef CONFIG_NODES_SPAN_OTHER_NODES
+static inline bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state)
+{
+ int nid;
+
+ nid = __early_pfn_to_nid(pfn, state);
+ if (nid >= 0 && nid != node)
+ return false;
+ return true;
+}
+
+/* Only safe to use early in boot when initialisation is single-threaded */
+static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+{
+ return meminit_pfn_in_nid(pfn, node, &global_init_state);
+}
+#endif
+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
@@ -4481,43 +4519,6 @@ int __meminit __early_pfn_to_nid(unsigned long pfn,
}
#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */

-struct __meminitdata mminit_pfnnid_cache global_init_state;
-
-/* Only safe to use early in boot when initialisation is single-threaded */
-int __meminit early_pfn_to_nid(unsigned long pfn)
-{
- int nid;
-
- /* The system will behave unpredictably otherwise */
- BUG_ON(system_state != SYSTEM_BOOTING);
-
- nid = __early_pfn_to_nid(pfn, &global_init_state);
- if (nid >= 0)
- return nid;
- /* just returns 0 */
- return 0;
-}
-
-#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state)
-{
- int nid;
-
- nid = __early_pfn_to_nid(pfn, state);
- if (nid >= 0 && nid != node)
- return false;
- return true;
-}
-
-/* Only safe to use early in boot when initialisation is single-threaded */
-bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
-{
- return meminit_pfn_in_nid(pfn, node, &global_init_state);
-}
-
-#endif
-
/**
* free_bootmem_with_active_regions - Call memblock_free_early_nid for each active range
* @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed.
--
2.3.5

2015-04-23 10:35:22

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

This patch initalises all low memory struct pages and 2G of the highest zone
on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT
is set. That config option cannot be set but will be available in a later
patch. Parallel initialisation of struct page depends on some features
from memory hotplug and it is necessary to alter alter section annotations.

Signed-off-by: Mel Gorman <[email protected]>
---
drivers/base/node.c | 11 +++++--
include/linux/mmzone.h | 8 ++++++
mm/Kconfig | 18 ++++++++++++
mm/internal.h | 8 ++++++
mm/page_alloc.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++--
5 files changed, 117 insertions(+), 6 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 36fabe43cd44..d03e976b4431 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -361,12 +361,16 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
#define page_initialized(page) (page->lru.next)

-static int get_nid_for_pfn(unsigned long pfn)
+static int get_nid_for_pfn(struct pglist_data *pgdat, unsigned long pfn)
{
struct page *page;

if (!pfn_valid_within(pfn))
return -1;
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ if (pgdat && pfn >= pgdat->first_deferred_pfn)
+ return early_pfn_to_nid(pfn);
+#endif
page = pfn_to_page(pfn);
if (!page_initialized(page))
return -1;
@@ -378,6 +382,7 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
{
int ret;
unsigned long pfn, sect_start_pfn, sect_end_pfn;
+ struct pglist_data *pgdat = NODE_DATA(nid);

if (!mem_blk)
return -EFAULT;
@@ -390,7 +395,7 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
int page_nid;

- page_nid = get_nid_for_pfn(pfn);
+ page_nid = get_nid_for_pfn(pgdat, pfn);
if (page_nid < 0)
continue;
if (page_nid != nid)
@@ -429,7 +434,7 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
int nid;

- nid = get_nid_for_pfn(pfn);
+ nid = get_nid_for_pfn(NULL, pfn);
if (nid < 0)
continue;
if (!node_online(nid))
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e3d8a2bd8d78..4882c53b70b5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -762,6 +762,14 @@ typedef struct pglist_data {
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
#endif
+
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ /*
+ * If memory initialisation on large machines is deferred then this
+ * is the first PFN that needs to be initialised.
+ */
+ unsigned long first_deferred_pfn;
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
} pg_data_t;

#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
diff --git a/mm/Kconfig b/mm/Kconfig
index a03131b6ba8e..3e40cb64e226 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -629,3 +629,21 @@ config MAX_STACK_SIZE_MB
changed to a smaller value in which case that is used.

A sane initial value is 80 MB.
+
+# For architectures that support deferred memory initialisation
+config ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
+ bool
+
+config DEFERRED_STRUCT_PAGE_INIT
+ bool "Defer initialisation of struct pages to kswapd"
+ default n
+ depends on ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
+ depends on MEMORY_HOTPLUG
+ help
+ Ordinarily all struct pages are initialised during early boot in a
+ single thread. On very large machines this can take a considerable
+ amount of time. If this option is set, large machines will bring up
+ a subset of memmap at boot and then initialise the rest in parallel
+ when kswapd starts. This has a potential performance impact on
+ processes running early in the lifetime of the systemm until kswapd
+ finishes the initialisation.
diff --git a/mm/internal.h b/mm/internal.h
index 76b605139c7a..4a73f74846bd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -385,6 +385,14 @@ static inline void mminit_verify_zonelist(void)
}
#endif /* CONFIG_DEBUG_MEMORY_INIT */

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+#define __defermem_init __meminit
+#define __defer_init __meminit
+#else
+#define __defermem_init
+#define __defer_init __init
+#endif
+
/* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */
#if defined(CONFIG_SPARSEMEM)
extern void mminit_validate_memmodel_limits(unsigned long *start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8b4659aa0bc2..c7c2d20c8bb5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -235,6 +235,64 @@ EXPORT_SYMBOL(nr_online_nodes);

int page_group_by_mobility_disabled __read_mostly;

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+static inline void reset_deferred_meminit(pg_data_t *pgdat)
+{
+ pgdat->first_deferred_pfn = ULONG_MAX;
+}
+
+/* Returns true if the struct page for the pfn is uninitialised */
+static inline bool __defermem_init early_page_uninitialised(unsigned long pfn)
+{
+ int nid = early_pfn_to_nid(pfn);
+
+ if (pfn >= NODE_DATA(nid)->first_deferred_pfn)
+ return true;
+
+ return false;
+}
+
+/*
+ * Returns false when the remaining initialisation should be deferred until
+ * later in the boot cycle when it can be parallelised.
+ */
+static inline bool update_defer_init(pg_data_t *pgdat,
+ unsigned long pfn, unsigned long zone_end,
+ unsigned long *nr_initialised)
+{
+ /* Always populate low zones for address-contrained allocations */
+ if (zone_end < pgdat_end_pfn(pgdat))
+ return true;
+
+ /* Initialise at least 2G of the highest zone */
+ (*nr_initialised)++;
+ if (*nr_initialised > (2UL << (30 - PAGE_SHIFT)) &&
+ (pfn & (PAGES_PER_SECTION - 1)) == 0) {
+ pgdat->first_deferred_pfn = pfn;
+ return false;
+ }
+
+ return true;
+}
+#else
+static inline void reset_deferred_meminit(pg_data_t *pgdat)
+{
+}
+
+static inline bool early_page_uninitialised(unsigned long pfn)
+{
+ return false;
+}
+
+static inline bool update_defer_init(pg_data_t *pgdat,
+ unsigned long pfn, unsigned long zone_end,
+ unsigned long *nr_initialised)
+{
+ return true;
+}
+#endif
+
+
void set_pageblock_migratetype(struct page *page, int migratetype)
{
if (unlikely(page_group_by_mobility_disabled &&
@@ -886,8 +944,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
local_irq_restore(flags);
}

-void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
- unsigned int order)
+static void __defer_init __free_pages_boot_core(struct page *page,
+ unsigned long pfn, unsigned int order)
{
unsigned int nr_pages = 1 << order;
struct page *p = page;
@@ -945,6 +1003,14 @@ static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
}
#endif

+void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
+ unsigned int order)
+{
+ if (early_page_uninitialised(pfn))
+ return;
+ return __free_pages_boot_core(page, pfn, order);
+}
+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
@@ -4217,14 +4283,16 @@ static void setup_zone_migrate_reserve(struct zone *zone)
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
+ pg_data_t *pgdat = NODE_DATA(nid);
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
struct zone *z;
+ unsigned long nr_initialised = 0;

if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;

- z = &NODE_DATA(nid)->node_zones[zone];
+ z = &pgdat->node_zones[zone];
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*
* There can be holes in boot-time mem_map[]s
@@ -4236,6 +4304,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
continue;
if (!early_pfn_in_nid(pfn, nid))
continue;
+ if (!update_defer_init(pgdat, pfn, end_pfn,
+ &nr_initialised))
+ break;
}
__init_single_pfn(pfn, zone, nid);
}
@@ -5037,6 +5108,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
/* pg_data_t should be reset to zero when it's allocated */
WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);

+ reset_deferred_meminit(pgdat);
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
--
2.3.5

2015-04-23 10:35:05

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 08/13] mm: meminit: Initialise remaining struct pages in parallel with kswapd

Only a subset of struct pages are initialised at the moment. When this patch
is applied kswapd initialise the remaining struct pages in parallel. This
should boot faster by spreading the work to multiple CPUs and initialising
data that is local to the CPU. The user-visible effect on large machines
is that free memory will appear to rapidly increase early in the lifetime
of the system until kswapd reports that all memory is initialised in the
kernel log. Once initialised there should be no other user-visibile effects.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 6 +++
mm/mm_init.c | 1 +
mm/page_alloc.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
mm/vmscan.c | 6 ++-
4 files changed, 123 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 4a73f74846bd..2c4057140bec 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -388,9 +388,15 @@ static inline void mminit_verify_zonelist(void)
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
#define __defermem_init __meminit
#define __defer_init __meminit
+
+void deferred_init_memmap(int nid);
#else
#define __defermem_init
#define __defer_init __init
+
+static inline void deferred_init_memmap(int nid)
+{
+}
#endif

/* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5f420f7fafa1..28fbf87b20aa 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -11,6 +11,7 @@
#include <linux/export.h>
#include <linux/memory.h>
#include <linux/notifier.h>
+#include <linux/sched.h>
#include "internal.h"

#ifdef CONFIG_DEBUG_MEMORY_INIT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c7c2d20c8bb5..f2db3d7aa6cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -252,6 +252,14 @@ static inline bool __defermem_init early_page_uninitialised(unsigned long pfn)
return false;
}

+static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
+{
+ if (pfn >= NODE_DATA(nid)->first_deferred_pfn)
+ return true;
+
+ return false;
+}
+
/*
* Returns false when the remaining initialisation should be deferred until
* later in the boot cycle when it can be parallelised.
@@ -284,6 +292,11 @@ static inline bool early_page_uninitialised(unsigned long pfn)
return false;
}

+static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
+{
+ return false;
+}
+
static inline bool update_defer_init(pg_data_t *pgdat,
unsigned long pfn, unsigned long zone_end,
unsigned long *nr_initialised)
@@ -880,14 +893,45 @@ static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
return __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
}

-void reserve_bootmem_region(unsigned long start, unsigned long end)
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+static void init_reserved_page(unsigned long pfn)
+{
+ pg_data_t *pgdat;
+ int nid, zid;
+
+ if (!early_page_uninitialised(pfn))
+ return;
+
+ nid = early_pfn_to_nid(pfn);
+ pgdat = NODE_DATA(nid);
+
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ struct zone *zone = &pgdat->node_zones[zid];
+
+ if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
+ break;
+ }
+ __init_single_pfn(pfn, zid, nid);
+}
+#else
+static inline void init_reserved_page(unsigned long pfn)
+{
+}
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
+void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);

- for (; start_pfn < end_pfn; start_pfn++)
- if (pfn_valid(start_pfn))
- SetPageReserved(pfn_to_page(start_pfn));
+ for (; start_pfn < end_pfn; start_pfn++) {
+ if (pfn_valid(start_pfn)) {
+ struct page *page = pfn_to_page(start_pfn);
+
+ init_reserved_page(start_pfn);
+ SetPageReserved(page);
+ }
+ }
}

static bool free_pages_prepare(struct page *page, unsigned int order)
@@ -1011,6 +1055,67 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
return __free_pages_boot_core(page, pfn, order);
}

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+/* Initialise remaining memory on a node */
+void __defermem_init deferred_init_memmap(int nid)
+{
+ unsigned long start = jiffies;
+ struct mminit_pfnnid_cache nid_init_state = { };
+
+ pg_data_t *pgdat = NODE_DATA(nid);
+ int zid;
+ unsigned long first_init_pfn = pgdat->first_deferred_pfn;
+
+ if (first_init_pfn == ULONG_MAX)
+ return;
+
+ /* Sanity check boundaries */
+ BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
+ BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
+ pgdat->first_deferred_pfn = ULONG_MAX;
+
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ struct zone *zone = pgdat->node_zones + zid;
+ unsigned long walk_start, walk_end;
+ int i;
+
+ for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
+ unsigned long pfn, end_pfn;
+
+ end_pfn = min(walk_end, zone_end_pfn(zone));
+ pfn = first_init_pfn;
+ if (pfn < walk_start)
+ pfn = walk_start;
+ if (pfn < zone->zone_start_pfn)
+ pfn = zone->zone_start_pfn;
+
+ for (; pfn < end_pfn; pfn++) {
+ struct page *page;
+
+ if (!pfn_valid(pfn))
+ continue;
+
+ if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state))
+ continue;
+
+ if (page->flags) {
+ VM_BUG_ON(page_zone(page) != zone);
+ continue;
+ }
+
+ __init_single_page(page, pfn, zid, nid);
+ __free_pages_boot_core(page, pfn, 0);
+ cond_resched();
+ }
+ first_init_pfn = max(end_pfn, first_init_pfn);
+ }
+ }
+
+ pr_info("kswapd %d initialised deferred memory in %ums\n", nid,
+ jiffies_to_msecs(jiffies - start));
+}
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
@@ -4221,6 +4326,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)
zone->nr_migrate_reserve_block = reserve;

for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+ if (!early_page_nid_uninitialised(pfn, zone_to_nid(zone)))
+ return;
+
if (!pfn_valid(pfn))
continue;
page = pfn_to_page(pfn);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..c4895d26d036 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3348,7 +3348,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
* If there are applications that are active memory-allocators
* (most normal use), this basically shouldn't matter.
*/
-static int kswapd(void *p)
+static int __defermem_init kswapd(void *p)
{
unsigned long order, new_order;
unsigned balanced_order;
@@ -3383,6 +3383,8 @@ static int kswapd(void *p)
tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
set_freezable();

+ deferred_init_memmap(pgdat->node_id);
+
order = new_order = 0;
balanced_order = 0;
classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
@@ -3538,7 +3540,7 @@ static int cpu_callback(struct notifier_block *nfb, unsigned long action,
* This kswapd start function will be called by init and node-hot-add.
* On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
*/
-int kswapd_run(int nid)
+int __defermem_init kswapd_run(int nid)
{
pg_data_t *pgdat = NODE_DATA(nid);
int ret = 0;
--
2.3.5

2015-04-23 10:33:33

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 09/13] mm: meminit: Minimise number of pfn->page lookups during initialisation

Deferred struct page initialisation is using pfn_to_page() on every PFN
unnecessarily. This patch minimises the number of lookups and scheduler
checks.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f2db3d7aa6cb..11125634e375 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1081,6 +1081,7 @@ void __defermem_init deferred_init_memmap(int nid)

for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
unsigned long pfn, end_pfn;
+ struct page *page = NULL;

end_pfn = min(walk_end, zone_end_pfn(zone));
pfn = first_init_pfn;
@@ -1090,13 +1091,32 @@ void __defermem_init deferred_init_memmap(int nid)
pfn = zone->zone_start_pfn;

for (; pfn < end_pfn; pfn++) {
- struct page *page;
-
- if (!pfn_valid(pfn))
+ if (!pfn_valid_within(pfn))
continue;

- if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state))
+ /*
+ * Ensure pfn_valid is checked every
+ * MAX_ORDER_NR_PAGES for memory holes
+ */
+ if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
+ if (!pfn_valid(pfn)) {
+ page = NULL;
+ continue;
+ }
+ }
+
+ if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
+ page = NULL;
continue;
+ }
+
+ /* Minimise pfn page lookups and scheduler checks */
+ if (page && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) {
+ page++;
+ } else {
+ page = pfn_to_page(pfn);
+ cond_resched();
+ }

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
@@ -1105,7 +1125,6 @@ void __defermem_init deferred_init_memmap(int nid)

__init_single_page(page, pfn, zid, nid);
__free_pages_boot_core(page, pfn, 0);
- cond_resched();
}
first_init_pfn = max(end_pfn, first_init_pfn);
}
--
2.3.5

2015-04-23 10:34:04

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64

Subject says it all. Other architectures may enable on a case-by-case
basis after auditing early_pfn_to_nid and testing.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca55187..1beff8a8fbc9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -18,6 +18,7 @@ config X86_64
select X86_DEV_DMA_OPS
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_LIVEPATCH
+ select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT

### Arch settings
config X86
--
2.3.5

2015-04-23 10:34:01

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 11/13] mm: meminit: Free pages in large chunks where possible

Parallel struct page frees pages one at a time. Try free pages as single
large pages where possible.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 11125634e375..73077dc63f0c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1056,6 +1056,20 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
}

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+void __defermem_init deferred_free_range(struct page *page, unsigned long pfn,
+ int nr_pages)
+{
+ int i;
+
+ if (nr_pages == MAX_ORDER_NR_PAGES && (pfn & (MAX_ORDER_NR_PAGES-1)) == 0) {
+ __free_pages_boot_core(page, pfn, MAX_ORDER-1);
+ return;
+ }
+
+ for (i = 0; i < nr_pages; i++, page++, pfn++)
+ __free_pages_boot_core(page, pfn, 0);
+}
+
/* Initialise remaining memory on a node */
void __defermem_init deferred_init_memmap(int nid)
{
@@ -1082,6 +1096,9 @@ void __defermem_init deferred_init_memmap(int nid)
for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
unsigned long pfn, end_pfn;
struct page *page = NULL;
+ struct page *free_base_page = NULL;
+ unsigned long free_base_pfn = 0;
+ int nr_to_free = 0;

end_pfn = min(walk_end, zone_end_pfn(zone));
pfn = first_init_pfn;
@@ -1092,7 +1109,7 @@ void __defermem_init deferred_init_memmap(int nid)

for (; pfn < end_pfn; pfn++) {
if (!pfn_valid_within(pfn))
- continue;
+ goto free_range;

/*
* Ensure pfn_valid is checked every
@@ -1101,30 +1118,49 @@ void __defermem_init deferred_init_memmap(int nid)
if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
if (!pfn_valid(pfn)) {
page = NULL;
- continue;
+ goto free_range;
}
}

if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
page = NULL;
- continue;
+ goto free_range;
}

/* Minimise pfn page lookups and scheduler checks */
if (page && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) {
page++;
} else {
+ deferred_free_range(free_base_page,
+ free_base_pfn, nr_to_free);
+ free_base_page = NULL;
+ free_base_pfn = nr_to_free = 0;
+
page = pfn_to_page(pfn);
cond_resched();
}

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
- continue;
+ goto free_range;
}

__init_single_page(page, pfn, zid, nid);
- __free_pages_boot_core(page, pfn, 0);
+ if (!free_base_page) {
+ free_base_page = page;
+ free_base_pfn = pfn;
+ nr_to_free = 0;
+ }
+ nr_to_free++;
+
+ /* Where possible, batch up pages for a single free */
+ continue;
+free_range:
+ /* Free the current block of pages to allocator */
+ if (free_base_page)
+ deferred_free_range(free_base_page, free_base_pfn, nr_to_free);
+ free_base_page = NULL;
+ free_base_pfn = nr_to_free = 0;
}
first_init_pfn = max(end_pfn, first_init_pfn);
}
--
2.3.5

2015-04-23 10:34:00

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 12/13] mm: meminit: Reduce number of times pageblocks are set during struct page init

During parallel sturct page initialisation, ranges are checked for every
PFN unnecessarily which increases boot times. This patch alters when the
ranges are checked.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 45 +++++++++++++++++++++++----------------------
1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 73077dc63f0c..576b03bc9057 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -852,33 +852,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
- struct zone *z = &NODE_DATA(nid)->node_zones[zone];
-
set_page_links(page, zone, nid, pfn);
mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);

- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
- *
- * bitmap is created for zone's valid pfn range. but memmap
- * can be created for invalid pages (for alignment)
- * check here not to call set_pageblock_migratetype() against
- * pfn out of zone.
- */
- if ((z->zone_start_pfn <= pfn)
- && (pfn < zone_end_pfn(z))
- && !(pfn & (pageblock_nr_pages - 1)))
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-
INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
@@ -1062,6 +1041,7 @@ void __defermem_init deferred_free_range(struct page *page, unsigned long pfn,
int i;

if (nr_pages == MAX_ORDER_NR_PAGES && (pfn & (MAX_ORDER_NR_PAGES-1)) == 0) {
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
__free_pages_boot_core(page, pfn, MAX_ORDER-1);
return;
}
@@ -4471,7 +4451,28 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
&nr_initialised))
break;
}
- __init_single_pfn(pfn, zone, nid);
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made. Later some blocks near
+ * the start are marked MIGRATE_RESERVE by
+ * setup_zone_migrate_reserve()
+ *
+ * bitmap is created for zone's valid pfn range. but memmap
+ * can be created for invalid pages (for alignment)
+ * check here not to call set_pageblock_migratetype() against
+ * pfn out of zone.
+ */
+ if (!(pfn & (pageblock_nr_pages - 1))) {
+ struct page *page = pfn_to_page(pfn);
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ __init_single_page(page, pfn, zone, nid);
+ } else {
+ __init_single_pfn(pfn, zone, nid);
+ }
}
}

--
2.3.5

2015-04-23 10:33:34

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 13/13] mm: meminit: Remove mminit_verify_page_links

mminit_verify_page_links() is an extremely paranoid check that was introduced
when memory initialisation was being heavily reworked. Profiles indicated
that up to 10% of parallel memory initialisation was spent on checking
this for every page. The cost could be reduced but in practice this check
only found problems very early during the initialisation rewrite and has
found nothing since. This patch removes an expensive unnecessary check.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 8 --------
mm/mm_init.c | 8 --------
mm/page_alloc.c | 1 -
3 files changed, 17 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 2c4057140bec..c73ad248f8f4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -360,10 +360,7 @@ do { \
} while (0)

extern void mminit_verify_pageflags_layout(void);
-extern void mminit_verify_page_links(struct page *page,
- enum zone_type zone, unsigned long nid, unsigned long pfn);
extern void mminit_verify_zonelist(void);
-
#else

static inline void mminit_dprintk(enum mminit_level level,
@@ -375,11 +372,6 @@ static inline void mminit_verify_pageflags_layout(void)
{
}

-static inline void mminit_verify_page_links(struct page *page,
- enum zone_type zone, unsigned long nid, unsigned long pfn)
-{
-}
-
static inline void mminit_verify_zonelist(void)
{
}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 28fbf87b20aa..fdadf918de76 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -131,14 +131,6 @@ void __init mminit_verify_pageflags_layout(void)
BUG_ON(or_mask != add_mask);
}

-void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone,
- unsigned long nid, unsigned long pfn)
-{
- BUG_ON(page_to_nid(page) != nid);
- BUG_ON(page_zonenum(page) != zone);
- BUG_ON(page_to_pfn(page) != pfn);
-}
-
static __init int set_mminit_loglevel(char *str)
{
get_option(&str, &mminit_loglevel);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 576b03bc9057..739b1840de2c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -853,7 +853,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
set_page_links(page, zone, nid, pfn);
- mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
--
2.3.5

2015-04-23 15:54:18

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3

On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman <[email protected]> wrote:
> The big change here is an adjustment to the topology_init path that
> caused
> soft lockups on Waiman and Daniel Blue had reported it was an
> expensive
> function.
>
> Changelog since v2
> o Reduce overhead of topology_init
> o Remove boot-time kernel parameter to enable/disable
> o Enable on UMA
>
> Changelog since v1
> o Always initialise low zones
> o Typo corrections
> o Rename parallel mem init to parallel struct page init
> o Rebase to 4.0
[]

Splendid work! On this 256c setup, topology_init now takes 185ms.

This brings the kernel boot time down to 324s [1]. It turns out that
one memset is responsible for most of the time setting up the the PUDs
and PMDs; adapting memset to using non-temporal writes [3] avoids
generating RMW cycles, bringing boot time down to 186s [2].

If this is a possibility, I can split this patch and map other arch's
memset_nocache to memset, or change the callsite as preferred; comments
welcome.

Thanks,
Daniel

[1] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2.txt
[2]
https://resources.numascale.com/telemetry/defermem/h8qgl-defer2-nontemporal.txt

-- [3]

From f822139736cab8434302693c635fa146b465273c Mon Sep 17 00:00:00 2001
From: Daniel J Blueman <[email protected]>
Date: Thu, 23 Apr 2015 23:26:27 +0800
Subject: [RFC] Speedup PMD setup

Using non-temporal writes prevents read-modify-write cycles,
which are much slower over large topologies.

Adapt the existing memset() function into a _nocache variant and use
when setting up PMDs during early boot to reduce boot time.

Signed-off-by: Daniel J Blueman <[email protected]>
---
arch/x86/include/asm/string_64.h | 3 ++
arch/x86/lib/memset_64.S | 90
++++++++++++++++++++++++++++++++++++++++
mm/memblock.c | 2 +-
3 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/string_64.h
b/arch/x86/include/asm/string_64.h
index e466119..1ef28d0 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -55,6 +55,8 @@ extern void *memcpy(void *to, const void *from,
size_t len);
#define __HAVE_ARCH_MEMSET
void *memset(void *s, int c, size_t n);
void *__memset(void *s, int c, size_t n);
+void *memset_nocache(void *s, int c, size_t n);
+void *__memset_nocache(void *s, int c, size_t n);

#define __HAVE_ARCH_MEMMOVE
void *memmove(void *dest, const void *src, size_t count);
@@ -77,6 +79,7 @@ int strcmp(const char *cs, const char *ct);
#define memcpy(dst, src, len) __memcpy(dst, src, len)
#define memmove(dst, src, len) __memmove(dst, src, len)
#define memset(s, c, n) __memset(s, c, n)
+#define memset_nocache(s, c, n) __memset_nocache(s, c, n)
#endif

#endif /* __KERNEL__ */
diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index 6f44935..fb46f78 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -137,6 +137,96 @@ ENTRY(__memset)
ENDPROC(memset)
ENDPROC(__memset)

+/*
+ * bzero_nocache - set a memory block to zero. This function uses
+ * non-temporal writes in the fastpath
+ *
+ * rdi destination
+ * rsi value (char)
+ * rdx count (bytes)
+ *
+ * rax original destination
+ */
+
+ENTRY(memset_nocache)
+ENTRY(__memset_nocache)
+ CFI_STARTPROC
+ movq %rdi,%r10
+
+ /* expand byte value */
+ movzbl %sil,%ecx
+ movabs $0x0101010101010101,%rax
+ imulq %rcx,%rax
+
+ /* align dst */
+ movl %edi,%r9d
+ andl $7,%r9d
+ jnz bad_alignment
+ CFI_REMEMBER_STATE
+after_bad_alignment:
+
+ movq %rdx,%rcx
+ shrq $6,%rcx
+ jz handle_tail
+
+ .p2align 4
+loop_64:
+ decq %rcx
+ movnti %rax,(%rdi)
+ movnti %rax,8(%rdi)
+ movnti %rax,16(%rdi)
+ movnti %rax,24(%rdi)
+ movnti %rax,32(%rdi)
+ movnti %rax,40(%rdi)
+ movnti %rax,48(%rdi)
+ movnti %rax,56(%rdi)
+ leaq 64(%rdi),%rdi
+ jnz loop_64
+
+ /* Handle tail in loops; the loops should be faster than hard
+ to predict jump tables */
+ .p2align 4
+handle_tail:
+ movl %edx,%ecx
+ andl $63&(~7),%ecx
+ jz handle_7
+ shrl $3,%ecx
+ .p2align 4
+loop_8:
+ decl %ecx
+ movnti %rax,(%rdi)
+ leaq 8(%rdi),%rdi
+ jnz loop_8
+
+handle_7:
+ andl $7,%edx
+ jz ende
+ .p2align 4
+loop_1:
+ decl %edx
+ movb %al,(%rdi)
+ leaq 1(%rdi),%rdi
+ jnz loop_1
+
+ende:
+ movq %r10,%rax
+ ret
+
+ CFI_RESTORE_STATE
+bad_alignment:
+ cmpq $7,%rdx
+ jbe handle_7
+ movnti %rax,(%rdi) /* unaligned store */
+ movq $8,%r8
+ subq %r9,%r8
+ addq %r8,%rdi
+ subq %r8,%rdx
+ jmp after_bad_alignment
+final:
+ CFI_ENDPROC
+ENDPROC(memset_nocache)
+ENDPROC(__memset_nocache)
+
/* Some CPUs support enhanced REP MOVSB/STOSB feature.
* It is recommended to use this when possible.
*
diff --git a/mm/memblock.c b/mm/memblock.c
index f3e97d8..df434d2 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1212,7 +1212,7 @@ again:
done:
memblock_reserve(alloc, size);
ptr = phys_to_virt(alloc);
- memset(ptr, 0, size);
+ memset_nocache(ptr, 0, size);

/*
* The min_count is set to 0 so that bootmem allocated blocks

2015-04-23 15:56:15

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

On Thu, Apr 23, 2015 at 11:33:10AM +0100, Mel Gorman wrote:
> This patch initalises all low memory struct pages and 2G of the highest zone
> on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT
> is set. That config option cannot be set but will be available in a later
> patch. Parallel initialisation of struct page depends on some features
> from memory hotplug and it is necessary to alter alter section annotations.
>
> Signed-off-by: Mel Gorman <[email protected]>

I belatedly noticed that this causes section warnings. It'll be harmless
for testing but the next (hopefully last) version will have this on top

diff --git a/drivers/base/node.c b/drivers/base/node.c
index d03e976b4431..97ab2c4dd39e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -361,14 +361,14 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
#define page_initialized(page) (page->lru.next)

-static int get_nid_for_pfn(struct pglist_data *pgdat, unsigned long pfn)
+static int __init_refok get_nid_for_pfn(unsigned long pfn)
{
struct page *page;

if (!pfn_valid_within(pfn))
return -1;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
- if (pgdat && pfn >= pgdat->first_deferred_pfn)
+ if (system_state == SYSTEM_BOOTING)
return early_pfn_to_nid(pfn);
#endif
page = pfn_to_page(pfn);
@@ -382,7 +382,6 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
{
int ret;
unsigned long pfn, sect_start_pfn, sect_end_pfn;
- struct pglist_data *pgdat = NODE_DATA(nid);

if (!mem_blk)
return -EFAULT;
@@ -395,7 +394,7 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
int page_nid;

- page_nid = get_nid_for_pfn(pgdat, pfn);
+ page_nid = get_nid_for_pfn(pfn);
if (page_nid < 0)
continue;
if (page_nid != nid)
@@ -434,7 +433,7 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
int nid;

- nid = get_nid_for_pfn(NULL, pfn);
+ nid = get_nid_for_pfn(pfn);
if (nid < 0)
continue;
if (!node_online(nid))

2015-04-23 16:30:48

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3

On Thu, Apr 23, 2015 at 11:53:57PM +0800, Daniel J Blueman wrote:
> On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman <[email protected]> wrote:
> >The big change here is an adjustment to the topology_init path
> >that caused
> >soft lockups on Waiman and Daniel Blue had reported it was an
> >expensive
> >function.
> >
> >Changelog since v2
> >o Reduce overhead of topology_init
> >o Remove boot-time kernel parameter to enable/disable
> >o Enable on UMA
> >
> >Changelog since v1
> >o Always initialise low zones
> >o Typo corrections
> >o Rename parallel mem init to parallel struct page init
> >o Rebase to 4.0
> []
>
> Splendid work! On this 256c setup, topology_init now takes 185ms.
>
> This brings the kernel boot time down to 324s [1].

Good stuff. Am I correct in thinking that the vanilla kernel takes 732s?

> It turns out that
> one memset is responsible for most of the time setting up the the
> PUDs and PMDs; adapting memset to using non-temporal writes [3]
> avoids generating RMW cycles, bringing boot time down to 186s [2].
>
> If this is a possibility, I can split this patch and map other
> arch's memset_nocache to memset, or change the callsite as
> preferred; comments welcome.
>

In general, I see no problem with the patch and that it would be useful
going in before or after this series. I would suggest you splt this into
three patches. The first that is an asm-generic alias of memset_nocache
to memset with documentation saying it's optional for an architecture to
implement. The second would be your implementation for x86 that needs to
go to the x86 maintainers. The third would then be the memblock.c change.

Thanks.

--
Mel Gorman
SUSE Labs

2015-04-24 19:48:17

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3

On 04/23/2015 11:53 AM, Daniel J Blueman wrote:
> On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman <[email protected]> wrote:
>> The big change here is an adjustment to the topology_init path that
>> caused
>> soft lockups on Waiman and Daniel Blue had reported it was an expensive
>> function.
>>
>> Changelog since v2
>> o Reduce overhead of topology_init
>> o Remove boot-time kernel parameter to enable/disable
>> o Enable on UMA
>>
>> Changelog since v1
>> o Always initialise low zones
>> o Typo corrections
>> o Rename parallel mem init to parallel struct page init
>> o Rebase to 4.0
> []
>
> Splendid work! On this 256c setup, topology_init now takes 185ms.
>
> This brings the kernel boot time down to 324s [1]. It turns out that
> one memset is responsible for most of the time setting up the the PUDs
> and PMDs; adapting memset to using non-temporal writes [3] avoids
> generating RMW cycles, bringing boot time down to 186s [2].
>
> If this is a possibility, I can split this patch and map other arch's
> memset_nocache to memset, or change the callsite as preferred;
> comments welcome.
>
> Thanks,
> Daniel
>
> [1] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2.txt
> [2]
> https://resources.numascale.com/telemetry/defermem/h8qgl-defer2-nontemporal.txt
>
> -- [3]
>
> From f822139736cab8434302693c635fa146b465273c Mon Sep 17 00:00:00 2001
> From: Daniel J Blueman <[email protected]>
> Date: Thu, 23 Apr 2015 23:26:27 +0800
> Subject: [RFC] Speedup PMD setup
>
> Using non-temporal writes prevents read-modify-write cycles,
> which are much slower over large topologies.
>
> Adapt the existing memset() function into a _nocache variant and use
> when setting up PMDs during early boot to reduce boot time.
>
> Signed-off-by: Daniel J Blueman <[email protected]>
> ---
> arch/x86/include/asm/string_64.h | 3 ++
> arch/x86/lib/memset_64.S | 90
> ++++++++++++++++++++++++++++++++++++++++
> mm/memblock.c | 2 +-
> 3 files changed, 94 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/string_64.h
> b/arch/x86/include/asm/string_64.h
> index e466119..1ef28d0 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -55,6 +55,8 @@ extern void *memcpy(void *to, const void *from,
> size_t len);
> #define __HAVE_ARCH_MEMSET
> void *memset(void *s, int c, size_t n);
> void *__memset(void *s, int c, size_t n);
> +void *memset_nocache(void *s, int c, size_t n);
> +void *__memset_nocache(void *s, int c, size_t n);
>
> #define __HAVE_ARCH_MEMMOVE
> void *memmove(void *dest, const void *src, size_t count);
> @@ -77,6 +79,7 @@ int strcmp(const char *cs, const char *ct);
> #define memcpy(dst, src, len) __memcpy(dst, src, len)
> #define memmove(dst, src, len) __memmove(dst, src, len)
> #define memset(s, c, n) __memset(s, c, n)
> +#define memset_nocache(s, c, n) __memset_nocache(s, c, n)
> #endif
>
> #endif /* __KERNEL__ */
> diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
> index 6f44935..fb46f78 100644
> --- a/arch/x86/lib/memset_64.S
> +++ b/arch/x86/lib/memset_64.S
> @@ -137,6 +137,96 @@ ENTRY(__memset)
> ENDPROC(memset)
> ENDPROC(__memset)
>
> +/*
> + * bzero_nocache - set a memory block to zero. This function uses
> + * non-temporal writes in the fastpath
> + *
> + * rdi destination
> + * rsi value (char)
> + * rdx count (bytes)
> + *
> + * rax original destination
> + */
> +
> +ENTRY(memset_nocache)
> +ENTRY(__memset_nocache)
> + CFI_STARTPROC
> + movq %rdi,%r10
> +
> + /* expand byte value */
> + movzbl %sil,%ecx
> + movabs $0x0101010101010101,%rax
> + imulq %rcx,%rax
> +
> + /* align dst */
> + movl %edi,%r9d
> + andl $7,%r9d
> + jnz bad_alignment
> + CFI_REMEMBER_STATE
> +after_bad_alignment:
> +
> + movq %rdx,%rcx
> + shrq $6,%rcx
> + jz handle_tail
> +
> + .p2align 4
> +loop_64:
> + decq %rcx
> + movnti %rax,(%rdi)
> + movnti %rax,8(%rdi)
> + movnti %rax,16(%rdi)
> + movnti %rax,24(%rdi)
> + movnti %rax,32(%rdi)
> + movnti %rax,40(%rdi)
> + movnti %rax,48(%rdi)
> + movnti %rax,56(%rdi)
> + leaq 64(%rdi),%rdi
> + jnz loop_64
> +
> +

Your version of memset_nocache differs from from memset only in the use
of movnti instruction. You may consider using compiler macros to make a
single copy of source code to generate 2 different versions of
executable codes. That will make the new code much easier to maintain.

For example,

#include ...

#define MOVQ movnti
#define memset memset_nocache
#define __mmset __memset_nocache

#include "memset_64.S"

Of course, you need to replace the target movq instructions in
memset_64.S to MOVQ, define

#ifndef MOVQ
#define MOVQ movq
#endif

You also need to use conditional compilation macro to disable the
alternate instruction stuff in memset_64.S.

Cheers,
Longman

2015-04-27 22:43:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 03/13] mm: meminit: Only set page reserved in the memblock region

On Thu, 23 Apr 2015 11:33:06 +0100 Mel Gorman <[email protected]> wrote:

> From: Nathan Zimmer <[email protected]>
>
> Currently we when we initialze each page struct is set as reserved upon
> initialization.

Hard to parse. I changed it to "Currently each page struct is set as
reserved upon initialization".

> This changes to starting with the reserved bit clear and
> then only setting the bit in the reserved region.

For what reason?

A code comment over reserve_bootmem_region() would be a good way
to answer that.

2015-04-27 22:43:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 05/13] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid

On Thu, 23 Apr 2015 11:33:08 +0100 Mel Gorman <[email protected]> wrote:

> __early_pfn_to_nid() in the generic and arch-specific implementations
> use static variables to cache recent lookups. Without the cache
> boot times are much higher due to the excessive memblock lookups but
> it assumes that memory initialisation is single-threaded. Parallel
> initialisation of struct pages will break that assumption so this patch
> makes __early_pfn_to_nid() SMP-safe by requiring the caller to cache
> recent search information. early_pfn_to_nid() keeps the same interface
> but is only safe to use early in boot due to the use of a global static
> variable. meminit_pfn_in_nid() is an SMP-safe version that callers must
> maintain their own state for.

Seems a bit awkward.

> +struct __meminitdata mminit_pfnnid_cache global_init_state;
> +
> +/* Only safe to use early in boot when initialisation is single-threaded */
> int __meminit early_pfn_to_nid(unsigned long pfn)
> {
> int nid;
>
> - nid = __early_pfn_to_nid(pfn);
> + /* The system will behave unpredictably otherwise */
> + BUG_ON(system_state != SYSTEM_BOOTING);

Because of this.

Providing a cache per cpu:

struct __meminitdata mminit_pfnnid_cache global_init_state[NR_CPUS];

would be simpler?


Also, `global_init_state' is a poor name for a kernel-wide symbol.

2015-04-27 22:43:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

On Thu, 23 Apr 2015 11:33:10 +0100 Mel Gorman <[email protected]> wrote:

> This patch initalises all low memory struct pages and 2G of the highest zone
> on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT
> is set. That config option cannot be set but will be available in a later
> patch. Parallel initialisation of struct page depends on some features
> from memory hotplug and it is necessary to alter alter section annotations.
>
> ...
>
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +#define __defermem_init __meminit
> +#define __defer_init __meminit
> +#else
> +#define __defermem_init
> +#define __defer_init __init
> +#endif

Could we get some comments describing these? What they do, when and
where they should be used. I have a suspicion that the naming isn't
good, but I didn't spend a lot of time reverse-engineering the
intent...

2015-04-27 22:43:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 08/13] mm: meminit: Initialise remaining struct pages in parallel with kswapd

On Thu, 23 Apr 2015 11:33:11 +0100 Mel Gorman <[email protected]> wrote:

> Only a subset of struct pages are initialised at the moment. When this patch
> is applied kswapd initialise the remaining struct pages in parallel. This
> should boot faster by spreading the work to multiple CPUs and initialising
> data that is local to the CPU. The user-visible effect on large machines
> is that free memory will appear to rapidly increase early in the lifetime
> of the system until kswapd reports that all memory is initialised in the
> kernel log. Once initialised there should be no other user-visibile effects.
>
> ...
>
> + pr_info("kswapd %d initialised deferred memory in %ums\n", nid,
> + jiffies_to_msecs(jiffies - start));

It might be nice to tell people how much deferred memory kswapd
initialised.

2015-04-27 22:44:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 11/13] mm: meminit: Free pages in large chunks where possible

On Thu, 23 Apr 2015 11:33:14 +0100 Mel Gorman <[email protected]> wrote:

> Parallel struct page frees pages one at a time. Try free pages as single
> large pages where possible.
>
> ...
>
> void __defermem_init deferred_init_memmap(int nid)

This function is gruesome in an 80-col display. Even the code comments
wrap, which is nuts. Maybe hoist the contents of the outermost loop
into a separate function, called for each zone?

2015-04-27 22:46:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function.

On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman <[email protected]> wrote:

> From: Robin Holt <[email protected]>

: <[email protected]>: host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1
: <[email protected]>: Recipient address rejected: User unknown in virtual alias
: table (in reply to RCPT TO command)

Has Robin moved, or is SGI mail busted?

2015-04-28 08:28:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function.

On Mon, Apr 27, 2015 at 03:46:33PM -0700, Andrew Morton wrote:
> On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman <[email protected]> wrote:
>
> > From: Robin Holt <[email protected]>
>
> : <[email protected]>: host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1
> : <[email protected]>: Recipient address rejected: User unknown in virtual alias
> : table (in reply to RCPT TO command)
>
> Has Robin moved, or is SGI mail busted?

Robin has moved and I do not have an updated address for him. The
address used in the patches was the one he posted the patches with.

--
Mel Gorman
SUSE Labs

2015-04-28 09:37:58

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 05/13] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid

On Mon, Apr 27, 2015 at 03:43:33PM -0700, Andrew Morton wrote:
> On Thu, 23 Apr 2015 11:33:08 +0100 Mel Gorman <[email protected]> wrote:
>
> > __early_pfn_to_nid() in the generic and arch-specific implementations
> > use static variables to cache recent lookups. Without the cache
> > boot times are much higher due to the excessive memblock lookups but
> > it assumes that memory initialisation is single-threaded. Parallel
> > initialisation of struct pages will break that assumption so this patch
> > makes __early_pfn_to_nid() SMP-safe by requiring the caller to cache
> > recent search information. early_pfn_to_nid() keeps the same interface
> > but is only safe to use early in boot due to the use of a global static
> > variable. meminit_pfn_in_nid() is an SMP-safe version that callers must
> > maintain their own state for.
>
> Seems a bit awkward.
>

I'm afraid I don't understand which part you mean.

> > +struct __meminitdata mminit_pfnnid_cache global_init_state;
> > +
> > +/* Only safe to use early in boot when initialisation is single-threaded */
> > int __meminit early_pfn_to_nid(unsigned long pfn)
> > {
> > int nid;
> >
> > - nid = __early_pfn_to_nid(pfn);
> > + /* The system will behave unpredictably otherwise */
> > + BUG_ON(system_state != SYSTEM_BOOTING);
>
> Because of this.
>
> Providing a cache per cpu:
>
> struct __meminitdata mminit_pfnnid_cache global_init_state[NR_CPUS];
>
> would be simpler?
>

It would be simplier in terms of implementation but it's wasteful. We
only need a small number of these caches early in boot. NR_CPUS is
potentially very large.

>
> Also, `global_init_state' is a poor name for a kernel-wide symbol.

You're right. It's not really global, it's just the one that is used if
the caller does not track their own state. It should have been static and
I renamed it to early_pfnnid_cache.

--
Mel Gorman
SUSE Labs

2015-04-28 09:53:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

On Mon, Apr 27, 2015 at 03:43:44PM -0700, Andrew Morton wrote:
> On Thu, 23 Apr 2015 11:33:10 +0100 Mel Gorman <[email protected]> wrote:
>
> > This patch initalises all low memory struct pages and 2G of the highest zone
> > on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT
> > is set. That config option cannot be set but will be available in a later
> > patch. Parallel initialisation of struct page depends on some features
> > from memory hotplug and it is necessary to alter alter section annotations.
> >
> > ...
> >
> > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> > +#define __defermem_init __meminit
> > +#define __defer_init __meminit
> > +#else
> > +#define __defermem_init
> > +#define __defer_init __init
> > +#endif
>
> Could we get some comments describing these? What they do, when and
> where they should be used. I have a suspicion that the naming isn't
> good, but I didn't spend a lot of time reverse-engineering the
> intent...
>

Of course. The next version will have

+/*
+ * Deferred struct page initialisation requires some early init functions that
+ * are removed before kswapd is up and running. The feature depends on memory
+ * hotplug so put the data and code required by deferred initialisation into
+ * the __meminit section where they are preserved.
+ */

--
Mel Gorman
SUSE Labs

2015-04-28 11:38:25

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 11/13] mm: meminit: Free pages in large chunks where possible

On Mon, Apr 27, 2015 at 03:43:56PM -0700, Andrew Morton wrote:
> On Thu, 23 Apr 2015 11:33:14 +0100 Mel Gorman <[email protected]> wrote:
>
> > Parallel struct page frees pages one at a time. Try free pages as single
> > large pages where possible.
> >
> > ...
> >
> > void __defermem_init deferred_init_memmap(int nid)
>
> This function is gruesome in an 80-col display. Even the code comments
> wrap, which is nuts. Maybe hoist the contents of the outermost loop
> into a separate function, called for each zone?

I can do better than that because only the highest zone is deferred
in this version and the loop is no longer necessary. I should post a V4
before the end of my day that addresses your feedback. It caused a lot of
conflicts and it'll be easier to replace the full series than try managing
incremental fixes.

Thanks Andrew.

--
Mel Gorman
SUSE Labs

2015-04-28 13:41:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

On Tue, 28 Apr 2015 10:53:23 +0100 Mel Gorman <[email protected]> wrote:

> > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> > > +#define __defermem_init __meminit
> > > +#define __defer_init __meminit
> > > +#else
> > > +#define __defermem_init
> > > +#define __defer_init __init
> > > +#endif
> >
> > Could we get some comments describing these? What they do, when and
> > where they should be used. I have a suspicion that the naming isn't
> > good, but I didn't spend a lot of time reverse-engineering the
> > intent...
> >
>
> Of course. The next version will have
>
> +/*
> + * Deferred struct page initialisation requires some early init functions that
> + * are removed before kswapd is up and running. The feature depends on memory
> + * hotplug so put the data and code required by deferred initialisation into
> + * the __meminit section where they are preserved.
> + */

I'm still not getting it even a little bit :( You say "data and code",
so I'd expect to see

#define __defer_meminitdata __meminitdata
#define __defer_meminit __meminit

But the patch doesn't mention the data segment at all.

The patch uses both __defermem_init and __defer_init to tag functions
(ie: text) and I can't work out why.

2015-04-28 14:56:38

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

On Tue, Apr 28, 2015 at 06:48:10AM -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2015 10:53:23 +0100 Mel Gorman <[email protected]> wrote:
>
> > > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> > > > +#define __defermem_init __meminit
> > > > +#define __defer_init __meminit
> > > > +#else
> > > > +#define __defermem_init
> > > > +#define __defer_init __init
> > > > +#endif
> > >
> > > Could we get some comments describing these? What they do, when and
> > > where they should be used. I have a suspicion that the naming isn't
> > > good, but I didn't spend a lot of time reverse-engineering the
> > > intent...
> > >
> >
> > Of course. The next version will have
> >
> > +/*
> > + * Deferred struct page initialisation requires some early init functions that
> > + * are removed before kswapd is up and running. The feature depends on memory
> > + * hotplug so put the data and code required by deferred initialisation into
> > + * the __meminit section where they are preserved.
> > + */
>
> I'm still not getting it even a little bit :( You say "data and code",
> so I'd expect to see
>
> #define __defer_meminitdata __meminitdata
> #define __defer_meminit __meminit
>
> But the patch doesn't mention the data segment at all.
>

Take 2. Suggestions on different names are welcome because they are poor.

/*
* Deferred struct page initialisation requires init functions that are freed
* before kswapd is available. Reuse the memory hotplug section annotation
* to mark the required code.
*
* __defermem_init is code that always exists but is annotated __meminit to
* avoid section warnings.
* __defer_init code gets marked __meminit when deferring struct page
* initialistion but is otherwise in the init section.
*/


--
Mel Gorman
SUSE Labs

2015-04-28 16:02:55

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function.

This is the one I have, but I haven't had a chance to talk with him in a
long time.
[email protected]

On 04/28/2015 03:28 AM, Mel Gorman wrote:
> On Mon, Apr 27, 2015 at 03:46:33PM -0700, Andrew Morton wrote:
>> On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman <[email protected]> wrote:
>>
>>> From: Robin Holt <[email protected]>
>> : <[email protected]>: host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1
>> : <[email protected]>: Recipient address rejected: User unknown in virtual alias
>> : table (in reply to RCPT TO command)
>>
>> Has Robin moved, or is SGI mail busted?
> Robin has moved and I do not have an updated address for him. The
> address used in the patches was the one he posted the patches with.
>

2015-04-28 22:41:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function.

On Tue, 28 Apr 2015 09:28:31 +0100 Mel Gorman <[email protected]> wrote:

> On Mon, Apr 27, 2015 at 03:46:33PM -0700, Andrew Morton wrote:
> > On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman <[email protected]> wrote:
> >
> > > From: Robin Holt <[email protected]>
> >
> > : <[email protected]>: host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1
> > : <[email protected]>: Recipient address rejected: User unknown in virtual alias
> > : table (in reply to RCPT TO command)
> >
> > Has Robin moved, or is SGI mail busted?
>
> Robin has moved and I do not have an updated address for him. The
> address used in the patches was the one he posted the patches with.
>

As Nathan mentioned,

z:/usr/src/git26> git log | grep "Robin Holt"
Cc: Robin Holt <[email protected]>
Acked-by: Robin Holt <[email protected]>
Cc: Robin Holt <[email protected]>
Cc: Robin Holt <[email protected]>
Cc: Robin Holt <[email protected]>

2015-04-28 23:05:13

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function.

On Tue, Apr 28, 2015 at 03:41:00PM -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2015 09:28:31 +0100 Mel Gorman <[email protected]> wrote:
>
> > On Mon, Apr 27, 2015 at 03:46:33PM -0700, Andrew Morton wrote:
> > > On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman <[email protected]> wrote:
> > >
> > > > From: Robin Holt <[email protected]>
> > >
> > > : <[email protected]>: host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1
> > > : <[email protected]>: Recipient address rejected: User unknown in virtual alias
> > > : table (in reply to RCPT TO command)
> > >
> > > Has Robin moved, or is SGI mail busted?
> >
> > Robin has moved and I do not have an updated address for him. The
> > address used in the patches was the one he posted the patches with.
> >
>
> As Nathan mentioned,
>
> z:/usr/src/git26> git log | grep "Robin Holt"
> Cc: Robin Holt <[email protected]>
> Acked-by: Robin Holt <[email protected]>
> Cc: Robin Holt <[email protected]>
> Cc: Robin Holt <[email protected]>
> Cc: Robin Holt <[email protected]>

I can update the address if Robin wishes (cc'd). I was preserving the
address that was used to actually sign off the patches as that was the
history.

--
Mel Gorman
SUSE Labs

2015-04-29 01:31:55

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3

On 04/23/2015 11:53 AM, Daniel J Blueman wrote:
> On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman <[email protected]> wrote:
>> The big change here is an adjustment to the topology_init path that
>> caused
>> soft lockups on Waiman and Daniel Blue had reported it was an expensive
>> function.
>>
>> Changelog since v2
>> o Reduce overhead of topology_init
>> o Remove boot-time kernel parameter to enable/disable
>> o Enable on UMA
>>
>> Changelog since v1
>> o Always initialise low zones
>> o Typo corrections
>> o Rename parallel mem init to parallel struct page init
>> o Rebase to 4.0
> []
>
> Splendid work! On this 256c setup, topology_init now takes 185ms.
>
> This brings the kernel boot time down to 324s [1]. It turns out that
> one memset is responsible for most of the time setting up the the PUDs
> and PMDs; adapting memset to using non-temporal writes [3] avoids
> generating RMW cycles, bringing boot time down to 186s [2].
>
> If this is a possibility, I can split this patch and map other arch's
> memset_nocache to memset, or change the callsite as preferred;
> comments welcome.
>
> Thanks,
> Daniel
>
> [1] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2.txt
> [2]
> https://resources.numascale.com/telemetry/defermem/h8qgl-defer2-nontemporal.txt
>
> -- [3]
>
> From f822139736cab8434302693c635fa146b465273c Mon Sep 17 00:00:00 2001
> From: Daniel J Blueman <[email protected]>
> Date: Thu, 23 Apr 2015 23:26:27 +0800
> Subject: [RFC] Speedup PMD setup
>
> Using non-temporal writes prevents read-modify-write cycles,
> which are much slower over large topologies.
>
> Adapt the existing memset() function into a _nocache variant and use
> when setting up PMDs during early boot to reduce boot time.
>
> Signed-off-by: Daniel J Blueman <[email protected]>
> ---
> arch/x86/include/asm/string_64.h | 3 ++
> arch/x86/lib/memset_64.S | 90
> ++++++++++++++++++++++++++++++++++++++++
> mm/memblock.c | 2 +-
> 3 files changed, 94 insertions(+), 1 deletion(-)
>

I tried your patch on my 12-TB IvyBridge-EX test machine and the bootup
time increased from 265s to 289s (24s increase). I think my IvyBridge-EX
box was using the optimized memset_c_e (rep stosb) code which turned out
to perform better than the non-temporal move in your code. I think that
may be due to the temporal moves that need to be done at the beginning
and end of the memory range.

I had tried to replace clear_page() with non-temporal moves. I generally
got about a few percentage points improvement compared with the
optimized clear_page_c() and clear_page_c_e() code. That is not a lot.

Anyway, I think the AMD box that you used wasn't setting the
X86_FEATURE_REP_GOOD or X86_FEATURE_ERMS bits resulting in poor memset
performance. If such a feature is supported in the AMD CPU (albeit in a
different way), you may consider sending in patch to set those features
bit. Alternatively, you will need to duplicate the alternative
instruction stuff in your memset_nocache() to make sure that it can use
the optimized code, if appropriate.

Cheers,
Longman