LinuxLists.cc - [RFC PATCH 0/14] Parallel struct page initialisation v5r4

2015-04-22 17:08:05

[permalink] [raw]

Subject: [RFC PATCH 0/14] Parallel struct page initialisation v5r4

Changelog since v1
o Always initialise low zones
o Typo corrections
o Rename parallel mem init to parallel struct page init
o Rebase to 4.0

Struct page initialisation had been identified as one of the reasons why
large machines take a long time to boot. Patches were posted a long time ago
to defer initialisation until they were first used. This was rejected on
the grounds it should not be necessary to hurt the fast paths. This series
reuses much of the work from that time but defers the initialisation of
memory to kswapd so that one thread per node initialises memory local to
that node.

After applying the series and setting the appropriate Kconfig variable I
see this in the boot log on a 64G machine

[ 7.383764] kswapd 0 initialised deferred memory in 188ms
[ 7.404253] kswapd 1 initialised deferred memory in 208ms
[ 7.411044] kswapd 3 initialised deferred memory in 216ms
[ 7.411551] kswapd 2 initialised deferred memory in 216ms

On a 1TB machine, I see

[ 11.913324] kswapd 0 initialised deferred memory in 1168ms
[ 12.220011] kswapd 2 initialised deferred memory in 1476ms
[ 12.245369] kswapd 3 initialised deferred memory in 1500ms
[ 12.271680] kswapd 1 initialised deferred memory in 1528ms

Once booted the machine appears to work as normal. Boot times were measured
from the time shutdown was called until ssh was available again. In the
64G case, the boot time savings are negligible. On the 1TB machine, the
savings were 16 seconds.

It would be nice if the people that have access to really large machines
would test this series and report back if the complexity is justified.

Patches are against 4.0.

Documentation/kernel-parameters.txt | 6 +
arch/ia64/mm/numa.c | 19 +-
arch/x86/Kconfig | 1 +
include/linux/memblock.h | 18 ++
include/linux/mm.h | 8 +-
include/linux/mmzone.h | 45 ++--
init/main.c | 1 +
mm/Kconfig | 28 +++
mm/bootmem.c | 8 +-
mm/internal.h | 23 +-
mm/memblock.c | 34 ++-
mm/mm_init.c | 9 +-
mm/nobootmem.c | 7 +-
mm/page_alloc.c | 408 +++++++++++++++++++++++++++++++-----
mm/vmscan.c | 6 +-
15 files changed, 514 insertions(+), 107 deletions(-)

--
2.1.2

2015-04-22 17:08:03

[permalink] [raw]

Subject: [PATCH 01/13] memblock: Introduce a for_each_reserved_mem_region iterator.

From: Robin Holt <[email protected]>

As part of initializing struct page's in 2MiB chunks, we noticed that
at the end of free_all_bootmem(), there was nothing which had forced
the reserved/allocated 4KiB pages to be initialized.

This helper function will be used for that expansion.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nate Zimmer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/memblock.h | 18 ++++++++++++++++++
mm/memblock.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index e8cc45307f8f..3075e7673c54 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -93,6 +93,9 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a,
struct memblock_type *type_b, phys_addr_t *out_start,
phys_addr_t *out_end, int *out_nid);

+void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
+ phys_addr_t *out_end);
+
/**
* for_each_mem_range - iterate through memblock areas from type_a and not
* included in type_b. Or just type_a if type_b is NULL.
@@ -132,6 +135,21 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a,
__next_mem_range_rev(&i, nid, type_a, type_b, \
p_start, p_end, p_nid))

+/**
+ * for_each_reserved_mem_region - iterate over all reserved memblock areas
+ * @i: u64 used as loop variable
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over reserved areas of memblock. Available as soon as memblock
+ * is initialized.
+ */
+#define for_each_reserved_mem_region(i, p_start, p_end) \
+ for (i = 0UL, \
+ __next_reserved_mem_region(&i, p_start, p_end); \
+ i != (u64)ULLONG_MAX; \
+ __next_reserved_mem_region(&i, p_start, p_end))
+
#ifdef CONFIG_MOVABLE_NODE
static inline bool memblock_is_hotpluggable(struct memblock_region *m)
{
diff --git a/mm/memblock.c b/mm/memblock.c
index 252b77bdf65e..e0cc2d174f74 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -765,6 +765,38 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
}

/**
+ * __next_reserved_mem_region - next function for for_each_reserved_region()
+ * @idx: pointer to u64 loop variable
+ * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL
+ * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL
+ *
+ * Iterate over all reserved memory regions.
+ */
+void __init_memblock __next_reserved_mem_region(u64 *idx,
+ phys_addr_t *out_start,
+ phys_addr_t *out_end)
+{
+ struct memblock_type *rsv = &memblock.reserved;
+
+ if (*idx >= 0 && *idx < rsv->cnt) {
+ struct memblock_region *r = &rsv->regions[*idx];
+ phys_addr_t base = r->base;
+ phys_addr_t size = r->size;
+
+ if (out_start)
+ *out_start = base;
+ if (out_end)
+ *out_end = base + size - 1;
+
+ *idx += 1;
+ return;
+ }
+
+ /* signal end of iteration */
+ *idx = ULLONG_MAX;
+}
+
+/**
* __next__mem_range - next function for for_each_free_mem_range() etc.
* @idx: pointer to u64 loop variable
* @nid: node selector, %NUMA_NO_NODE for all nodes
--
2.1.2

2015-04-22 17:13:12

[permalink] [raw]

Subject: [PATCH 02/13] mm: meminit: Move page initialization into a separate function.

From: Robin Holt <[email protected]>

Currently, memmap_init_zone() has all the smarts for initializing a single
page. A subset of this is required for parallel page initialisation and so
this patch breaks up the monolithic function in preparation.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 79 +++++++++++++++++++++++++++++++++------------------------
1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40e29429e7b0..fd7a6d09062d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -778,6 +778,51 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
return 0;
}

+static void __meminit __init_single_page(struct page *page, unsigned long pfn,
+ unsigned long zone, int nid)
+{
+ struct zone *z = &NODE_DATA(nid)->node_zones[zone];
+
+ set_page_links(page, zone, nid, pfn);
+ mminit_verify_page_links(page, zone, nid, pfn);
+ init_page_count(page);
+ page_mapcount_reset(page);
+ page_cpupid_reset_last(page);
+ SetPageReserved(page);
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made. Later some blocks near
+ * the start are marked MIGRATE_RESERVE by
+ * setup_zone_migrate_reserve()
+ *
+ * bitmap is created for zone's valid pfn range. but memmap
+ * can be created for invalid pages (for alignment)
+ * check here not to call set_pageblock_migratetype() against
+ * pfn out of zone.
+ */
+ if ((z->zone_start_pfn <= pfn)
+ && (pfn < zone_end_pfn(z))
+ && !(pfn & (pageblock_nr_pages - 1)))
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+
+ INIT_LIST_HEAD(&page->lru);
+#ifdef WANT_PAGE_VIRTUAL
+ /* The shift won't overflow because ZONE_NORMAL is below 4G. */
+ if (!is_highmem_idx(zone))
+ set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+}
+
+static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
+ int nid)
+{
+ return __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
bool compound = PageCompound(page);
@@ -4124,7 +4169,6 @@ static void setup_zone_migrate_reserve(struct zone *zone)
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
- struct page *page;
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
struct zone *z;
@@ -4145,38 +4189,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (!early_pfn_in_nid(pfn, nid))
continue;
}
- page = pfn_to_page(pfn);
- set_page_links(page, zone, nid, pfn);
- mminit_verify_page_links(page, zone, nid, pfn);
- init_page_count(page);
- page_mapcount_reset(page);
- page_cpupid_reset_last(page);
- SetPageReserved(page);
- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
- *
- * bitmap is created for zone's valid pfn range. but memmap
- * can be created for invalid pages (for alignment)
- * check here not to call set_pageblock_migratetype() against
- * pfn out of zone.
- */
- if ((z->zone_start_pfn <= pfn)
- && (pfn < zone_end_pfn(z))
- && !(pfn & (pageblock_nr_pages - 1)))
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-
- INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
- /* The shift won't overflow because ZONE_NORMAL is below 4G. */
- if (!is_highmem_idx(zone))
- set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
+ __init_single_pfn(pfn, zone, nid);
}
}

--
2.1.2

2015-04-22 17:12:53

[permalink] [raw]

Subject: [PATCH 03/13] mm: meminit: Only set page reserved in the memblock region

From: Nathan Zimmer <[email protected]>

Currently we when we initialze each page struct is set as reserved upon
initialization. This changes to starting with the reserved bit clear and
then only setting the bit in the reserved region.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
---
include/linux/mm.h | 2 ++
mm/nobootmem.c | 3 +++
mm/page_alloc.c | 11 ++++++++++-
3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a93928b90f..b6f82a31028a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1711,6 +1711,8 @@ extern void free_highmem_page(struct page *page);
extern void adjust_managed_page_count(struct page *page, long count);
extern void mem_init_print_info(const char *str);

+extern void reserve_bootmem_region(unsigned long start, unsigned long end);
+
/* Free the reserved page into the buddy system, so it gets managed. */
static inline void __free_reserved_page(struct page *page)
{
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 90b50468333e..396f9e450dc1 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -121,6 +121,9 @@ static unsigned long __init free_low_memory_core_early(void)

memblock_clear_hotplug(0, -1);

+ for_each_reserved_mem_region(i, &start, &end)
+ reserve_bootmem_region(start, end);
+
for_each_free_mem_range(i, NUMA_NO_NODE, &start, &end, NULL)
count += __free_memory_core(start, end);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd7a6d09062d..2abb3b861e70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -788,7 +788,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
- SetPageReserved(page);

/*
* Mark the block movable so that blocks are reserved for
@@ -823,6 +822,16 @@ static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
return __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
}

+void reserve_bootmem_region(unsigned long start, unsigned long end)
+{
+ unsigned long start_pfn = PFN_DOWN(start);
+ unsigned long end_pfn = PFN_UP(end);
+
+ for (; start_pfn < end_pfn; start_pfn++)
+ if (pfn_valid(start_pfn))
+ SetPageReserved(pfn_to_page(start_pfn));
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
bool compound = PageCompound(page);
--
2.1.2

2015-04-22 17:08:07

[permalink] [raw]

Subject: [PATCH 04/13] mm: page_alloc: Pass PFN to __free_pages_bootmem

__free_pages_bootmem prepares a page for release to the buddy allocator
and assumes that the struct page is initialised. Parallel initialisation of
struct pages defers initialisation and __free_pages_bootmem can be called
for struct pages that cannot yet map struct page to PFN. This patch passes
PFN to __free_pages_bootmem with no other functional change.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/bootmem.c | 8 ++++----
mm/internal.h | 3 ++-
mm/memblock.c | 2 +-
mm/nobootmem.c | 4 ++--
mm/page_alloc.c | 3 ++-
5 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 477be696511d..daf956bb4782 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -164,7 +164,7 @@ void __init free_bootmem_late(unsigned long physaddr, unsigned long size)
end = PFN_DOWN(physaddr + size);

for (; cursor < end; cursor++) {
- __free_pages_bootmem(pfn_to_page(cursor), 0);
+ __free_pages_bootmem(pfn_to_page(cursor), cursor, 0);
totalram_pages++;
}
}
@@ -210,7 +210,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
if (IS_ALIGNED(start, BITS_PER_LONG) && vec == ~0UL) {
int order = ilog2(BITS_PER_LONG);

- __free_pages_bootmem(pfn_to_page(start), order);
+ __free_pages_bootmem(pfn_to_page(start), start, order);
count += BITS_PER_LONG;
start += BITS_PER_LONG;
} else {
@@ -220,7 +220,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
while (vec && cur != start) {
if (vec & 1) {
page = pfn_to_page(cur);
- __free_pages_bootmem(page, 0);
+ __free_pages_bootmem(page, cur, 0);
count++;
}
vec >>= 1;
@@ -234,7 +234,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
pages = bootmem_bootmap_pages(pages);
count += pages;
while (pages--)
- __free_pages_bootmem(page++, 0);
+ __free_pages_bootmem(page++, cur++, 0);

bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);

diff --git a/mm/internal.h b/mm/internal.h
index a96da5b0029d..76b605139c7a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -155,7 +155,8 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
}

extern int __isolate_free_page(struct page *page, unsigned int order);
-extern void __free_pages_bootmem(struct page *page, unsigned int order);
+extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
+ unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);
#ifdef CONFIG_MEMORY_FAILURE
extern bool is_free_buddy_page(struct page *page);
diff --git a/mm/memblock.c b/mm/memblock.c
index e0cc2d174f74..f3e97d8eeb5c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1334,7 +1334,7 @@ void __init __memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);

for (; cursor < end; cursor++) {
- __free_pages_bootmem(pfn_to_page(cursor), 0);
+ __free_pages_bootmem(pfn_to_page(cursor), cursor, 0);
totalram_pages++;
}
}
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 396f9e450dc1..bae652713ee5 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -77,7 +77,7 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size)
end = PFN_DOWN(addr + size);

for (; cursor < end; cursor++) {
- __free_pages_bootmem(pfn_to_page(cursor), 0);
+ __free_pages_bootmem(pfn_to_page(cursor), cursor, 0);
totalram_pages++;
}
}
@@ -92,7 +92,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end)
while (start + (1UL << order) > end)
order--;

- __free_pages_bootmem(pfn_to_page(start), order);
+ __free_pages_bootmem(pfn_to_page(start), start, order);

start += (1UL << order);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2abb3b861e70..0a0e0f280d87 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -886,7 +886,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
local_irq_restore(flags);
}

-void __init __free_pages_bootmem(struct page *page, unsigned int order)
+void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
+ unsigned int order)
{
unsigned int nr_pages = 1 << order;
struct page *p = page;
--
2.1.2

2015-04-22 17:12:18

[permalink] [raw]

Subject: [PATCH 05/13] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid

__early_pfn_to_nid() in the generic and arch-specific implementations
use static variables to cache recent lookups. Without the cache
boot times are much higher due to the excessive memblock lookups but
it assumes that memory initialisation is single-threaded. Parallel
initialisation of struct pages will break that assumption so this patch
makes __early_pfn_to_nid() SMP-safe by requiring the caller to cache
recent search information. early_pfn_to_nid() keeps the same interface
but is only safe to use early in boot due to the use of a global static
variable. meminit_pfn_in_nid() is an SMP-safe version that callers must
maintain their own state for.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/ia64/mm/numa.c | 19 +++++++------------
include/linux/mm.h | 8 ++++++--
include/linux/mmzone.h | 16 +++++++++++++++-
mm/page_alloc.c | 40 +++++++++++++++++++++++++---------------
4 files changed, 53 insertions(+), 30 deletions(-)

diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index ea21d4cad540..aa19b7ac8222 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -58,27 +58,22 @@ paddr_to_nid(unsigned long paddr)
* SPARSEMEM to allocate the SPARSEMEM sectionmap on the NUMA node where
* the section resides.
*/
-int __meminit __early_pfn_to_nid(unsigned long pfn)
+int __meminit __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state)
{
int i, section = pfn >> PFN_SECTION_SHIFT, ssec, esec;
- /*
- * NOTE: The following SMP-unsafe globals are only used early in boot
- * when the kernel is running single-threaded.
- */
- static int __meminitdata last_ssec, last_esec;
- static int __meminitdata last_nid;

- if (section >= last_ssec && section < last_esec)
- return last_nid;
+ if (section >= state->last_start && section < state->last_end)
+ return state->last_nid;

for (i = 0; i < num_node_memblks; i++) {
ssec = node_memblk[i].start_paddr >> PA_SECTION_SHIFT;
esec = (node_memblk[i].start_paddr + node_memblk[i].size +
((1L << PA_SECTION_SHIFT) - 1)) >> PA_SECTION_SHIFT;
if (section >= ssec && section < esec) {
- last_ssec = ssec;
- last_esec = esec;
- last_nid = node_memblk[i].nid;
+ state->last_start = ssec;
+ state->last_end = esec;
+ state->last_nid = node_memblk[i].nid;
return node_memblk[i].nid;
}
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b6f82a31028a..3a4c9f72c080 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1802,7 +1802,8 @@ extern void sparse_memory_present_with_active_regions(int nid);

#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
!defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
-static inline int __early_pfn_to_nid(unsigned long pfn)
+static inline int __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state)
{
return 0;
}
@@ -1810,7 +1811,10 @@ static inline int __early_pfn_to_nid(unsigned long pfn)
/* please see mm/page_alloc.c */
extern int __meminit early_pfn_to_nid(unsigned long pfn);
/* there is a per-arch backend function. */
-extern int __meminit __early_pfn_to_nid(unsigned long pfn);
+extern int __meminit __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state);
+bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state);
#endif

extern void set_dma_reserve(unsigned long new_dma_reserve);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2782df47101e..a67b33e52dfe 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1216,10 +1216,24 @@ void sparse_init(void);
#define sparse_index_init(_sec, _nid) do {} while (0)
#endif /* CONFIG_SPARSEMEM */

+/*
+ * During memory init memblocks map pfns to nids. The search is expensive and
+ * this caches recent lookups. The implementation of __early_pfn_to_nid
+ * may treat start/end as pfns or sections.
+ */
+struct mminit_pfnnid_cache {
+ unsigned long last_start;
+ unsigned long last_end;
+ int last_nid;
+};
+
#ifdef CONFIG_NODES_SPAN_OTHER_NODES
bool early_pfn_in_nid(unsigned long pfn, int nid);
+bool meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state);
#else
-#define early_pfn_in_nid(pfn, nid) (1)
+#define early_pfn_in_nid(pfn, nid) (1)
+#define meminit_pfn_in_nid(pfn, nid, state) (1)
#endif

#ifndef early_pfn_valid
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0a0e0f280d87..f556ed63b964 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4457,39 +4457,41 @@ int __meminit init_currently_empty_zone(struct zone *zone,

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+
/*
* Required by SPARSEMEM. Given a PFN, return what node the PFN is on.
*/
-int __meminit __early_pfn_to_nid(unsigned long pfn)
+int __meminit __early_pfn_to_nid(unsigned long pfn,
+ struct mminit_pfnnid_cache *state)
{
unsigned long start_pfn, end_pfn;
int nid;
- /*
- * NOTE: The following SMP-unsafe globals are only used early in boot
- * when the kernel is running single-threaded.
- */
- static unsigned long __meminitdata last_start_pfn, last_end_pfn;
- static int __meminitdata last_nid;

- if (last_start_pfn <= pfn && pfn < last_end_pfn)
- return last_nid;
+ if (state->last_start <= pfn && pfn < state->last_end)
+ return state->last_nid;

nid = memblock_search_pfn_nid(pfn, &start_pfn, &end_pfn);
if (nid != -1) {
- last_start_pfn = start_pfn;
- last_end_pfn = end_pfn;
- last_nid = nid;
+ state->last_start = start_pfn;
+ state->last_end = end_pfn;
+ state->last_nid = nid;
}

return nid;
}
#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */

+struct __meminitdata mminit_pfnnid_cache global_init_state;
+
+/* Only safe to use early in boot when initialisation is single-threaded */
int __meminit early_pfn_to_nid(unsigned long pfn)
{
int nid;

- nid = __early_pfn_to_nid(pfn);
+ /* The system will behave unpredictably otherwise */
+ BUG_ON(system_state != SYSTEM_BOOTING);
+
+ nid = __early_pfn_to_nid(pfn, &global_init_state);
if (nid >= 0)
return nid;
/* just returns 0 */
@@ -4497,15 +4499,23 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
}

#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state)
{
int nid;

- nid = __early_pfn_to_nid(pfn);
+ nid = __early_pfn_to_nid(pfn, state);
if (nid >= 0 && nid != node)
return false;
return true;
}
+
+/* Only safe to use early in boot when initialisation is single-threaded */
+bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+{
+ return meminit_pfn_in_nid(pfn, node, &global_init_state);
+}
+
#endif

/**
--
2.1.2

2015-04-22 17:11:50

[permalink] [raw]

Subject: [PATCH 06/13] mm: meminit: Inline some helper functions

early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
unnecessarily visible outside memory initialisation. As well as unnecessary
visibility, it's unnecessary function call overhead when initialising pages.
This patch moves the helpers inline.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm.h | 2 --
include/linux/mmzone.h | 17 -----------
mm/page_alloc.c | 80 +++++++++++++++++++++++++++-----------------------
3 files changed, 43 insertions(+), 56 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3a4c9f72c080..a8a8b161fd65 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1813,8 +1813,6 @@ extern int __meminit early_pfn_to_nid(unsigned long pfn);
/* there is a per-arch backend function. */
extern int __meminit __early_pfn_to_nid(unsigned long pfn,
struct mminit_pfnnid_cache *state);
-bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state);
#endif

extern void set_dma_reserve(unsigned long new_dma_reserve);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a67b33e52dfe..f78ca65a9884 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1036,14 +1036,6 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
#include <asm/sparsemem.h>
#endif

-#if !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && \
- !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
-static inline unsigned long early_pfn_to_nid(unsigned long pfn)
-{
- return 0;
-}
-#endif
-
#ifdef CONFIG_FLATMEM
#define pfn_to_nid(pfn) (0)
#endif
@@ -1227,15 +1219,6 @@ struct mminit_pfnnid_cache {
int last_nid;
};

-#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-bool early_pfn_in_nid(unsigned long pfn, int nid);
-bool meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state);
-#else
-#define early_pfn_in_nid(pfn, nid) (1)
-#define meminit_pfn_in_nid(pfn, nid, state) (1)
-#endif
-
#ifndef early_pfn_valid
#define early_pfn_valid(pfn) (1)
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f556ed63b964..b148c9921740 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -907,6 +907,49 @@ void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
__free_pages(page, order);
}

+#if !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && \
+ !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
+static inline unsigned long early_pfn_to_nid(unsigned long pfn)
+{
+ return 0;
+}
+#else
+/* Only safe to use early in boot when initialisation is single-threaded */
+struct __meminitdata mminit_pfnnid_cache global_init_state;
+int __meminit early_pfn_to_nid(unsigned long pfn)
+{
+ int nid;
+
+ /* The system will behave unpredictably otherwise */
+ BUG_ON(system_state != SYSTEM_BOOTING);
+
+ nid = __early_pfn_to_nid(pfn, &global_init_state);
+ if (nid >= 0)
+ return nid;
+ /* just returns 0 */
+ return 0;
+}
+#endif
+
+#ifdef CONFIG_NODES_SPAN_OTHER_NODES
+static inline bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
+ struct mminit_pfnnid_cache *state)
+{
+ int nid;
+
+ nid = __early_pfn_to_nid(pfn, state);
+ if (nid >= 0 && nid != node)
+ return false;
+ return true;
+}
+
+/* Only safe to use early in boot when initialisation is single-threaded */
+static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
+{
+ return meminit_pfn_in_nid(pfn, node, &global_init_state);
+}
+#endif
+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
@@ -4481,43 +4524,6 @@ int __meminit __early_pfn_to_nid(unsigned long pfn,
}
#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */

-struct __meminitdata mminit_pfnnid_cache global_init_state;
-
-/* Only safe to use early in boot when initialisation is single-threaded */
-int __meminit early_pfn_to_nid(unsigned long pfn)
-{
- int nid;
-
- /* The system will behave unpredictably otherwise */
- BUG_ON(system_state != SYSTEM_BOOTING);
-
- nid = __early_pfn_to_nid(pfn, &global_init_state);
- if (nid >= 0)
- return nid;
- /* just returns 0 */
- return 0;
-}
-
-#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state)
-{
- int nid;
-
- nid = __early_pfn_to_nid(pfn, state);
- if (nid >= 0 && nid != node)
- return false;
- return true;
-}
-
-/* Only safe to use early in boot when initialisation is single-threaded */
-bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
-{
- return meminit_pfn_in_nid(pfn, node, &global_init_state);
-}
-
-#endif
-
/**
* free_bootmem_with_active_regions - Call memblock_free_early_nid for each active range
* @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed.
--
2.1.2

2015-04-22 17:10:48

[permalink] [raw]

Subject: [PATCH 07/13] mm: meminit: Only a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set

This patch initalises all low memory struct pages and 2G of the highest zone
on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT
is set. That config option cannot be set but will be available in a later
patch. Parallel initialisation of struct page depends on some features
from memory hotplug and it is necessary to alter alter section annotations.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 8 +++++
mm/internal.h | 8 +++++
mm/page_alloc.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 94 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f78ca65a9884..821f5000dec9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -762,6 +762,14 @@ typedef struct pglist_data {
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
#endif
+
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ /*
+ * If memory initialisation on large machines is deferred then this
+ * is the first PFN that needs to be initialised.
+ */
+ unsigned long first_deferred_pfn;
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
} pg_data_t;

#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
diff --git a/mm/internal.h b/mm/internal.h
index 76b605139c7a..4a73f74846bd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -385,6 +385,14 @@ static inline void mminit_verify_zonelist(void)
}
#endif /* CONFIG_DEBUG_MEMORY_INIT */

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+#define __defermem_init __meminit
+#define __defer_init __meminit
+#else
+#define __defermem_init
+#define __defer_init __init
+#endif
+
/* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */
#if defined(CONFIG_SPARSEMEM)
extern void mminit_validate_memmodel_limits(unsigned long *start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b148c9921740..2d649e0a1f9e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -235,6 +235,67 @@ EXPORT_SYMBOL(nr_online_nodes);

int page_group_by_mobility_disabled __read_mostly;

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+static inline void reset_deferred_meminit(pg_data_t *pgdat)
+{
+ pgdat->first_deferred_pfn = ULONG_MAX;
+}
+
+/* Returns true if the struct page for the pfn is uninitialised */
+static inline bool __defermem_init early_page_uninitialised(unsigned long pfn)
+{
+ int nid = early_pfn_to_nid(pfn);
+
+ if (pfn >= NODE_DATA(nid)->first_deferred_pfn)
+ return true;
+
+ return false;
+}
+
+/*
+ * Returns false when the remaining initialisation should be deferred until
+ * later in the boot cycle when it can be parallelised.
+ */
+static inline bool update_defer_init(pg_data_t *pgdat,
+ unsigned long pfn, unsigned long zone_end,
+ unsigned long *nr_initialised)
+{
+ if (!deferred_mem_init_enabled)
+ return true;
+
+ /* Always populate low zones for address-contrained allocations */
+ if (zone_end < pgdat_end_pfn(pgdat))
+ return true;
+
+ /* Initialise at least 2G of the highest zone */
+ (*nr_initialised)++;
+ if (*nr_initialised > (2UL << (30 - PAGE_SHIFT)) &&
+ (pfn & (PAGES_PER_SECTION - 1)) == 0) {
+ pgdat->first_deferred_pfn = pfn;
+ return false;
+ }
+
+ return true;
+}
+#else
+static inline void reset_deferred_meminit(pg_data_t *pgdat)
+{
+}
+
+static inline bool early_page_uninitialised(unsigned long pfn)
+{
+ return false;
+}
+
+static inline bool update_defer_init(pg_data_t *pgdat,
+ unsigned long pfn, unsigned long zone_end,
+ unsigned long *nr_initialised)
+{
+ return true;
+}
+#endif
+
+
void set_pageblock_migratetype(struct page *page, int migratetype)
{
if (unlikely(page_group_by_mobility_disabled &&
@@ -886,8 +947,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
local_irq_restore(flags);
}

-void __init __free_pages_bootmem(struct page *page, unsigned long pfn,
- unsigned int order)
+static void __defer_init __free_pages_boot_core(struct page *page,
+ unsigned long pfn, unsigned int order)
{
unsigned int nr_pages = 1 << order;
struct page *p = page;
@@ -950,6 +1011,14 @@ static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
}
#endif

+void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
+ unsigned int order)
+{
+ if (early_page_uninitialised(pfn))
+ return;
+ return __free_pages_boot_core(page, pfn, order);
+}
+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
@@ -4222,14 +4291,16 @@ static void setup_zone_migrate_reserve(struct zone *zone)
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
+ pg_data_t *pgdat = NODE_DATA(nid);
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
struct zone *z;
+ unsigned long nr_initialised = 0;

if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;

- z = &NODE_DATA(nid)->node_zones[zone];
+ z = &pgdat->node_zones[zone];
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*
* There can be holes in boot-time mem_map[]s
@@ -4241,6 +4312,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
continue;
if (!early_pfn_in_nid(pfn, nid))
continue;
+ if (!update_defer_init(pgdat, pfn, end_pfn,
+ &nr_initialised))
+ break;
}
__init_single_pfn(pfn, zone, nid);
}
@@ -5042,6 +5116,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
/* pg_data_t should be reset to zero when it's allocated */
WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);

+ reset_deferred_meminit(pgdat);
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
--
2.1.2

2015-04-22 17:10:11

[permalink] [raw]

Subject: [PATCH 08/13] mm: meminit: Initialise remaining struct pages in parallel with kswapd

Only a subset of struct pages are initialised at the moment. When this patch
is applied kswapd initialise the remaining struct pages in parallel. This
should boot faster by spreading the work to multiple CPUs and initialising
data that is local to the CPU. The user-visible effect on large machines
is that free memory will appear to rapidly increase early in the lifetime
of the system until kswapd reports that all memory is initialised in the
kernel log. Once initialised there should be no other user-visibile effects.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 6 +++
mm/mm_init.c | 1 +
mm/page_alloc.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
mm/vmscan.c | 6 ++-
4 files changed, 123 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 4a73f74846bd..2c4057140bec 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -388,9 +388,15 @@ static inline void mminit_verify_zonelist(void)
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
#define __defermem_init __meminit
#define __defer_init __meminit
+
+void deferred_init_memmap(int nid);
#else
#define __defermem_init
#define __defer_init __init
+
+static inline void deferred_init_memmap(int nid)
+{
+}
#endif

/* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5f420f7fafa1..28fbf87b20aa 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -11,6 +11,7 @@
#include <linux/export.h>
#include <linux/memory.h>
#include <linux/notifier.h>
+#include <linux/sched.h>
#include "internal.h"

#ifdef CONFIG_DEBUG_MEMORY_INIT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d649e0a1f9e..dfe63a3c3816 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -252,6 +252,14 @@ static inline bool __defermem_init early_page_uninitialised(unsigned long pfn)
return false;
}

+static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
+{
+ if (pfn >= NODE_DATA(nid)->first_deferred_pfn)
+ return true;
+
+ return false;
+}
+
/*
* Returns false when the remaining initialisation should be deferred until
* later in the boot cycle when it can be parallelised.
@@ -287,6 +295,11 @@ static inline bool early_page_uninitialised(unsigned long pfn)
return false;
}

+static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
+{
+ return false;
+}
+
static inline bool update_defer_init(pg_data_t *pgdat,
unsigned long pfn, unsigned long zone_end,
unsigned long *nr_initialised)
@@ -883,14 +896,45 @@ static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
return __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
}

-void reserve_bootmem_region(unsigned long start, unsigned long end)
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+static void init_reserved_page(unsigned long pfn)
+{
+ pg_data_t *pgdat;
+ int nid, zid;
+
+ if (!early_page_uninitialised(pfn))
+ return;
+
+ nid = early_pfn_to_nid(pfn);
+ pgdat = NODE_DATA(nid);
+
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ struct zone *zone = &pgdat->node_zones[zid];
+
+ if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
+ break;
+ }
+ __init_single_pfn(pfn, zid, nid);
+}
+#else
+static inline void init_reserved_page(unsigned long pfn)
+{
+}
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
+void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);

- for (; start_pfn < end_pfn; start_pfn++)
- if (pfn_valid(start_pfn))
- SetPageReserved(pfn_to_page(start_pfn));
+ for (; start_pfn < end_pfn; start_pfn++) {
+ if (pfn_valid(start_pfn)) {
+ struct page *page = pfn_to_page(start_pfn);
+
+ init_reserved_page(start_pfn);
+ SetPageReserved(page);
+ }
+ }
}

static bool free_pages_prepare(struct page *page, unsigned int order)
@@ -1019,6 +1063,67 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
return __free_pages_boot_core(page, pfn, order);
}

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+/* Initialise remaining memory on a node */
+void __defermem_init deferred_init_memmap(int nid)
+{
+ unsigned long start = jiffies;
+ struct mminit_pfnnid_cache nid_init_state = { };
+
+ pg_data_t *pgdat = NODE_DATA(nid);
+ int zid;
+ unsigned long first_init_pfn = pgdat->first_deferred_pfn;
+
+ if (first_init_pfn == ULONG_MAX)
+ return;
+
+ /* Sanity check boundaries */
+ BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
+ BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
+ pgdat->first_deferred_pfn = ULONG_MAX;
+
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ struct zone *zone = pgdat->node_zones + zid;
+ unsigned long walk_start, walk_end;
+ int i;
+
+ for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
+ unsigned long pfn, end_pfn;
+
+ end_pfn = min(walk_end, zone_end_pfn(zone));
+ pfn = first_init_pfn;
+ if (pfn < walk_start)
+ pfn = walk_start;
+ if (pfn < zone->zone_start_pfn)
+ pfn = zone->zone_start_pfn;
+
+ for (; pfn < end_pfn; pfn++) {
+ struct page *page;
+
+ if (!pfn_valid(pfn))
+ continue;
+
+ if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state))
+ continue;
+
+ if (page->flags) {
+ VM_BUG_ON(page_zone(page) != zone);
+ continue;
+ }
+
+ __init_single_page(page, pfn, zid, nid);
+ __free_pages_boot_core(page, pfn, 0);
+ cond_resched();
+ }
+ first_init_pfn = max(end_pfn, first_init_pfn);
+ }
+ }
+
+ pr_info("kswapd %d initialised deferred memory in %ums\n", nid,
+ jiffies_to_msecs(jiffies - start));
+}
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
#ifdef CONFIG_CMA
/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
void __init init_cma_reserved_pageblock(struct page *page)
@@ -4229,6 +4334,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)
zone->nr_migrate_reserve_block = reserve;

for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+ if (!early_page_nid_uninitialised(pfn, zone_to_nid(zone)))
+ return;
+
if (!pfn_valid(pfn))
continue;
page = pfn_to_page(pfn);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..c4895d26d036 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3348,7 +3348,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
* If there are applications that are active memory-allocators
* (most normal use), this basically shouldn't matter.
*/
-static int kswapd(void *p)
+static int __defermem_init kswapd(void *p)
{
unsigned long order, new_order;
unsigned balanced_order;
@@ -3383,6 +3383,8 @@ static int kswapd(void *p)
tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
set_freezable();

+ deferred_init_memmap(pgdat->node_id);
+
order = new_order = 0;
balanced_order = 0;
classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
@@ -3538,7 +3540,7 @@ static int cpu_callback(struct notifier_block *nfb, unsigned long action,
* This kswapd start function will be called by init and node-hot-add.
* On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
*/
-int kswapd_run(int nid)
+int __defermem_init kswapd_run(int nid)
{
pg_data_t *pgdat = NODE_DATA(nid);
int ret = 0;
--
2.1.2

2015-04-22 17:08:11

[permalink] [raw]

Subject: [PATCH 09/13] mm: meminit: Minimise number of pfn->page lookups during initialisation

Deferred struct page initialisation is using pfn_to_page() on every PFN
unnecessarily. This patch minimises the number of lookups and scheduler
checks.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dfe63a3c3816..839e4c73ce6d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1089,6 +1089,7 @@ void __defermem_init deferred_init_memmap(int nid)

for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
unsigned long pfn, end_pfn;
+ struct page *page = NULL;

end_pfn = min(walk_end, zone_end_pfn(zone));
pfn = first_init_pfn;
@@ -1098,13 +1099,32 @@ void __defermem_init deferred_init_memmap(int nid)
pfn = zone->zone_start_pfn;

for (; pfn < end_pfn; pfn++) {
- struct page *page;
-
- if (!pfn_valid(pfn))
+ if (!pfn_valid_within(pfn))
continue;

- if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state))
+ /*
+ * Ensure pfn_valid is checked every
+ * MAX_ORDER_NR_PAGES for memory holes
+ */
+ if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
+ if (!pfn_valid(pfn)) {
+ page = NULL;
+ continue;
+ }
+ }
+
+ if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
+ page = NULL;
continue;
+ }
+
+ /* Minimise pfn page lookups and scheduler checks */
+ if (page && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) {
+ page++;
+ } else {
+ page = pfn_to_page(pfn);
+ cond_resched();
+ }

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
@@ -1113,7 +1133,6 @@ void __defermem_init deferred_init_memmap(int nid)

__init_single_page(page, pfn, zid, nid);
__free_pages_boot_core(page, pfn, 0);
- cond_resched();
}
first_init_pfn = max(end_pfn, first_init_pfn);
}
--
2.1.2

2015-04-22 17:09:32

[permalink] [raw]

Subject: [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64

This patch adds the Kconfig logic to add deferred struct page initialisation
to x86-64 if NUMA is enabled. Other architectures may enable on a
case-by-case basis after auditing early_pfn_to_nid and testing.

Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/kernel-parameters.txt | 6 ++++++
arch/x86/Kconfig | 1 +
include/linux/mmzone.h | 14 ++++++++++++++
init/main.c | 1 +
mm/Kconfig | 28 ++++++++++++++++++++++++++++
mm/page_alloc.c | 21 +++++++++++++++++++++
6 files changed, 71 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index bfcb1a62a7b4..e7c6f7486214 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -807,6 +807,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.

debug_objects [KNL] Enable object debugging

+ defer_meminit= [KNL,X86] Enable or disable deferred struct page init.
+ Large machine may take a long time to initialise
+ memory management structures. If enabled then a
+ subset of struct pages are initialised and kswapd
+ initialses the rest in parallel.
+
no_debug_objects
[KNL] Disable object debugging

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca55187..d15d74a052d5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -32,6 +32,7 @@ config X86
select HAVE_UNSTABLE_SCHED_CLOCK
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_SUPPORTS_INT128 if X86_64
+ select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT if X86_64 && NUMA
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_PCSPKR_PLATFORM
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 821f5000dec9..8ac074db364f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -822,6 +822,20 @@ static inline struct zone *lruvec_zone(struct lruvec *lruvec)
#endif
}

+
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+extern bool deferred_mem_init_enabled;
+static inline void setup_deferred_meminit(void)
+{
+ if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT_DEFAULT_ENABLED))
+ deferred_mem_init_enabled = true;
+}
+#else
+static inline void setup_deferred_meminit(void)
+{
+}
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
#ifdef CONFIG_HAVE_MEMORY_PRESENT
void memory_present(int nid, unsigned long start, unsigned long end);
#else
diff --git a/init/main.c b/init/main.c
index 6f0f1c5ff8cc..f339d37a43e8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -506,6 +506,7 @@ asmlinkage __visible void __init start_kernel(void)
boot_init_stack_canary();

cgroup_init_early();
+ setup_deferred_meminit();

local_irq_disable();
early_boot_irqs_disabled = true;
diff --git a/mm/Kconfig b/mm/Kconfig
index a03131b6ba8e..87a4535e0df4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -629,3 +629,31 @@ config MAX_STACK_SIZE_MB
changed to a smaller value in which case that is used.

A sane initial value is 80 MB.
+
+# For architectures that support deferred memory initialisation
+config ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
+ bool
+
+config DEFERRED_STRUCT_PAGE_INIT
+ bool "Defer initialisation of struct pages to kswapd"
+ default n
+ depends on ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
+ depends on MEMORY_HOTPLUG
+ help
+ Ordinarily all struct pages are initialised during early boot in a
+ single thread. On very large machines this can take a considerable
+ amount of time. If this option is set, large machines will bring up
+ a subset of memmap at boot and then initialise the rest in parallel
+ when kswapd starts. This has a potential performance impact on
+ processes running early in the lifetime of the systemm until kswapd
+ finishes the initialisation.
+
+config DEFERRED_STRUCT_PAGE_INIT_DEFAULT_ENABLED
+ bool "Automatically enable deferred struct page initialisation"
+ default y
+ depends on DEFERRED_STRUCT_PAGE_INIT
+ help
+ If set, struct page initialisation will be deferred by default on
+ large memory configurations. If DEFERRED_STRUCT_PAGE_INIT is set
+ then it is a reasonable default to enable this too. User may need
+ to disable this if allocating huge pages from the command line.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 839e4c73ce6d..6b2f6c21b70f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -236,6 +236,8 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+bool __meminitdata deferred_mem_init_enabled;
+
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
pgdat->first_deferred_pfn = ULONG_MAX;
@@ -285,6 +287,25 @@ static inline bool update_defer_init(pg_data_t *pgdat,

return true;
}
+
+static int __init setup_deferred_mem_init(char *str)
+{
+ if (!str)
+ return -1;
+
+ if (!strcmp(str, "enable")) {
+ deferred_mem_init_enabled = true;
+ } else if (!strcmp(str, "disable")) {
+ deferred_mem_init_enabled = false;
+ } else {
+ pr_warn("Unable to parse deferred_mem_init=\n");
+ return -1;
+ }
+
+ return 0;
+}
+
+early_param("defer_meminit", setup_deferred_mem_init);
#else
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
--
2.1.2

2015-04-22 17:09:16

[permalink] [raw]

Subject: [PATCH 11/13] mm: meminit: Free pages in large chunks where possible

Parallel struct page frees pages one at a time. Try free pages as single
large pages where possible.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6b2f6c21b70f..8d3fd13a09c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1085,6 +1085,20 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn,
}

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+void __defermem_init deferred_free_range(struct page *page, unsigned long pfn,
+ int nr_pages)
+{
+ int i;
+
+ if (nr_pages == MAX_ORDER_NR_PAGES && (pfn & (MAX_ORDER_NR_PAGES-1)) == 0) {
+ __free_pages_boot_core(page, pfn, MAX_ORDER-1);
+ return;
+ }
+
+ for (i = 0; i < nr_pages; i++, page++, pfn++)
+ __free_pages_boot_core(page, pfn, 0);
+}
+
/* Initialise remaining memory on a node */
void __defermem_init deferred_init_memmap(int nid)
{
@@ -1111,6 +1125,9 @@ void __defermem_init deferred_init_memmap(int nid)
for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) {
unsigned long pfn, end_pfn;
struct page *page = NULL;
+ struct page *free_base_page = NULL;
+ unsigned long free_base_pfn = 0;
+ int nr_to_free = 0;

end_pfn = min(walk_end, zone_end_pfn(zone));
pfn = first_init_pfn;
@@ -1121,7 +1138,7 @@ void __defermem_init deferred_init_memmap(int nid)

for (; pfn < end_pfn; pfn++) {
if (!pfn_valid_within(pfn))
- continue;
+ goto free_range;

/*
* Ensure pfn_valid is checked every
@@ -1130,30 +1147,49 @@ void __defermem_init deferred_init_memmap(int nid)
if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
if (!pfn_valid(pfn)) {
page = NULL;
- continue;
+ goto free_range;
}
}

if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
page = NULL;
- continue;
+ goto free_range;
}

/* Minimise pfn page lookups and scheduler checks */
if (page && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) {
page++;
} else {
+ deferred_free_range(free_base_page,
+ free_base_pfn, nr_to_free);
+ free_base_page = NULL;
+ free_base_pfn = nr_to_free = 0;
+
page = pfn_to_page(pfn);
cond_resched();
}

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
- continue;
+ goto free_range;
}

__init_single_page(page, pfn, zid, nid);
- __free_pages_boot_core(page, pfn, 0);
+ if (!free_base_page) {
+ free_base_page = page;
+ free_base_pfn = pfn;
+ nr_to_free = 0;
+ }
+ nr_to_free++;
+
+ /* Where possible, batch up pages for a single free */
+ continue;
+free_range:
+ /* Free the current block of pages to allocator */
+ if (free_base_page)
+ deferred_free_range(free_base_page, free_base_pfn, nr_to_free);
+ free_base_page = NULL;
+ free_base_pfn = nr_to_free = 0;
}
first_init_pfn = max(end_pfn, first_init_pfn);
}
--
2.1.2

2015-04-22 17:08:35

[permalink] [raw]

Subject: [PATCH 12/13] mm: meminit: Reduce number of times pageblocks are set during struct page init

During parallel sturct page initialisation, ranges are checked for every
PFN unnecessarily which increases boot times. This patch alters when the
ranges are checked.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 45 +++++++++++++++++++++++----------------------
1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8d3fd13a09c9..945d56667b61 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -876,33 +876,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
- struct zone *z = &NODE_DATA(nid)->node_zones[zone];
-
set_page_links(page, zone, nid, pfn);
mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);

- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
- *
- * bitmap is created for zone's valid pfn range. but memmap
- * can be created for invalid pages (for alignment)
- * check here not to call set_pageblock_migratetype() against
- * pfn out of zone.
- */
- if ((z->zone_start_pfn <= pfn)
- && (pfn < zone_end_pfn(z))
- && !(pfn & (pageblock_nr_pages - 1)))
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-
INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
@@ -1091,6 +1070,7 @@ void __defermem_init deferred_free_range(struct page *page, unsigned long pfn,
int i;

if (nr_pages == MAX_ORDER_NR_PAGES && (pfn & (MAX_ORDER_NR_PAGES-1)) == 0) {
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
__free_pages_boot_core(page, pfn, MAX_ORDER-1);
return;
}
@@ -4500,7 +4480,28 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
&nr_initialised))
break;
}
- __init_single_pfn(pfn, zone, nid);
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made. Later some blocks near
+ * the start are marked MIGRATE_RESERVE by
+ * setup_zone_migrate_reserve()
+ *
+ * bitmap is created for zone's valid pfn range. but memmap
+ * can be created for invalid pages (for alignment)
+ * check here not to call set_pageblock_migratetype() against
+ * pfn out of zone.
+ */
+ if (!(pfn & (pageblock_nr_pages - 1))) {
+ struct page *page = pfn_to_page(pfn);
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ __init_single_page(page, pfn, zone, nid);
+ } else {
+ __init_single_pfn(pfn, zone, nid);
+ }
}
}

--
2.1.2

2015-04-22 17:08:14

[permalink] [raw]

Subject: [PATCH 13/13] mm: meminit: Remove mminit_verify_page_links

mminit_verify_page_links() is an extremely paranoid check that was introduced
when memory initialisation was being heavily reworked. Profiles indicated
that up to 10% of parallel memory initialisation was spent on checking
this for every page. The cost could be reduced but in practice this check
only found problems very early during the initialisation rewrite and has
found nothing since. This patch removes an expensive unnecessary check.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 8 --------
mm/mm_init.c | 8 --------
mm/page_alloc.c | 1 -
3 files changed, 17 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 2c4057140bec..c73ad248f8f4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -360,10 +360,7 @@ do { \
} while (0)

extern void mminit_verify_pageflags_layout(void);
-extern void mminit_verify_page_links(struct page *page,
- enum zone_type zone, unsigned long nid, unsigned long pfn);
extern void mminit_verify_zonelist(void);
-
#else

static inline void mminit_dprintk(enum mminit_level level,
@@ -375,11 +372,6 @@ static inline void mminit_verify_pageflags_layout(void)
{
}

-static inline void mminit_verify_page_links(struct page *page,
- enum zone_type zone, unsigned long nid, unsigned long pfn)
-{
-}
-
static inline void mminit_verify_zonelist(void)
{
}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 28fbf87b20aa..fdadf918de76 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -131,14 +131,6 @@ void __init mminit_verify_pageflags_layout(void)
BUG_ON(or_mask != add_mask);
}

-void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone,
- unsigned long nid, unsigned long pfn)
-{
- BUG_ON(page_to_nid(page) != nid);
- BUG_ON(page_zonenum(page) != zone);
- BUG_ON(page_to_pfn(page) != pfn);
-}
-
static __init int set_mminit_loglevel(char *str)
{
get_option(&str, &mminit_loglevel);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 945d56667b61..cb84d67ae9ba 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -877,7 +877,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
set_page_links(page, zone, nid, pfn);
- mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
--
2.1.2

2015-04-22 17:11:15

[permalink] [raw]

Subject: Re: [RFC PATCH 0/14] Parallel struct page initialisation v5r4

The 0/14 is a mistake. There are only 13 patches.

--
Mel Gorman
SUSE Labs

2015-04-22 23:45:05

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64

On Wed, 22 Apr 2015 18:07:50 +0100 Mel Gorman <[email protected]> wrote:

> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -32,6 +32,7 @@ config X86
> select HAVE_UNSTABLE_SCHED_CLOCK
> select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
> select ARCH_SUPPORTS_INT128 if X86_64
> + select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT if X86_64 && NUMA

Put this in the "config X86_64" section and skip the "X86_64 &&"?

Can we omit the whole defer_meminit= thing and permanently enable the
feature? That's simpler, provides better test coverage and is, we
hope, faster.

And can this be used on non-NUMA? Presumably that won't speed things
up any if we're bandwidth limited but again it's simpler and provides
better coverage.

2015-04-23 09:23:35

[permalink] [raw]

Subject: Re: [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64

On Wed, Apr 22, 2015 at 04:45:00PM -0700, Andrew Morton wrote:
> On Wed, 22 Apr 2015 18:07:50 +0100 Mel Gorman <[email protected]> wrote:
>
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -32,6 +32,7 @@ config X86
> > select HAVE_UNSTABLE_SCHED_CLOCK
> > select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
> > select ARCH_SUPPORTS_INT128 if X86_64
> > + select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT if X86_64 && NUMA
>
> Put this in the "config X86_64" section and skip the "X86_64 &&"?
>

Done.

> Can we omit the whole defer_meminit= thing and permanently enable the
> feature? That's simpler, provides better test coverage and is, we
> hope, faster.
>

Yes. The intent was to have a workaround if there were any failures like
Waiman's vmalloc failures in an earlier version but they are bugs that
should be fixed.

> And can this be used on non-NUMA? Presumably that won't speed things
> up any if we're bandwidth limited but again it's simpler and provides
> better coverage.

Nothing prevents it. There is less opportunity for parallelism but
improving coverage is desirable.

--
Mel Gorman
SUSE Labs

2015-04-24 14:36:00

[permalink] [raw]

Subject: Re: [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64

On 04/23/2015 05:23 AM, Mel Gorman wrote:
> On Wed, Apr 22, 2015 at 04:45:00PM -0700, Andrew Morton wrote:
>> On Wed, 22 Apr 2015 18:07:50 +0100 Mel Gorman<[email protected]> wrote:
>>
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -32,6 +32,7 @@ config X86
>>> select HAVE_UNSTABLE_SCHED_CLOCK
>>> select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
>>> select ARCH_SUPPORTS_INT128 if X86_64
>>> + select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT if X86_64&& NUMA
>> Put this in the "config X86_64" section and skip the "X86_64&&"?
>>
> Done.
>
>> Can we omit the whole defer_meminit= thing and permanently enable the
>> feature? That's simpler, provides better test coverage and is, we
>> hope, faster.
>>
> Yes. The intent was to have a workaround if there were any failures like
> Waiman's vmalloc failures in an earlier version but they are bugs that
> should be fixed.
>
>> And can this be used on non-NUMA? Presumably that won't speed things
>> up any if we're bandwidth limited but again it's simpler and provides
>> better coverage.
> Nothing prevents it. There is less opportunity for parallelism but
> improving coverage is desirable.
>

Memory access latency can be more than double for local vs. remote node
memory. Bandwidth can also be much lower depending on what kind of
interconnect is between the 2 nodes. So it is better to do it in a
NUMA-aware way. Within a NUMA node, however, we can split the memory
initialization to 2 or more local CPUs if the memory size is big enough.

Cheers,
Longman

2015-04-24 15:20:13

[permalink] [raw]

Subject: Re: [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64

On Fri, Apr 24, 2015 at 10:35:49AM -0400, Waiman Long wrote:
> On 04/23/2015 05:23 AM, Mel Gorman wrote:
> >On Wed, Apr 22, 2015 at 04:45:00PM -0700, Andrew Morton wrote:
> >>On Wed, 22 Apr 2015 18:07:50 +0100 Mel Gorman<[email protected]> wrote:
> >>
> >>>--- a/arch/x86/Kconfig
> >>>+++ b/arch/x86/Kconfig
> >>>@@ -32,6 +32,7 @@ config X86
> >>> select HAVE_UNSTABLE_SCHED_CLOCK
> >>> select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
> >>> select ARCH_SUPPORTS_INT128 if X86_64
> >>>+ select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT if X86_64&& NUMA
> >>Put this in the "config X86_64" section and skip the "X86_64&&"?
> >>
> >Done.
> >
> >>Can we omit the whole defer_meminit= thing and permanently enable the
> >>feature? That's simpler, provides better test coverage and is, we
> >>hope, faster.
> >>
> >Yes. The intent was to have a workaround if there were any failures like
> >Waiman's vmalloc failures in an earlier version but they are bugs that
> >should be fixed.
> >
> >>And can this be used on non-NUMA? Presumably that won't speed things
> >>up any if we're bandwidth limited but again it's simpler and provides
> >>better coverage.
> >Nothing prevents it. There is less opportunity for parallelism but
> >improving coverage is desirable.
> >
>
> Memory access latency can be more than double for local vs. remote
> node memory. Bandwidth can also be much lower depending on what kind
> of interconnect is between the 2 nodes. So it is better to do it in
> a NUMA-aware way.

I do not believe that is what he was asking. He was asking if we could
defer memory initialisation even when there is only one node. It does not
gain much in terms of boot times but it improves testing coverage.

> Within a NUMA node, however, we can split the
> memory initialization to 2 or more local CPUs if the memory size is
> big enough.
>

I considered it but discarded the idea. It'd be more complex to setup and
the two CPUs could simply end up contending on the same memory bus as
well as contending on zone->lock.

--
Mel Gorman
SUSE Labs

2015-04-24 19:04:34

[permalink] [raw]

Subject: Re: [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64

On 04/24/2015 11:20 AM, Mel Gorman wrote:
> On Fri, Apr 24, 2015 at 10:35:49AM -0400, Waiman Long wrote:
>> On 04/23/2015 05:23 AM, Mel Gorman wrote:
>>> On Wed, Apr 22, 2015 at 04:45:00PM -0700, Andrew Morton wrote:
>>>> On Wed, 22 Apr 2015 18:07:50 +0100 Mel Gorman<[email protected]> wrote:
>>>>
>>>>> --- a/arch/x86/Kconfig
>>>>> +++ b/arch/x86/Kconfig
>>>>> @@ -32,6 +32,7 @@ config X86
>>>>> select HAVE_UNSTABLE_SCHED_CLOCK
>>>>> select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
>>>>> select ARCH_SUPPORTS_INT128 if X86_64
>>>>> + select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT if X86_64&& NUMA
>>>> Put this in the "config X86_64" section and skip the "X86_64&&"?
>>>>
>>> Done.
>>>
>>>> Can we omit the whole defer_meminit= thing and permanently enable the
>>>> feature? That's simpler, provides better test coverage and is, we
>>>> hope, faster.
>>>>
>>> Yes. The intent was to have a workaround if there were any failures like
>>> Waiman's vmalloc failures in an earlier version but they are bugs that
>>> should be fixed.
>>>
>>>> And can this be used on non-NUMA? Presumably that won't speed things
>>>> up any if we're bandwidth limited but again it's simpler and provides
>>>> better coverage.
>>> Nothing prevents it. There is less opportunity for parallelism but
>>> improving coverage is desirable.
>>>
>> Memory access latency can be more than double for local vs. remote
>> node memory. Bandwidth can also be much lower depending on what kind
>> of interconnect is between the 2 nodes. So it is better to do it in
>> a NUMA-aware way.
> I do not believe that is what he was asking. He was asking if we could
> defer memory initialisation even when there is only one node. It does not
> gain much in terms of boot times but it improves testing coverage.

Thanks for the clarification.

>> Within a NUMA node, however, we can split the
>> memory initialization to 2 or more local CPUs if the memory size is
>> big enough.
>>
> I considered it but discarded the idea. It'd be more complex to setup and
> the two CPUs could simply end up contending on the same memory bus as
> well as contending on zone->lock.
>

I don't think we need that now. However, we may have to consider this
when one day even a single node can have TBs of memory unless we move to
a page size larger than 4k.

Cheers,
Longman

2015-04-25 17:29:07

[permalink] [raw]

Subject: Re: [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64

On Fri, Apr 24, 2015 at 03:04:27PM -0400, Waiman Long wrote:
> >>Within a NUMA node, however, we can split the
> >>memory initialization to 2 or more local CPUs if the memory size is
> >>big enough.
> >>
> >I considered it but discarded the idea. It'd be more complex to setup and
> >the two CPUs could simply end up contending on the same memory bus as
> >well as contending on zone->lock.
> >
>
> I don't think we need that now. However, we may have to consider
> this when one day even a single node can have TBs of memory unless
> we move to a page size larger than 4k.
>

We'll cross that bridge when we come to it. I suspect there is more room
for improvement in the initialisation that would be worth trying before
resorting to more threads. With more threads there is a risk that we hit
memory bus contention and a high risk that it actually is worse due to
contending on zone->lock when freeing the pages.

In the meantime, do you mind updating the before/after figures for your
test machine with this series please?

--
Mel Gorman
SUSE Labs

2015-04-27 20:07:47

[permalink] [raw]

Subject: Re: [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64

On 04/25/2015 01:28 PM, Mel Gorman wrote:
> On Fri, Apr 24, 2015 at 03:04:27PM -0400, Waiman Long wrote:
>>>> Within a NUMA node, however, we can split the
>>>> memory initialization to 2 or more local CPUs if the memory size is
>>>> big enough.
>>>>
>>> I considered it but discarded the idea. It'd be more complex to setup and
>>> the two CPUs could simply end up contending on the same memory bus as
>>> well as contending on zone->lock.
>>>
>> I don't think we need that now. However, we may have to consider
>> this when one day even a single node can have TBs of memory unless
>> we move to a page size larger than 4k.
>>
> We'll cross that bridge when we come to it. I suspect there is more room
> for improvement in the initialisation that would be worth trying before
> resorting to more threads. With more threads there is a risk that we hit
> memory bus contention and a high risk that it actually is worse due to
> contending on zone->lock when freeing the pages.
>
> In the meantime, do you mind updating the before/after figures for your
> test machine with this series please?
>

I will test the latest patch once I got my hand on a 12TB machine.

Cheers,
Longman