2018-03-13 18:27:21

by Pavel Tatashin

[permalink] [raw]
Subject: [v6 0/2] initialize pages on demand during boot

Change log:
v5 - v6
- Fixed issues found by Andrew Morton: replaced cond_resched() with
touch_nmi_watchdog(), instead of simply deleting it.
- Removed useless pgdata_resize_lock_irq(), as regular
pgdata_resize_lock() does exactly what is needed.
- Included fixes to comments by Andrew from
mm-initialize-pages-on-demand-during-boot-v5-fix.patch.

v4 - v5
- Fix issue reported by Vlasimil Babka:
> I've noticed that this function first disables the
> on-demand initialization, and then runs the kthreads.
> Doesn't that leave a window where allocations can fail? The
> chances are probably small, but I think it would be better
> to avoid it completely, rare failures suck.
>
> Fixing that probably means rethinking the whole
> synchronization more dramatically though :/
- Introduced a new patch that uses node resize lock to synchronize
on-demand deferred page initialization, and regular deferred page
initialization.

v3 - v4
- Fix !CONFIG_NUMA issue.

v2 - v3
Andrew Morton's comments:
- Moved read of pgdat->first_deferred_pfn into
deferred_zone_grow_lock, thus got rid of READ_ONCE()/WRITE_ONCE()
- Replaced spin_lock() with spin_lock_irqsave() in
deferred_grow_zone
- Updated comments for deferred_zone_grow_lock
- Updated comment before deferred_grow_zone() explaining return
value, and also noinline specifier.
- Fixed comment before _deferred_grow_zone().

v1 - v2
Added Tested-by: Masayoshi Mizuma

This change helps for three reasons:

1. Insufficient amount of reserved memory due to arguments provided by
user. User may request some buffers, increased hash tables sizes etc.
Currently, machine panics during boot if it can't allocate memory due
to insufficient amount of reserved memory. With this change, it will
be able to grow zone before deferred pages are initialized.

One observed example is described in the linked discussion [1] Mel
Gorman writes:

"
Yasuaki Ishimatsu reported a premature OOM when trace_buf_size=100m was
specified on a machine with many CPUs. The kernel tried to allocate 38.4GB
but only 16GB was available due to deferred memory initialisation.
"

The allocations in the above scenario happen per-cpu in smp_init(),
and before deferred pages are initialized. So, there is no way to
predict how much memory we should put aside to boot successfully with
deferred page initialization feature compiled in.

2. The second reason is future proof. The kernel memory requirements
may change, and we do not want to constantly update
reset_deferred_meminit() to satisfy the new requirements. In addition,
this function is currently in common code, but potentially would need
to be split into arch specific variants, as more arches will start
taking advantage of deferred page initialization feature.

3. On demand initialization of reserved pages guarantees that we will
initialize only as many pages early in boot using only one thread as
needed, the rest are going to be efficiently initialized in parallel.

[1] https://www.spinics.net/lists/linux-mm/msg139087.html

Pavel Tatashin (2):
mm: disable interrupts while initializing deferred pages
mm: initialize pages on demand during boot

include/linux/memblock.h | 10 --
include/linux/memory_hotplug.h | 53 ++++++-----
include/linux/mmzone.h | 5 +-
mm/memblock.c | 23 -----
mm/page_alloc.c | 202 +++++++++++++++++++++++++++++++----------
5 files changed, 186 insertions(+), 107 deletions(-)

--
2.16.2



2018-03-13 18:26:03

by Pavel Tatashin

[permalink] [raw]
Subject: [v6 2/2] mm: initialize pages on demand during boot

Deferred page initialization allows the boot cpu to initialize a small
subset of the system's pages early in boot, with other cpus doing the rest
later on.

It is, however, problematic to know how many pages the kernel needs during
boot. Different modules and kernel parameters may change the requirement,
so the boot cpu either initializes too many pages or runs out of memory.

To fix that, initialize early pages on demand. This ensures the kernel
does the minimum amount of work to initialize pages during boot and leaves
the rest to be divided in the multithreaded initialization path
(deferred_init_memmap).

The on-demand code is permanently disabled using static branching once
deferred pages are initialized. After the static branch is changed to
false, the overhead is up-to two branch-always instructions if the zone
watermark check fails or if rmqueue fails.

Sergey Senozhatsky noticed that while deferred pages currently make sense
only on NUMA machines (we start one thread per latency node), CONFIG_NUMA
is not a requirement for CONFIG_DEFERRED_STRUCT_PAGE_INIT, so that is also
must be addressed in the patch.

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Daniel Jordan <[email protected]>
Reviewed-by: Steven Sistare <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Tested-by: Masayoshi Mizuma <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
include/linux/memblock.h | 10 ---
mm/memblock.c | 23 ------
mm/page_alloc.c | 183 +++++++++++++++++++++++++++++++++++++----------
3 files changed, 144 insertions(+), 72 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 8be5077efb5f..6c305afd95ab 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -417,21 +417,11 @@ static inline void early_memtest(phys_addr_t start, phys_addr_t end)
{
}
#endif
-
-extern unsigned long memblock_reserved_memory_within(phys_addr_t start_addr,
- phys_addr_t end_addr);
#else
static inline phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align)
{
return 0;
}
-
-static inline unsigned long memblock_reserved_memory_within(phys_addr_t start_addr,
- phys_addr_t end_addr)
-{
- return 0;
-}
-
#endif /* CONFIG_HAVE_MEMBLOCK */

#endif /* __KERNEL__ */
diff --git a/mm/memblock.c b/mm/memblock.c
index b6ba6b7adadc..80a12c64b203 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1778,29 +1778,6 @@ static void __init_memblock memblock_dump(struct memblock_type *type)
}
}

-extern unsigned long __init_memblock
-memblock_reserved_memory_within(phys_addr_t start_addr, phys_addr_t end_addr)
-{
- struct memblock_region *rgn;
- unsigned long size = 0;
- int idx;
-
- for_each_memblock_type(idx, (&memblock.reserved), rgn) {
- phys_addr_t start, end;
-
- if (rgn->base + rgn->size < start_addr)
- continue;
- if (rgn->base > end_addr)
- continue;
-
- start = rgn->base;
- end = start + rgn->size;
- size += end - start;
- }
-
- return size;
-}
-
void __init_memblock __memblock_dump_all(void)
{
pr_info("MEMBLOCK configuration:\n");
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cada509e2176..529e2dce7d16 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -292,40 +292,6 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-
-/*
- * Determine how many pages need to be initialized during early boot
- * (non-deferred initialization).
- * The value of first_deferred_pfn will be set later, once non-deferred pages
- * are initialized, but for now set it ULONG_MAX.
- */
-static inline void reset_deferred_meminit(pg_data_t *pgdat)
-{
- phys_addr_t start_addr, end_addr;
- unsigned long max_pgcnt;
- unsigned long reserved;
-
- /*
- * Initialise at least 2G of a node but also take into account that
- * two large system hashes that can take up 1GB for 0.25TB/node.
- */
- max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
-
- /*
- * Compensate the all the memblock reservations (e.g. crash kernel)
- * from the initial estimation to make sure we will initialize enough
- * memory to boot.
- */
- start_addr = PFN_PHYS(pgdat->node_start_pfn);
- end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
- reserved = memblock_reserved_memory_within(start_addr, end_addr);
- max_pgcnt += PHYS_PFN(reserved);
-
- pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
- pgdat->first_deferred_pfn = ULONG_MAX;
-}
-
/* Returns true if the struct page for the pfn is uninitialised */
static inline bool __meminit early_page_uninitialised(unsigned long pfn)
{
@@ -361,10 +327,6 @@ static inline bool update_defer_init(pg_data_t *pgdat,
return true;
}
#else
-static inline void reset_deferred_meminit(pg_data_t *pgdat)
-{
-}
-
static inline bool early_page_uninitialised(unsigned long pfn)
{
return false;
@@ -1611,6 +1573,117 @@ static int __init deferred_init_memmap(void *data)
pgdat_init_report_one_done();
return 0;
}
+
+/*
+ * During boot we initialize deferred pages on-demand, as needed, but once
+ * page_alloc_init_late() has finished, the deferred pages are all initialized,
+ * and we can permanently disable that path.
+ */
+static DEFINE_STATIC_KEY_TRUE(deferred_pages);
+
+/*
+ * If this zone has deferred pages, try to grow it by initializing enough
+ * deferred pages to satisfy the allocation specified by order, rounded up to
+ * the nearest PAGES_PER_SECTION boundary. So we're adding memory in increments
+ * of SECTION_SIZE bytes by initializing struct pages in increments of
+ * PAGES_PER_SECTION * sizeof(struct page) bytes.
+ *
+ * Return true when zone was grown, otherwise return false. We return true even
+ * when we grow less than requested, to let the caller decide if there are
+ * enough pages to satisfy the allocation.
+ *
+ * Note: We use noinline because this function is needed only during boot, and
+ * it is called from a __ref function _deferred_grow_zone. This way we are
+ * making sure that it is not inlined into permanent text section.
+ */
+static noinline bool __init
+deferred_grow_zone(struct zone *zone, unsigned int order)
+{
+ int zid = zone_idx(zone);
+ int nid = zone_to_nid(zone);
+ pg_data_t *pgdat = NODE_DATA(nid);
+ unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
+ unsigned long nr_pages = 0;
+ unsigned long first_init_pfn, spfn, epfn, t, flags;
+ unsigned long first_deferred_pfn = pgdat->first_deferred_pfn;
+ phys_addr_t spa, epa;
+ u64 i;
+
+ /* Only the last zone may have deferred pages */
+ if (zone_end_pfn(zone) != pgdat_end_pfn(pgdat))
+ return false;
+
+ pgdat_resize_lock(pgdat, &flags);
+
+ /*
+ * If deferred pages have been initialized while we were waiting for
+ * the lock, return true, as the zone was grown. The caller will retry
+ * this zone. We won't return to this function since the caller also
+ * has this static branch.
+ */
+ if (!static_branch_unlikely(&deferred_pages)) {
+ pgdat_resize_unlock(pgdat, &flags);
+ return true;
+ }
+
+ /*
+ * If someone grew this zone while we were waiting for spinlock, return
+ * true, as there might be enough pages already.
+ */
+ if (first_deferred_pfn != pgdat->first_deferred_pfn) {
+ pgdat_resize_unlock(pgdat, &flags);
+ return true;
+ }
+
+ first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
+
+ if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
+ pgdat_resize_unlock(pgdat, &flags);
+ return false;
+ }
+
+ for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+ spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
+ epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
+
+ while (spfn < epfn && nr_pages < nr_pages_needed) {
+ t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
+ first_deferred_pfn = min(t, epfn);
+ nr_pages += deferred_init_pages(nid, zid, spfn,
+ first_deferred_pfn);
+ spfn = first_deferred_pfn;
+ }
+
+ if (nr_pages >= nr_pages_needed)
+ break;
+ }
+
+ for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+ spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
+ epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
+ deferred_free_pages(nid, zid, spfn, epfn);
+
+ if (first_deferred_pfn == epfn)
+ break;
+ }
+ pgdat->first_deferred_pfn = first_deferred_pfn;
+ pgdat_resize_unlock(pgdat, &flags);
+
+ return nr_pages > 0;
+}
+
+/*
+ * deferred_grow_zone() is __init, but it is called from
+ * get_page_from_freelist() during early boot until deferred_pages permanently
+ * disables this call. This is why we have refdata wrapper to avoid warning,
+ * and to ensure that the function body gets unloaded.
+ */
+static bool __ref
+_deferred_grow_zone(struct zone *zone, unsigned int order)
+{
+ return deferred_grow_zone(zone, order);
+}
+
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */

void __init page_alloc_init_late(void)
@@ -1629,6 +1702,12 @@ void __init page_alloc_init_late(void)
/* Block until all are initialised */
wait_for_completion(&pgdat_init_all_done_comp);

+ /*
+ * We initialized the rest of the deferred pages. Permanently disable
+ * on-demand struct page initialization.
+ */
+ static_branch_disable(&deferred_pages);
+
/* Reinit limits that are based on free pages after the kernel is up */
files_maxfiles_init();
#endif
@@ -3206,6 +3285,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
ac_classzone_idx(ac), alloc_flags)) {
int ret;

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ /*
+ * Watermark failed for this zone, but see if we can
+ * grow this zone if it contains deferred pages.
+ */
+ if (static_branch_unlikely(&deferred_pages)) {
+ if (_deferred_grow_zone(zone, order))
+ goto try_this_zone;
+ }
+#endif
/* Checked here to keep the fast path fast */
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (alloc_flags & ALLOC_NO_WATERMARKS)
@@ -3247,6 +3336,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
reserve_highatomic_pageblock(page, zone, order);

return page;
+ } else {
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ /* Try again if zone has deferred pages */
+ if (static_branch_unlikely(&deferred_pages)) {
+ if (_deferred_grow_zone(zone, order))
+ goto try_this_zone;
+ }
+#endif
}
}

@@ -6259,7 +6356,15 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,

alloc_node_mem_map(pgdat);

- reset_deferred_meminit(pgdat);
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ /*
+ * We start only with one section of pages, more pages are added as
+ * needed until the rest of deferred pages are initialized.
+ */
+ pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
+ pgdat->node_spanned_pages);
+ pgdat->first_deferred_pfn = ULONG_MAX;
+#endif
free_area_init_core(pgdat);
}

--
2.16.2


2018-03-13 18:27:50

by Pavel Tatashin

[permalink] [raw]
Subject: [v6 1/2] mm: disable interrupts while initializing deferred pages

Vlastimil Babka reported about a window issue during which when deferred
pages are initialized, and the current version of on-demand initialization
is finished, allocations may fail. While this is highly unlikely scenario,
since this kind of allocation request must be large, and must come from
interrupt handler, we still want to cover it.

We solve this by initializing deferred pages with interrupts disabled, and
holding node_size_lock spin lock while pages in the node are being
initialized. The on-demand deferred page initialization that comes later
will use the same lock, and thus synchronize with deferred_init_memmap().

It is unlikely for threads that initialize deferred pages to be
interrupted. They run soon after smp_init(), but before modules are
initialized, and long before user space programs. This is why there is no
adverse effect of having these threads running with interrupts disabled.

Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/memory_hotplug.h | 53 ++++++++++++++++++++++--------------------
include/linux/mmzone.h | 5 ++--
mm/page_alloc.c | 19 ++++++++-------
3 files changed, 42 insertions(+), 35 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index aba5f86eb038..2b0265265c28 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -51,24 +51,6 @@ enum {
MMOP_ONLINE_MOVABLE,
};

-/*
- * pgdat resizing functions
- */
-static inline
-void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
-{
- spin_lock_irqsave(&pgdat->node_size_lock, *flags);
-}
-static inline
-void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
-{
- spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
-}
-static inline
-void pgdat_resize_init(struct pglist_data *pgdat)
-{
- spin_lock_init(&pgdat->node_size_lock);
-}
/*
* Zone resizing functions
*
@@ -246,13 +228,6 @@ extern void clear_zone_contiguous(struct zone *zone);
___page; \
})

-/*
- * Stub functions for when hotplug is off
- */
-static inline void pgdat_resize_lock(struct pglist_data *p, unsigned long *f) {}
-static inline void pgdat_resize_unlock(struct pglist_data *p, unsigned long *f) {}
-static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
-
static inline unsigned zone_span_seqbegin(struct zone *zone)
{
return 0;
@@ -293,6 +268,34 @@ static inline bool movable_node_is_enabled(void)
}
#endif /* ! CONFIG_MEMORY_HOTPLUG */

+#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
+/*
+ * pgdat resizing functions
+ */
+static inline
+void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
+{
+ spin_lock_irqsave(&pgdat->node_size_lock, *flags);
+}
+static inline
+void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
+{
+ spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
+}
+static inline
+void pgdat_resize_init(struct pglist_data *pgdat)
+{
+ spin_lock_init(&pgdat->node_size_lock);
+}
+#else /* !(CONFIG_MEMORY_HOTPLUG || CONFIG_DEFERRED_STRUCT_PAGE_INIT) */
+/*
+ * Stub functions for when hotplug is off
+ */
+static inline void pgdat_resize_lock(struct pglist_data *p, unsigned long *f) {}
+static inline void pgdat_resize_unlock(struct pglist_data *p, unsigned long *f) {}
+static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
+#endif /* !(CONFIG_MEMORY_HOTPLUG || CONFIG_DEFERRED_STRUCT_PAGE_INIT) */
+
#ifdef CONFIG_MEMORY_HOTREMOVE

extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7522a6987595..d14168da66a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -633,14 +633,15 @@ typedef struct pglist_data {
#ifndef CONFIG_NO_BOOTMEM
struct bootmem_data *bdata;
#endif
-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
/*
* Must be held any time you expect node_start_pfn, node_present_pages
* or node_spanned_pages stay constant. Holding this will also
* guarantee that any pfn_valid() stays that way.
*
* pgdat_resize_lock() and pgdat_resize_unlock() are provided to
- * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG.
+ * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
+ * or CONFIG_DEFERRED_STRUCT_PAGE_INIT.
*
* Nests above zone->lock and zone->span_seqlock
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d974cb2a1a1..cada509e2176 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1506,7 +1506,7 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
} else if (!(pfn & nr_pgmask)) {
deferred_free_range(pfn - nr_free, nr_free);
nr_free = 1;
- cond_resched();
+ touch_nmi_watchdog();
} else {
nr_free++;
}
@@ -1535,7 +1535,7 @@ static unsigned long __init deferred_init_pages(int nid, int zid,
continue;
} else if (!page || !(pfn & nr_pgmask)) {
page = pfn_to_page(pfn);
- cond_resched();
+ touch_nmi_watchdog();
} else {
page++;
}
@@ -1552,23 +1552,25 @@ static int __init deferred_init_memmap(void *data)
int nid = pgdat->node_id;
unsigned long start = jiffies;
unsigned long nr_pages = 0;
- unsigned long spfn, epfn;
+ unsigned long spfn, epfn, first_init_pfn, flags;
phys_addr_t spa, epa;
int zid;
struct zone *zone;
- unsigned long first_init_pfn = pgdat->first_deferred_pfn;
const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
u64 i;

+ /* Bind memory initialisation thread to a local node if possible */
+ if (!cpumask_empty(cpumask))
+ set_cpus_allowed_ptr(current, cpumask);
+
+ pgdat_resize_lock(pgdat, &flags);
+ first_init_pfn = pgdat->first_deferred_pfn;
if (first_init_pfn == ULONG_MAX) {
+ pgdat_resize_unlock(pgdat, &flags);
pgdat_init_report_one_done();
return 0;
}

- /* Bind memory initialisation thread to a local node if possible */
- if (!cpumask_empty(cpumask))
- set_cpus_allowed_ptr(current, cpumask);
-
/* Sanity check boundaries */
BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
@@ -1598,6 +1600,7 @@ static int __init deferred_init_memmap(void *data)
epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
deferred_free_pages(nid, zid, spfn, epfn);
}
+ pgdat_resize_unlock(pgdat, &flags);

/* Sanity check that the next zone really is unpopulated */
WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
--
2.16.2