Change log:
v4 - v5
- Fix issue reported by Vlasimil Babka:
> I've noticed that this function first disables the
> on-demand initialization, and then runs the kthreads.
> Doesn't that leave a window where allocations can fail? The
> chances are probably small, but I think it would be better
> to avoid it completely, rare failures suck.
>
> Fixing that probably means rethinking the whole
> synchronization more dramatically though :/
- Introduced a new patch that uses node resize lock to synchronize
on-demand deferred page initialization, and regular deferred page
initialization.
v3 - v4
- Fix !CONFIG_NUMA issue.
v2 - v3
Andrew Morton's comments:
- Moved read of pgdat->first_deferred_pfn into
deferred_zone_grow_lock, thus got rid of READ_ONCE()/WRITE_ONCE()
- Replaced spin_lock() with spin_lock_irqsave() in
deferred_grow_zone
- Updated comments for deferred_zone_grow_lock
- Updated comment before deferred_grow_zone() explaining return
value, and also noinline specifier.
- Fixed comment before _deferred_grow_zone().
v1 - v2
Added Tested-by: Masayoshi Mizuma
This change helps for three reasons:
1. Insufficient amount of reserved memory due to arguments provided by
user. User may request some buffers, increased hash tables sizes etc.
Currently, machine panics during boot if it can't allocate memory due
to insufficient amount of reserved memory. With this change, it will
be able to grow zone before deferred pages are initialized.
One observed example is described in the linked discussion [1] Mel
Gorman writes:
"
Yasuaki Ishimatsu reported a premature OOM when trace_buf_size=100m was
specified on a machine with many CPUs. The kernel tried to allocate 38.4GB
but only 16GB was available due to deferred memory initialisation.
"
The allocations in the above scenario happen per-cpu in smp_init(),
and before deferred pages are initialized. So, there is no way to
predict how much memory we should put aside to boot successfully with
deferred page initialization feature compiled in.
2. The second reason is future proof. The kernel memory requirements
may change, and we do not want to constantly update
reset_deferred_meminit() to satisfy the new requirements. In addition,
this function is currently in common code, but potentially would need
to be split into arch specific variants, as more arches will start
taking advantage of deferred page initialization feature.
3. On demand initialization of reserved pages guarantees that we will
initialize only as many pages early in boot using only one thread as
needed, the rest are going to be efficiently initialized in parallel.
[1] https://www.spinics.net/lists/linux-mm/msg139087.html
Pavel Tatashin (2):
mm: disable interrupts while initializing deferred pages
mm: initialize pages on demand during boot
include/linux/memblock.h | 10 --
include/linux/memory_hotplug.h | 73 ++++++++++-----
include/linux/mmzone.h | 5 +-
mm/memblock.c | 23 -----
mm/page_alloc.c | 205 +++++++++++++++++++++++++++++++----------
5 files changed, 206 insertions(+), 110 deletions(-)
--
2.16.2
Deferred page initialization allows the boot cpu to initialize a small
subset of the system's pages early in boot, with other cpus doing the rest
later on.
It is, however, problematic to know how many pages the kernel needs during
boot. Different modules and kernel parameters may change the requirement,
so the boot cpu either initializes too many pages or runs out of memory.
To fix that, initialize early pages on demand. This ensures the kernel
does the minimum amount of work to initialize pages during boot and leaves
the rest to be divided in the multithreaded initialization path
(deferred_init_memmap).
The on-demand code is permanently disabled using static branching once
deferred pages are initialized. After the static branch is changed to
false, the overhead is up-to two branch-always instructions if the zone
watermark check fails or if rmqueue fails.
Sergey Senozhatsky noticed that while deferred pages currently make sense
only on NUMA machines (we start one thread per latency node), CONFIG_NUMA
is not a requirement for CONFIG_DEFERRED_STRUCT_PAGE_INIT, so that is also
must be addressed in the patch.
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Daniel Jordan <[email protected]>
Reviewed-by: Steven Sistare <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Tested-by: Masayoshi Mizuma <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
include/linux/memblock.h | 10 ---
mm/memblock.c | 23 ------
mm/page_alloc.c | 183 +++++++++++++++++++++++++++++++++++++----------
3 files changed, 144 insertions(+), 72 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 8be5077efb5f..6c305afd95ab 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -417,21 +417,11 @@ static inline void early_memtest(phys_addr_t start, phys_addr_t end)
{
}
#endif
-
-extern unsigned long memblock_reserved_memory_within(phys_addr_t start_addr,
- phys_addr_t end_addr);
#else
static inline phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align)
{
return 0;
}
-
-static inline unsigned long memblock_reserved_memory_within(phys_addr_t start_addr,
- phys_addr_t end_addr)
-{
- return 0;
-}
-
#endif /* CONFIG_HAVE_MEMBLOCK */
#endif /* __KERNEL__ */
diff --git a/mm/memblock.c b/mm/memblock.c
index 5a9ca2a1751b..4120e9f536f7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1778,29 +1778,6 @@ static void __init_memblock memblock_dump(struct memblock_type *type)
}
}
-extern unsigned long __init_memblock
-memblock_reserved_memory_within(phys_addr_t start_addr, phys_addr_t end_addr)
-{
- struct memblock_region *rgn;
- unsigned long size = 0;
- int idx;
-
- for_each_memblock_type(idx, (&memblock.reserved), rgn) {
- phys_addr_t start, end;
-
- if (rgn->base + rgn->size < start_addr)
- continue;
- if (rgn->base > end_addr)
- continue;
-
- start = rgn->base;
- end = start + rgn->size;
- size += end - start;
- }
-
- return size;
-}
-
void __init_memblock __memblock_dump_all(void)
{
pr_info("MEMBLOCK configuration:\n");
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 55b98df0be39..7a0f609a09b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -292,40 +292,6 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-
-/*
- * Determine how many pages need to be initialized during early boot
- * (non-deferred initialization).
- * The value of first_deferred_pfn will be set later, once non-deferred pages
- * are initialized, but for now set it ULONG_MAX.
- */
-static inline void reset_deferred_meminit(pg_data_t *pgdat)
-{
- phys_addr_t start_addr, end_addr;
- unsigned long max_pgcnt;
- unsigned long reserved;
-
- /*
- * Initialise at least 2G of a node but also take into account that
- * two large system hashes that can take up 1GB for 0.25TB/node.
- */
- max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
-
- /*
- * Compensate the all the memblock reservations (e.g. crash kernel)
- * from the initial estimation to make sure we will initialize enough
- * memory to boot.
- */
- start_addr = PFN_PHYS(pgdat->node_start_pfn);
- end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
- reserved = memblock_reserved_memory_within(start_addr, end_addr);
- max_pgcnt += PHYS_PFN(reserved);
-
- pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
- pgdat->first_deferred_pfn = ULONG_MAX;
-}
-
/* Returns true if the struct page for the pfn is uninitialised */
static inline bool __meminit early_page_uninitialised(unsigned long pfn)
{
@@ -361,10 +327,6 @@ static inline bool update_defer_init(pg_data_t *pgdat,
return true;
}
#else
-static inline void reset_deferred_meminit(pg_data_t *pgdat)
-{
-}
-
static inline bool early_page_uninitialised(unsigned long pfn)
{
return false;
@@ -1608,6 +1570,117 @@ static int __init deferred_init_memmap(void *data)
pgdat_init_report_one_done();
return 0;
}
+
+/*
+ * During boot we initialize deferred pages on-demand, as needed, but once
+ * page_alloc_init_late() has finished, the deferred pages are all initialized,
+ * and we can permanently disable path.
+ */
+static DEFINE_STATIC_KEY_TRUE(deferred_pages);
+
+/*
+ * If this zone has deferred pages, try to grow it by initializing enough
+ * deferred pages to satisfy the allocation specified by order, rounded up to
+ * the nearest PAGES_PER_SECTION boundary. So we're adding memory in increments
+ * of SECTION_SIZE bytes by initializing struct pages in increments of
+ * PAGES_PER_SECTION * sizeof(struct page) bytes.
+ *
+ * Return true when zone was grown, otherwise return false. We return true even
+ * when we grow less than requested, let the caller decide if there are enough
+ * pages to satisfy allocation.
+ *
+ * Note: We use noinline because this function is needed only during boot, and
+ * it is called from a __ref function _deferred_grow_zone. This way we are
+ * making sure that it is not inlined into permanent text section.
+ */
+static noinline bool __init
+deferred_grow_zone(struct zone *zone, unsigned int order)
+{
+ int zid = zone_idx(zone);
+ int nid = zone_to_nid(zone);
+ pg_data_t *pgdat = NODE_DATA(nid);
+ unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
+ unsigned long nr_pages = 0;
+ unsigned long first_init_pfn, spfn, epfn, t, flags;
+ unsigned long first_deferred_pfn = pgdat->first_deferred_pfn;
+ phys_addr_t spa, epa;
+ u64 i;
+
+ /* Only the last zone may have deferred pages */
+ if (zone_end_pfn(zone) != pgdat_end_pfn(pgdat))
+ return false;
+
+ pgdat_resize_lock_irq(pgdat, &flags);
+
+ /*
+ * If deferred pages have been initialized while we were waiting for
+ * lock return true, as zone was grown. The caller will try again this
+ * zone. We won't return to this function again, since caller also has
+ * this static branch.
+ */
+ if (!static_branch_unlikely(&deferred_pages)) {
+ pgdat_resize_unlock_irq(pgdat, &flags);
+ return true;
+ }
+
+ /*
+ * If someone grew this zone while we were waiting for spinlock, return
+ * true, as there might be enough pages already.
+ */
+ if (first_deferred_pfn != pgdat->first_deferred_pfn) {
+ pgdat_resize_unlock_irq(pgdat, &flags);
+ return true;
+ }
+
+ first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
+
+ if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
+ pgdat_resize_unlock_irq(pgdat, &flags);
+ return false;
+ }
+
+ for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+ spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
+ epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
+
+ while (spfn < epfn && nr_pages < nr_pages_needed) {
+ t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
+ first_deferred_pfn = min(t, epfn);
+ nr_pages += deferred_init_pages(nid, zid, spfn,
+ first_deferred_pfn);
+ spfn = first_deferred_pfn;
+ }
+
+ if (nr_pages >= nr_pages_needed)
+ break;
+ }
+
+ for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+ spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
+ epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
+ deferred_free_pages(nid, zid, spfn, epfn);
+
+ if (first_deferred_pfn == epfn)
+ break;
+ }
+ pgdat->first_deferred_pfn = first_deferred_pfn;
+ pgdat_resize_unlock_irq(pgdat, &flags);
+
+ return nr_pages > 0;
+}
+
+/*
+ * deferred_grow_zone() is __init, but it is called from
+ * get_page_from_freelist() during early boot until deferred_pages permanently
+ * disables this call. This is why we have refdata wrapper to avoid warning,
+ * and to ensure that the function body gets unloaded.
+ */
+static bool __ref
+_deferred_grow_zone(struct zone *zone, unsigned int order)
+{
+ return deferred_grow_zone(zone, order);
+}
+
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
void __init page_alloc_init_late(void)
@@ -1626,6 +1699,12 @@ void __init page_alloc_init_late(void)
/* Block until all are initialised */
wait_for_completion(&pgdat_init_all_done_comp);
+ /*
+ * We initialized the rest of deferred pages, permanently
+ * disable on-demand struct page initialization.
+ */
+ static_branch_disable(&deferred_pages);
+
/* Reinit limits that are based on free pages after the kernel is up */
files_maxfiles_init();
#endif
@@ -3203,6 +3282,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
ac_classzone_idx(ac), alloc_flags)) {
int ret;
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ /*
+ * Watermark failed for this zone, but see if we can
+ * grow this zone if it contains deferred pages.
+ */
+ if (static_branch_unlikely(&deferred_pages)) {
+ if (_deferred_grow_zone(zone, order))
+ goto try_this_zone;
+ }
+#endif
/* Checked here to keep the fast path fast */
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (alloc_flags & ALLOC_NO_WATERMARKS)
@@ -3244,6 +3333,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
reserve_highatomic_pageblock(page, zone, order);
return page;
+ } else {
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ /* Try again if zone has deferred pages */
+ if (static_branch_unlikely(&deferred_pages)) {
+ if (_deferred_grow_zone(zone, order))
+ goto try_this_zone;
+ }
+#endif
}
}
@@ -6251,7 +6348,15 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
alloc_node_mem_map(pgdat);
- reset_deferred_meminit(pgdat);
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ /*
+ * We start only with one section of pages, more pages are added as
+ * needed until the rest of deferred pages are initialized.
+ */
+ pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
+ pgdat->node_spanned_pages);
+ pgdat->first_deferred_pfn = ULONG_MAX;
+#endif
free_area_init_core(pgdat);
}
--
2.16.2
Vlastimil Babka reported about a window issue during which when deferred
pages are initialized, and the current version of on-demand initialization
is finished, allocations may fail. While this is highly unlikely scenario,
since this kind of allocation request must be large, and must come from
interrupt handler, we still want to cover it.
We solve this by initializing deferred pages with interrupts disabled, and
holding node_size_lock spin lock while pages in the node are being
initialized. The on-demand deferred page initialization that comes later
will use the same lock, and thus synchronize with deferred_init_memmap().
It is unlikely for threads that initialize deferred pages to be
interrupted. They run soon after smp_init(), but before modules are
initialized, and long before user space programs. This is why there is no
adverse effect of having these threads running with interrupts disabled.
Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/memory_hotplug.h | 73 +++++++++++++++++++++++++++---------------
include/linux/mmzone.h | 5 +--
mm/page_alloc.c | 22 ++++++-------
3 files changed, 62 insertions(+), 38 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index aba5f86eb038..02bfb04823c4 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -51,24 +51,6 @@ enum {
MMOP_ONLINE_MOVABLE,
};
-/*
- * pgdat resizing functions
- */
-static inline
-void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
-{
- spin_lock_irqsave(&pgdat->node_size_lock, *flags);
-}
-static inline
-void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
-{
- spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
-}
-static inline
-void pgdat_resize_init(struct pglist_data *pgdat)
-{
- spin_lock_init(&pgdat->node_size_lock);
-}
/*
* Zone resizing functions
*
@@ -246,13 +228,6 @@ extern void clear_zone_contiguous(struct zone *zone);
___page; \
})
-/*
- * Stub functions for when hotplug is off
- */
-static inline void pgdat_resize_lock(struct pglist_data *p, unsigned long *f) {}
-static inline void pgdat_resize_unlock(struct pglist_data *p, unsigned long *f) {}
-static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
-
static inline unsigned zone_span_seqbegin(struct zone *zone)
{
return 0;
@@ -293,6 +268,54 @@ static inline bool movable_node_is_enabled(void)
}
#endif /* ! CONFIG_MEMORY_HOTPLUG */
+#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
+/*
+ * pgdat resizing functions
+ */
+static inline
+void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
+{
+ spin_lock_irqsave(&pgdat->node_size_lock, *flags);
+}
+static inline
+void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
+{
+ spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
+}
+static inline
+void pgdat_resize_init(struct pglist_data *pgdat)
+{
+ spin_lock_init(&pgdat->node_size_lock);
+}
+
+/* Disable interrupts and save previous IRQ state in flags before locking */
+static inline
+void pgdat_resize_lock_irq(struct pglist_data *pgdat, unsigned long *flags)
+{
+ unsigned long tmp_flags;
+
+ local_irq_save(*flags);
+ local_irq_disable();
+ pgdat_resize_lock(pgdat, &tmp_flags);
+}
+
+static inline
+void pgdat_resize_unlock_irq(struct pglist_data *pgdat, unsigned long *flags)
+{
+ pgdat_resize_unlock(pgdat, flags);
+}
+
+#else /* !(CONFIG_MEMORY_HOTPLUG || CONFIG_DEFERRED_STRUCT_PAGE_INIT) */
+/*
+ * Stub functions for when hotplug is off
+ */
+static inline void pgdat_resize_lock(struct pglist_data *p, unsigned long *f) {}
+static inline void pgdat_resize_unlock(struct pglist_data *p, unsigned long *f) {}
+static inline void pgdat_resize_lock_irq(struct pglist_data *p, unsigned long *f) {}
+static inline void pgdat_resize_unlock_irq(struct pglist_data *p, unsigned long *f) {}
+static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
+#endif /* !(CONFIG_MEMORY_HOTPLUG || CONFIG_DEFERRED_STRUCT_PAGE_INIT) */
+
#ifdef CONFIG_MEMORY_HOTREMOVE
extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7522a6987595..d14168da66a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -633,14 +633,15 @@ typedef struct pglist_data {
#ifndef CONFIG_NO_BOOTMEM
struct bootmem_data *bdata;
#endif
-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
/*
* Must be held any time you expect node_start_pfn, node_present_pages
* or node_spanned_pages stay constant. Holding this will also
* guarantee that any pfn_valid() stays that way.
*
* pgdat_resize_lock() and pgdat_resize_unlock() are provided to
- * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG.
+ * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
+ * or CONFIG_DEFERRED_STRUCT_PAGE_INIT.
*
* Nests above zone->lock and zone->span_seqlock
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb416723538f..55b98df0be39 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1506,7 +1506,6 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
} else if (!(pfn & nr_pgmask)) {
deferred_free_range(pfn - nr_free, nr_free);
nr_free = 1;
- cond_resched();
} else {
nr_free++;
}
@@ -1533,12 +1532,10 @@ static unsigned long __init deferred_init_pages(int nid, int zid,
if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
page = NULL;
continue;
- } else if (!page || !(pfn & nr_pgmask)) {
+ } else if (!page || !(pfn & nr_pgmask))
page = pfn_to_page(pfn);
- cond_resched();
- } else {
+ else
page++;
- }
__init_single_page(page, pfn, zid, nid, true);
nr_pages++;
}
@@ -1552,23 +1549,25 @@ static int __init deferred_init_memmap(void *data)
int nid = pgdat->node_id;
unsigned long start = jiffies;
unsigned long nr_pages = 0;
- unsigned long spfn, epfn;
+ unsigned long spfn, epfn, first_init_pfn, flags;
phys_addr_t spa, epa;
int zid;
struct zone *zone;
- unsigned long first_init_pfn = pgdat->first_deferred_pfn;
const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
u64 i;
+ /* Bind memory initialisation thread to a local node if possible */
+ if (!cpumask_empty(cpumask))
+ set_cpus_allowed_ptr(current, cpumask);
+
+ pgdat_resize_lock_irq(pgdat, &flags);
+ first_init_pfn = pgdat->first_deferred_pfn;
if (first_init_pfn == ULONG_MAX) {
+ pgdat_resize_unlock_irq(pgdat, &flags);
pgdat_init_report_one_done();
return 0;
}
- /* Bind memory initialisation thread to a local node if possible */
- if (!cpumask_empty(cpumask))
- set_cpus_allowed_ptr(current, cpumask);
-
/* Sanity check boundaries */
BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn);
BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
@@ -1598,6 +1597,7 @@ static int __init deferred_init_memmap(void *data)
epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
deferred_free_pages(nid, zid, spfn, epfn);
}
+ pgdat_resize_unlock_irq(pgdat, &flags);
/* Sanity check that the next zone really is unpopulated */
WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
--
2.16.2
On Fri, 9 Mar 2018 17:08:06 -0500 Pavel Tatashin <[email protected]> wrote:
> Vlastimil Babka reported about a window issue during which when deferred
> pages are initialized, and the current version of on-demand initialization
> is finished, allocations may fail. While this is highly unlikely scenario,
> since this kind of allocation request must be large, and must come from
> interrupt handler, we still want to cover it.
>
> We solve this by initializing deferred pages with interrupts disabled, and
> holding node_size_lock spin lock while pages in the node are being
> initialized. The on-demand deferred page initialization that comes later
> will use the same lock, and thus synchronize with deferred_init_memmap().
>
> It is unlikely for threads that initialize deferred pages to be
> interrupted. They run soon after smp_init(), but before modules are
> initialized, and long before user space programs. This is why there is no
> adverse effect of having these threads running with interrupts disabled.
>
> ...
>
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
>
> +#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
> +/*
> + * pgdat resizing functions
> + */
> +static inline
> +void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags)
> +{
> + spin_lock_irqsave(&pgdat->node_size_lock, *flags);
> +}
> +static inline
> +void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags)
> +{
> + spin_unlock_irqrestore(&pgdat->node_size_lock, *flags);
> +}
> +static inline
> +void pgdat_resize_init(struct pglist_data *pgdat)
> +{
> + spin_lock_init(&pgdat->node_size_lock);
> +}
> +
> +/* Disable interrupts and save previous IRQ state in flags before locking */
> +static inline
> +void pgdat_resize_lock_irq(struct pglist_data *pgdat, unsigned long *flags)
> +{
> + unsigned long tmp_flags;
> +
> + local_irq_save(*flags);
> + local_irq_disable();
> + pgdat_resize_lock(pgdat, &tmp_flags);
> +}
As far as I can tell, this ugly-looking thing is identical to
pgdat_resize_lock().
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1506,7 +1506,6 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
> } else if (!(pfn & nr_pgmask)) {
> deferred_free_range(pfn - nr_free, nr_free);
> nr_free = 1;
> - cond_resched();
> } else {
> nr_free++;
And how can we simply remove these cond_resched()s? I assume this is
being done because interrupts are now disabled? But those were there
for a reason, weren't they?
Hi Andrew,
> > +/* Disable interrupts and save previous IRQ state in flags before locking */
> > +static inline
> > +void pgdat_resize_lock_irq(struct pglist_data *pgdat, unsigned long *flags)
> > +{
> > + unsigned long tmp_flags;
> > +
> > + local_irq_save(*flags);
> > + local_irq_disable();
> > + pgdat_resize_lock(pgdat, &tmp_flags);
> > +}
>
> As far as I can tell, this ugly-looking thing is identical to
> pgdat_resize_lock().
I will get rid of it, and use pgdat_resize_lock(). My confusion was that I
thought that local_irq_save() only saves the IRQ flags does not disable
them.
>
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1506,7 +1506,6 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
> > } else if (!(pfn & nr_pgmask)) {
> > deferred_free_range(pfn - nr_free, nr_free);
> > nr_free = 1;
> > - cond_resched();
> > } else {
> > nr_free++;
>
> And how can we simply remove these cond_resched()s? I assume this is
> being done because interrupts are now disabled? But those were there
> for a reason, weren't they?
We must remove cond_resched() because we can't sleep anymore. They were
added to fight NMI timeouts, so I will replace them with
touch_nmi_watchdog() in a follow-up fix.
Thank you for your review,
Pavel
On Tue, 13 Mar 2018 12:04:30 -0400 Pavel Tatashin <[email protected]> wrote:
> >
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1506,7 +1506,6 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
> > > } else if (!(pfn & nr_pgmask)) {
> > > deferred_free_range(pfn - nr_free, nr_free);
> > > nr_free = 1;
> > > - cond_resched();
> > > } else {
> > > nr_free++;
> >
> > And how can we simply remove these cond_resched()s? I assume this is
> > being done because interrupts are now disabled? But those were there
> > for a reason, weren't they?
>
> We must remove cond_resched() because we can't sleep anymore. They were
> added to fight NMI timeouts, so I will replace them with
> touch_nmi_watchdog() in a follow-up fix.
This makes no sense. Any code section where we can add cond_resched()
was never subject to NMI timeouts because that code cannot be running with
disabled interrupts.
> >
> > We must remove cond_resched() because we can't sleep anymore. They were
> > added to fight NMI timeouts, so I will replace them with
> > touch_nmi_watchdog() in a follow-up fix.
>
> This makes no sense. Any code section where we can add cond_resched()
> was never subject to NMI timeouts because that code cannot be running with
> disabled interrupts.
>
Hi Andrew,
I was talking about this patch:
9b6e63cbf85b89b2dbffa4955dbf2df8250e5375
mm, page_alloc: add scheduling point to memmap_init_zone
Which adds cond_resched() to memmap_init_zone() to avoid NMI timeouts.
memmap_init_zone() is used both, early in boot, when non-deferred struct
pages are initialized, but also may be used later, during memory hotplug.
As I understand, the later case could cause the timeout on non-preemptible
kernels.
My understanding, is that the same logic was used here when cond_resched()s
were added.
Please correct me if I am wrong.
Thank you,
Pavel
On Tue, 13 Mar 2018 15:45:46 -0400 Pavel Tatashin <[email protected]> wrote:
> > >
> > > We must remove cond_resched() because we can't sleep anymore. They were
> > > added to fight NMI timeouts, so I will replace them with
> > > touch_nmi_watchdog() in a follow-up fix.
> >
> > This makes no sense. Any code section where we can add cond_resched()
> > was never subject to NMI timeouts because that code cannot be running with
> > disabled interrupts.
> >
>
> Hi Andrew,
>
> I was talking about this patch:
>
> 9b6e63cbf85b89b2dbffa4955dbf2df8250e5375
> mm, page_alloc: add scheduling point to memmap_init_zone
>
> Which adds cond_resched() to memmap_init_zone() to avoid NMI timeouts.
>
> memmap_init_zone() is used both, early in boot, when non-deferred struct
> pages are initialized, but also may be used later, during memory hotplug.
>
> As I understand, the later case could cause the timeout on non-preemptible
> kernels.
>
> My understanding, is that the same logic was used here when cond_resched()s
> were added.
>
> Please correct me if I am wrong.
Yes, the message is a bit confusing and the terminology is perhaps
vague. And it's been a while since I played with this stuff, so from
(dated) memory:
Soft lockup: kernel has run for too long without rescheduling
Hard lockup: kernel has run for too long with interrupts disabled
Both of these are detected by the NMI watchdog handler.
9b6e63cbf85b89b2d fixes a soft lockup by adding a manual rescheduling
point. Replacing that with touch_nmi_watchdog() won't work (I think).
Presumably calling touch_softlockup_watchdog() will "work", in that it
suppresses the warning. But it won't fix the thing which the warning
is actually warning about: starvation of the CPU scheduler. That's
what the cond_resched() does.
I'm not sure what to suggest, really. Your changelog isn't the best:
"Vlastimil Babka reported about a window issue during which when
deferred pages are initialized, and the current version of on-demand
initialization is finished, allocations may fail". Well... where is
ths mysterious window? Without such detail it's hard for others to
suggest alternative approaches.
> Soft lockup: kernel has run for too long without rescheduling
> Hard lockup: kernel has run for too long with interrupts disabled
>
> Both of these are detected by the NMI watchdog handler.
>
> 9b6e63cbf85b89b2d fixes a soft lockup by adding a manual rescheduling
> point. Replacing that with touch_nmi_watchdog() won't work (I think).
> Presumably calling touch_softlockup_watchdog() will "work", in that it
> suppresses the warning. But it won't fix the thing which the warning
> is actually warning about: starvation of the CPU scheduler. That's
> what the cond_resched() does.
But, unlike memmap_init_zone(), which can be used after boot, here we do
not worry about kernel running for too long. This is because we are
booting, and no user programs are running.
So, it is acceptable to have a long uninterruptible span, as long
as we making a useful progress. BTW, the boot CPU still has
interrupts enabled during this span.
Comment in: include/linux/nmi.h, states:
* If the architecture supports the NMI watchdog, touch_nmi_watchdog()
* may be used to reset the timeout - for code which intentionally
* disables interrupts for a long time. This call is stateless.
Which is exactly what we are trying to do here, now that these threads
run with interrupts disabled.
Before, where they were running with interrupts enabled, and
cond_resched() was enough to satisfy soft lockups.
>
> I'm not sure what to suggest, really. Your changelog isn't the best:
> "Vlastimil Babka reported about a window issue during which when
> deferred pages are initialized, and the current version of on-demand
> initialization is finished, allocations may fail". Well... where is
> ths mysterious window? Without such detail it's hard for others to
> suggest alternative approaches.
Here is hopefully a better description of the problem:
Currently, during boot we preinitialize some number of struct pages to satisfy all boot allocations. Even if these allocations happen when we initialize the reset of deferred pages in page_alloc_init_late(). The problem is that we do not know how much kernel will need, and it also depends on various options.
So, with this work, we are changing this behavior to initialize struct pages on-demand, only when allocations happen.
During boot, when we try to allocate memory, the on-demand struct page initialization code takes care of it. But, once the deferred pages are initializing in:
page_alloc_init_late()
for_each_node_state(nid, N_MEMORY)
kthread_run(deferred_init_memmap())
We cannot use on-demand initialization, as these threads resize pgdat.
This whole thing is to take care of this time.
My first version of on-demand deferred page initialization would simply fail to allocate memory during this period of time. But, this new version waits for threads to finish initializing deferred memory, and successfully perform the allocation.
Because interrupt handler would wait for pgdat resize lock.
Thank you,
Pavel
On Tue, 13 Mar 2018 16:43:47 -0400 Pavel Tatashin <[email protected]> wrote:
>
> > Soft lockup: kernel has run for too long without rescheduling
> > Hard lockup: kernel has run for too long with interrupts disabled
> >
> > Both of these are detected by the NMI watchdog handler.
> >
> > 9b6e63cbf85b89b2d fixes a soft lockup by adding a manual rescheduling
> > point. Replacing that with touch_nmi_watchdog() won't work (I think).
> > Presumably calling touch_softlockup_watchdog() will "work", in that it
> > suppresses the warning. But it won't fix the thing which the warning
> > is actually warning about: starvation of the CPU scheduler. That's
> > what the cond_resched() does.
>
> But, unlike memmap_init_zone(), which can be used after boot, here we do
> not worry about kernel running for too long. This is because we are
> booting, and no user programs are running.
>
> So, it is acceptable to have a long uninterruptible span, as long
> as we making a useful progress. BTW, the boot CPU still has
> interrupts enabled during this span.
>
> Comment in: include/linux/nmi.h, states:
>
> * If the architecture supports the NMI watchdog, touch_nmi_watchdog()
> * may be used to reset the timeout - for code which intentionally
> * disables interrupts for a long time. This call is stateless.
>
> Which is exactly what we are trying to do here, now that these threads
> run with interrupts disabled.
>
> Before, where they were running with interrupts enabled, and
> cond_resched() was enough to satisfy soft lockups.
hm, maybe. But I'm not sure that touch_nmi_watchdog() will hold off a
soft lockup warning. Maybe it will.
And please let's get the above thoughts into the changlog.
> >
> > I'm not sure what to suggest, really. Your changelog isn't the best:
> > "Vlastimil Babka reported about a window issue during which when
> > deferred pages are initialized, and the current version of on-demand
> > initialization is finished, allocations may fail". Well... where is
> > ths mysterious window? Without such detail it's hard for others to
> > suggest alternative approaches.
>
> Here is hopefully a better description of the problem:
>
> Currently, during boot we preinitialize some number of struct pages to satisfy all boot allocations. Even if these allocations happen when we initialize the reset of deferred pages in page_alloc_init_late(). The problem is that we do not know how much kernel will need, and it also depends on various options.
>
> So, with this work, we are changing this behavior to initialize struct pages on-demand, only when allocations happen.
>
> During boot, when we try to allocate memory, the on-demand struct page initialization code takes care of it. But, once the deferred pages are initializing in:
>
> page_alloc_init_late()
> for_each_node_state(nid, N_MEMORY)
> kthread_run(deferred_init_memmap())
>
> We cannot use on-demand initialization, as these threads resize pgdat.
>
> This whole thing is to take care of this time.
>
> My first version of on-demand deferred page initialization would simply fail to allocate memory during this period of time. But, this new version waits for threads to finish initializing deferred memory, and successfully perform the allocation.
>
> Because interrupt handler would wait for pgdat resize lock.
OK, thanks. Please also add to changelog.
> hm, maybe. But I'm not sure that touch_nmi_watchdog() will hold off a
> soft lockup warning. Maybe it will.
It should:
124static inline void touch_nmi_watchdog(void)
125{
126 arch_touch_nmi_watchdog();
127 touch_softlockup_watchdog();
128}
>
> And please let's get the above thoughts into the changlog.
OK
>
>> >
>> > I'm not sure what to suggest, really. Your changelog isn't the best:
>> > "Vlastimil Babka reported about a window issue during which when
>> > deferred pages are initialized, and the current version of on-demand
>> > initialization is finished, allocations may fail". Well... where is
>> > ths mysterious window? Without such detail it's hard for others to
>> > suggest alternative approaches.
>>
>> Here is hopefully a better description of the problem:
>>
>> Currently, during boot we preinitialize some number of struct pages to satisfy all boot allocations. Even if these allocations happen when we initialize the reset of deferred pages in page_alloc_init_late(). The problem is that we do not know how much kernel will need, and it also depends on various options.
>>
>> So, with this work, we are changing this behavior to initialize struct pages on-demand, only when allocations happen.
>>
>> During boot, when we try to allocate memory, the on-demand struct page initialization code takes care of it. But, once the deferred pages are initializing in:
>>
>> page_alloc_init_late()
>> for_each_node_state(nid, N_MEMORY)
>> kthread_run(deferred_init_memmap())
>>
>> We cannot use on-demand initialization, as these threads resize pgdat.
>>
>> This whole thing is to take care of this time.
>>
>> My first version of on-demand deferred page initialization would simply fail to allocate memory during this period of time. But, this new version waits for threads to finish initializing deferred memory, and successfully perform the allocation.
>>
>> Because interrupt handler would wait for pgdat resize lock.
>
> OK, thanks. Please also add to changelog.
OK, I will send an updated patch, with changelog changes.