2020-04-03 13:36:31

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v3 0/3] initialize deferred pages with interrupts enabled

Keep interrupts enabled during deferred page initialization in order to
make code more modular and allow jiffies to update.

Original approach, and discussion can be found here:
https://lore.kernel.org/linux-mm/[email protected]

Changelog
v3:
- Splitted cond_resched() change into a separate patch as suggested by
David Hildenbrand

v2:
- Addressed comments Daniel Jordan. Replaced touch_nmi_watchdog() to cond_resched().
Added reviewed-by's and acked-by's.

v1:
https://lore.kernel.org/linux-mm/[email protected]

Daniel Jordan (1):
mm: call touch_nmi_watchdog() on max order boundaries in deferred init

Pavel Tatashin (2):
mm: initialize deferred pages with interrupts enabled
mm: call cond_resched() from deferred_init_memmap()

include/linux/mmzone.h | 2 ++
mm/page_alloc.c | 27 +++++++++++----------------
2 files changed, 13 insertions(+), 16 deletions(-)

--
2.17.1


2020-04-03 13:37:12

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v3 1/3] mm: call touch_nmi_watchdog() on max order boundaries in deferred init

From: Daniel Jordan <[email protected]>

deferred_init_memmap() disables interrupts the entire time, so it calls
touch_nmi_watchdog() periodically to avoid soft lockup splats. Soon it
will run with interrupts enabled, at which point cond_resched() should
be used instead.

deferred_grow_zone() makes the same watchdog calls through code shared
with deferred init but will continue to run with interrupts disabled, so
it can't call cond_resched().

Pull the watchdog calls up to these two places to allow the first to be
changed later, independently of the second. The frequency reduces from
twice per pageblock (init and free) to once per max order block.

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Cc: [email protected] # 4.17+

Signed-off-by: Daniel Jordan <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>

---
mm/page_alloc.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..e8ff6a176164 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1639,7 +1639,6 @@ static void __init deferred_free_pages(unsigned long pfn,
} else if (!(pfn & nr_pgmask)) {
deferred_free_range(pfn - nr_free, nr_free);
nr_free = 1;
- touch_nmi_watchdog();
} else {
nr_free++;
}
@@ -1669,7 +1668,6 @@ static unsigned long __init deferred_init_pages(struct zone *zone,
continue;
} else if (!page || !(pfn & nr_pgmask)) {
page = pfn_to_page(pfn);
- touch_nmi_watchdog();
} else {
page++;
}
@@ -1809,8 +1807,10 @@ static int __init deferred_init_memmap(void *data)
* that we can avoid introducing any issues with the buddy
* allocator.
*/
- while (spfn < epfn)
+ while (spfn < epfn) {
nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+ touch_nmi_watchdog();
+ }
zone_empty:
pgdat_resize_unlock(pgdat, &flags);

@@ -1894,6 +1894,7 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
first_deferred_pfn = spfn;

nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+ touch_nmi_watchdog();

/* We should only stop along section boundaries */
if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
--
2.17.1

2020-04-03 14:03:18

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v3 3/3] mm: call cond_resched() from deferred_init_memmap()

Now that deferred pages are initialized with interrupts enabled we can
replace touch_nmi_watchdog() with cond_resched(), as it was before
3a2d7fa8a3d5.

For now, we cannot do the same in deferred_grow_zone() as it is still
initializes pages with interrupts disabled.

This change fixes RCU problem described:
linux-mm/[email protected]

[ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
[ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
[ 60.475000] Sending NMI from CPU 0 to CPUs 1:
[ 1.760091] NMI backtrace for cpu 1
[ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
[ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
[ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
[ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
[ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
[ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
[ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
[ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
[ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
[ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
[ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
[ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
[ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1.760091] Call Trace:
[ 1.760091] deferred_init_pages+0x8f/0xbf
[ 1.760091] deferred_init_memmap+0x184/0x29d
[ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
[ 1.760091] kthread+0x112/0x130
[ 1.760091] ? kthread_flush_work_fn+0x10/0x10
[ 1.760091] ret_from_fork+0x35/0x40
[ 89.123011] node 0 initialised, 1055935372 pages in 88650ms

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Cc: [email protected] # 4.17+

Reported-by: Yiqian Wei <[email protected]>
Tested-by: David Hildenbrand <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a60f2427eb0..445f74358997 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1816,7 +1816,7 @@ static int __init deferred_init_memmap(void *data)
*/
while (spfn < epfn) {
nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
- touch_nmi_watchdog();
+ sched_clock();
}
zone_empty:
/* Sanity check that the next zone really is unpopulated */
--
2.17.1

2020-04-03 14:04:24

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v3 2/3] mm: initialize deferred pages with interrupts enabled

Initializing struct pages is a long task and keeping interrupts disabled
for the duration of this operation introduces a number of problems.

1. jiffies are not updated for long period of time, and thus incorrect time
is reported. See proposed solution and discussion here:
lkml/[email protected]
2. It prevents farther improving deferred page initialization by allowing
intra-node multi-threading.

We are keeping interrupts disabled to solve a rather theoretical problem
that was never observed in real world (See 3a2d7fa8a3d5).

Lets keep interrupts enabled. In case we ever encounter a scenario where
an interrupt thread wants to allocate large amount of memory this early in
boot we can deal with that by growing zone (see deferred_grow_zone()) by
the needed amount before starting deferred_init_memmap() threads.

Before:
[ 1.232459] node 0 initialised, 12058412 pages in 1ms

After:
[ 1.632580] node 0 initialised, 12051227 pages in 436ms

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Cc: [email protected] # 4.17+

Reported-by: Shile Zhang <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Daniel Jordan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/mmzone.h | 2 ++
mm/page_alloc.c | 20 +++++++-------------
2 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..c5bdf55da034 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -721,6 +721,8 @@ typedef struct pglist_data {
/*
* Must be held any time you expect node_start_pfn,
* node_present_pages, node_spanned_pages or nr_zones to stay constant.
+ * Also synchronizes pgdat->first_deferred_pfn during deferred page
+ * init.
*
* pgdat_resize_lock() and pgdat_resize_unlock() are provided to
* manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e8ff6a176164..4a60f2427eb0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1790,6 +1790,13 @@ static int __init deferred_init_memmap(void *data)
BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
pgdat->first_deferred_pfn = ULONG_MAX;

+ /*
+ * Once we unlock here, the zone cannot be grown anymore, thus if an
+ * interrupt thread must allocate this early in boot, zone must be
+ * pre-grown prior to start of deferred page initialization.
+ */
+ pgdat_resize_unlock(pgdat, &flags);
+
/* Only the highest zone is deferred so find it */
for (zid = 0; zid < MAX_NR_ZONES; zid++) {
zone = pgdat->node_zones + zid;
@@ -1812,8 +1819,6 @@ static int __init deferred_init_memmap(void *data)
touch_nmi_watchdog();
}
zone_empty:
- pgdat_resize_unlock(pgdat, &flags);
-
/* Sanity check that the next zone really is unpopulated */
WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));

@@ -1855,17 +1860,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)

pgdat_resize_lock(pgdat, &flags);

- /*
- * If deferred pages have been initialized while we were waiting for
- * the lock, return true, as the zone was grown. The caller will retry
- * this zone. We won't return to this function since the caller also
- * has this static branch.
- */
- if (!static_branch_unlikely(&deferred_pages)) {
- pgdat_resize_unlock(pgdat, &flags);
- return true;
- }
-
/*
* If someone grew this zone while we were waiting for spinlock, return
* true, as there might be enough pages already.
--
2.17.1

2020-04-03 14:05:01

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] mm: call cond_resched() from deferred_init_memmap()

On Fri, Apr 03, 2020 at 09:35:49AM -0400, Pavel Tatashin wrote:
> Now that deferred pages are initialized with interrupts enabled we can
> replace touch_nmi_watchdog() with cond_resched(), as it was before
> 3a2d7fa8a3d5.
...
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4a60f2427eb0..445f74358997 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1816,7 +1816,7 @@ static int __init deferred_init_memmap(void *data)
> */
> while (spfn < epfn) {
> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> - touch_nmi_watchdog();
> + sched_clock();

I think you meant cond_resched()?

With that,
Reviewed-by: Daniel Jordan <[email protected]>

2020-04-03 15:43:59

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] mm: call cond_resched() from deferred_init_memmap()

> I think you meant cond_resched()?
>
> With that,
> Reviewed-by: Daniel Jordan <[email protected]>

Thank you Of course, I will re-submit quickly!

Pasha