Two fixes for misleading stall messages / soft lockups with huge nodes /
zones during boot without CONFIG_PREEMPT.
David Hildenbrand (2):
mm/page_alloc: fix RCU stalls during deferred page initialization
mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
mm/page_alloc.c | 2 ++
1 file changed, 2 insertions(+)
--
2.25.1
Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
e.g., while booting up.
[ 105.608900] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
[ 105.608933] Modules linked in:
[ 105.608933] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
[ 105.608933] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
[ 105.608933] RIP: 0010:__pageblock_pfn_to_page+0x134/0x1c0
[ 105.608933] Code: 85 c0 74 71 4a 8b 04 d0 48 85 c0 74 68 48 01 c1 74 63 f6 01 04 74 5e 48 c1 e7 06 4c 8b 05 cc 991
[ 105.608933] RSP: 0000:ffffb6d94000fe60 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
[ 105.608933] RAX: fffff81953250000 RBX: 000000000a4c9600 RCX: ffff8fe9ff7c1990
[ 105.608933] RDX: ffff8fe9ff7dab80 RSI: 000000000a4c95ff RDI: 0000000293250000
[ 105.608933] RBP: ffff8fe9ff7dab80 R08: fffff816c0000000 R09: 0000000000000008
[ 105.608933] R10: 0000000000000014 R11: 0000000000000014 R12: 0000000000000000
[ 105.608933] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 105.608933] FS: 0000000000000000(0000) GS:ffff8fe1ff400000(0000) knlGS:0000000000000000
[ 105.608933] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 105.608933] CR2: 000000000f613000 CR3: 00000088cf20a000 CR4: 00000000000006f0
[ 105.608933] Call Trace:
[ 105.608933] set_zone_contiguous+0x56/0x70
[ 105.608933] page_alloc_init_late+0x166/0x176
[ 105.608933] kernel_init_freeable+0xfa/0x255
[ 105.608933] ? rest_init+0xaa/0xaa
[ 105.608933] kernel_init+0xa/0x106
[ 105.608933] ret_from_fork+0x35/0x40
The issue becomes visible when having a lot of memory (e.g., 4TB)
assigned to a single NUMA node - a system that can easily be created
using QEMU. Inside VMs on a hypervisor with quite some memory
overcommit, this is fairly easy to trigger.
Cc: Andrew Morton <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Shile Zhang <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/page_alloc.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 084cabffc90d..cc4f07d52939 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1607,6 +1607,7 @@ void set_zone_contiguous(struct zone *zone)
if (!__pageblock_pfn_to_page(block_start_pfn,
block_end_pfn, zone))
return;
+ cond_resched();
}
/* We confirm that there is no hole */
--
2.25.1
With CONFIG_DEFERRED_STRUCT_PAGE_INIT and without CONFIG_PREEMPT, it can
happen that we get RCU stalls detected when booting up.
[ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
[ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
[ 60.475000] Sending NMI from CPU 0 to CPUs 1:
[ 1.760091] NMI backtrace for cpu 1
[ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
[ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
[ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
[ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
[ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
[ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
[ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
[ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
[ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
[ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
[ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
[ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
[ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1.760091] Call Trace:
[ 1.760091] deferred_init_pages+0x8f/0xbf
[ 1.760091] deferred_init_memmap+0x184/0x29d
[ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
[ 1.760091] kthread+0x112/0x130
[ 1.760091] ? kthread_flush_work_fn+0x10/0x10
[ 1.760091] ret_from_fork+0x35/0x40
[ 89.123011] node 0 initialised, 1055935372 pages in 88650ms
The issue becomes visible when having a lot of memory (e.g., 4TB)
assigned to a single NUMA node - a system that can easily be created
using QEMU. Inside VMs on a hypervisor with quite some memory
overcommit, this is fairly easy to trigger.
Adding the cond_resched() makes RCU happy.
Reported-by: Yiqian Wei <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Shile Zhang <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/page_alloc.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ca1453204e66..084cabffc90d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1877,6 +1877,7 @@ static int __init deferred_init_memmap(void *data)
prev_nr_pages = nr_pages;
pgdat->first_deferred_pfn = spfn;
pgdat_resize_unlock(pgdat, &flags);
+ cond_resched();
goto again;
}
}
--
2.25.1
On Wed, Apr 1, 2020 at 6:42 AM David Hildenbrand <[email protected]> wrote:
>
> Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
> e.g., while booting up.
>
> [ 105.608900] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
> [ 105.608933] Modules linked in:
> [ 105.608933] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
> [ 105.608933] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
> [ 105.608933] RIP: 0010:__pageblock_pfn_to_page+0x134/0x1c0
> [ 105.608933] Code: 85 c0 74 71 4a 8b 04 d0 48 85 c0 74 68 48 01 c1 74 63 f6 01 04 74 5e 48 c1 e7 06 4c 8b 05 cc 991
> [ 105.608933] RSP: 0000:ffffb6d94000fe60 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
> [ 105.608933] RAX: fffff81953250000 RBX: 000000000a4c9600 RCX: ffff8fe9ff7c1990
> [ 105.608933] RDX: ffff8fe9ff7dab80 RSI: 000000000a4c95ff RDI: 0000000293250000
> [ 105.608933] RBP: ffff8fe9ff7dab80 R08: fffff816c0000000 R09: 0000000000000008
> [ 105.608933] R10: 0000000000000014 R11: 0000000000000014 R12: 0000000000000000
> [ 105.608933] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 105.608933] FS: 0000000000000000(0000) GS:ffff8fe1ff400000(0000) knlGS:0000000000000000
> [ 105.608933] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 105.608933] CR2: 000000000f613000 CR3: 00000088cf20a000 CR4: 00000000000006f0
> [ 105.608933] Call Trace:
> [ 105.608933] set_zone_contiguous+0x56/0x70
> [ 105.608933] page_alloc_init_late+0x166/0x176
> [ 105.608933] kernel_init_freeable+0xfa/0x255
> [ 105.608933] ? rest_init+0xaa/0xaa
> [ 105.608933] kernel_init+0xa/0x106
> [ 105.608933] ret_from_fork+0x35/0x40
>
> The issue becomes visible when having a lot of memory (e.g., 4TB)
> assigned to a single NUMA node - a system that can easily be created
> using QEMU. Inside VMs on a hypervisor with quite some memory
> overcommit, this is fairly easy to trigger.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Kirill Tkhai <[email protected]>
> Cc: Shile Zhang <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Daniel Jordan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
On Wed, Apr 1, 2020 at 6:42 AM David Hildenbrand <[email protected]> wrote:
>
> With CONFIG_DEFERRED_STRUCT_PAGE_INIT and without CONFIG_PREEMPT, it can
> happen that we get RCU stalls detected when booting up.
>
> [ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
> [ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
> [ 60.475000] Sending NMI from CPU 0 to CPUs 1:
> [ 1.760091] NMI backtrace for cpu 1
> [ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
> [ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
> [ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
> [ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
> [ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
> [ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
> [ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
> [ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
> [ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
> [ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
> [ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
> [ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
> [ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1.760091] Call Trace:
> [ 1.760091] deferred_init_pages+0x8f/0xbf
> [ 1.760091] deferred_init_memmap+0x184/0x29d
> [ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
> [ 1.760091] kthread+0x112/0x130
> [ 1.760091] ? kthread_flush_work_fn+0x10/0x10
> [ 1.760091] ret_from_fork+0x35/0x40
> [ 89.123011] node 0 initialised, 1055935372 pages in 88650ms
>
> The issue becomes visible when having a lot of memory (e.g., 4TB)
> assigned to a single NUMA node - a system that can easily be created
> using QEMU. Inside VMs on a hypervisor with quite some memory
> overcommit, this is fairly easy to trigger.
>
> Adding the cond_resched() makes RCU happy.
>
> Reported-by: Yiqian Wei <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Kirill Tkhai <[email protected]>
> Cc: Shile Zhang <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Daniel Jordan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
On 04/01/20 at 12:41pm, David Hildenbrand wrote:
> With CONFIG_DEFERRED_STRUCT_PAGE_INIT and without CONFIG_PREEMPT, it can
> happen that we get RCU stalls detected when booting up.
>
> [ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
> [ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
> [ 60.475000] Sending NMI from CPU 0 to CPUs 1:
> [ 1.760091] NMI backtrace for cpu 1
> [ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
> [ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
> [ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
> [ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
> [ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
> [ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
> [ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
> [ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
> [ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
> [ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
> [ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
> [ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
> [ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1.760091] Call Trace:
> [ 1.760091] deferred_init_pages+0x8f/0xbf
> [ 1.760091] deferred_init_memmap+0x184/0x29d
> [ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
> [ 1.760091] kthread+0x112/0x130
> [ 1.760091] ? kthread_flush_work_fn+0x10/0x10
> [ 1.760091] ret_from_fork+0x35/0x40
> [ 89.123011] node 0 initialised, 1055935372 pages in 88650ms
>
> The issue becomes visible when having a lot of memory (e.g., 4TB)
> assigned to a single NUMA node - a system that can easily be created
> using QEMU. Inside VMs on a hypervisor with quite some memory
> overcommit, this is fairly easy to trigger.
>
> Adding the cond_resched() makes RCU happy.
>
> Reported-by: Yiqian Wei <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Kirill Tkhai <[email protected]>
> Cc: Shile Zhang <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Daniel Jordan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/page_alloc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ca1453204e66..084cabffc90d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1877,6 +1877,7 @@ static int __init deferred_init_memmap(void *data)
> prev_nr_pages = nr_pages;
> pgdat->first_deferred_pfn = spfn;
> pgdat_resize_unlock(pgdat, &flags);
> + cond_resched();
> goto again;
Reviewed-by: Baoquan He <[email protected]>
On 01.04.20 12:41, David Hildenbrand wrote:
> Two fixes for misleading stall messages / soft lockups with huge nodes /
> zones during boot without CONFIG_PREEMPT.
>
> David Hildenbrand (2):
> mm/page_alloc: fix RCU stalls during deferred page initialization
> mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
>
> mm/page_alloc.c | 2 ++
> 1 file changed, 2 insertions(+)
>
Patch #1 requires "[PATCH v3] mm: fix tick timer stall during deferred
page init"
https://lkml.kernel.org/r/[email protected]
--
Thanks,
David / dhildenb
> With CONFIG_DEFERRED_STRUCT_PAGE_INIT and without CONFIG_PREEMPT, it can
> happen that we get RCU stalls detected when booting up.
>
> [ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
> [ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
> [ 60.475000] Sending NMI from CPU 0 to CPUs 1:
> [ 1.760091] NMI backtrace for cpu 1
> [ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
> [ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
> [ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
> [ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
> [ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
> [ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
> [ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
> [ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
> [ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
> [ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
> [ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
> [ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
> [ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1.760091] Call Trace:
> [ 1.760091] deferred_init_pages+0x8f/0xbf
> [ 1.760091] deferred_init_memmap+0x184/0x29d
> [ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
> [ 1.760091] kthread+0x112/0x130
> [ 1.760091] ? kthread_flush_work_fn+0x10/0x10
> [ 1.760091] ret_from_fork+0x35/0x40
> [ 89.123011] node 0 initialised, 1055935372 pages in 88650ms
>
> The issue becomes visible when having a lot of memory (e.g., 4TB)
> assigned to a single NUMA node - a system that can easily be created
> using QEMU. Inside VMs on a hypervisor with quite some memory
> overcommit, this is fairly easy to trigger.
>
> Adding the cond_resched() makes RCU happy.
>
> Reported-by: Yiqian Wei <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Kirill Tkhai <[email protected]>
> Cc: Shile Zhang <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Daniel Jordan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/page_alloc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ca1453204e66..084cabffc90d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1877,6 +1877,7 @@ static int __init deferred_init_memmap(void *data)
> prev_nr_pages = nr_pages;
> pgdat->first_deferred_pfn = spfn;
> pgdat_resize_unlock(pgdat, &flags);
> + cond_resched();
> goto again;
> }
> }
> --
Reviewed-by: Pankaj Gupta <[email protected]>
> 2.25.1
>
>
On 04/01/20 at 12:41pm, David Hildenbrand wrote:
> Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
> e.g., while booting up.
>
> [ 105.608900] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
> [ 105.608933] Modules linked in:
> [ 105.608933] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
> [ 105.608933] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
> [ 105.608933] RIP: 0010:__pageblock_pfn_to_page+0x134/0x1c0
> [ 105.608933] Code: 85 c0 74 71 4a 8b 04 d0 48 85 c0 74 68 48 01 c1 74 63 f6 01 04 74 5e 48 c1 e7 06 4c 8b 05 cc 991
> [ 105.608933] RSP: 0000:ffffb6d94000fe60 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
> [ 105.608933] RAX: fffff81953250000 RBX: 000000000a4c9600 RCX: ffff8fe9ff7c1990
> [ 105.608933] RDX: ffff8fe9ff7dab80 RSI: 000000000a4c95ff RDI: 0000000293250000
> [ 105.608933] RBP: ffff8fe9ff7dab80 R08: fffff816c0000000 R09: 0000000000000008
> [ 105.608933] R10: 0000000000000014 R11: 0000000000000014 R12: 0000000000000000
> [ 105.608933] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 105.608933] FS: 0000000000000000(0000) GS:ffff8fe1ff400000(0000) knlGS:0000000000000000
> [ 105.608933] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 105.608933] CR2: 000000000f613000 CR3: 00000088cf20a000 CR4: 00000000000006f0
> [ 105.608933] Call Trace:
> [ 105.608933] set_zone_contiguous+0x56/0x70
> [ 105.608933] page_alloc_init_late+0x166/0x176
> [ 105.608933] kernel_init_freeable+0xfa/0x255
> [ 105.608933] ? rest_init+0xaa/0xaa
> [ 105.608933] kernel_init+0xa/0x106
> [ 105.608933] ret_from_fork+0x35/0x40
>
> The issue becomes visible when having a lot of memory (e.g., 4TB)
> assigned to a single NUMA node - a system that can easily be created
> using QEMU. Inside VMs on a hypervisor with quite some memory
> overcommit, this is fairly easy to trigger.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Kirill Tkhai <[email protected]>
> Cc: Shile Zhang <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Daniel Jordan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/page_alloc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 084cabffc90d..cc4f07d52939 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1607,6 +1607,7 @@ void set_zone_contiguous(struct zone *zone)
> if (!__pageblock_pfn_to_page(block_start_pfn,
> block_end_pfn, zone))
> return;
> + cond_resched();
> }
Reviewed-by: Baoquan He <[email protected]>
> Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
> e.g., while booting up.
>
> [ 105.608900] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
> [ 105.608933] Modules linked in:
> [ 105.608933] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
> [ 105.608933] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
> [ 105.608933] RIP: 0010:__pageblock_pfn_to_page+0x134/0x1c0
> [ 105.608933] Code: 85 c0 74 71 4a 8b 04 d0 48 85 c0 74 68 48 01 c1 74 63 f6 01 04 74 5e 48 c1 e7 06 4c 8b 05 cc 991
> [ 105.608933] RSP: 0000:ffffb6d94000fe60 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
> [ 105.608933] RAX: fffff81953250000 RBX: 000000000a4c9600 RCX: ffff8fe9ff7c1990
> [ 105.608933] RDX: ffff8fe9ff7dab80 RSI: 000000000a4c95ff RDI: 0000000293250000
> [ 105.608933] RBP: ffff8fe9ff7dab80 R08: fffff816c0000000 R09: 0000000000000008
> [ 105.608933] R10: 0000000000000014 R11: 0000000000000014 R12: 0000000000000000
> [ 105.608933] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 105.608933] FS: 0000000000000000(0000) GS:ffff8fe1ff400000(0000) knlGS:0000000000000000
> [ 105.608933] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 105.608933] CR2: 000000000f613000 CR3: 00000088cf20a000 CR4: 00000000000006f0
> [ 105.608933] Call Trace:
> [ 105.608933] set_zone_contiguous+0x56/0x70
> [ 105.608933] page_alloc_init_late+0x166/0x176
> [ 105.608933] kernel_init_freeable+0xfa/0x255
> [ 105.608933] ? rest_init+0xaa/0xaa
> [ 105.608933] kernel_init+0xa/0x106
> [ 105.608933] ret_from_fork+0x35/0x40
>
> The issue becomes visible when having a lot of memory (e.g., 4TB)
> assigned to a single NUMA node - a system that can easily be created
> using QEMU. Inside VMs on a hypervisor with quite some memory
> overcommit, this is fairly easy to trigger.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Kirill Tkhai <[email protected]>
> Cc: Shile Zhang <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Daniel Jordan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/page_alloc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 084cabffc90d..cc4f07d52939 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1607,6 +1607,7 @@ void set_zone_contiguous(struct zone *zone)
> if (!__pageblock_pfn_to_page(block_start_pfn,
> block_end_pfn, zone))
> return;
> + cond_resched();
> }
>
> /* We confirm that there is no hole */
> --
Reviewed-by: Pankaj Gupta <[email protected]>
> 2.25.1
>
>
On 2020/4/1 18:41, David Hildenbrand wrote:
> Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
> e.g., while booting up.
>
> [ 105.608900] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
> [ 105.608933] Modules linked in:
> [ 105.608933] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
> [ 105.608933] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
> [ 105.608933] RIP: 0010:__pageblock_pfn_to_page+0x134/0x1c0
> [ 105.608933] Code: 85 c0 74 71 4a 8b 04 d0 48 85 c0 74 68 48 01 c1 74 63 f6 01 04 74 5e 48 c1 e7 06 4c 8b 05 cc 991
> [ 105.608933] RSP: 0000:ffffb6d94000fe60 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
> [ 105.608933] RAX: fffff81953250000 RBX: 000000000a4c9600 RCX: ffff8fe9ff7c1990
> [ 105.608933] RDX: ffff8fe9ff7dab80 RSI: 000000000a4c95ff RDI: 0000000293250000
> [ 105.608933] RBP: ffff8fe9ff7dab80 R08: fffff816c0000000 R09: 0000000000000008
> [ 105.608933] R10: 0000000000000014 R11: 0000000000000014 R12: 0000000000000000
> [ 105.608933] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 105.608933] FS: 0000000000000000(0000) GS:ffff8fe1ff400000(0000) knlGS:0000000000000000
> [ 105.608933] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 105.608933] CR2: 000000000f613000 CR3: 00000088cf20a000 CR4: 00000000000006f0
> [ 105.608933] Call Trace:
> [ 105.608933] set_zone_contiguous+0x56/0x70
> [ 105.608933] page_alloc_init_late+0x166/0x176
> [ 105.608933] kernel_init_freeable+0xfa/0x255
> [ 105.608933] ? rest_init+0xaa/0xaa
> [ 105.608933] kernel_init+0xa/0x106
> [ 105.608933] ret_from_fork+0x35/0x40
>
> The issue becomes visible when having a lot of memory (e.g., 4TB)
> assigned to a single NUMA node - a system that can easily be created
> using QEMU. Inside VMs on a hypervisor with quite some memory
> overcommit, this is fairly easy to trigger.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Kirill Tkhai <[email protected]>
> Cc: Shile Zhang <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Daniel Jordan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/page_alloc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 084cabffc90d..cc4f07d52939 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1607,6 +1607,7 @@ void set_zone_contiguous(struct zone *zone)
> if (!__pageblock_pfn_to_page(block_start_pfn,
> block_end_pfn, zone))
> return;
> + cond_resched();
> }
>
> /* We confirm that there is no hole */
Reviewed-by: Shile Zhang<[email protected]>
On 2020/4/1 18:41, David Hildenbrand wrote:
> With CONFIG_DEFERRED_STRUCT_PAGE_INIT and without CONFIG_PREEMPT, it can
> happen that we get RCU stalls detected when booting up.
>
> [ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
> [ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
> [ 60.475000] Sending NMI from CPU 0 to CPUs 1:
> [ 1.760091] NMI backtrace for cpu 1
> [ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
> [ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
> [ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
> [ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
> [ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
> [ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
> [ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
> [ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
> [ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
> [ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
> [ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
> [ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
> [ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1.760091] Call Trace:
> [ 1.760091] deferred_init_pages+0x8f/0xbf
> [ 1.760091] deferred_init_memmap+0x184/0x29d
> [ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
> [ 1.760091] kthread+0x112/0x130
> [ 1.760091] ? kthread_flush_work_fn+0x10/0x10
> [ 1.760091] ret_from_fork+0x35/0x40
> [ 89.123011] node 0 initialised, 1055935372 pages in 88650ms
>
> The issue becomes visible when having a lot of memory (e.g., 4TB)
> assigned to a single NUMA node - a system that can easily be created
> using QEMU. Inside VMs on a hypervisor with quite some memory
> overcommit, this is fairly easy to trigger.
>
> Adding the cond_resched() makes RCU happy.
>
> Reported-by: Yiqian Wei <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Kirill Tkhai <[email protected]>
> Cc: Shile Zhang <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Daniel Jordan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/page_alloc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ca1453204e66..084cabffc90d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1877,6 +1877,7 @@ static int __init deferred_init_memmap(void *data)
> prev_nr_pages = nr_pages;
> pgdat->first_deferred_pfn = spfn;
> pgdat_resize_unlock(pgdat, &flags);
> + cond_resched();
> goto again;
> }
> }
Reviewed-by: Shile Zhang<[email protected]>
> On 01.04.20 12:41, David Hildenbrand wrote:
> > Two fixes for misleading stall messages / soft lockups with huge nodes /
> > zones during boot without CONFIG_PREEMPT.
> >
> > David Hildenbrand (2):
> > mm/page_alloc: fix RCU stalls during deferred page initialization
> > mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
> >
> > mm/page_alloc.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
>
> Patch #1 requires "[PATCH v3] mm: fix tick timer stall during deferred
> page init"
>
> https://lkml.kernel.org/r/[email protected]
Thanks! Took me some time to figure it out.
Pankaj
>
> --
> Thanks,
>
> David / dhildenb
>
>
On Wed, Apr 01, 2020 at 04:31:51PM +0200, Pankaj Gupta wrote:
> > On 01.04.20 12:41, David Hildenbrand wrote:
> > > Two fixes for misleading stall messages / soft lockups with huge nodes /
> > > zones during boot without CONFIG_PREEMPT.
> > >
> > > David Hildenbrand (2):
> > > mm/page_alloc: fix RCU stalls during deferred page initialization
> > > mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
> > >
> > > mm/page_alloc.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> >
> > Patch #1 requires "[PATCH v3] mm: fix tick timer stall during deferred
> > page init"
> >
> > https://lkml.kernel.org/r/[email protected]
>
> Thanks! Took me some time to figure it out.
FYI, I'm planning to post an alternate version of that fix, hopefully today if
all goes well with my testing.
On Wed 01-04-20 12:41:56, David Hildenbrand wrote:
> Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
> e.g., while booting up.
>
> [ 105.608900] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
> [ 105.608933] Modules linked in:
> [ 105.608933] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
> [ 105.608933] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
> [ 105.608933] RIP: 0010:__pageblock_pfn_to_page+0x134/0x1c0
> [ 105.608933] Code: 85 c0 74 71 4a 8b 04 d0 48 85 c0 74 68 48 01 c1 74 63 f6 01 04 74 5e 48 c1 e7 06 4c 8b 05 cc 991
> [ 105.608933] RSP: 0000:ffffb6d94000fe60 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13
> [ 105.608933] RAX: fffff81953250000 RBX: 000000000a4c9600 RCX: ffff8fe9ff7c1990
> [ 105.608933] RDX: ffff8fe9ff7dab80 RSI: 000000000a4c95ff RDI: 0000000293250000
> [ 105.608933] RBP: ffff8fe9ff7dab80 R08: fffff816c0000000 R09: 0000000000000008
> [ 105.608933] R10: 0000000000000014 R11: 0000000000000014 R12: 0000000000000000
> [ 105.608933] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 105.608933] FS: 0000000000000000(0000) GS:ffff8fe1ff400000(0000) knlGS:0000000000000000
> [ 105.608933] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 105.608933] CR2: 000000000f613000 CR3: 00000088cf20a000 CR4: 00000000000006f0
> [ 105.608933] Call Trace:
> [ 105.608933] set_zone_contiguous+0x56/0x70
> [ 105.608933] page_alloc_init_late+0x166/0x176
> [ 105.608933] kernel_init_freeable+0xfa/0x255
> [ 105.608933] ? rest_init+0xaa/0xaa
> [ 105.608933] kernel_init+0xa/0x106
> [ 105.608933] ret_from_fork+0x35/0x40
>
> The issue becomes visible when having a lot of memory (e.g., 4TB)
> assigned to a single NUMA node - a system that can easily be created
> using QEMU. Inside VMs on a hypervisor with quite some memory
> overcommit, this is fairly easy to trigger.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Kirill Tkhai <[email protected]>
> Cc: Shile Zhang <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Daniel Jordan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/page_alloc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 084cabffc90d..cc4f07d52939 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1607,6 +1607,7 @@ void set_zone_contiguous(struct zone *zone)
> if (!__pageblock_pfn_to_page(block_start_pfn,
> block_end_pfn, zone))
> return;
> + cond_resched();
> }
>
> /* We confirm that there is no hole */
> --
> 2.25.1
--
Michal Hocko
SUSE Labs
On Wed 01-04-20 12:41:55, David Hildenbrand wrote:
> With CONFIG_DEFERRED_STRUCT_PAGE_INIT and without CONFIG_PREEMPT, it can
> happen that we get RCU stalls detected when booting up.
>
> [ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
> [ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
> [ 60.475000] Sending NMI from CPU 0 to CPUs 1:
> [ 1.760091] NMI backtrace for cpu 1
> [ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
> [ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
> [ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
> [ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
> [ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
> [ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
> [ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
> [ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
> [ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
> [ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
> [ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
> [ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
> [ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1.760091] Call Trace:
> [ 1.760091] deferred_init_pages+0x8f/0xbf
> [ 1.760091] deferred_init_memmap+0x184/0x29d
> [ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
> [ 1.760091] kthread+0x112/0x130
> [ 1.760091] ? kthread_flush_work_fn+0x10/0x10
> [ 1.760091] ret_from_fork+0x35/0x40
> [ 89.123011] node 0 initialised, 1055935372 pages in 88650ms
>
> The issue becomes visible when having a lot of memory (e.g., 4TB)
> assigned to a single NUMA node - a system that can easily be created
> using QEMU. Inside VMs on a hypervisor with quite some memory
> overcommit, this is fairly easy to trigger.
>
> Adding the cond_resched() makes RCU happy.
I believe the patch you depend on is a wrong way to go so please let's
wait until that settles down. But your cond_resched makes a perfect
sense. Just have it called $FOO pages - e.g. hotplug is once per
section. This is not bound to SPARSEMEM so you would have to use
a differen't constant but something along those lines would work.
>
> Reported-by: Yiqian Wei <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Kirill Tkhai <[email protected]>
> Cc: Shile Zhang <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Daniel Jordan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/page_alloc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ca1453204e66..084cabffc90d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1877,6 +1877,7 @@ static int __init deferred_init_memmap(void *data)
> prev_nr_pages = nr_pages;
> pgdat->first_deferred_pfn = spfn;
> pgdat_resize_unlock(pgdat, &flags);
> + cond_resched();
> goto again;
> }
> }
> --
> 2.25.1
--
Michal Hocko
SUSE Labs
On 01.04.20 17:45, Michal Hocko wrote:
> On Wed 01-04-20 12:41:55, David Hildenbrand wrote:
>> With CONFIG_DEFERRED_STRUCT_PAGE_INIT and without CONFIG_PREEMPT, it can
>> happen that we get RCU stalls detected when booting up.
>>
>> [ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>> [ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
>> [ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
>> [ 60.475000] Sending NMI from CPU 0 to CPUs 1:
>> [ 1.760091] NMI backtrace for cpu 1
>> [ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
>> [ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
>> [ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
>> [ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
>> [ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
>> [ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
>> [ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
>> [ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
>> [ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
>> [ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
>> [ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
>> [ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
>> [ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 1.760091] Call Trace:
>> [ 1.760091] deferred_init_pages+0x8f/0xbf
>> [ 1.760091] deferred_init_memmap+0x184/0x29d
>> [ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
>> [ 1.760091] kthread+0x112/0x130
>> [ 1.760091] ? kthread_flush_work_fn+0x10/0x10
>> [ 1.760091] ret_from_fork+0x35/0x40
>> [ 89.123011] node 0 initialised, 1055935372 pages in 88650ms
>>
>> The issue becomes visible when having a lot of memory (e.g., 4TB)
>> assigned to a single NUMA node - a system that can easily be created
>> using QEMU. Inside VMs on a hypervisor with quite some memory
>> overcommit, this is fairly easy to trigger.
>>
>> Adding the cond_resched() makes RCU happy.
>
> I believe the patch you depend on is a wrong way to go so please let's
> wait until that settles down.
I saw a RB as a reply and thought this would get picked up fairly soon.
But sure, let's see how that will look like. Thanks
--
Thanks,
David / dhildenb
On 01.04.20 16:45, Daniel Jordan wrote:
> On Wed, Apr 01, 2020 at 04:31:51PM +0200, Pankaj Gupta wrote:
>>> On 01.04.20 12:41, David Hildenbrand wrote:
>>>> Two fixes for misleading stall messages / soft lockups with huge nodes /
>>>> zones during boot without CONFIG_PREEMPT.
>>>>
>>>> David Hildenbrand (2):
>>>> mm/page_alloc: fix RCU stalls during deferred page initialization
>>>> mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
>>>>
>>>> mm/page_alloc.c | 2 ++
>>>> 1 file changed, 2 insertions(+)
>>>>
>>>
>>> Patch #1 requires "[PATCH v3] mm: fix tick timer stall during deferred
>>> page init"
>>>
>>> https://lkml.kernel.org/r/[email protected]
>>
>> Thanks! Took me some time to figure it out.
>
> FYI, I'm planning to post an alternate version of that fix, hopefully today if
> all goes well with my testing.
>
Cool, please CC me :)
--
Thanks,
David / dhildenb
On Wed, Apr 01, 2020 at 05:54:40PM +0200, David Hildenbrand wrote:
> On 01.04.20 16:45, Daniel Jordan wrote:
> > On Wed, Apr 01, 2020 at 04:31:51PM +0200, Pankaj Gupta wrote:
> >>> On 01.04.20 12:41, David Hildenbrand wrote:
> >>>> Two fixes for misleading stall messages / soft lockups with huge nodes /
> >>>> zones during boot without CONFIG_PREEMPT.
> >>>>
> >>>> David Hildenbrand (2):
> >>>> mm/page_alloc: fix RCU stalls during deferred page initialization
> >>>> mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
> >>>>
> >>>> mm/page_alloc.c | 2 ++
> >>>> 1 file changed, 2 insertions(+)
> >>>>
> >>>
> >>> Patch #1 requires "[PATCH v3] mm: fix tick timer stall during deferred
> >>> page init"
> >>>
> >>> https://lkml.kernel.org/r/[email protected]
> >>
> >> Thanks! Took me some time to figure it out.
> >
> > FYI, I'm planning to post an alternate version of that fix, hopefully today if
> > all goes well with my testing.
> >
>
> Cool, please CC me :)
Sure, in fact you already were! :)
On Wed, 1 Apr 2020 10:45:29 -0400 Daniel Jordan <[email protected]> wrote:
> On Wed, Apr 01, 2020 at 04:31:51PM +0200, Pankaj Gupta wrote:
> > > On 01.04.20 12:41, David Hildenbrand wrote:
> > > > Two fixes for misleading stall messages / soft lockups with huge nodes /
> > > > zones during boot without CONFIG_PREEMPT.
> > > >
> > > > David Hildenbrand (2):
> > > > mm/page_alloc: fix RCU stalls during deferred page initialization
> > > > mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
> > > >
> > > > mm/page_alloc.c | 2 ++
> > > > 1 file changed, 2 insertions(+)
> > > >
> > >
> > > Patch #1 requires "[PATCH v3] mm: fix tick timer stall during deferred
> > > page init"
> > >
> > > https://lkml.kernel.org/r/[email protected]
> >
> > Thanks! Took me some time to figure it out.
>
> FYI, I'm planning to post an alternate version of that fix, hopefully today if
> all goes well with my testing.
I assume you'll redo this two-patch series to apply on top of this
forthcoming patch?
> Am 01.04.2020 um 20:06 schrieb Andrew Morton <[email protected]>:
>
> On Wed, 1 Apr 2020 10:45:29 -0400 Daniel Jordan <[email protected]> wrote:
>
>> On Wed, Apr 01, 2020 at 04:31:51PM +0200, Pankaj Gupta wrote:
>>>>> On 01.04.20 12:41, David Hildenbrand wrote:
>>>>>> Two fixes for misleading stall messages / soft lockups with huge nodes /
>>>>>> zones during boot without CONFIG_PREEMPT.
>>>>>>
>>>>>> David Hildenbrand (2):
>>>>>> mm/page_alloc: fix RCU stalls during deferred page initialization
>>>>>> mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
>>>>>>
>>>>>> mm/page_alloc.c | 2 ++
>>>>>> 1 file changed, 2 insertions(+)
>>>>>>
>>>>>
>>>>> Patch #1 requires "[PATCH v3] mm: fix tick timer stall during deferred
>>>>> page init"
>>>>>
>>>>> https://lkml.kernel.org/r/[email protected]
>>>
>>> Thanks! Took me some time to figure it out.
>>
>> FYI, I'm planning to post an alternate version of that fix, hopefully today if
>> all goes well with my testing.
>
> I assume you'll redo this two-patch series to apply on top of this
> forthcoming patch?
>
Yes, will wait until the old one in -next has been replaced by a revised one. Thanks!