2015-11-02 08:06:22

by kernel test robot

[permalink] [raw]
Subject: [lkp] [fs] df4c0e36f1: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]

FYI, we noticed the below changes on

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
commit df4c0e36f1b1782b0611a77c52cc240e5c4752dd ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports")


The commit fixed a KASan bug, but introduced or revealed a soft lockup
issue as follow.


[ 116.546697] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
[ 116.546697] irq event stamp: 3018750
[ 116.546697] hardirqs last enabled at (3018749): [<ffffffff8224f36c>] restore_args+0x0/0x30
[ 116.546697] hardirqs last disabled at (3018750): [<ffffffff8224f65d>] apic_timer_interrupt+0x6d/0x80
[ 116.546697] softirqs last enabled at (3018748): [<ffffffff810de898>] __do_softirq+0x417/0x65b
[ 116.546697] softirqs last disabled at (3018743): [<ffffffff810ded30>] irq_exit+0x55/0xc3
[ 116.546697] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.19.0-05249-gdf4c0e3 #1
[ 116.546697] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 116.546697] task: ffff88002f3f0000 ti: ffff88002f3f8000 task.ti: ffff88002f3f8000
[ 116.546697] RIP: 0010:[<ffffffff8125fdf9>] [<ffffffff8125fdf9>] __asan_load4+0x12/0x91
[ 116.546697] RSP: 0000:ffff88002f3fbdc8 EFLAGS: 00000282
[ 116.546697] RAX: ffff7fffffffffff RBX: 0000000000000000 RCX: dffffc0000000000
[ 116.546697] RDX: fffffbfff0763087 RSI: ffffffff83b18b10 RDI: ffffffff83b1803c
[ 116.546697] RBP: ffff88002f3fbdf8 R08: 0000000000000001 R09: fffffbfff0763074
[ 116.546697] R10: ffffffff83b183a7 R11: fffffbfff0763074 R12: ffff880038d29003
[ 116.546697] R13: ffffed00071a5200 R14: ffffffff818cfadb R15: ffff88002f3fbdf8
[ 116.546697] FS: 0000000000000000(0000) GS:ffff88002f600000(0000) knlGS:0000000000000000
[ 116.546697] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 116.546697] CR2: 0000000000000000 CR3: 0000000002793000 CR4: 00000000000006f0
[ 116.546697] Stack:
[ 116.546697] ffffffff83b18d5c ffffffff83b18b08 ffffffff83b18878 ffffffff83b18b10
[ 116.546697] ffffffff83b18868 0000000000000000 ffff88002f3fbe48 ffffffff8223c4d3
[ 116.546697] 0000000000000000 c4400d6283b18d40 0000000000000202 0000000000000d98
[ 116.546697] Call Trace:
[ 116.546697] [<ffffffff8223c4d3>] insert_augmented+0x52/0xf5
[ 116.546697] [<ffffffff82b7b4ef>] rbtree_test_init+0x151/0x205
[ 116.546697] [<ffffffff82b7b39e>] ? glob_init+0x11c/0x11c
[ 116.546697] [<ffffffff82b222dc>] do_one_initcall+0x105/0x197
[ 116.546697] [<ffffffff82b22501>] kernel_init_freeable+0x193/0x22e
[ 116.546697] [<ffffffff8222ba4b>] ? rest_init+0x14d/0x14d
[ 116.546697] [<ffffffff8222ba59>] kernel_init+0xe/0xdf
[ 116.546697] [<ffffffff8224e5fc>] ret_from_fork+0x7c/0xb0
[ 116.546697] [<ffffffff8222ba4b>] ? rest_init+0x14d/0x14d
[ 116.546697] Code: 38 00 74 13 48 8b 4d 08 ba 01 00 00 00 be 02 00 00 00 e8 e0 09 00 00 c9 c3 55 48 b8 ff ff ff ff ff 7f ff ff 48 89 e5 48 83 ec 30 <48> 39 c7 77 23 48 8b 45 08 48 89 7d d8 48 8d 7d d8 48 c7 45 e8
[ 116.546697] Kernel panic - not syncing: softlockup: hung tasks
[ 116.546697] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G L 3.19.0-05249-gdf4c0e3 #1
[ 116.546697] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 116.546697] 0000000000000000 0000000000000000 0000000000000000 ffff88002f603d48
[ 116.546697] ffffffff8223bcf7 fffffbfff074360b ffffffff8263d4e2 ffff88002f603dc8
[ 116.546697] ffffffff82234300 ffff88002f603dc8 ffffffff00000008 ffff88002f603dd8
[ 116.546697] Call Trace:
[ 116.546697] <IRQ> [<ffffffff8223bcf7>] dump_stack+0x4c/0x65
[ 116.546697] [<ffffffff82234300>] panic+0xed/0x231
[ 116.546697] [<ffffffff811adf9d>] watchdog_timer_fn+0x1d2/0x1f8
[ 116.546697] [<ffffffff811addcb>] ? __touch_watchdog+0x1c/0x1c
[ 116.546697] [<ffffffff8117083e>] __run_hrtimer+0x283/0x6a2
[ 116.546697] [<ffffffff81171aaf>] hrtimer_run_queues+0x126/0x145
[ 116.546697] [<ffffffff8116ff2f>] run_local_timers+0xe/0x1a
[ 116.546697] [<ffffffff8116ff62>] update_process_times+0x27/0x54
[ 116.546697] [<ffffffff81181e76>] tick_nohz_handler+0x101/0x164
[ 116.546697] [<ffffffff81040dbc>] local_apic_timer_interrupt+0x78/0x80
[ 116.546697] [<ffffffff8225144c>] smp_apic_timer_interrupt+0x57/0x69
[ 116.546697] [<ffffffff8224f662>] apic_timer_interrupt+0x72/0x80
[ 116.546697] <EOI> [<ffffffff8125fdf9>] ? __asan_load4+0x12/0x91
[ 116.546697] [<ffffffff8223c4d3>] insert_augmented+0x52/0xf5
[ 116.546697] [<ffffffff82b7b4ef>] rbtree_test_init+0x151/0x205
[ 116.546697] [<ffffffff82b7b39e>] ? glob_init+0x11c/0x11c
[ 116.546697] [<ffffffff82b222dc>] do_one_initcall+0x105/0x197
[ 116.546697] [<ffffffff82b22501>] kernel_init_freeable+0x193/0x22e
[ 116.546697] [<ffffffff8222ba4b>] ? rest_init+0x14d/0x14d
[ 116.546697] [<ffffffff8222ba59>] kernel_init+0xe/0xdf
[ 116.546697] [<ffffffff8224e5fc>] ret_from_fork+0x7c/0xb0
[ 116.546697] [<ffffffff8222ba4b>] ? rest_init+0x14d/0x14d
[ 116.546697] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)


Thanks,
Ying Huang


Attachments:
(No filename) (4.98 kB)
config-3.19.0-05249-gdf4c0e3 (85.69 kB)
dmesg.xz (9.50 kB)
Download all attachments

2015-11-02 09:17:47

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [lkp] [fs] df4c0e36f1: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]

Hi Ying,

Does testing happen with CONFIG_KASAN=y? If not, this is probably a
flake caused by something else.


On Mon, Nov 2, 2015 at 9:06 AM, kernel test robot
<[email protected]> wrote:
> FYI, we noticed the below changes on
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> commit df4c0e36f1b1782b0611a77c52cc240e5c4752dd ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports")
>
>
> The commit fixed a KASan bug, but introduced or revealed a soft lockup
> issue as follow.
>
>
> [ 116.546697] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
> [ 116.546697] irq event stamp: 3018750
> [ 116.546697] hardirqs last enabled at (3018749): [<ffffffff8224f36c>] restore_args+0x0/0x30
> [ 116.546697] hardirqs last disabled at (3018750): [<ffffffff8224f65d>] apic_timer_interrupt+0x6d/0x80
> [ 116.546697] softirqs last enabled at (3018748): [<ffffffff810de898>] __do_softirq+0x417/0x65b
> [ 116.546697] softirqs last disabled at (3018743): [<ffffffff810ded30>] irq_exit+0x55/0xc3
> [ 116.546697] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.19.0-05249-gdf4c0e3 #1
> [ 116.546697] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> [ 116.546697] task: ffff88002f3f0000 ti: ffff88002f3f8000 task.ti: ffff88002f3f8000
> [ 116.546697] RIP: 0010:[<ffffffff8125fdf9>] [<ffffffff8125fdf9>] __asan_load4+0x12/0x91
> [ 116.546697] RSP: 0000:ffff88002f3fbdc8 EFLAGS: 00000282
> [ 116.546697] RAX: ffff7fffffffffff RBX: 0000000000000000 RCX: dffffc0000000000
> [ 116.546697] RDX: fffffbfff0763087 RSI: ffffffff83b18b10 RDI: ffffffff83b1803c
> [ 116.546697] RBP: ffff88002f3fbdf8 R08: 0000000000000001 R09: fffffbfff0763074
> [ 116.546697] R10: ffffffff83b183a7 R11: fffffbfff0763074 R12: ffff880038d29003
> [ 116.546697] R13: ffffed00071a5200 R14: ffffffff818cfadb R15: ffff88002f3fbdf8
> [ 116.546697] FS: 0000000000000000(0000) GS:ffff88002f600000(0000) knlGS:0000000000000000
> [ 116.546697] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 116.546697] CR2: 0000000000000000 CR3: 0000000002793000 CR4: 00000000000006f0
> [ 116.546697] Stack:
> [ 116.546697] ffffffff83b18d5c ffffffff83b18b08 ffffffff83b18878 ffffffff83b18b10
> [ 116.546697] ffffffff83b18868 0000000000000000 ffff88002f3fbe48 ffffffff8223c4d3
> [ 116.546697] 0000000000000000 c4400d6283b18d40 0000000000000202 0000000000000d98
> [ 116.546697] Call Trace:
> [ 116.546697] [<ffffffff8223c4d3>] insert_augmented+0x52/0xf5
> [ 116.546697] [<ffffffff82b7b4ef>] rbtree_test_init+0x151/0x205
> [ 116.546697] [<ffffffff82b7b39e>] ? glob_init+0x11c/0x11c
> [ 116.546697] [<ffffffff82b222dc>] do_one_initcall+0x105/0x197
> [ 116.546697] [<ffffffff82b22501>] kernel_init_freeable+0x193/0x22e
> [ 116.546697] [<ffffffff8222ba4b>] ? rest_init+0x14d/0x14d
> [ 116.546697] [<ffffffff8222ba59>] kernel_init+0xe/0xdf
> [ 116.546697] [<ffffffff8224e5fc>] ret_from_fork+0x7c/0xb0
> [ 116.546697] [<ffffffff8222ba4b>] ? rest_init+0x14d/0x14d
> [ 116.546697] Code: 38 00 74 13 48 8b 4d 08 ba 01 00 00 00 be 02 00 00 00 e8 e0 09 00 00 c9 c3 55 48 b8 ff ff ff ff ff 7f ff ff 48 89 e5 48 83 ec 30 <48> 39 c7 77 23 48 8b 45 08 48 89 7d d8 48 8d 7d d8 48 c7 45 e8
> [ 116.546697] Kernel panic - not syncing: softlockup: hung tasks
> [ 116.546697] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G L 3.19.0-05249-gdf4c0e3 #1
> [ 116.546697] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> [ 116.546697] 0000000000000000 0000000000000000 0000000000000000 ffff88002f603d48
> [ 116.546697] ffffffff8223bcf7 fffffbfff074360b ffffffff8263d4e2 ffff88002f603dc8
> [ 116.546697] ffffffff82234300 ffff88002f603dc8 ffffffff00000008 ffff88002f603dd8
> [ 116.546697] Call Trace:
> [ 116.546697] <IRQ> [<ffffffff8223bcf7>] dump_stack+0x4c/0x65
> [ 116.546697] [<ffffffff82234300>] panic+0xed/0x231
> [ 116.546697] [<ffffffff811adf9d>] watchdog_timer_fn+0x1d2/0x1f8
> [ 116.546697] [<ffffffff811addcb>] ? __touch_watchdog+0x1c/0x1c
> [ 116.546697] [<ffffffff8117083e>] __run_hrtimer+0x283/0x6a2
> [ 116.546697] [<ffffffff81171aaf>] hrtimer_run_queues+0x126/0x145
> [ 116.546697] [<ffffffff8116ff2f>] run_local_timers+0xe/0x1a
> [ 116.546697] [<ffffffff8116ff62>] update_process_times+0x27/0x54
> [ 116.546697] [<ffffffff81181e76>] tick_nohz_handler+0x101/0x164
> [ 116.546697] [<ffffffff81040dbc>] local_apic_timer_interrupt+0x78/0x80
> [ 116.546697] [<ffffffff8225144c>] smp_apic_timer_interrupt+0x57/0x69
> [ 116.546697] [<ffffffff8224f662>] apic_timer_interrupt+0x72/0x80
> [ 116.546697] <EOI> [<ffffffff8125fdf9>] ? __asan_load4+0x12/0x91
> [ 116.546697] [<ffffffff8223c4d3>] insert_augmented+0x52/0xf5
> [ 116.546697] [<ffffffff82b7b4ef>] rbtree_test_init+0x151/0x205
> [ 116.546697] [<ffffffff82b7b39e>] ? glob_init+0x11c/0x11c
> [ 116.546697] [<ffffffff82b222dc>] do_one_initcall+0x105/0x197
> [ 116.546697] [<ffffffff82b22501>] kernel_init_freeable+0x193/0x22e
> [ 116.546697] [<ffffffff8222ba4b>] ? rest_init+0x14d/0x14d
> [ 116.546697] [<ffffffff8222ba59>] kernel_init+0xe/0xdf
> [ 116.546697] [<ffffffff8224e5fc>] ret_from_fork+0x7c/0xb0
> [ 116.546697] [<ffffffff8222ba4b>] ? rest_init+0x14d/0x14d
> [ 116.546697] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
>
>
> Thanks,
> Ying Huang

2015-11-02 09:48:11

by Andrey Ryabinin

[permalink] [raw]
Subject: Re: [lkp] [fs] df4c0e36f1: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]

2015-11-02 11:06 GMT+03:00 kernel test robot <[email protected]>:
> FYI, we noticed the below changes on
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> commit df4c0e36f1b1782b0611a77c52cc240e5c4752dd ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports")
>
>
> The commit fixed a KASan bug, but introduced or revealed a soft lockup
> issue as follow.
>

I think that this issue is caused by several factors.
1. Non-preempt kernel + heavy debug options (CONFIG_PREEMPT=n
CONFIG_GCOV_PROFILE_ALL=y CONFIG_KASAN=y)
2. I suspect that your host is overloaded, thus KVM guest runs too slow.

And the major factor here is number 2.

In your dmesg:
[ 67.891156] rbtree testing -> 570841 cycles
[ 88.609636] augmented rbtree testing
[ 116.546697] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
[swapper/0:1]


I've tried to reproduce this, and got this:
[ 0.693574] rbtree testing -> 15513 cycles
570841/15513 = 36x times faster.

[ 1.159450] augmented rbtree testing -> 23675 cycles
[ 1.864996]
It took less than a second, meanwhile in your case it didn't finish in
22 seconds.

This makes me think that your host is overloaded and the problem is on
your side.

>
> [ 116.546697] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
> [ 116.546697] irq event stamp: 3018750
> [ 116.546697] hardirqs last enabled at (3018749): [<ffffffff8224f36c>] restore_args+0x0/0x30
> [ 116.546697] hardirqs last disabled at (3018750): [<ffffffff8224f65d>] apic_timer_interrupt+0x6d/0x80
> [ 116.546697] softirqs last enabled at (3018748): [<ffffffff810de898>] __do_softirq+0x417/0x65b
> [ 116.546697] softirqs last disabled at (3018743): [<ffffffff810ded30>] irq_exit+0x55/0xc3
> [ 116.546697] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.19.0-05249-gdf4c0e3 #1
> [ 116.546697] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> [ 116.546697] task: ffff88002f3f0000 ti: ffff88002f3f8000 task.ti: ffff88002f3f8000
> [ 116.546697] RIP: 0010:[<ffffffff8125fdf9>] [<ffffffff8125fdf9>] __asan_load4+0x12/0x91
> [ 116.546697] RSP: 0000:ffff88002f3fbdc8 EFLAGS: 00000282
> [ 116.546697] RAX: ffff7fffffffffff RBX: 0000000000000000 RCX: dffffc0000000000
> [ 116.546697] RDX: fffffbfff0763087 RSI: ffffffff83b18b10 RDI: ffffffff83b1803c
> [ 116.546697] RBP: ffff88002f3fbdf8 R08: 0000000000000001 R09: fffffbfff0763074
> [ 116.546697] R10: ffffffff83b183a7 R11: fffffbfff0763074 R12: ffff880038d29003
> [ 116.546697] R13: ffffed00071a5200 R14: ffffffff818cfadb R15: ffff88002f3fbdf8
> [ 116.546697] FS: 0000000000000000(0000) GS:ffff88002f600000(0000) knlGS:0000000000000000
> [ 116.546697] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 116.546697] CR2: 0000000000000000 CR3: 0000000002793000 CR4: 00000000000006f0
> [ 116.546697] Stack:
> [ 116.546697] ffffffff83b18d5c ffffffff83b18b08 ffffffff83b18878 ffffffff83b18b10
> [ 116.546697] ffffffff83b18868 0000000000000000 ffff88002f3fbe48 ffffffff8223c4d3
> [ 116.546697] 0000000000000000 c4400d6283b18d40 0000000000000202 0000000000000d98
> [ 116.546697] Call Trace:
> [ 116.546697] [<ffffffff8223c4d3>] insert_augmented+0x52/0xf5
> [ 116.546697] [<ffffffff82b7b4ef>] rbtree_test_init+0x151/0x205
> [ 116.546697] [<ffffffff82b7b39e>] ? glob_init+0x11c/0x11c
> [ 116.546697] [<ffffffff82b222dc>] do_one_initcall+0x105/0x197
> [ 116.546697] [<ffffffff82b22501>] kernel_init_freeable+0x193/0x22e
> [ 116.546697] [<ffffffff8222ba4b>] ? rest_init+0x14d/0x14d
> [ 116.546697] [<ffffffff8222ba59>] kernel_init+0xe/0xdf
> [ 116.546697] [<ffffffff8224e5fc>] ret_from_fork+0x7c/0xb0
> [ 116.546697] [<ffffffff8222ba4b>] ? rest_init+0x14d/0x14d
> [ 116.546697] Code: 38 00 74 13 48 8b 4d 08 ba 01 00 00 00 be 02 00 00 00 e8 e0 09 00 00 c9 c3 55 48 b8 ff ff ff ff ff 7f ff ff 48 89 e5 48 83 ec 30 <48> 39 c7 77 23 48 8b 45 08 48 89 7d d8 48 8d 7d d8 48 c7 45 e8
> [ 116.546697] Kernel panic - not syncing: softlockup: hung tasks
> [ 116.546697] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G L 3.19.0-05249-gdf4c0e3 #1
> [ 116.546697] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> [ 116.546697] 0000000000000000 0000000000000000 0000000000000000 ffff88002f603d48
> [ 116.546697] ffffffff8223bcf7 fffffbfff074360b ffffffff8263d4e2 ffff88002f603dc8
> [ 116.546697] ffffffff82234300 ffff88002f603dc8 ffffffff00000008 ffff88002f603dd8
> [ 116.546697] Call Trace:
> [ 116.546697] <IRQ> [<ffffffff8223bcf7>] dump_stack+0x4c/0x65
> [ 116.546697] [<ffffffff82234300>] panic+0xed/0x231
> [ 116.546697] [<ffffffff811adf9d>] watchdog_timer_fn+0x1d2/0x1f8
> [ 116.546697] [<ffffffff811addcb>] ? __touch_watchdog+0x1c/0x1c
> [ 116.546697] [<ffffffff8117083e>] __run_hrtimer+0x283/0x6a2
> [ 116.546697] [<ffffffff81171aaf>] hrtimer_run_queues+0x126/0x145
> [ 116.546697] [<ffffffff8116ff2f>] run_local_timers+0xe/0x1a
> [ 116.546697] [<ffffffff8116ff62>] update_process_times+0x27/0x54
> [ 116.546697] [<ffffffff81181e76>] tick_nohz_handler+0x101/0x164
> [ 116.546697] [<ffffffff81040dbc>] local_apic_timer_interrupt+0x78/0x80
> [ 116.546697] [<ffffffff8225144c>] smp_apic_timer_interrupt+0x57/0x69
> [ 116.546697] [<ffffffff8224f662>] apic_timer_interrupt+0x72/0x80
> [ 116.546697] <EOI> [<ffffffff8125fdf9>] ? __asan_load4+0x12/0x91
> [ 116.546697] [<ffffffff8223c4d3>] insert_augmented+0x52/0xf5
> [ 116.546697] [<ffffffff82b7b4ef>] rbtree_test_init+0x151/0x205
> [ 116.546697] [<ffffffff82b7b39e>] ? glob_init+0x11c/0x11c
> [ 116.546697] [<ffffffff82b222dc>] do_one_initcall+0x105/0x197
> [ 116.546697] [<ffffffff82b22501>] kernel_init_freeable+0x193/0x22e
> [ 116.546697] [<ffffffff8222ba4b>] ? rest_init+0x14d/0x14d
> [ 116.546697] [<ffffffff8222ba59>] kernel_init+0xe/0xdf
> [ 116.546697] [<ffffffff8224e5fc>] ret_from_fork+0x7c/0xb0
> [ 116.546697] [<ffffffff8222ba4b>] ? rest_init+0x14d/0x14d
> [ 116.546697] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
>
>
> Thanks,
> Ying Huang

2015-11-02 17:39:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [lkp] [fs] df4c0e36f1: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]

On 11/02/2015 01:32 AM, Andrey Ryabinin wrote:
> And the major factor here is number 2.
>
> In your dmesg:
> [ 67.891156] rbtree testing -> 570841 cycles
> [ 88.609636] augmented rbtree testing
> [ 116.546697] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
> [swapper/0:1]
>
>
> I've tried to reproduce this, and got this:
> [ 0.693574] rbtree testing -> 15513 cycles
> 570841/15513 = 36x times faster.
>
> [ 1.159450] augmented rbtree testing -> 23675 cycles
> [ 1.864996]
> It took less than a second, meanwhile in your case it didn't finish in
> 22 seconds.
>
> This makes me think that your host is overloaded and the problem is on
> your side.

It's probably just a matter of putting some cond_resched()s in the test
code.

2015-11-02 19:34:09

by Andrey Ryabinin

[permalink] [raw]
Subject: Re: [lkp] [fs] df4c0e36f1: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]

2015-11-02 20:39 GMT+03:00 Dave Hansen <[email protected]>:
> On 11/02/2015 01:32 AM, Andrey Ryabinin wrote:
>> And the major factor here is number 2.
>>
>> In your dmesg:
>> [ 67.891156] rbtree testing -> 570841 cycles
>> [ 88.609636] augmented rbtree testing
>> [ 116.546697] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
>> [swapper/0:1]
>>
>>
>> I've tried to reproduce this, and got this:
>> [ 0.693574] rbtree testing -> 15513 cycles
>> 570841/15513 = 36x times faster.
>>
>> [ 1.159450] augmented rbtree testing -> 23675 cycles
>> [ 1.864996]
>> It took less than a second, meanwhile in your case it didn't finish in
>> 22 seconds.
>>
>> This makes me think that your host is overloaded and the problem is on
>> your side.
>
> It's probably just a matter of putting some cond_resched()s in the test
> code.

Yes, but is it worthwhile? It's very likely that lockup will just
trigger in another place.

2015-11-02 20:07:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [lkp] [fs] df4c0e36f1: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]

On 11/02/2015 11:34 AM, Andrey Ryabinin wrote:
>>> >>
>>> >> [ 1.159450] augmented rbtree testing -> 23675 cycles
>>> >> [ 1.864996]
>>> >> It took less than a second, meanwhile in your case it didn't finish in
>>> >> 22 seconds.
>>> >>
>>> >> This makes me think that your host is overloaded and the problem is on
>>> >> your side.
>> >
>> > It's probably just a matter of putting some cond_resched()s in the test
>> > code.
> Yes, but is it worthwhile? It's very likely that lockup will just
> trigger in another place.

I'm guessing that the lockup here was because the tests were running for
too long. If we cond_resched() in there often enough, the kernel won't
detect a softlockup at all. It won't shift somewhere else.

2015-11-02 21:29:39

by Andrey Ryabinin

[permalink] [raw]
Subject: Re: [lkp] [fs] df4c0e36f1: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]

2015-11-02 23:07 GMT+03:00 Dave Hansen <[email protected]>:
> On 11/02/2015 11:34 AM, Andrey Ryabinin wrote:
>>>> >>
>>>> >> [ 1.159450] augmented rbtree testing -> 23675 cycles
>>>> >> [ 1.864996]
>>>> >> It took less than a second, meanwhile in your case it didn't finish in
>>>> >> 22 seconds.
>>>> >>
>>>> >> This makes me think that your host is overloaded and the problem is on
>>>> >> your side.
>>> >
>>> > It's probably just a matter of putting some cond_resched()s in the test
>>> > code.
>> Yes, but is it worthwhile? It's very likely that lockup will just
>> trigger in another place.
>
> I'm guessing that the lockup here was because the tests were running for
> too long. If we cond_resched() in there often enough, the kernel won't
> detect a softlockup at all.

Sure, but why are these tests running so long?
In my setup it takes less than a second to finish these tests.
On the same kernel version and config of course. Although I might have more
powerful hardware it doesn't explain such huge difference.
So these tests are actually fast tests. I guess that the host is
overloaded and KVM guest runs
so slow that even these simple tests start triggering softlockup.

> It won't shift somewhere else.

That's not what I mean. Sure, the cond_resched() in rbtree_test_init()
will fix this particular softlockup.
But if even such normally fast tests now are running too long, then a
lot of other kernel code, which normally
runs fast, likely becomes too slow on Ying's setup and will trigger
another softlockup.
rbtree_test_init() is just the first such place.
In that case, sticking cond_resched()s across the whole kernel is not
a solution.