2020-04-23 16:42:56

by John Stultz

[permalink] [raw]
Subject: BUG: Invalid wait context with 5.7-rc2?

Hey Folks,

Recently, I've seen some occasional hangs earlyish in boot on my
HiKey960 board with 5.7-rc1/rc2. The kernel isn't totally wedged as I
will see some kernel messages (firmware loading failures, etc) much
later if I leave it. But oddly sysrq doesn't respond.

Figuring it must be some sort of deadlock, I added LOCKDEP and a bunch
of other debug options and started booting in a loop. So far I've not
been able to trigger the original problem, but I do see the following
every boot:

Curious if this was already on anyone's radar?

thanks
-john

[ 2.111212] =============================
[ 2.115256] [ BUG: Invalid wait context ]
[ 2.119303] 5.7.0-rc2-00070-g4160a2ecd371 #247 Not tainted
[ 2.124842] -----------------------------
[ 2.128886] swapper/5/0 is trying to lock:
[ 2.133019] ffffff8219c33258 (&pool->lock){..-.}-{3:3}, at:
__queue_work+0x108/0x7c8
[ 2.140856] other info that might help us debug this:
[ 2.145954] context-{2:2}
[ 2.148593] 1 lock held by swapper/5/0:
[ 2.152461] #0: ffffffc011d749c8 (rcu_read_lock){....}-{1:3}, at:
__queue_work+0x48/0x7c8
[ 2.160814] stack backtrace:
[ 2.163719] CPU: 5 PID: 0 Comm: swapper/5 Not tainted
5.7.0-rc2-00070-g4160a2ecd371 #247
[ 2.171891] Hardware name: HiKey960 (DT)
[ 2.175847] Call trace:
[ 2.178315] dump_backtrace+0x0/0x1a8
[ 2.182009] show_stack+0x18/0x28
[ 2.185354] dump_stack+0xdc/0x148
[ 2.188786] __lock_acquire+0x5c0/0x1568
[ 2.192744] lock_acquire+0x100/0x378
[ 2.196441] _raw_spin_lock+0x64/0x108
[ 2.200225] __queue_work+0x108/0x7c8
[ 2.203918] queue_work_on+0xd0/0xf0
[ 2.207526] timers_update_nohz+0x28/0x38
[ 2.211572] tick_setup_sched_timer+0x110/0x178
[ 2.216144] hrtimer_run_queues+0x114/0x170
[ 2.220365] run_local_timers+0x30/0x70
[ 2.224234] update_process_times+0x28/0x58
[ 2.228457] tick_periodic+0x48/0x148
[ 2.232151] tick_handle_periodic+0x28/0xc8
[ 2.236375] arch_timer_handler_phys+0x2c/0x50
[ 2.238026] VFS: Disk quotas dquot_6.6.0
[ 2.240862] handle_percpu_devid_irq+0xe0/0x460
[ 2.240866] generic_handle_irq+0x30/0x48
[ 2.240868] __handle_domain_irq+0x88/0xf8
[ 2.240873] gic_handle_irq+0x5c/0xb0
[ 2.240877] el1_irq+0xf4/0x1c0
[ 2.240882] arch_cpu_idle+0x2c/0x230
[ 2.244918] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[ 2.249406] default_idle_call+0x20/0x44
[ 2.249409] do_idle+0x1ec/0x2d0
[ 2.249411] cpu_startup_entry+0x24/0x48
[ 2.249416] secondary_start_kernel+0x160/0x210


2020-04-23 17:13:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: BUG: Invalid wait context with 5.7-rc2?

On Thu, Apr 23, 2020 at 09:40:47AM -0700, John Stultz wrote:
> Hey Folks,
>
> Recently, I've seen some occasional hangs earlyish in boot on my
> HiKey960 board with 5.7-rc1/rc2. The kernel isn't totally wedged as I
> will see some kernel messages (firmware loading failures, etc) much
> later if I leave it. But oddly sysrq doesn't respond.
>
> Figuring it must be some sort of deadlock, I added LOCKDEP and a bunch
> of other debug options and started booting in a loop. So far I've not
> been able to trigger the original problem, but I do see the following
> every boot:
>
> Curious if this was already on anyone's radar?

Looks like __queue_work() is attempting to acquire a normal spinlock_t
with interrupts disabled, which -rt doesn't like much. And I believe
that lockdep has been upgraded to check for this.

Adding Sebastian for his thoughts.

Thanx, Paul

> thanks
> -john
>
> [ 2.111212] =============================
> [ 2.115256] [ BUG: Invalid wait context ]
> [ 2.119303] 5.7.0-rc2-00070-g4160a2ecd371 #247 Not tainted
> [ 2.124842] -----------------------------
> [ 2.128886] swapper/5/0 is trying to lock:
> [ 2.133019] ffffff8219c33258 (&pool->lock){..-.}-{3:3}, at:
> __queue_work+0x108/0x7c8
> [ 2.140856] other info that might help us debug this:
> [ 2.145954] context-{2:2}
> [ 2.148593] 1 lock held by swapper/5/0:
> [ 2.152461] #0: ffffffc011d749c8 (rcu_read_lock){....}-{1:3}, at:
> __queue_work+0x48/0x7c8
> [ 2.160814] stack backtrace:
> [ 2.163719] CPU: 5 PID: 0 Comm: swapper/5 Not tainted
> 5.7.0-rc2-00070-g4160a2ecd371 #247
> [ 2.171891] Hardware name: HiKey960 (DT)
> [ 2.175847] Call trace:
> [ 2.178315] dump_backtrace+0x0/0x1a8
> [ 2.182009] show_stack+0x18/0x28
> [ 2.185354] dump_stack+0xdc/0x148
> [ 2.188786] __lock_acquire+0x5c0/0x1568
> [ 2.192744] lock_acquire+0x100/0x378
> [ 2.196441] _raw_spin_lock+0x64/0x108
> [ 2.200225] __queue_work+0x108/0x7c8
> [ 2.203918] queue_work_on+0xd0/0xf0
> [ 2.207526] timers_update_nohz+0x28/0x38
> [ 2.211572] tick_setup_sched_timer+0x110/0x178
> [ 2.216144] hrtimer_run_queues+0x114/0x170
> [ 2.220365] run_local_timers+0x30/0x70
> [ 2.224234] update_process_times+0x28/0x58
> [ 2.228457] tick_periodic+0x48/0x148
> [ 2.232151] tick_handle_periodic+0x28/0xc8
> [ 2.236375] arch_timer_handler_phys+0x2c/0x50
> [ 2.238026] VFS: Disk quotas dquot_6.6.0
> [ 2.240862] handle_percpu_devid_irq+0xe0/0x460
> [ 2.240866] generic_handle_irq+0x30/0x48
> [ 2.240868] __handle_domain_irq+0x88/0xf8
> [ 2.240873] gic_handle_irq+0x5c/0xb0
> [ 2.240877] el1_irq+0xf4/0x1c0
> [ 2.240882] arch_cpu_idle+0x2c/0x230
> [ 2.244918] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
> [ 2.249406] default_idle_call+0x20/0x44
> [ 2.249409] do_idle+0x1ec/0x2d0
> [ 2.249411] cpu_startup_entry+0x24/0x48
> [ 2.249416] secondary_start_kernel+0x160/0x210

2020-04-23 17:23:58

by Valentin Schneider

[permalink] [raw]
Subject: Re: BUG: Invalid wait context with 5.7-rc2?


Hi John,

On 23/04/20 17:40, John Stultz wrote:
> Hey Folks,
>
> Recently, I've seen some occasional hangs earlyish in boot on my
> HiKey960 board with 5.7-rc1/rc2. The kernel isn't totally wedged as I
> will see some kernel messages (firmware loading failures, etc) much
> later if I leave it. But oddly sysrq doesn't respond.
>
> Figuring it must be some sort of deadlock, I added LOCKDEP and a bunch
> of other debug options and started booting in a loop. So far I've not
> been able to trigger the original problem, but I do see the following
> every boot:
>

Interestingly I can't seem to reproduce that one with the latest master
(5.7.0-rc2-00115-g8c2e9790f196). Is that with some of the extra h960
patches?

I do get this however:

[ 3.626638] INFO: trying to register non-static key.
[ 3.626639] the code is fine but needs lockdep annotation.
[ 3.626640] turning off the locking correctness validator.
[ 3.626644] CPU: 7 PID: 51 Comm: kworker/7:1 Not tainted 5.7.0-rc2-00115-g8c2e9790f196 #116
[ 3.626646] Hardware name: HiKey960 (DT)
[ 3.626656] Workqueue: events deferred_probe_work_func
[ 3.632476] sd 0:0:0:0: [sda] Optimal transfer size 8192 bytes not a multiple of physical block size (16384 bytes)
[ 3.640220] Call trace:
[ 3.640225] dump_backtrace+0x0/0x1b8
[ 3.640227] show_stack+0x20/0x30
[ 3.640230] dump_stack+0xec/0x158
[ 3.640234] register_lock_class+0x598/0x5c0
[ 3.640235] __lock_acquire+0x80/0x16c0
[ 3.640236] lock_acquire+0xf4/0x4a0
[ 3.640241] _raw_spin_lock_irqsave+0x70/0xa8
[ 3.640245] uart_add_one_port+0x388/0x4b8
[ 3.640248] pl011_register_port+0x70/0xf0
[ 3.640250] pl011_probe+0x184/0x1b8
[ 3.640254] amba_probe+0xdc/0x180
[ 3.640256] really_probe+0xe0/0x338
[ 3.640257] driver_probe_device+0x60/0xf8
[ 3.640259] __device_attach_driver+0x8c/0xd0
[ 3.640260] bus_for_each_drv+0x84/0xd8
[ 3.640261] __device_attach+0xe4/0x140
[ 3.640263] device_initial_probe+0x1c/0x28
[ 3.640265] bus_probe_device+0xa4/0xb0
[ 3.640266] deferred_probe_work_func+0x7c/0xb8
[ 3.640269] process_one_work+0x2c0/0x768
[ 3.640271] worker_thread+0x4c/0x498
[ 3.640272] kthread+0x14c/0x158
[ 3.640275] ret_from_fork+0x10/0x1c

2020-04-23 17:50:42

by Valentin Schneider

[permalink] [raw]
Subject: Re: BUG: Invalid wait context with 5.7-rc2?


On 23/04/20 18:29, John Stultz wrote:
> Oof. Way to twist the knife :) I'm probably to blame for that
> deferred_probe_work_func issue. I'll take a look at it.
>

Sorry :(

> thanks
> -john

2020-04-23 18:01:06

by John Stultz

[permalink] [raw]
Subject: Re: BUG: Invalid wait context with 5.7-rc2?

On Thu, Apr 23, 2020 at 10:29 AM John Stultz <[email protected]> wrote:
>
> On Thu, Apr 23, 2020 at 10:19 AM Valentin Schneider
> <[email protected]> wrote:
> > On 23/04/20 17:40, John Stultz wrote:
> > > Hey Folks,
> > >
> > > Recently, I've seen some occasional hangs earlyish in boot on my
> > > HiKey960 board with 5.7-rc1/rc2. The kernel isn't totally wedged as I
> > > will see some kernel messages (firmware loading failures, etc) much
> > > later if I leave it. But oddly sysrq doesn't respond.
> > >
> > > Figuring it must be some sort of deadlock, I added LOCKDEP and a bunch
> > > of other debug options and started booting in a loop. So far I've not
> > > been able to trigger the original problem, but I do see the following
> > > every boot:
> > >
> >
> > Interestingly I can't seem to reproduce that one with the latest master
> > (5.7.0-rc2-00115-g8c2e9790f196). Is that with some of the extra h960
> > patches?
>
> There are additional patches for hikey960, but nothing from the
> tracelog (which looked pretty generic).
> But I'll pull everything out to confirm and re-check against
> linus/master in case there's a recent fix.

Yep. Still seeing it against linus/master with nothing else:
[ 2.091225] clocksource: Switched to clocksource arch_sys_counter
[ 2.091406]
[ 2.098896] =============================
[ 2.102942] [ BUG: Invalid wait context ]
[ 2.106991] 5.7.0-rc2-00069-g18bf34080c4c #250 Not tainted
[ 2.112528] -----------------------------
[ 2.116574] swapper/1/0 is trying to lock:
[ 2.120710] ffffff82194af258 (&pool->lock){..-.}-{3:3}, at:
__queue_work+0x108/0x7c8
[ 2.128550] other info that might help us debug this:
[ 2.133649] context-{2:2}
[ 2.136290] 1 lock held by swapper/1/0:
[ 2.140159] #0: ffffffc011d049c8 (rcu_read_lock){....}-{1:3}, at:
__queue_work+0x48/0x7c8
[ 2.148513] stack backtrace:
[ 2.151421] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
5.7.0-rc2-00069-g18bf34080c4c #250
[ 2.159593] Hardware name: HiKey960 (DT)
[ 2.163554] Call trace:
[ 2.166023] dump_backtrace+0x0/0x1a8
[ 2.169719] show_stack+0x18/0x28
[ 2.173066] dump_stack+0xdc/0x148
[ 2.176500] __lock_acquire+0x5c0/0x1568
[ 2.180458] lock_acquire+0x100/0x378
[ 2.184159] _raw_spin_lock+0x64/0x108
[ 2.187942] __queue_work+0x108/0x7c8
[ 2.191637] queue_work_on+0xd0/0xf0
[ 2.195248] timers_update_nohz+0x28/0x38
[ 2.199296] tick_setup_sched_timer+0x110/0x178
[ 2.203870] hrtimer_run_queues+0x114/0x170
[ 2.208092] run_local_timers+0x30/0x70
[ 2.211963] update_process_times+0x28/0x58
[ 2.216189] tick_periodic+0x48/0x148
[ 2.219884] tick_handle_periodic+0x28/0xc8
[ 2.224112] arch_timer_handler_phys+0x2c/0x50
[ 2.225533] VFS: Disk quotas dquot_6.6.0
[ 2.228600] handle_percpu_devid_irq+0xe0/0x460
[ 2.228605] generic_handle_irq+0x30/0x48
[ 2.228608] __handle_domain_irq+0x88/0xf8
[ 2.228613] gic_handle_irq+0x5c/0xb0
[ 2.228619] el1_irq+0xf4/0x1c0
[ 2.232654] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[ 2.237146] arch_cpu_idle+0x2c/0x230
[ 2.237151] default_idle_call+0x20/0x44
[ 2.237156] do_idle+0x1ec/0x2d0
[ 2.237159] cpu_startup_entry+0x24/0x48
[ 2.241701] pnp: PnP ACPI: disabled
[ 2.245338] secondary_start_kernel+0x160/0x210

2020-04-23 19:37:24

by John Stultz

[permalink] [raw]
Subject: Re: BUG: Invalid wait context with 5.7-rc2?

On Thu, Apr 23, 2020 at 10:19 AM Valentin Schneider
<[email protected]> wrote:
> On 23/04/20 17:40, John Stultz wrote:
> > Hey Folks,
> >
> > Recently, I've seen some occasional hangs earlyish in boot on my
> > HiKey960 board with 5.7-rc1/rc2. The kernel isn't totally wedged as I
> > will see some kernel messages (firmware loading failures, etc) much
> > later if I leave it. But oddly sysrq doesn't respond.
> >
> > Figuring it must be some sort of deadlock, I added LOCKDEP and a bunch
> > of other debug options and started booting in a loop. So far I've not
> > been able to trigger the original problem, but I do see the following
> > every boot:
> >
>
> Interestingly I can't seem to reproduce that one with the latest master
> (5.7.0-rc2-00115-g8c2e9790f196). Is that with some of the extra h960
> patches?

There are additional patches for hikey960, but nothing from the
tracelog (which looked pretty generic).
But I'll pull everything out to confirm and re-check against
linus/master in case there's a recent fix.

> I do get this however:
>
> [ 3.626638] INFO: trying to register non-static key.
> [ 3.626639] the code is fine but needs lockdep annotation.
> [ 3.626640] turning off the locking correctness validator.
> [ 3.626644] CPU: 7 PID: 51 Comm: kworker/7:1 Not tainted 5.7.0-rc2-00115-g8c2e9790f196 #116
> [ 3.626646] Hardware name: HiKey960 (DT)
> [ 3.626656] Workqueue: events deferred_probe_work_func
> [ 3.632476] sd 0:0:0:0: [sda] Optimal transfer size 8192 bytes not a multiple of physical block size (16384 bytes)
> [ 3.640220] Call trace:
> [ 3.640225] dump_backtrace+0x0/0x1b8
> [ 3.640227] show_stack+0x20/0x30
> [ 3.640230] dump_stack+0xec/0x158
> [ 3.640234] register_lock_class+0x598/0x5c0
> [ 3.640235] __lock_acquire+0x80/0x16c0
> [ 3.640236] lock_acquire+0xf4/0x4a0
> [ 3.640241] _raw_spin_lock_irqsave+0x70/0xa8
> [ 3.640245] uart_add_one_port+0x388/0x4b8
> [ 3.640248] pl011_register_port+0x70/0xf0
> [ 3.640250] pl011_probe+0x184/0x1b8
> [ 3.640254] amba_probe+0xdc/0x180
> [ 3.640256] really_probe+0xe0/0x338
> [ 3.640257] driver_probe_device+0x60/0xf8
> [ 3.640259] __device_attach_driver+0x8c/0xd0
> [ 3.640260] bus_for_each_drv+0x84/0xd8
> [ 3.640261] __device_attach+0xe4/0x140
> [ 3.640263] device_initial_probe+0x1c/0x28
> [ 3.640265] bus_probe_device+0xa4/0xb0
> [ 3.640266] deferred_probe_work_func+0x7c/0xb8
> [ 3.640269] process_one_work+0x2c0/0x768
> [ 3.640271] worker_thread+0x4c/0x498
> [ 3.640272] kthread+0x14c/0x158
> [ 3.640275] ret_from_fork+0x10/0x1c

Oof. Way to twist the knife :) I'm probably to blame for that
deferred_probe_work_func issue. I'll take a look at it.

thanks
-john

2020-04-23 19:39:16

by John Stultz

[permalink] [raw]
Subject: Re: BUG: Invalid wait context with 5.7-rc2?

On Thu, Apr 23, 2020 at 11:49 AM Sebastian Andrzej Siewior
<[email protected]> wrote:
>
> On 2020-04-23 10:10:29 [-0700], Paul E. McKenney wrote:
> > > Curious if this was already on anyone's radar?
> >
> > Looks like __queue_work() is attempting to acquire a normal spinlock_t
> > with interrupts disabled, which -rt doesn't like much. And I believe
> > that lockdep has been upgraded to check for this.
> >
> > Adding Sebastian for his thoughts.
>
> This is known. There is more than just workqueue in this regard. Please
> disable CONFIG_PROVE_RAW_LOCK_NESTING until this is fixed (see Kconfig
> help). Except for workqueue, there is also printk which triggers the
> same warning and isn't as easy to fix as workqueue.

Ok. Appreciate it! Thanks so much for the context! I'll stop worrying
about that one and keep digging on my own issue.
-john

Subject: Re: BUG: Invalid wait context with 5.7-rc2?

On 2020-04-23 10:10:29 [-0700], Paul E. McKenney wrote:
> > Curious if this was already on anyone's radar?
>
> Looks like __queue_work() is attempting to acquire a normal spinlock_t
> with interrupts disabled, which -rt doesn't like much. And I believe
> that lockdep has been upgraded to check for this.
>
> Adding Sebastian for his thoughts.

This is known. There is more than just workqueue in this regard. Please
disable CONFIG_PROVE_RAW_LOCK_NESTING until this is fixed (see Kconfig
help). Except for workqueue, there is also printk which triggers the
same warning and isn't as easy to fix as workqueue.

> Thanx, Paul

Sebastian