2022-11-08 03:05:38

by Francesco Ruggeri

[permalink] [raw]
Subject: nvme: hung task in blk_mq_freeze_queue_wait

We are running into hung tasks in blk_mq_freeze_queue_wait, sometimes
in nvme_reset_work and sometimes in nvme_scan_work.
In some cases we also see a warning in blk_mq_unfreeze_queue
(WARN_ON_ONCE(q->mq_freeze_depth < 0)):
WARNING: CPU: 10 PID: 152 at block/blk-mq.c:199 blk_mq_unfreeze_queue+0x26/0x5a
The hung tasks are preceded by two instances of nvme_timeout racing in
nvme_dev_disable after a device becomes unresponsive.
The first nvme_dev_disable blocks in wait_for_completion_io_timeout, and
a while later a second instance blocks in mutex_lock(&dev->shutdown_lock).
When the timeout expires and the first instance releases the mutex,
the two instances start racing.
We have seen the hung tasks in 4.19, but from a quick look at the
latest 6.1 code the issue is probably there as well.
There seem to be two different scenarios where unbalanced
blk_freeze_queue_start/blk_mq_unfreeze_queue could be the cause.

1) In this case we have an unfreeze without a corresponding freeze
preceding it.

TIMEOUT 1 TIMEOUT 2 RESET WORK 1 RESET WORK 2

state = NVME_CTRL_LIVE
nvme_timeout
nvme_dev_disable
mutex_lock
nvme_start_freeze
blk_freeze_queue_start
nvme_disable_io_queues
wait_for_completion_io_timeout
nvme_timeout
nvme_dev_disable
mutex_lock
mutex_unlock
nvme_reset_ctrl
state = NVME_CTRL_RESETTING
queue_work(nvme_reset_work)
nvme_reset_work
state = NVME_CTRL_CONNECTING
(state != NVME_CTRL_LIVE and
state != NVME_CTRL_RESETTING)
skips nvme_start_freeze
mutex_unlock
nvme_reset_ctrl
state = NVME_CTRL_RESETTING
queue_work(nvme_reset_work)
nvme_unfreeze (matches
nvme_start_freeze in
TIMEOUT 1)
nvme_reset_work
nvme_unfreeze (no
match in TIMEOUT 2)


2) In this case a freeze has no corresponding unfreeze following it.
TIMEOUT 2 cannot schedule nvme_reset_work because TIMEOUT 1's is already
scheduled but not yet running.

TIMEOUT 1 TIMEOUT 2 RESET WORK 1 RESET WORK 2

state = NVME_CTRL_LIVE
nvme_timeout
nvme_dev_disable
mutex_lock
nvme_start_freeze
blk_freeze_queue_start
nvme_disable_io_queues
wait_for_completion_io_timeout
nvme_timeout
nvme_dev_disable
mutex_lock
mutex_unlock
nvme_reset_ctrl
state = NVME_CTRL_RESETTING
queue_work(nvme_reset_work)
(state == NVME_CTRL_LIVE or
state == NVME_CTRL_RESETTING)
nvme_start_freeze
blk_freeze_queue_start
mutex_unlock
nvme_reset_ctrl
state = NVME_CTRL_RESETTING
queue_work(nvme_reset_work)
fails because nvme_reset_work
is not running yet
nvme_reset_work
nvme_unfreeze (matches
nvme_start_freeze in
TIMEOUT 1)
It gets never
scheduled.


Following is one such backtrace.

[ 4390.119745] INFO: task kworker/u80:2:8279 blocked for more than 300 seconds.
[ 4390.471456] Tainted: P O 4.19.142.Ar-29007847.buytenhb7335070 #1
[ 4390.832568] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4391.193663] kworker/u80:2 D 0 8279 2 0x80000000
[ 4391.526603] Workqueue: nvme-reset-wq nvme_reset_work
[ 4391.853272] Call Trace:
[ 4392.149696] __schedule+0x75b/0x7c2
[ 4392.458627] schedule+0x78/0x8c
[ 4392.763399] blk_mq_freeze_queue_wait+0x8b/0xb6
[ 4393.084849] ? wait_woken+0x92/0x92
[ 4393.393791] nvme_wait_freeze+0x39/0x4e
[ 4393.706900] nvme_reset_work+0x1464/0x1665
[ 4394.023142] ? dequeue_entity+0x694/0x6a3
[ 4394.338340] process_one_work+0x1c2/0x30f
[ 4394.511642] worker_thread+0x1e9/0x2cc
[ 4394.556626] ? rescuer_thread+0x2b7/0x2b7
[ 4394.604738] kthread+0x15d/0x165
[ 4394.643460] ? kthread_park+0x98/0x98
[ 4394.687396] ret_from_fork+0x1f/0x30




2022-11-08 17:26:46

by Keith Busch

[permalink] [raw]
Subject: Re: nvme: hung task in blk_mq_freeze_queue_wait

On Mon, Nov 07, 2022 at 06:56:02PM -0800, Francesco Ruggeri wrote:
> [ 4390.119745] INFO: task kworker/u80:2:8279 blocked for more than 300 seconds.
> [ 4390.471456] Tainted: P O 4.19.142.Ar-29007847.buytenhb7335070 #1

That is an old kernel, I think it'd be worth trying something newer to
confirm if this observation isn't already fixed. Specifically, it looks
like you have multiple namespaces timing out IO near simultaneously, and
causing a mismatched handling. That kind of thing was fixed after the
follow (plus some prior dependencies):

commit d6135c3a1ec0cddda7b8b8e1b5b4abeeafd98289
Author: Keith Busch <[email protected]>
Date: Tue May 14 14:46:09 2019 -0600

nvme-pci: Sync queues on reset

2022-11-08 17:41:44

by Francesco Ruggeri

[permalink] [raw]
Subject: Re: nvme: hung task in blk_mq_freeze_queue_wait

On Tue, Nov 8, 2022 at 8:51 AM Keith Busch <[email protected]> wrote:
> That is an old kernel, I think it'd be worth trying something newer to
> confirm if this observation isn't already fixed. Specifically, it looks
> like you have multiple namespaces timing out IO near simultaneously, and
> causing a mismatched handling. That kind of thing was fixed after the
> follow (plus some prior dependencies):
>
> commit d6135c3a1ec0cddda7b8b8e1b5b4abeeafd98289
> Author: Keith Busch <[email protected]>
> Date: Tue May 14 14:46:09 2019 -0600
>
> nvme-pci: Sync queues on reset


On Tue, Nov 8, 2022 at 3:29 AM Hillf Danton <[email protected]> wrote:
>
> Given blk_mq_freeze_queue_wait() in the call trace, another explanation
> may be percpu refcount [1], which can be tested with the diff attached.
>
> Only for thoughts.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/block/blk-mq.c?h=next-20221104#n228
>
> Hillf

Thanks Keith, Hillf, I will look into that.

Francesco