2019-10-01 17:47:45

by Steven Rostedt

[permalink] [raw]
Subject: Lockup on USB and block devices

Not sure who to blame, but my server locked up when upgraded (accessing
volume group information), and echoing in "w" into sysrq-trigger showed
a bit of information.

First, looking at my dmesg I see that my usb-storage is hung up, for
whatever reason. Thus, this could be the source of all issues.


[5434447.145737] INFO: task usb-storage:32246 blocked for more than 120 seconds.
[5434447.145740] Not tainted 5.2.4-custom #4

(BTW, I was upgrading to my 5.2.17 kernel when this happened)

[5434447.145741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5434447.145743] usb-storage D 0 32246 2 0x80004000
[5434447.145745] Call Trace:
[5434447.145749] ? __schedule+0x1e8/0x600
[5434447.145752] schedule+0x29/0x90
[5434447.145755] schedule_timeout+0x208/0x300
[5434447.145765] ? usb_hcd_submit_urb+0xbe/0xb90 [usbcore]
[5434447.145773] ? usb_alloc_urb+0x23/0x70 [usbcore]
[5434447.145782] ? usb_sg_init+0x92/0x2b0 [usbcore]
[5434447.145791] ? usb_hcd_submit_urb+0xbe/0xb90 [usbcore]
[5434447.145795] ? __switch_to_asm+0x34/0x70
[5434447.145798] wait_for_completion+0x111/0x180
[5434447.145800] ? wake_up_q+0x70/0x70
[5434447.145809] usb_sg_wait+0xfa/0x150 [usbcore]
[5434447.145814] usb_stor_bulk_transfer_sglist.part.4+0x64/0xb0 [usb_storage]
[5434447.145818] usb_stor_bulk_srb+0x49/0x80 [usb_storage]
[5434447.145821] usb_stor_Bulk_transport+0x167/0x3e0 [usb_storage]
[5434447.145824] ? schedule+0x29/0x90
[5434447.145828] ? usb_stor_disconnect+0xb0/0xb0 [usb_storage]
[5434447.145832] usb_stor_invoke_transport+0x3a/0x4e0 [usb_storage]
[5434447.145835] ? wait_for_completion_interruptible+0x12d/0x1d0
[5434447.145837] ? wake_up_q+0x70/0x70
[5434447.145841] usb_stor_control_thread+0x1c5/0x270 [usb_storage]
[5434447.145845] kthread+0x116/0x130
[5434447.145847] ? kthread_create_worker_on_cpu+0x70/0x70
[5434447.145851] ret_from_fork+0x35/0x40


But then this messed up with this I presume:

[5434567.980526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[5434567.980529] pool D 0 32364 1 0x00000000
[5434567.980532] Call Trace:
[5434567.980543] ? __schedule+0x1e8/0x600
[5434567.980547] schedule+0x29/0x90
[5434567.980558] scsi_block_when_processing_errors+0xb7/0xc0 [scsi_mod]
[5434567.980564] ? finish_wait+0x80/0x80
[5434567.980568] sd_open+0x49/0x170 [sd_mod]
[5434567.980574] __blkdev_get+0x45a/0x510
[5434567.980577] ? bd_acquire+0xb0/0xb0
[5434567.980579] blkdev_get+0x108/0x300
[5434567.980582] ? bd_acquire+0xb0/0xb0
[5434567.980588] do_dentry_open+0x13a/0x370
[5434567.980593] path_openat+0x2c6/0x1450
[5434567.980598] ? part_stat_show+0x427/0x450
[5434567.980603] do_filp_open+0x93/0x100
[5434567.980608] ? __check_object_size+0x15d/0x189
[5434567.980612] do_sys_open+0x186/0x220
[5434567.980616] do_syscall_64+0x4e/0x100
[5434567.980621] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[5434567.980625] RIP: 0033:0x7f8b4b034d0e
[5434567.980632] Code: Bad RIP value.
[5434567.980633] RSP: 002b:00007f8b49a389d0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
[5434567.980636] RAX: ffffffffffffffda RBX: 00007f8b3c035490 RCX: 00007f8b4b034d0e
[5434567.980638] RDX: 0000000000000800 RSI: 000055d82f508d30 RDI: 00000000ffffff9c
[5434567.980639] RBP: 0000000000000000 R08: 0000000000000000 R09: 000055d82f3eff80
[5434567.980641] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f8b49a38bd0
[5434567.980642] R13: 00007f8b49a38af8 R14: 0000000000000000 R15: 00007f8b3c035490


and all access to __blkdev_get() blocks on this volume too:


[5437427.042327] vgs D 0 32353 1 0x80004004
[5437427.042334] Call Trace:
[5437427.042341] ? __schedule+0x1e8/0x600
[5437427.042347] schedule+0x29/0x90
[5437427.042354] schedule_timeout+0x208/0x300
[5437427.042363] ? __kfifo_from_user_r+0xb0/0xb0
[5437427.042369] ? __percpu_ref_switch_mode+0xcd/0x170
[5437427.042375] ? __wake_up_common_lock+0x89/0xc0
[5437427.042381] wait_for_completion+0x111/0x180
[5437427.042387] ? wake_up_q+0x70/0x70
[5437427.042395] exit_aio+0xdc/0xe3
[5437427.042404] mmput+0x28/0x120
[5437427.042411] do_exit+0x28b/0xb90
[5437427.042420] ? __switch_to+0x3a6/0x460
[5437427.042425] ? _cond_resched+0x15/0x30
[5437427.042428] ? mutex_lock+0xe/0x30
[5437427.042430] ? aio_read_events+0x28a/0x320
[5437427.042434] do_group_exit+0x3a/0xa0
[5437427.042438] get_signal+0x14b/0x840
[5437427.042444] do_signal+0x36/0x660
[5437427.042449] ? __hrtimer_init+0xb0/0xb0
[5437427.042452] ? do_io_getevents+0x7c/0xc0
[5437427.042456] exit_to_usermode_loop+0x71/0xe0
[5437427.042459] do_syscall_64+0xf2/0x100
[5437427.042463] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[5437427.042465] RIP: 0033:0x7fc12f303f59
[5437427.042470] Code: Bad RIP value.
[5437427.042472] RSP: 002b:00007fffffad42e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d0
[5437427.042474] RAX: fffffffffffffffc RBX: 00007fc12efa2700 RCX: 00007fc12f303f59
[5437427.042475] RDX: 0000000000000040 RSI: 0000000000000001 RDI: 00007fc12f974000
[5437427.042477] RBP: 00007fc12f974000 R08: 0000000000000000 R09: 00000002fffffe30
[5437427.042478] R10: 00007fffffad4370 R11: 0000000000000246 R12: 0000000000000001
[5437427.042479] R13: 0000000000000002 R14: 0000000000000040 R15: 00007fffffad4370


Full dmesg is attached.

-- Steve


Attachments:
(No filename) (5.30 kB)
dmesg.xz (4.24 kB)
Download all attachments

2019-10-01 18:33:27

by Alan Stern

[permalink] [raw]
Subject: Re: Lockup on USB and block devices

On Tue, 1 Oct 2019, Steven Rostedt wrote:

> Not sure who to blame, but my server locked up when upgraded (accessing
> volume group information), and echoing in "w" into sysrq-trigger showed
> a bit of information.
>
> First, looking at my dmesg I see that my usb-storage is hung up, for
> whatever reason. Thus, this could be the source of all issues.
>
>
> [5434447.145737] INFO: task usb-storage:32246 blocked for more than 120 seconds.
> [5434447.145740] Not tainted 5.2.4-custom #4
>
> (BTW, I was upgrading to my 5.2.17 kernel when this happened)
>
> [5434447.145741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [5434447.145743] usb-storage D 0 32246 2 0x80004000
> [5434447.145745] Call Trace:
> [5434447.145749] ? __schedule+0x1e8/0x600
> [5434447.145752] schedule+0x29/0x90
> [5434447.145755] schedule_timeout+0x208/0x300
> [5434447.145765] ? usb_hcd_submit_urb+0xbe/0xb90 [usbcore]
> [5434447.145773] ? usb_alloc_urb+0x23/0x70 [usbcore]
> [5434447.145782] ? usb_sg_init+0x92/0x2b0 [usbcore]
> [5434447.145791] ? usb_hcd_submit_urb+0xbe/0xb90 [usbcore]
> [5434447.145795] ? __switch_to_asm+0x34/0x70
> [5434447.145798] wait_for_completion+0x111/0x180
> [5434447.145800] ? wake_up_q+0x70/0x70
> [5434447.145809] usb_sg_wait+0xfa/0x150 [usbcore]
> [5434447.145814] usb_stor_bulk_transfer_sglist.part.4+0x64/0xb0 [usb_storage]
> [5434447.145818] usb_stor_bulk_srb+0x49/0x80 [usb_storage]
> [5434447.145821] usb_stor_Bulk_transport+0x167/0x3e0 [usb_storage]
> [5434447.145824] ? schedule+0x29/0x90
> [5434447.145828] ? usb_stor_disconnect+0xb0/0xb0 [usb_storage]
> [5434447.145832] usb_stor_invoke_transport+0x3a/0x4e0 [usb_storage]
> [5434447.145835] ? wait_for_completion_interruptible+0x12d/0x1d0
> [5434447.145837] ? wake_up_q+0x70/0x70
> [5434447.145841] usb_stor_control_thread+0x1c5/0x270 [usb_storage]
> [5434447.145845] kthread+0x116/0x130
> [5434447.145847] ? kthread_create_worker_on_cpu+0x70/0x70
> [5434447.145851] ret_from_fork+0x35/0x40

It looks like a problem with your xHCI USB host controller. Normally a
usb-storage transfer would be aborted after 90 seconds. But if the
host controller (or its driver) isn't working right, and the abort
never completes, you end up with a situation like this -- usb-storage
and the higher SCSI and block layers waiting indefinitely for an event
that won't occur.

In theory, unplugging the USB mass-storage device would unblock
everything (unless xhci-hcd is too badly wedged). But of course, doing
that is likely to mess up whatever you were working on.

Alan Stern

2019-10-14 10:59:22

by Oliver Neukum

[permalink] [raw]
Subject: Re: Lockup on USB and block devices

Am Dienstag, den 01.10.2019, 13:44 -0400 schrieb Steven Rostedt:
> Not sure who to blame, but my server locked up when upgraded (accessing
> volume group information), and echoing in "w" into sysrq-trigger showed
> a bit of information.
>
> First, looking at my dmesg I see that my usb-storage is hung up, for
> whatever reason. Thus, this could be the source of all issues.

The reset of the device is hung. You may be hanging on dev_mutex.
Is this repeatable?

Regards
Oliver

2019-10-14 12:03:20

by Oliver Neukum

[permalink] [raw]
Subject: Re: Lockup on USB and block devices

Am Dienstag, den 01.10.2019, 14:31 -0400 schrieb Alan Stern:
> It looks like a problem with your xHCI USB host controller. Normally a
> usb-storage transfer would be aborted after 90 seconds. But if the
> host controller (or its driver) isn't working right, and the abort
> never completes, you end up with a situation like this -- usb-storage
> and the higher SCSI and block layers waiting indefinitely for an event
> that won't occur.

Hi,

that would imply that usb_unlink_urb() is malfunctioning
without taking down the whole bus. Is that likely?

Regards
Oliver

2019-10-14 15:14:13

by Alan Stern

[permalink] [raw]
Subject: Re: Lockup on USB and block devices

On Mon, 14 Oct 2019, Oliver Neukum wrote:

> Am Dienstag, den 01.10.2019, 14:31 -0400 schrieb Alan Stern:
> > It looks like a problem with your xHCI USB host controller. Normally a
> > usb-storage transfer would be aborted after 90 seconds. But if the
> > host controller (or its driver) isn't working right, and the abort
> > never completes, you end up with a situation like this -- usb-storage
> > and the higher SCSI and block layers waiting indefinitely for an event
> > that won't occur.
>
> Hi,
>
> that would imply that usb_unlink_urb() is malfunctioning
> without taking down the whole bus. Is that likely?

Mathias is a better person to ask than me. But yes, as far as I know,
something like that can happen.

Alan Stern