2022-08-19 12:21:17

by Hans de Goede

[permalink] [raw]
Subject: 6.0-rc1 regression block (blk_mq) / RCU task stuck errors + block-io hang

Hi All,

I've been dogfooding 6.0-rc1 on my main workstation and I have hit
this pretty serious bug, serious enough for me to go back to 5.19

My dmesg is showing various blk_mq (RCU?) related lockdep splats
followed by some tasks getting stuck on disk-IO. E.g. "sync"
is guaranteed to hang, but other tasks too.

This seems to be mainly the case on "sd" disks (both sata
and USB) where as my main nvme drive seems fine, which has
probably saved me from worse issues...

Here are 4 task stuck reports from my last boot, where
I had to turn off the machine by keeping the power button
pressed for 4 seconds.

Aug 19 12:25:46 shalem kernel: INFO: task kworker/0:3:305 blocked for more than 122 seconds.
Aug 19 12:25:46 shalem kernel: Tainted: G W OE ------- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
Aug 19 12:25:46 shalem kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 12:25:46 shalem kernel: task:kworker/0:3 state:D stack:13008 pid: 305 ppid: 2 flags:0x00004000
Aug 19 12:25:46 shalem kernel: Workqueue: events ata_scsi_dev_rescan
Aug 19 12:25:46 shalem kernel: Call Trace:
Aug 19 12:25:46 shalem kernel: <TASK>
Aug 19 12:25:46 shalem kernel: __schedule+0x47f/0x16d0
Aug 19 12:25:46 shalem kernel: ? mark_held_locks+0x50/0x80
Aug 19 12:25:46 shalem kernel: ? _raw_spin_unlock_irq+0x24/0x50
Aug 19 12:25:46 shalem kernel: schedule+0x5d/0xe0
Aug 19 12:25:46 shalem kernel: blk_mq_freeze_queue_wait+0x57/0x90
Aug 19 12:25:46 shalem kernel: ? prepare_to_wait_exclusive+0xd0/0xd0
Aug 19 12:25:46 shalem kernel: disk_clear_zone_settings+0x19/0xa0
Aug 19 12:25:46 shalem kernel: sd_revalidate_disk.isra.0+0x1660/0x2c60
Aug 19 12:25:46 shalem kernel: ? slab_free_freelist_hook.constprop.0+0x11b/0x160
Aug 19 12:25:46 shalem kernel: ? scsi_rescan_device+0x2c/0x90
Aug 19 12:25:46 shalem kernel: scsi_rescan_device+0x77/0x90
Aug 19 12:25:46 shalem kernel: ata_scsi_dev_rescan+0x9d/0x110
Aug 19 12:25:46 shalem kernel: process_one_work+0x2a0/0x600
Aug 19 12:25:46 shalem kernel: worker_thread+0x4f/0x3a0
Aug 19 12:25:46 shalem kernel: ? process_one_work+0x600/0x600
Aug 19 12:25:46 shalem kernel: kthread+0xf5/0x120
Aug 19 12:25:46 shalem kernel: ? kthread_complete_and_exit+0x20/0x20
Aug 19 12:25:46 shalem kernel: ret_from_fork+0x22/0x30
Aug 19 12:25:46 shalem kernel: </TASK>
Aug 19 12:25:46 shalem kernel: INFO: task smartd:2557 blocked for more than 122 seconds.
Aug 19 12:25:46 shalem kernel: Tainted: G W OE ------- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
Aug 19 12:25:46 shalem kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 12:25:46 shalem kernel: task:smartd state:D stack:13000 pid: 2557 ppid: 1 flags:0x00000002
Aug 19 12:25:46 shalem kernel: Call Trace:
Aug 19 12:25:46 shalem kernel: <TASK>
Aug 19 12:25:46 shalem kernel: __schedule+0x47f/0x16d0
Aug 19 12:25:46 shalem kernel: ? lock_release+0x14f/0x460
Aug 19 12:25:46 shalem kernel: ? _raw_spin_unlock_irqrestore+0x30/0x60
Aug 19 12:25:46 shalem kernel: ? _raw_spin_unlock_irqrestore+0x40/0x60
Aug 19 12:25:46 shalem kernel: schedule+0x5d/0xe0
Aug 19 12:25:46 shalem kernel: blk_queue_enter+0xe5/0x190
Aug 19 12:25:46 shalem kernel: ? prepare_to_wait_exclusive+0xd0/0xd0
Aug 19 12:25:46 shalem kernel: blk_mq_alloc_request+0x50/0xb0
Aug 19 12:25:46 shalem kernel: scsi_alloc_request+0xb/0x30
Aug 19 12:25:46 shalem kernel: sg_io+0x6b/0x390
Aug 19 12:25:46 shalem kernel: ? _copy_from_user+0x83/0xa0
Aug 19 12:25:46 shalem kernel: ? get_sg_io_hdr+0x26/0xb0
Aug 19 12:25:46 shalem kernel: scsi_ioctl+0x44c/0x960
Aug 19 12:25:46 shalem kernel: ? scsi_ioctl_block_when_processing_errors+0x45/0x50
Aug 19 12:25:46 shalem kernel: blkdev_ioctl+0x143/0x2c0
Aug 19 12:25:46 shalem kernel: __x64_sys_ioctl+0x90/0xd0
Aug 19 12:25:46 shalem kernel: do_syscall_64+0x5b/0x80
Aug 19 12:25:46 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:25:46 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:25:46 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:25:46 shalem kernel: ? lockdep_hardirqs_on+0x7d/0x100
Aug 19 12:25:46 shalem kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Aug 19 12:25:46 shalem kernel: RIP: 0033:0x7f9641245cdf
Aug 19 12:25:46 shalem kernel: RSP: 002b:00007ffc73d257d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 19 12:25:46 shalem kernel: RAX: ffffffffffffffda RBX: 00007ffc73d25ad0 RCX: 00007f9641245cdf
Aug 19 12:25:46 shalem kernel: RDX: 00007ffc73d25840 RSI: 0000000000002285 RDI: 0000000000000003
Aug 19 12:25:46 shalem kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Aug 19 12:25:46 shalem kernel: R10: 000000000000003c R11: 0000000000000246 R12: 00007ffc73d25c50
Aug 19 12:25:46 shalem kernel: R13: 0000000000000001 R14: 0000000000000000 R15: 000055fbff0abd10
Aug 19 12:25:46 shalem kernel: </TASK>
Aug 19 12:25:46 shalem kernel: INFO: task pool-udisksd:9404 blocked for more than 122 seconds.
Aug 19 12:25:46 shalem kernel: Tainted: G W OE ------- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
Aug 19 12:25:46 shalem kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 12:25:46 shalem kernel: task:pool-udisksd state:D stack:14624 pid: 9404 ppid: 1 flags:0x00000002
Aug 19 12:25:46 shalem kernel: Call Trace:
Aug 19 12:25:46 shalem kernel: <TASK>
Aug 19 12:25:46 shalem kernel: __schedule+0x47f/0x16d0
Aug 19 12:25:46 shalem kernel: ? lock_release+0x14f/0x460
Aug 19 12:25:46 shalem kernel: ? _raw_spin_unlock_irqrestore+0x30/0x60
Aug 19 12:25:46 shalem kernel: ? _raw_spin_unlock_irqrestore+0x40/0x60
Aug 19 12:25:46 shalem kernel: schedule+0x5d/0xe0
Aug 19 12:25:46 shalem kernel: blk_queue_enter+0xe5/0x190
Aug 19 12:25:46 shalem kernel: ? prepare_to_wait_exclusive+0xd0/0xd0
Aug 19 12:25:46 shalem kernel: blk_mq_alloc_request+0x50/0xb0
Aug 19 12:25:46 shalem kernel: scsi_alloc_request+0xb/0x30
Aug 19 12:25:46 shalem kernel: sg_io+0x6b/0x390
Aug 19 12:25:46 shalem kernel: ? _copy_from_user+0x83/0xa0
Aug 19 12:25:46 shalem kernel: ? get_sg_io_hdr+0x26/0xb0
Aug 19 12:25:46 shalem kernel: scsi_ioctl+0x44c/0x960
Aug 19 12:25:46 shalem kernel: ? scsi_ioctl_block_when_processing_errors+0x45/0x50
Aug 19 12:25:46 shalem kernel: blkdev_ioctl+0x143/0x2c0
Aug 19 12:25:46 shalem kernel: __x64_sys_ioctl+0x90/0xd0
Aug 19 12:25:46 shalem kernel: do_syscall_64+0x5b/0x80
Aug 19 12:25:46 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:25:46 shalem kernel: ? lockdep_hardirqs_on+0x7d/0x100
Aug 19 12:25:46 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:25:46 shalem kernel: ? ret_from_fork+0x15/0x30
Aug 19 12:25:46 shalem kernel: ? lockdep_hardirqs_on+0x7d/0x100
Aug 19 12:25:46 shalem kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Aug 19 12:25:46 shalem kernel: RIP: 0033:0x7f0c77065cdf
Aug 19 12:25:46 shalem kernel: RSP: 002b:00007f0c75dfd7f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 19 12:25:46 shalem kernel: RAX: ffffffffffffffda RBX: 00007f0c75dfda00 RCX: 00007f0c77065cdf
Aug 19 12:25:46 shalem kernel: RDX: 00007f0c75dfd860 RSI: 0000000000002285 RDI: 0000000000000011
Aug 19 12:25:46 shalem kernel: RBP: 0000000000001388 R08: 00007f0c75dfd9e0 R09: 00007f0c75dfe550
Aug 19 12:25:46 shalem kernel: R10: 00007f0c75dfd960 R11: 0000000000000246 R12: 00007f0c75dfdb70
Aug 19 12:25:46 shalem kernel: R13: 0000000000000000 R14: 0000000000000011 R15: 00007f0c75dfd9e0
Aug 19 12:25:46 shalem kernel: </TASK>
Aug 19 12:25:46 shalem kernel:
Showing all locks held in the system:
Aug 19 12:25:46 shalem kernel: 1 lock held by rcu_tasks_kthre/11:
Aug 19 12:25:46 shalem kernel: #0: ffffffff86168820 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2b/0x3e0
Aug 19 12:25:46 shalem kernel: 1 lock held by rcu_tasks_rude_/12:
Aug 19 12:25:46 shalem kernel: #0: ffffffff86168560 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2b/0x3e0
Aug 19 12:25:46 shalem kernel: 1 lock held by rcu_tasks_trace/13:
Aug 19 12:25:46 shalem kernel: #0: ffffffff86168260 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2b/0x3e0
Aug 19 12:25:46 shalem kernel: 1 lock held by khungtaskd/100:
Aug 19 12:25:46 shalem kernel: #0: ffffffff86169320 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x16b
Aug 19 12:25:46 shalem kernel: 4 locks held by kworker/0:3/305:
Aug 19 12:25:46 shalem kernel: #0: ffff8ac180051748 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x20b/0x600
Aug 19 12:25:46 shalem kernel: #1: ffffa78dc4007e78 ((work_completion)(&ap->scsi_rescan_task)){+.+.}-{0:0}, at: process_one_work+0x20b/0x600
Aug 19 12:25:46 shalem kernel: #2: ffff8ac18ba840a8 (&ap->scsi_scan_mutex){+.+.}-{3:3}, at: ata_scsi_dev_rescan+0x32/0x110
Aug 19 12:25:46 shalem kernel: #3: ffff8ac1909b5390 (&dev->mutex){....}-{3:3}, at: scsi_rescan_device+0x24/0x90
Aug 19 12:25:46 shalem kernel: 1 lock held by LCDd/2542:
Aug 19 12:25:46 shalem kernel: #0: ffffffff8635a578 (minors_rwsem){++++}-{3:3}, at: hidraw_write+0x1e/0x50
Aug 19 12:25:46 shalem kernel:
Aug 19 12:25:46 shalem kernel: =============================================


Aug 19 12:27:49 shalem kernel: INFO: task kworker/0:3:305 blocked for more than 245 seconds.
Aug 19 12:27:49 shalem kernel: Tainted: G W OE ------- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
Aug 19 12:27:49 shalem kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 12:27:49 shalem kernel: task:kworker/0:3 state:D stack:13008 pid: 305 ppid: 2 flags:0x00004000
Aug 19 12:27:49 shalem kernel: Workqueue: events ata_scsi_dev_rescan
Aug 19 12:27:49 shalem kernel: Call Trace:
Aug 19 12:27:49 shalem kernel: <TASK>
Aug 19 12:27:49 shalem kernel: __schedule+0x47f/0x16d0
Aug 19 12:27:49 shalem kernel: ? mark_held_locks+0x50/0x80
Aug 19 12:27:49 shalem kernel: ? _raw_spin_unlock_irq+0x24/0x50
Aug 19 12:27:49 shalem kernel: schedule+0x5d/0xe0
Aug 19 12:27:49 shalem kernel: blk_mq_freeze_queue_wait+0x57/0x90
Aug 19 12:27:49 shalem kernel: ? prepare_to_wait_exclusive+0xd0/0xd0
Aug 19 12:27:49 shalem kernel: disk_clear_zone_settings+0x19/0xa0
Aug 19 12:27:49 shalem kernel: sd_revalidate_disk.isra.0+0x1660/0x2c60
Aug 19 12:27:49 shalem kernel: ? slab_free_freelist_hook.constprop.0+0x11b/0x160
Aug 19 12:27:49 shalem kernel: ? scsi_rescan_device+0x2c/0x90
Aug 19 12:27:49 shalem kernel: scsi_rescan_device+0x77/0x90
Aug 19 12:27:49 shalem kernel: ata_scsi_dev_rescan+0x9d/0x110
Aug 19 12:27:49 shalem kernel: process_one_work+0x2a0/0x600
Aug 19 12:27:49 shalem kernel: worker_thread+0x4f/0x3a0
Aug 19 12:27:49 shalem kernel: ? process_one_work+0x600/0x600
Aug 19 12:27:49 shalem kernel: kthread+0xf5/0x120
Aug 19 12:27:49 shalem kernel: ? kthread_complete_and_exit+0x20/0x20
Aug 19 12:27:49 shalem kernel: ret_from_fork+0x22/0x30
Aug 19 12:27:49 shalem kernel: </TASK>
Aug 19 12:27:49 shalem kernel: INFO: task smartd:2557 blocked for more than 245 seconds.
Aug 19 12:27:49 shalem kernel: Tainted: G W OE ------- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
Aug 19 12:27:49 shalem kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 12:27:49 shalem kernel: task:smartd state:D stack:13000 pid: 2557 ppid: 1 flags:0x00000002
Aug 19 12:27:49 shalem kernel: Call Trace:
Aug 19 12:27:49 shalem kernel: <TASK>
Aug 19 12:27:49 shalem kernel: __schedule+0x47f/0x16d0
Aug 19 12:27:49 shalem kernel: ? lock_release+0x14f/0x460
Aug 19 12:27:49 shalem kernel: ? _raw_spin_unlock_irqrestore+0x30/0x60
Aug 19 12:27:49 shalem kernel: ? _raw_spin_unlock_irqrestore+0x40/0x60
Aug 19 12:27:49 shalem kernel: schedule+0x5d/0xe0
Aug 19 12:27:49 shalem kernel: blk_queue_enter+0xe5/0x190
Aug 19 12:27:49 shalem kernel: ? prepare_to_wait_exclusive+0xd0/0xd0
Aug 19 12:27:49 shalem kernel: blk_mq_alloc_request+0x50/0xb0
Aug 19 12:27:49 shalem kernel: scsi_alloc_request+0xb/0x30
Aug 19 12:27:49 shalem kernel: sg_io+0x6b/0x390
Aug 19 12:27:49 shalem kernel: ? _copy_from_user+0x83/0xa0
Aug 19 12:27:49 shalem kernel: ? get_sg_io_hdr+0x26/0xb0
Aug 19 12:27:49 shalem kernel: scsi_ioctl+0x44c/0x960
Aug 19 12:27:49 shalem kernel: ? scsi_ioctl_block_when_processing_errors+0x45/0x50
Aug 19 12:27:49 shalem kernel: blkdev_ioctl+0x143/0x2c0
Aug 19 12:27:49 shalem kernel: __x64_sys_ioctl+0x90/0xd0
Aug 19 12:27:49 shalem kernel: do_syscall_64+0x5b/0x80
Aug 19 12:27:49 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:27:49 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:27:49 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:27:49 shalem kernel: ? lockdep_hardirqs_on+0x7d/0x100
Aug 19 12:27:49 shalem kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Aug 19 12:27:49 shalem kernel: RIP: 0033:0x7f9641245cdf
Aug 19 12:27:49 shalem kernel: RSP: 002b:00007ffc73d257d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 19 12:27:49 shalem kernel: RAX: ffffffffffffffda RBX: 00007ffc73d25ad0 RCX: 00007f9641245cdf
Aug 19 12:27:49 shalem kernel: RDX: 00007ffc73d25840 RSI: 0000000000002285 RDI: 0000000000000003
Aug 19 12:27:49 shalem kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Aug 19 12:27:49 shalem kernel: R10: 000000000000003c R11: 0000000000000246 R12: 00007ffc73d25c50
Aug 19 12:27:49 shalem kernel: R13: 0000000000000001 R14: 0000000000000000 R15: 000055fbff0abd10
Aug 19 12:27:49 shalem kernel: </TASK>
Aug 19 12:27:49 shalem kernel: INFO: task pool-udisksd:9404 blocked for more than 245 seconds.
Aug 19 12:27:49 shalem kernel: Tainted: G W OE ------- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
Aug 19 12:27:49 shalem kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 12:27:49 shalem kernel: task:pool-udisksd state:D stack:14624 pid: 9404 ppid: 1 flags:0x00000002
Aug 19 12:27:49 shalem kernel: Call Trace:
Aug 19 12:27:49 shalem kernel: <TASK>
Aug 19 12:27:49 shalem kernel: __schedule+0x47f/0x16d0
Aug 19 12:27:49 shalem kernel: ? lock_release+0x14f/0x460
Aug 19 12:27:49 shalem kernel: ? _raw_spin_unlock_irqrestore+0x30/0x60
Aug 19 12:27:49 shalem kernel: ? _raw_spin_unlock_irqrestore+0x40/0x60
Aug 19 12:27:49 shalem kernel: schedule+0x5d/0xe0
Aug 19 12:27:49 shalem kernel: blk_queue_enter+0xe5/0x190
Aug 19 12:27:49 shalem kernel: ? prepare_to_wait_exclusive+0xd0/0xd0
Aug 19 12:27:49 shalem kernel: blk_mq_alloc_request+0x50/0xb0
Aug 19 12:27:49 shalem kernel: scsi_alloc_request+0xb/0x30
Aug 19 12:27:49 shalem kernel: sg_io+0x6b/0x390
Aug 19 12:27:49 shalem kernel: ? _copy_from_user+0x83/0xa0
Aug 19 12:27:49 shalem kernel: ? get_sg_io_hdr+0x26/0xb0
Aug 19 12:27:49 shalem kernel: scsi_ioctl+0x44c/0x960
Aug 19 12:27:49 shalem kernel: ? scsi_ioctl_block_when_processing_errors+0x45/0x50
Aug 19 12:27:49 shalem kernel: blkdev_ioctl+0x143/0x2c0
Aug 19 12:27:49 shalem kernel: __x64_sys_ioctl+0x90/0xd0
Aug 19 12:27:49 shalem kernel: do_syscall_64+0x5b/0x80
Aug 19 12:27:49 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:27:49 shalem kernel: ? lockdep_hardirqs_on+0x7d/0x100
Aug 19 12:27:49 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:27:49 shalem kernel: ? ret_from_fork+0x15/0x30
Aug 19 12:27:49 shalem kernel: ? lockdep_hardirqs_on+0x7d/0x100
Aug 19 12:27:49 shalem kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Aug 19 12:27:49 shalem kernel: RIP: 0033:0x7f0c77065cdf
Aug 19 12:27:49 shalem kernel: RSP: 002b:00007f0c75dfd7f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 19 12:27:49 shalem kernel: RAX: ffffffffffffffda RBX: 00007f0c75dfda00 RCX: 00007f0c77065cdf
Aug 19 12:27:49 shalem kernel: RDX: 00007f0c75dfd860 RSI: 0000000000002285 RDI: 0000000000000011
Aug 19 12:27:49 shalem kernel: RBP: 0000000000001388 R08: 00007f0c75dfd9e0 R09: 00007f0c75dfe550
Aug 19 12:27:49 shalem kernel: R10: 00007f0c75dfd960 R11: 0000000000000246 R12: 00007f0c75dfdb70
Aug 19 12:27:49 shalem kernel: R13: 0000000000000000 R14: 0000000000000011 R15: 00007f0c75dfd9e0
Aug 19 12:27:49 shalem kernel: </TASK>
Aug 19 12:27:49 shalem kernel:
Showing all locks held in the system:
Aug 19 12:27:49 shalem kernel: 1 lock held by rcu_tasks_kthre/11:
Aug 19 12:27:49 shalem kernel: #0: ffffffff86168820 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2b/0x3e0
Aug 19 12:27:49 shalem kernel: 1 lock held by rcu_tasks_rude_/12:
Aug 19 12:27:49 shalem kernel: #0: ffffffff86168560 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2b/0x3e0
Aug 19 12:27:49 shalem kernel: 1 lock held by rcu_tasks_trace/13:
Aug 19 12:27:49 shalem kernel: #0: ffffffff86168260 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2b/0x3e0
Aug 19 12:27:49 shalem kernel: 1 lock held by khungtaskd/100:
Aug 19 12:27:49 shalem kernel: #0: ffffffff86169320 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x16b
Aug 19 12:27:49 shalem kernel: 4 locks held by kworker/0:3/305:
Aug 19 12:27:49 shalem kernel: #0: ffff8ac180051748 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x20b/0x600
Aug 19 12:27:49 shalem kernel: #1: ffffa78dc4007e78 ((work_completion)(&ap->scsi_rescan_task)){+.+.}-{0:0}, at: process_one_work+0x20b/0x600
Aug 19 12:27:49 shalem kernel: #2: ffff8ac18ba840a8 (&ap->scsi_scan_mutex){+.+.}-{3:3}, at: ata_scsi_dev_rescan+0x32/0x110
Aug 19 12:27:49 shalem kernel: #3: ffff8ac1909b5390 (&dev->mutex){....}-{3:3}, at: scsi_rescan_device+0x24/0x90
Aug 19 12:27:49 shalem kernel: 1 lock held by systemd-journal/2096:
Aug 19 12:27:49 shalem kernel: 1 lock held by LCDd/2542:
Aug 19 12:27:49 shalem kernel: #0: ffffffff8635a578 (minors_rwsem){++++}-{3:3}, at: hidraw_write+0x1e/0x50
Aug 19 12:27:49 shalem kernel:
Aug 19 12:27:49 shalem kernel: =============================================


Aug 19 12:29:51 shalem kernel: INFO: task kworker/0:3:305 blocked for more than 368 seconds.
Aug 19 12:29:51 shalem kernel: Tainted: G W OE ------- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
Aug 19 12:29:51 shalem kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 12:29:51 shalem kernel: task:kworker/0:3 state:D stack:13008 pid: 305 ppid: 2 flags:0x00004000
Aug 19 12:29:51 shalem kernel: Workqueue: events ata_scsi_dev_rescan
Aug 19 12:29:51 shalem kernel: Call Trace:
Aug 19 12:29:51 shalem kernel: <TASK>
Aug 19 12:29:51 shalem kernel: __schedule+0x47f/0x16d0
Aug 19 12:29:51 shalem kernel: ? mark_held_locks+0x50/0x80
Aug 19 12:29:51 shalem kernel: ? _raw_spin_unlock_irq+0x24/0x50
Aug 19 12:29:51 shalem kernel: schedule+0x5d/0xe0
Aug 19 12:29:51 shalem kernel: blk_mq_freeze_queue_wait+0x57/0x90
Aug 19 12:29:51 shalem kernel: ? prepare_to_wait_exclusive+0xd0/0xd0
Aug 19 12:29:51 shalem kernel: disk_clear_zone_settings+0x19/0xa0
Aug 19 12:29:51 shalem kernel: sd_revalidate_disk.isra.0+0x1660/0x2c60
Aug 19 12:29:51 shalem kernel: ? slab_free_freelist_hook.constprop.0+0x11b/0x160
Aug 19 12:29:51 shalem kernel: ? scsi_rescan_device+0x2c/0x90
Aug 19 12:29:51 shalem kernel: scsi_rescan_device+0x77/0x90
Aug 19 12:29:51 shalem kernel: ata_scsi_dev_rescan+0x9d/0x110
Aug 19 12:29:51 shalem kernel: process_one_work+0x2a0/0x600
Aug 19 12:29:51 shalem kernel: worker_thread+0x4f/0x3a0
Aug 19 12:29:51 shalem kernel: ? process_one_work+0x600/0x600
Aug 19 12:29:51 shalem kernel: kthread+0xf5/0x120
Aug 19 12:29:51 shalem kernel: ? kthread_complete_and_exit+0x20/0x20
Aug 19 12:29:51 shalem kernel: ret_from_fork+0x22/0x30
Aug 19 12:29:51 shalem kernel: </TASK>
Aug 19 12:29:51 shalem kernel: INFO: task smartd:2557 blocked for more than 368 seconds.
Aug 19 12:29:51 shalem kernel: Tainted: G W OE ------- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
Aug 19 12:29:51 shalem kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 12:29:51 shalem kernel: task:smartd state:D stack:13000 pid: 2557 ppid: 1 flags:0x00000002
Aug 19 12:29:51 shalem kernel: Call Trace:
Aug 19 12:29:51 shalem kernel: <TASK>
Aug 19 12:29:51 shalem kernel: __schedule+0x47f/0x16d0
Aug 19 12:29:51 shalem kernel: ? lock_release+0x14f/0x460
Aug 19 12:29:51 shalem kernel: ? _raw_spin_unlock_irqrestore+0x30/0x60
Aug 19 12:29:51 shalem kernel: ? _raw_spin_unlock_irqrestore+0x40/0x60
Aug 19 12:29:51 shalem kernel: schedule+0x5d/0xe0
Aug 19 12:29:51 shalem kernel: blk_queue_enter+0xe5/0x190
Aug 19 12:29:51 shalem kernel: ? prepare_to_wait_exclusive+0xd0/0xd0
Aug 19 12:29:51 shalem kernel: blk_mq_alloc_request+0x50/0xb0
Aug 19 12:29:51 shalem kernel: scsi_alloc_request+0xb/0x30
Aug 19 12:29:51 shalem kernel: sg_io+0x6b/0x390
Aug 19 12:29:51 shalem kernel: ? _copy_from_user+0x83/0xa0
Aug 19 12:29:51 shalem kernel: ? get_sg_io_hdr+0x26/0xb0
Aug 19 12:29:51 shalem kernel: scsi_ioctl+0x44c/0x960
Aug 19 12:29:51 shalem kernel: ? scsi_ioctl_block_when_processing_errors+0x45/0x50
Aug 19 12:29:51 shalem kernel: blkdev_ioctl+0x143/0x2c0
Aug 19 12:29:51 shalem kernel: __x64_sys_ioctl+0x90/0xd0
Aug 19 12:29:51 shalem kernel: do_syscall_64+0x5b/0x80
Aug 19 12:29:51 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:29:51 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:29:51 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:29:51 shalem kernel: ? lockdep_hardirqs_on+0x7d/0x100
Aug 19 12:29:51 shalem kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Aug 19 12:29:51 shalem kernel: RIP: 0033:0x7f9641245cdf
Aug 19 12:29:51 shalem kernel: RSP: 002b:00007ffc73d257d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 19 12:29:51 shalem kernel: RAX: ffffffffffffffda RBX: 00007ffc73d25ad0 RCX: 00007f9641245cdf
Aug 19 12:29:51 shalem kernel: RDX: 00007ffc73d25840 RSI: 0000000000002285 RDI: 0000000000000003
Aug 19 12:29:51 shalem kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Aug 19 12:29:51 shalem kernel: R10: 000000000000003c R11: 0000000000000246 R12: 00007ffc73d25c50
Aug 19 12:29:51 shalem kernel: R13: 0000000000000001 R14: 0000000000000000 R15: 000055fbff0abd10
Aug 19 12:29:51 shalem kernel: </TASK>
Aug 19 12:29:51 shalem kernel: INFO: task pool-udisksd:9404 blocked for more than 368 seconds.
Aug 19 12:29:51 shalem kernel: Tainted: G W OE ------- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
Aug 19 12:29:51 shalem kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 12:29:51 shalem kernel: task:pool-udisksd state:D stack:14624 pid: 9404 ppid: 1 flags:0x00000002
Aug 19 12:29:51 shalem kernel: Call Trace:
Aug 19 12:29:51 shalem kernel: <TASK>
Aug 19 12:29:51 shalem kernel: __schedule+0x47f/0x16d0
Aug 19 12:29:51 shalem kernel: ? lock_release+0x14f/0x460
Aug 19 12:29:51 shalem kernel: ? _raw_spin_unlock_irqrestore+0x30/0x60
Aug 19 12:29:51 shalem kernel: ? _raw_spin_unlock_irqrestore+0x40/0x60
Aug 19 12:29:51 shalem kernel: schedule+0x5d/0xe0
Aug 19 12:29:51 shalem kernel: blk_queue_enter+0xe5/0x190
Aug 19 12:29:51 shalem kernel: ? prepare_to_wait_exclusive+0xd0/0xd0
Aug 19 12:29:51 shalem kernel: blk_mq_alloc_request+0x50/0xb0
Aug 19 12:29:51 shalem kernel: scsi_alloc_request+0xb/0x30
Aug 19 12:29:51 shalem kernel: sg_io+0x6b/0x390
Aug 19 12:29:51 shalem kernel: ? _copy_from_user+0x83/0xa0
Aug 19 12:29:51 shalem kernel: ? get_sg_io_hdr+0x26/0xb0
Aug 19 12:29:51 shalem kernel: scsi_ioctl+0x44c/0x960
Aug 19 12:29:51 shalem kernel: ? scsi_ioctl_block_when_processing_errors+0x45/0x50
Aug 19 12:29:51 shalem kernel: blkdev_ioctl+0x143/0x2c0
Aug 19 12:29:51 shalem kernel: __x64_sys_ioctl+0x90/0xd0
Aug 19 12:29:51 shalem kernel: do_syscall_64+0x5b/0x80
Aug 19 12:29:51 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:29:51 shalem kernel: ? lockdep_hardirqs_on+0x7d/0x100
Aug 19 12:29:51 shalem kernel: ? do_syscall_64+0x67/0x80
Aug 19 12:29:51 shalem kernel: ? ret_from_fork+0x15/0x30
Aug 19 12:29:51 shalem kernel: ? lockdep_hardirqs_on+0x7d/0x100
Aug 19 12:29:51 shalem kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Aug 19 12:29:51 shalem kernel: RIP: 0033:0x7f0c77065cdf
Aug 19 12:29:51 shalem kernel: RSP: 002b:00007f0c75dfd7f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 19 12:29:51 shalem kernel: RAX: ffffffffffffffda RBX: 00007f0c75dfda00 RCX: 00007f0c77065cdf
Aug 19 12:29:51 shalem kernel: RDX: 00007f0c75dfd860 RSI: 0000000000002285 RDI: 0000000000000011
Aug 19 12:29:51 shalem kernel: RBP: 0000000000001388 R08: 00007f0c75dfd9e0 R09: 00007f0c75dfe550
Aug 19 12:29:51 shalem kernel: R10: 00007f0c75dfd960 R11: 0000000000000246 R12: 00007f0c75dfdb70
Aug 19 12:29:51 shalem kernel: R13: 0000000000000000 R14: 0000000000000011 R15: 00007f0c75dfd9e0
Aug 19 12:29:51 shalem kernel: </TASK>
Aug 19 12:29:51 shalem kernel:
Showing all locks held in the system:
Aug 19 12:29:51 shalem kernel: 1 lock held by rcu_tasks_kthre/11:
Aug 19 12:29:51 shalem kernel: #0: ffffffff86168820 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2b/0x3e0
Aug 19 12:29:51 shalem kernel: 1 lock held by rcu_tasks_rude_/12:
Aug 19 12:29:51 shalem kernel: #0: ffffffff86168560 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2b/0x3e0
Aug 19 12:29:51 shalem kernel: 1 lock held by rcu_tasks_trace/13:
Aug 19 12:29:51 shalem kernel: #0: ffffffff86168260 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2b/0x3e0
Aug 19 12:29:51 shalem kernel: 1 lock held by khungtaskd/100:
Aug 19 12:29:51 shalem kernel: #0: ffffffff86169320 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x16b
Aug 19 12:29:51 shalem kernel: 4 locks held by kworker/0:3/305:
Aug 19 12:29:51 shalem kernel: #0: ffff8ac180051748 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x20b/0x600
Aug 19 12:29:51 shalem kernel: #1: ffffa78dc4007e78 ((work_completion)(&ap->scsi_rescan_task)){+.+.}-{0:0}, at: process_one_work+0x20b/0x600
Aug 19 12:29:51 shalem kernel: #2: ffff8ac18ba840a8 (&ap->scsi_scan_mutex){+.+.}-{3:3}, at: ata_scsi_dev_rescan+0x32/0x110
Aug 19 12:29:51 shalem kernel: #3: ffff8ac1909b5390 (&dev->mutex){....}-{3:3}, at: scsi_rescan_device+0x24/0x90
Aug 19 12:29:51 shalem kernel: 1 lock held by LCDd/2542:
Aug 19 12:29:51 shalem kernel: #0: ffffffff8635a578 (minors_rwsem){++++}-{3:3}, at: hidraw_write+0x1e/0x50
Aug 19 12:29:51 shalem kernel:
Aug 19 12:29:51 shalem kernel: =============================================


Aug 19 12:31:54 shalem kernel: INFO: task kworker/0:3:305 blocked for more than 491 seconds.
Aug 19 12:31:54 shalem kernel: Tainted: G W OE ------- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
Aug 19 12:31:54 shalem kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 12:31:54 shalem kernel: task:kworker/0:3 state:D stack:13008 pid: 305 ppid: 2 flags:0x00004000
Aug 19 12:31:54 shalem kernel: Workqueue: events ata_scsi_dev_rescan
Aug 19 12:31:54 shalem kernel: Call Trace:
Aug 19 12:31:54 shalem kernel: <TASK>
Aug 19 12:31:54 shalem kernel: __schedule+0x47f/0x16d0
Aug 19 12:31:54 shalem kernel: ? mark_held_locks+0x50/0x80
Aug 19 12:31:54 shalem kernel: ? _raw_spin_unlock_irq+0x24/0x50
Aug 19 12:31:54 shalem kernel: schedule+0x5d/0xe0
Aug 19 12:31:54 shalem kernel: blk_mq_freeze_queue_wait+0x57/0x90
Aug 19 12:31:54 shalem kernel: ? prepare_to_wait_exclusive+0xd0/0xd0
Aug 19 12:31:54 shalem kernel: disk_clear_zone_settings+0x19/0xa0
Aug 19 12:31:54 shalem kernel: sd_revalidate_disk.isra.0+0x1660/0x2c60
Aug 19 12:31:54 shalem kernel: ? slab_free_freelist_hook.constprop.0+0x11b/0x160
Aug 19 12:31:54 shalem kernel: ? scsi_rescan_device+0x2c/0x90
Aug 19 12:31:54 shalem kernel: scsi_rescan_device+0x77/0x90
Aug 19 12:31:54 shalem kernel: ata_scsi_dev_rescan+0x9d/0x110
Aug 19 12:31:54 shalem kernel: process_one_work+0x2a0/0x600
Aug 19 12:31:54 shalem kernel: worker_thread+0x4f/0x3a0
Aug 19 12:31:54 shalem kernel: ? process_one_work+0x600/0x600
Aug 19 12:31:54 shalem kernel: kthread+0xf5/0x120
Aug 19 12:31:54 shalem kernel: ? kthread_complete_and_exit+0x20/0x20
Aug 19 12:31:54 shalem kernel: ret_from_fork+0x22/0x30
Aug 19 12:31:54 shalem kernel: </TASK>
Aug 19 12:31:54 shalem kernel:
Showing all locks held in the system:
Aug 19 12:31:54 shalem kernel: 1 lock held by rcu_tasks_kthre/11:
Aug 19 12:31:54 shalem kernel: #0: ffffffff86168820 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2b/0x3e0
Aug 19 12:31:54 shalem kernel: 1 lock held by rcu_tasks_rude_/12:
Aug 19 12:31:54 shalem kernel: #0: ffffffff86168560 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2b/0x3e0
Aug 19 12:31:54 shalem kernel: 1 lock held by rcu_tasks_trace/13:
Aug 19 12:31:54 shalem kernel: #0: ffffffff86168260 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2b/0x3e0
Aug 19 12:31:54 shalem kernel: 1 lock held by khungtaskd/100:
Aug 19 12:31:54 shalem kernel: #0: ffffffff86169320 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x16b
Aug 19 12:31:54 shalem kernel: 4 locks held by kworker/0:3/305:
Aug 19 12:31:54 shalem kernel: #0: ffff8ac180051748 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x20b/0x600
Aug 19 12:31:54 shalem kernel: #1: ffffa78dc4007e78 ((work_completion)(&ap->scsi_rescan_task)){+.+.}-{0:0}, at: process_one_work+0x20b/0x600
Aug 19 12:31:54 shalem kernel: #2: ffff8ac18ba840a8 (&ap->scsi_scan_mutex){+.+.}-{3:3}, at: ata_scsi_dev_rescan+0x32/0x110
Aug 19 12:31:54 shalem kernel: #3: ffff8ac1909b5390 (&dev->mutex){....}-{3:3}, at: scsi_rescan_device+0x24/0x90
Aug 19 12:31:54 shalem kernel: 1 lock held by gfx/688:
Aug 19 12:31:54 shalem kernel: #0: ffffffff86169320 (rcu_read_lock){....}-{1:2}, at: raw_spin_rq_lock_nested+0x24/0x80
Aug 19 12:31:54 shalem kernel: 1 lock held by LCDd/2542:
Aug 19 12:31:54 shalem kernel: #0: ffffffff8635a578 (minors_rwsem){++++}-{3:3}, at: hidraw_write+0x1e/0x50
Aug 19 12:31:54 shalem kernel: 2 locks held by kworker/u64:119/8383:
Aug 19 12:31:54 shalem kernel: 1 lock held by WRRende~ckend#1/9586:
Aug 19 12:31:54 shalem kernel:
Aug 19 12:31:54 shalem kernel: =============================================

Sorry for not being able to write a better bug-report but I don't have
the time to dive into this deeper. I hope this report is enough for
someone to have a clue what is going on.

Regards,

Hans


2022-08-19 15:19:53

by Bart Van Assche

[permalink] [raw]
Subject: Re: 6.0-rc1 regression block (blk_mq) / RCU task stuck errors + block-io hang

On 8/19/22 05:01, Hans de Goede wrote:
> I've been dogfooding 6.0-rc1 on my main workstation and I have hit
> this pretty serious bug, serious enough for me to go back to 5.19
>
> My dmesg is showing various blk_mq (RCU?) related lockdep splats
> followed by some tasks getting stuck on disk-IO. E.g. "sync"
> is guaranteed to hang, but other tasks too.
>
> This seems to be mainly the case on "sd" disks (both sata
> and USB) where as my main nvme drive seems fine, which has
> probably saved me from worse issues...
>
> Here are 4 task stuck reports from my last boot, where
> I had to turn off the machine by keeping the power button
> pressed for 4 seconds.
>
> [ ... ]
>
> Sorry for not being able to write a better bug-report but I don't have
> the time to dive into this deeper. I hope this report is enough for
> someone to have a clue what is going on.

Thank you for the detailed report. I think this report is detailed
enough to root-cause this issue, something that was not possible before
this report.

Please help with verifying whether this patch fixes this issue: "[PATCH]
scsi: sd: Revert "Rework asynchronous resume support""
(https://lore.kernel.org/linux-scsi/[email protected]/).

Thanks,

Bart.

2022-08-20 15:55:40

by Hans de Goede

[permalink] [raw]
Subject: Re: 6.0-rc1 regression block (blk_mq) / RCU task stuck errors + block-io hang

Hi Bart,

On 8/19/22 16:49, Bart Van Assche wrote:
> On 8/19/22 05:01, Hans de Goede wrote:
>> I've been dogfooding 6.0-rc1 on my main workstation and I have hit
>> this pretty serious bug, serious enough for me to go back to 5.19
>>
>> My dmesg is showing various blk_mq (RCU?) related lockdep splats
>> followed by some tasks getting stuck on disk-IO. E.g. "sync"
>> is guaranteed to hang, but other tasks too.
>>
>> This seems to be mainly the case on "sd" disks (both sata
>> and USB) where as my main nvme drive seems fine, which has
>> probably saved me from worse issues...
>>
>> Here are 4 task stuck reports from my last boot, where
>> I had to turn off the machine by keeping the power button
>> pressed for 4 seconds.
>>
>> [ ... ]
>>
>> Sorry for not being able to write a better bug-report but I don't have
>> the time to dive into this deeper. I hope this report is enough for
>> someone to have a clue what is going on.
>
> Thank you for the detailed report. I think this report is detailed enough to root-cause this issue, something that was not possible before this report.
>
> Please help with verifying whether this patch fixes this issue: "[PATCH] scsi: sd: Revert "Rework asynchronous resume support"" (https://lore.kernel.org/linux-scsi/[email protected]/).

Thanks that is very useful. I'm running 6.0-rc1 with this
patch added now and so far I've not seen the problem re-occur.

I was also seeing 6.0 suspend/resume issues on 2 laptops with
sata disks (rather then NVME) which I did not yet get around
to collecting logs from / reporting. I'm happy to report that
those suspend/resume issues are also fixed by this.

I'll reply to the patch with my Tested-by for this.

Regards,

Hans