When scsi disks went wrong frequently, and with serial console
attached, tasks may be blocked in the following flow for more than 10s:
[ 557.369580] <<EOE>> [<ffffffff81309336>] blkcg_print_blkgs+0x76/0xf0 ----》 wait for blkg->q->queue_lock
[ 557.369581] [<ffffffff8130f236>] cfqg_print_rwstat_recursive+0x36/0x40
[ 557.369583] [<ffffffff81109393>] cgroup_seqfile_show+0x73/0x80
[ 557.369584] [<ffffffff81222b57>] ? seq_buf_alloc+0x17/0x40
[ 557.369585] [<ffffffff8122305a>] seq_read+0x10a/0x3b0
[ 557.369586] [<ffffffff811fe9be>] vfs_read+0x9e/0x170
[ 557.369587] [<ffffffff811ff58f>] SyS_read+0x7f/0xe0
[ 557.369588] [<ffffffff81697ac9>] system_call_fastpath+0x16/0x1b
That's because the serial is very slow, and the other task hold the
q->queue_lock for very long time, waiting for the serial finishing
writing.
PID: 319 TASK: ffff881ffb09edd0 CPU: 7 COMMAND: "kworker/u113:1"
...
#4 [ffff881ffb0b7540] delay_tsc at ffffffff81326724
#5 [ffff881ffb0b7548] __const_udelay at ffffffff81326678
#6 [ffff881ffb0b7558] wait_for_xmitr at ffffffff814056e0
#7 [ffff881ffb0b7580] serial8250_console_putchar at ffffffff814058ac
#8 [ffff881ffb0b75a0] uart_console_write at ffffffff8140035a
#9 [ffff881ffb0b75d0] serial8250_console_write at ffffffff814057fe
#10 [ffff881ffb0b7618] call_console_drivers.constprop.17 at ffffffff81087011
#11 [ffff881ffb0b7640] console_unlock at ffffffff810889e9
#12 [ffff881ffb0b7680] vprintk_emit at ffffffff81088df4
#13 [ffff881ffb0b76f0] dev_vprintk_emit at ffffffff81428e72
#14 [ffff881ffb0b77a8] dev_printk_emit at ffffffff81428eee
#15 [ffff881ffb0b7808] __dev_printk at ffffffff8142937e
#16 [ffff881ffb0b7818] dev_printk at ffffffff8142942d
#17 [ffff881ffb0b7888] sdev_prefix_printk at ffffffff81463771
#18 [ffff881ffb0b7918] scsi_prep_state_check at ffffffff814598e4
#19 [ffff881ffb0b7928] scsi_prep_fn at ffffffff8145992d
#20 [ffff881ffb0b7960] blk_peek_request at ffffffff812f0826
#21 [ffff881ffb0b7988] scsi_request_fn at ffffffff8145b588
#22 [ffff881ffb0b79f0] __blk_run_queue at ffffffff812ebd63
#23 [ffff881ffb0b7a08] blk_queue_bio at ffffffff812f1013 -----》acquired q->queue_lock and wait for console_write to finish
#24 [ffff881ffb0b7a50] generic_make_request at ffffffff812ef209
#25 [ffff881ffb0b7a98] submit_bio at ffffffff812ef351
#26 [ffff881ffb0b7af0] xfs_submit_ioend_bio at ffffffffa0146a63 [xfs]
#27 [ffff881ffb0b7b00] xfs_submit_ioend at ffffffffa0146b31 [xfs]
#28 [ffff881ffb0b7b40] xfs_vm_writepages at ffffffffa0146e18 [xfs]
#29 [ffff881ffb0b7bb8] do_writepages at ffffffff8118da6e
commit e480af09c49736848f749a43dff2c902104f6691 avoided the watchdog
trigger, but could not avoid tasks being blocked for long time.
This patch replacing the sdev_printk with async printk_deferred can
avoid task blocking because of the slow serial and unstable disks in
such senario.
Signed-off-by: Wen Yang <[email protected]>
Signed-off-by: Jiang Biao <[email protected]>
---
drivers/scsi/scsi_lib.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index a86df9c..2bbe34c 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1297,8 +1297,10 @@ static int scsi_setup_cmnd(struct scsi_device *sdev, struct request *req)
* commands. The device must be brought online
* before trying any recovery commands.
*/
- sdev_printk(KERN_ERR, sdev,
- "rejecting I/O to offline device\n");
+ printk_deferred(KERN_ERR
+ "%s %s: rejecting I/O to offline device\n",
+ dev_driver_string(&sdev->sdev_gendev),
+ dev_name(&sdev->sdev_gendev));
ret = BLKPREP_KILL;
break;
case SDEV_DEL:
@@ -1306,8 +1308,10 @@ static int scsi_setup_cmnd(struct scsi_device *sdev, struct request *req)
* If the device is fully deleted, we refuse to
* process any commands as well.
*/
- sdev_printk(KERN_ERR, sdev,
- "rejecting I/O to dead device\n");
+ printk_deferred(KERN_ERR
+ "%s %s: rejecting I/O to dead device\n",
+ dev_driver_string(&sdev->sdev_gendev),
+ dev_name(&sdev->sdev_gendev));
ret = BLKPREP_KILL;
break;
case SDEV_BLOCK:
@@ -1798,8 +1802,10 @@ static void scsi_request_fn(struct request_queue *q)
break;
if (unlikely(!scsi_device_online(sdev))) {
- sdev_printk(KERN_ERR, sdev,
- "rejecting I/O to offline device\n");
+ printk_deferred(KERN_ERR
+ "%s %s: rejecting I/O to offline device\n",
+ dev_driver_string(&sdev->sdev_gendev),
+ dev_name(&sdev->sdev_gendev));
scsi_kill_request(req, q);
continue;
}
--
1.8.3.1
On Thu, 2018-03-08 at 16:50 +0800, Wen Yang wrote:
> This patch replacing the sdev_printk with async printk_deferred can
> avoid task blocking because of the slow serial and unstable disks in
> such senario.
Although I'm not enthusiast about this approach, I haven't been able to come
up with a better solution and the patch itself looks fine to me.
Bart.
On Thu, 2018-03-08 at 16:50 +0800, Wen Yang wrote:
> When scsi disks went wrong frequently, and with serial console
> attached, tasks may be blocked in the following flow for more than 10s: [ ... ]
From https://bugzilla.kernel.org/show_bug.cgi?id=199003: "Hm,
printk_deferred is a bit dangerous; it moves console_unlock() to
IRQ. [ ... ] So I'd say that those two approaches
printk_deferred + touch_nmi_watchdog
combined can do quite some harm. One thing for sure - they don't really fix
any problems."
Does that mean that this patch should be dropped?
Thanks,
Bart.