Hi,
I found two hangup problems between iscsid service and iscsi module. And I can reproduce one
of them in the latest kernel always. So I think the problems really exist.
It really took me a long time to find out why due to my lack of knowledge of iscsi. But I cannot
find a good way to solve them both.
Please do help to take a look at them. Thx.
=========
Problem 1:
***************
[What it looks like]
***************
First, we connect to 10 remote LUNs with iscsid service with at least two dirrerent sessions.
When network error occurs, the session could be in error. If we do login and logout, iscsid
service could run into D state.
My colleague has posted an email to report this problem before. And he posted a long call trace.
But barely gain any feedback.
(https://lkml.org/lkml/2017/6/19/330)
**************
[Why it happens]
**************
In the latest kernel, asynchronous part of sd_probe() was executed
in scsi_sd_probe_domain, and sd_remove() would wait until all the
works in scsi_sd_probe_domain finished. When we use iscsi based
remote storage, and the network is broken, the following deadlock
could happen.
1. An iscsi session login is in progress, and calls sd_probe() to
probe a remote lun. The synchronous part has finished, and the
asynchronous part is scheduled in scsi_sd_probe_domain, and will
submit io to execute scsi cmd to obtain device info. When the
network is broken, the session will go into ISCSI_SESSION_FAILED
state, and the io will retry until the session becomes
ISCSI_SESSION_FREE. As a result, the work in scsi_sd_probe_domain
hangs.
2. On the other hand, iscsi kernel module detects network ping
timeout, and triggers ISCSI_KEVENT_CONN_ERROR event. iscsid in
user space will handle this event by triggering
ISCSI_UEVENT_DESTROY_SESSION event. Destroy session process is
synchronous, and when it calls sd_remove() to remove the lun,
it waits until all the works in scsi_sd_probe_domain finish. As
a result, it hangs, and iscsid in user space goes into D state
which is not killable, and not able to handle all the other
events.
****************
[How to reproduce]
****************
With the script below, I can reproduce it in the latest kernel always.
# create network errors
tc qdisc add dev eth1 root netem loss 60%
while [1]
do
iscsiadm -m node -T xxxxxx -login
sleep 5
iscsiadm -m node -T xxxxxx -logout &
iscsiadm -m node -T yyyyyy -login &
done
xxxxxx and yyyyyy are two different target names.
Connect to about 10 remote LUNs, and run the script for about half an hour will reproduce the problem.
*******************
[How I avoid it for now]
*******************
To avoid this problem, I simply remove scsi_sd_probe_domain, and call sd_probe_async() synchronously in sd_probe().
So sd_remove() doesn't need to wait for the domain again.
@@ -2986,7 +2986,40 @@ static int sd_probe(struct device *dev)
get_device(&sdkp->dev); /* prevent release before async_schedule */
- async_schedule_domain(sd_probe_async, sdkp, &scsi_sd_probe_domain);
+ sd_probe_async((void *)sdkp, 0);
I know this is not a good way, so would you please give some advice about it ?
=========
Problem 2:
***************
[What it looks like]
***************
When remove a scsi device, and the network error happens, __blk_drain_queue() could hang forever.
# cat /proc/19160/stack
[<ffffffff8005886d>] msleep+0x1d/0x30
[<ffffffff80201a84>] __blk_drain_queue+0xe4/0x160
[<ffffffff80202766>] blk_cleanup_queue+0x106/0x2e0
[<ffffffffa000fb02>] __scsi_remove_device+0x52/0xc0 [scsi_mod]
[<ffffffffa000fb9b>] scsi_remove_device+0x2b/0x40 [scsi_mod]
[<ffffffffa000fbc0>] sdev_store_delete_callback+0x10/0x20 [scsi_mod]
[<ffffffff801a4e75>] sysfs_schedule_callback_work+0x15/0x80
[<ffffffff80062d69>] process_one_work+0x169/0x340
[<ffffffff800667e3>] worker_thread+0x183/0x490
[<ffffffff8006a526>] kthread+0x96/0xa0
[<ffffffff8041ebb4>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff
The request queue of this device was stopped. So the following check will be true forever:
__blk_run_queue()
{
if (unlikely(blk_queue_stopped(q)))
return;
__blk_run_queue_uncond(q);
}
So __blk_run_queue_uncond() will never be called, and the process hang.
**************
[Why it happens]
**************
When the network error happens, iscsi kernel module detected the ping timeout and
tried to recover the session. Here, the queue was stopped, or you can also say
session was blocked.
iscsi_start_session_recovery(session, conn, flag);
|-> iscsi_block_session(session->cls_session);
|-> blk_stop_queue(q)
The session should be unblocked if the session is recovered or the recovery times out.
But it was not unblocked properly because scsi_remove_device() deleted the the device
first, and then called __blk_drain_queue().
__scsi_remove_device()
|-> device_del(dev)
|-> blk_cleanup_queue()
|-> scsi_request_fn()
|-> __blk_drain_queue()
At this time, the device was not on the children list of the parent device. So when
__iscsi_unblock_session() tried to unblock the parent device and its children, the removed
device could not be unblocked. And its queue was stopped forever.
__iscsi_unblock_session()
|-> scsi_target_unblock()
|-> device_for_each_child()
****************
[How to reproduce]
****************
Unfortunately I cannot reproduce it in the latest kernel.
The script below will help to reproduce, but not very often.
# create network error
tc qdisc add dev eth1 root netem loss 60%
# restart iscsid and rescan scsi bus again and again
while [ 1 ]
do
systemctl restart iscsid
rescan-scsi-bus (http://manpages.ubuntu.com/manpages/trusty/man8/rescan-scsi-bus.8.html)
done
**************
[How I resolve it]
**************
For now, I resolve this problem by checking QUEUE_FLAG_DYING flag in __blk_run_queue().
blk_cleanup_queue() will set QUEUE_FLAG_DYING, and then call __blk_drain_queue().
At this time, __scsi_remove_device() should have already set scsi_device status to SDEV_DEL.
So if the quese is dying, no matter if the quese is stopped, we goto __blk_run_queue_uncond(),
and then scsi_request_fn() will kill the rest requests.
---
void __blk_run_queue(struct request_queue *q)
{
- if (unlikely(blk_queue_stopped(q)))
+ if (unlikely(blk_queue_stopped(q)) && unlikely(!blk_queue_dying(q)))
return;
__blk_run_queue_uncond(q);
--
Thanks
On Mon, 2017-08-14 at 11:23 +0000, Tangchen (UVP) wrote:
> Problem 2:
>
> ***************
> [What it looks like]
> ***************
> When remove a scsi device, and the network error happens, __blk_drain_queue() could hang forever.
>
> # cat /proc/19160/stack
> [<ffffffff8005886d>] msleep+0x1d/0x30
> [<ffffffff80201a84>] __blk_drain_queue+0xe4/0x160
> [<ffffffff80202766>] blk_cleanup_queue+0x106/0x2e0
> [<ffffffffa000fb02>] __scsi_remove_device+0x52/0xc0 [scsi_mod]
> [<ffffffffa000fb9b>] scsi_remove_device+0x2b/0x40 [scsi_mod]
> [<ffffffffa000fbc0>] sdev_store_delete_callback+0x10/0x20 [scsi_mod]
> [<ffffffff801a4e75>] sysfs_schedule_callback_work+0x15/0x80
> [<ffffffff80062d69>] process_one_work+0x169/0x340
> [<ffffffff800667e3>] worker_thread+0x183/0x490
> [<ffffffff8006a526>] kthread+0x96/0xa0
> [<ffffffff8041ebb4>] kernel_thread_helper+0x4/0x10
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> The request queue of this device was stopped. So the following check will be true forever:
> __blk_run_queue()
> {
> if (unlikely(blk_queue_stopped(q)))
> return;
>
> __blk_run_queue_uncond(q);
> }
>
> So __blk_run_queue_uncond() will never be called, and the process hang.
>
> [ ... ]
>
> ****************
> [How to reproduce]
> ****************
> Unfortunately I cannot reproduce it in the latest kernel.
> The script below will help to reproduce, but not very often.
>
> # create network error
> tc qdisc add dev eth1 root netem loss 60%
>
> # restart iscsid and rescan scsi bus again and again
> while [ 1 ]
> do
> systemctl restart iscsid
> rescan-scsi-bus (http://manpages.ubuntu.com/manpages/trusty/man8/rescan-scsi-bus.8.html)
> done
This should have been fixed by commit 36e3cf273977 ("scsi: Avoid that SCSI
queues get stuck"). The first mainline kernel that includes this commit is
kernel v4.11.
> void __blk_run_queue(struct request_queue *q)
> {
> - if (unlikely(blk_queue_stopped(q)))
> + if (unlikely(blk_queue_stopped(q)) && unlikely(!blk_queue_dying(q)))
> return;
>
> __blk_run_queue_uncond(q);
Are you aware that the single queue block layer is on its way out and will
be removed sooner or later? Please focus your testing on scsi-mq.
Regarding the above patch: it is wrong because it will cause lockups during
path removal for other block drivers. Please drop this patch.
Bart.
Hi, Bart,
Thank you very much for the quick response.
But I'm not using mq, and I run into these two problems in a non-mq system.
The patch you pointed out is fix for mq, so I don't think it can resolve this problem.
IIUC, mq is for SSD ? I'm not using ssd, so mq is disabled.
On Mon, 2017-08-14 at 11:23 +0000, Tangchen (UVP) wrote:
> Problem 2:
>
> ***************
> [What it looks like]
> ***************
> When remove a scsi device, and the network error happens, __blk_drain_queue() could hang forever.
>
> # cat /proc/19160/stack
> [<ffffffff8005886d>] msleep+0x1d/0x30
> [<ffffffff80201a84>] __blk_drain_queue+0xe4/0x160 [<ffffffff80202766>]
> blk_cleanup_queue+0x106/0x2e0 [<ffffffffa000fb02>]
> __scsi_remove_device+0x52/0xc0 [scsi_mod] [<ffffffffa000fb9b>]
> scsi_remove_device+0x2b/0x40 [scsi_mod] [<ffffffffa000fbc0>]
> sdev_store_delete_callback+0x10/0x20 [scsi_mod] [<ffffffff801a4e75>]
> sysfs_schedule_callback_work+0x15/0x80
> [<ffffffff80062d69>] process_one_work+0x169/0x340 [<ffffffff800667e3>]
> worker_thread+0x183/0x490 [<ffffffff8006a526>] kthread+0x96/0xa0
> [<ffffffff8041ebb4>] kernel_thread_helper+0x4/0x10
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> The request queue of this device was stopped. So the following check will be true forever:
> __blk_run_queue()
> {
> if (unlikely(blk_queue_stopped(q)))
> return;
>
> __blk_run_queue_uncond(q);
> }
>
> So __blk_run_queue_uncond() will never be called, and the process hang.
>
> [ ... ]
>
> ****************
> [How to reproduce]
> ****************
> Unfortunately I cannot reproduce it in the latest kernel.
> The script below will help to reproduce, but not very often.
>
> # create network error
> tc qdisc add dev eth1 root netem loss 60%
>
> # restart iscsid and rescan scsi bus again and again while [ 1 ] do
> systemctl restart iscsid
> rescan-scsi-bus (http://manpages.ubuntu.com/manpages/trusty/man8/rescan-scsi-bus.8.html)
> done
This should have been fixed by commit 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck"). The first mainline kernel that includes this commit is kernel v4.11.
> void __blk_run_queue(struct request_queue *q) {
> - if (unlikely(blk_queue_stopped(q)))
> + if (unlikely(blk_queue_stopped(q)) &&
> + unlikely(!blk_queue_dying(q)))
> return;
>
> __blk_run_queue_uncond(q);
Are you aware that the single queue block layer is on its way out and will be removed sooner or later? Please focus your testing on scsi-mq.
Regarding the above patch: it is wrong because it will cause lockups during path removal for other block drivers. Please drop this patch.
Bart.
On Tue, 2017-08-15 at 02:16 +0000, Tangchen (UVP) wrote:
> But I'm not using mq, and I run into these two problems in a non-mq system.
> The patch you pointed out is fix for mq, so I don't think it can resolve this problem.
>
> IIUC, mq is for SSD ? I'm not using ssd, so mq is disabled.
Hello Tangchen,
Please post replies below the original e-mail instead of above - that is the
reply style used on all Linux-related mailing lists I know of. From
https://en.wikipedia.org/wiki/Posting_style:
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?
Regarding your question: sorry but I quoted the wrong commit in my previous
e-mail. The commit I should have referred to is 255ee9320e5d ("scsi: Make
__scsi_remove_device go straight from BLOCKED to DEL"). That patch not only
affects scsi-mq but also the single-queue code in the SCSI core.
blk-mq/scsi-mq was introduced for SSDs but is not only intended for SSDs.
The plan is to remove the blk-sq/scsi-sq code once the blk-mq/scsi-mq code
works at least as fast as the single queue code for all supported devices.
That includes hard disks.
Bart.
> On Tue, 2017-08-15 at 02:16 +0000, Tangchen (UVP) wrote:
> > But I'm not using mq, and I run into these two problems in a non-mq system.
> > The patch you pointed out is fix for mq, so I don't think it can resolve this
> problem.
> >
> > IIUC, mq is for SSD ? I'm not using ssd, so mq is disabled.
>
> Hello Tangchen,
>
> Please post replies below the original e-mail instead of above - that is the reply
> style used on all Linux-related mailing lists I know of. From
> https://en.wikipedia.org/wiki/Posting_style:
>
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?
Hi Bart,
Thanks for the reply. Will post the reply in e-mail. :)
>
> Regarding your question: sorry but I quoted the wrong commit in my previous
> e-mail. The commit I should have referred to is 255ee9320e5d ("scsi: Make
> __scsi_remove_device go straight from BLOCKED to DEL"). That patch not only
> affects scsi-mq but also the single-queue code in the SCSI core.
OK, I'll try this one. Thx.
>
> blk-mq/scsi-mq was introduced for SSDs but is not only intended for SSDs.
> The plan is to remove the blk-sq/scsi-sq code once the blk-mq/scsi-mq code
> works at least as fast as the single queue code for all supported devices.
> That includes hard disks.
OK, thanks for tell me this.
>
> Bart.