2023-10-16 02:04:20

by Wenchao Hao

[permalink] [raw]
Subject: [PATCH v3 0/4] SCSI: Fix issues between removing device and error handle

I am testing SCSI error handle with my previous scsi_debug error
injection patches, and found some issues when removing device and
error handler happened together.

These issues are triggered because devices in removing would be skipped
when calling shost_for_each_device().

Three issues are found:
1. statistic info printed at beginning of scsi_error_handler is wrong
2. device reset is not triggered
3. IO requeued to request_queue would be hang after error handle

V3:
- Update patch description
- Update comments of functions added

V2:
- Fix IO hang by run all devices' queue after error handler
- Do not modify shost_for_each_device() directly but add a new
helper to iterate devices but do not skip devices in removing

Wenchao Hao (4):
scsi: core: Add new helper to iterate all devices of host
scsi: scsi_error: Fix wrong statistic when print error info
scsi: scsi_error: Fix device reset is not triggered
scsi: scsi_core: Fix IO hang when device removing

drivers/scsi/scsi.c | 46 ++++++++++++++++++++++++++------------
drivers/scsi/scsi_error.c | 4 ++--
drivers/scsi/scsi_lib.c | 2 +-
include/scsi/scsi_device.h | 25 ++++++++++++++++++---
4 files changed, 57 insertions(+), 20 deletions(-)

--
2.32.0


2023-10-16 02:04:27

by Wenchao Hao

[permalink] [raw]
Subject: [PATCH v3 2/4] scsi: scsi_error: Fix wrong statistic when print error info

shost_for_each_device() would skip devices which is in progress of
removing, so commands of these devices would be ignored in
scsi_eh_prt_fail_stats().

Fix this issue by using shost_for_each_device_include_deleted()
to iterate devices in scsi_eh_prt_fail_stats().

Signed-off-by: Wenchao Hao <[email protected]>
---
drivers/scsi/scsi_error.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index c67cdcdc3ba8..2550f8cd182a 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -407,7 +407,7 @@ static inline void scsi_eh_prt_fail_stats(struct Scsi_Host *shost,
int cmd_cancel = 0;
int devices_failed = 0;

- shost_for_each_device(sdev, shost) {
+ shost_for_each_device_include_deleted(sdev, shost) {
list_for_each_entry(scmd, work_q, eh_entry) {
if (scmd->device == sdev) {
++total_failures;
--
2.32.0

2023-10-16 02:04:32

by Wenchao Hao

[permalink] [raw]
Subject: [PATCH v3 4/4] scsi: scsi_core: Fix IO hang when device removing

shost_for_each_device() would skip devices which is in progress of
removing, so scsi_run_queue() for these devices would be skipped in
scsi_run_host_queues() after blocking hosts' IO.

IO hang would be caused if return true when state is SDEV_CANCEL with
following order:

T1: T2:scsi_error_handler
__scsi_remove_device()
scsi_device_set_state(sdev, SDEV_CANCEL)
...
sd_remove()
del_gendisk()
blk_mq_freeze_queue_wait()
scsi_eh_flush_done_q()
scsi_queue_insert(scmd,...)

scsi_queue_insert() would not kick device's queue since commit
8b566edbdbfb ("scsi: core: Only kick the requeue list if necessary")

After scsi_unjam_host(), the scsi error handler would call
scsi_run_host_queues() to trigger run queue for devices, while it
would not run queue for devices which is in progress of removing
because shost_for_each_device() would skip them.

So the requests added to these queues would not be handled any more,
and the removing device process would hang too.

Fix this issue by using shost_for_each_device_include_deleted() in
scsi_run_host_queues() to trigger a run queue for devices in removing.

Signed-off-by: Wenchao Hao <[email protected]>
---
drivers/scsi/scsi_lib.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 195ca80667d0..40f407ffd26f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -466,7 +466,7 @@ void scsi_run_host_queues(struct Scsi_Host *shost)
{
struct scsi_device *sdev;

- shost_for_each_device(sdev, shost)
+ shost_for_each_device_include_deleted(sdev, shost)
scsi_run_queue(sdev->request_queue);
}

--
2.32.0

2023-10-16 02:04:34

by Wenchao Hao

[permalink] [raw]
Subject: [PATCH v3 1/4] scsi: core: Add new helper to iterate all devices of host

shost_for_each_device() would skip devices which is in SDEV_CANCEL or
SDEV_DEL state, for some scenarios, we donot want to skip these devices,
so add a new macro shost_for_each_device_include_deleted() to handle it.

Following changes are introduced:

1. Rework scsi_device_get(), add new helper __scsi_device_get() which
determine if skip deleted scsi_device by parameter "skip_deleted".
2. Add new parameter "skip_deleted" to __scsi_iterate_devices() which
is used when calling __scsi_device_get()
3. Update shost_for_each_device() to call __scsi_iterate_devices() with
"skip_deleted" true
4. Add new macro shost_for_each_device_include_deleted() which call
__scsi_iterate_devices() with "skip_deleted" false

Signed-off-by: Wenchao Hao <[email protected]>
---
drivers/scsi/scsi.c | 46 ++++++++++++++++++++++++++------------
include/scsi/scsi_device.h | 25 ++++++++++++++++++---
2 files changed, 54 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index d1c0ba3ef1f5..a9d695841250 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -704,20 +704,18 @@ int scsi_cdl_enable(struct scsi_device *sdev, bool enable)
return 0;
}

-/**
- * scsi_device_get - get an additional reference to a scsi_device
+/*
+ * __scsi_device_get - get an additional reference to a scsi_device
* @sdev: device to get a reference to
- *
- * Description: Gets a reference to the scsi_device and increments the use count
- * of the underlying LLDD module. You must hold host_lock of the
- * parent Scsi_Host or already have a reference when calling this.
- *
- * This will fail if a device is deleted or cancelled, or when the LLD module
- * is in the process of being unloaded.
+ * @skip_deleted: when true, would return failed if device is deleted
*/
-int scsi_device_get(struct scsi_device *sdev)
+static int __scsi_device_get(struct scsi_device *sdev, bool skip_deleted)
{
- if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)
+ /*
+ * if skip_deleted is true and device is in removing, return failed
+ */
+ if (skip_deleted &&
+ (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL))
goto fail;
if (!try_module_get(sdev->host->hostt->module))
goto fail;
@@ -730,6 +728,22 @@ int scsi_device_get(struct scsi_device *sdev)
fail:
return -ENXIO;
}
+
+/**
+ * scsi_device_get - get an additional reference to a scsi_device
+ * @sdev: device to get a reference to
+ *
+ * Description: Gets a reference to the scsi_device and increments the use count
+ * of the underlying LLDD module. You must hold host_lock of the
+ * parent Scsi_Host or already have a reference when calling this.
+ *
+ * This will fail if a device is deleted or cancelled, or when the LLD module
+ * is in the process of being unloaded.
+ */
+int scsi_device_get(struct scsi_device *sdev)
+{
+ return __scsi_device_get(sdev, 0);
+}
EXPORT_SYMBOL(scsi_device_get);

/**
@@ -749,9 +763,13 @@ void scsi_device_put(struct scsi_device *sdev)
}
EXPORT_SYMBOL(scsi_device_put);

-/* helper for shost_for_each_device, see that for documentation */
+/**
+ * helper for shost_for_each_device, see that for documentation
+ * @skip_deleted: if true, sdev in progress of removing would be skipped
+ */
struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *shost,
- struct scsi_device *prev)
+ struct scsi_device *prev,
+ bool skip_deleted)
{
struct list_head *list = (prev ? &prev->siblings : &shost->__devices);
struct scsi_device *next = NULL;
@@ -761,7 +779,7 @@ struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *shost,
while (list->next != &shost->__devices) {
next = list_entry(list->next, struct scsi_device, siblings);
/* skip devices that we can't get a reference to */
- if (!scsi_device_get(next))
+ if (!__scsi_device_get(next, skip_deleted))
break;
next = NULL;
list = list->next;
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index b9230b6add04..ed02755bbc42 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -390,7 +390,8 @@ extern void __starget_for_each_device(struct scsi_target *, void *,

/* only exposed to implement shost_for_each_device */
extern struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *,
- struct scsi_device *);
+ struct scsi_device *,
+ bool);

/**
* shost_for_each_device - iterate over all devices of a host
@@ -400,11 +401,29 @@ extern struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *,
* Iterator that returns each device attached to @shost. This loop
* takes a reference on each device and releases it at the end. If
* you break out of the loop, you must call scsi_device_put(sdev).
+ *
+ * Note: this macro would skip sdev which is in progress of removing
*/
#define shost_for_each_device(sdev, shost) \
- for ((sdev) = __scsi_iterate_devices((shost), NULL); \
+ for ((sdev) = __scsi_iterate_devices((shost), NULL, 1); \
+ (sdev); \
+ (sdev) = __scsi_iterate_devices((shost), (sdev), 1))
+
+/*
+ * shost_for_each_device_include_deleted- iterate over all devices of a host
+ * @sdev: the &struct scsi_device to use as a cursor
+ * @shost: the &struct scsi_host to iterate over
+ *
+ * Iterator that returns each device attached to @shost. This loop
+ * takes a reference on each device and releases it at the end. If
+ * you break out of the loop, you must call scsi_device_put(sdev).
+ *
+ * Note: this macro would include sdev which is in progress of removing
+ */
+#define shost_for_each_device_include_deleted(sdev, shost) \
+ for ((sdev) = __scsi_iterate_devices((shost), NULL, 0); \
(sdev); \
- (sdev) = __scsi_iterate_devices((shost), (sdev)))
+ (sdev) = __scsi_iterate_devices((shost), (sdev), 0))

/**
* __shost_for_each_device - iterate over all devices of a host (UNLOCKED)
--
2.32.0

2023-10-16 02:04:42

by Wenchao Hao

[permalink] [raw]
Subject: [PATCH v3 3/4] scsi: scsi_error: Fix device reset is not triggered

shost_for_each_device() would skip devices which is in progress of
removing, so scsi_try_bus_device_reset() for these devices would be
skipped in scsi_eh_bus_device_reset() with following order:

T1: T2:scsi_error_handle
__scsi_remove_device
scsi_device_set_state(sdev, SDEV_DEL)
// would skip device with SDEV_DEL state
shost_for_each_device()
scsi_try_bus_device_reset
flush all commands
...
releasing and free scsi_device

Some drivers like smartpqi only implement eh_device_reset_handler,
if device reset is skipped, the commands which had been sent to
firmware or devices hardware are not cleared. The error handle
would flush all these commands in scsi_unjam_host().

When the commands are finished by hardware, use after free issue is
triggered.

Fix this issue by using shost_for_each_device_include_deleted()
to iterate devices in scsi_eh_bus_device_reset().

Signed-off-by: Wenchao Hao <[email protected]>
---
drivers/scsi/scsi_error.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 2550f8cd182a..57e3cc556549 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1568,7 +1568,7 @@ static int scsi_eh_bus_device_reset(struct Scsi_Host *shost,
struct scsi_device *sdev;
enum scsi_disposition rtn;

- shost_for_each_device(sdev, shost) {
+ shost_for_each_device_include_deleted(sdev, shost) {
if (scsi_host_eh_past_deadline(shost)) {
SCSI_LOG_ERROR_RECOVERY(3,
sdev_printk(KERN_INFO, sdev,
--
2.32.0

2023-10-17 15:33:10

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] scsi: core: Add new helper to iterate all devices of host

Hi Wenchao,

kernel test robot noticed the following build warnings:

[auto build test WARNING on mkp-scsi/for-next]
[also build test WARNING on jejb-scsi/for-next linus/master v6.6-rc6 next-20231017]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Wenchao-Hao/scsi-core-Add-new-helper-to-iterate-all-devices-of-host/20231017-140049
base: https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next
patch link: https://lore.kernel.org/r/20231016020314.1269636-2-haowenchao2%40huawei.com
patch subject: [PATCH v3 1/4] scsi: core: Add new helper to iterate all devices of host
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20231017/[email protected]/config)
compiler: m68k-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231017/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> drivers/scsi/scsi.c:767: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* helper for shost_for_each_device, see that for documentation


vim +767 drivers/scsi/scsi.c

765
766 /**
> 767 * helper for shost_for_each_device, see that for documentation
768 * @skip_deleted: if true, sdev in progress of removing would be skipped
769 */
770 struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *shost,
771 struct scsi_device *prev,
772 bool skip_deleted)
773 {
774 struct list_head *list = (prev ? &prev->siblings : &shost->__devices);
775 struct scsi_device *next = NULL;
776 unsigned long flags;
777
778 spin_lock_irqsave(shost->host_lock, flags);
779 while (list->next != &shost->__devices) {
780 next = list_entry(list->next, struct scsi_device, siblings);
781 /* skip devices that we can't get a reference to */
782 if (!__scsi_device_get(next, skip_deleted))
783 break;
784 next = NULL;
785 list = list->next;
786 }
787 spin_unlock_irqrestore(shost->host_lock, flags);
788
789 if (prev)
790 scsi_device_put(prev);
791 return next;
792 }
793 EXPORT_SYMBOL(__scsi_iterate_devices);
794

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-10-17 17:00:59

by Wenchao Hao

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] SCSI: Fix issues between removing device and error handle

On Mon, Oct 16, 2023 at 10:04 AM Wenchao Hao <[email protected]> wrote:
>
> I am testing SCSI error handle with my previous scsi_debug error
> injection patches, and found some issues when removing device and
> error handler happened together.
>
> These issues are triggered because devices in removing would be skipped
> when calling shost_for_each_device().

Friendly ping...

>
> Three issues are found:
> 1. statistic info printed at beginning of scsi_error_handler is wrong
> 2. device reset is not triggered
> 3. IO requeued to request_queue would be hang after error handle
>
> V3:
> - Update patch description
> - Update comments of functions added
>
> V2:
> - Fix IO hang by run all devices' queue after error handler
> - Do not modify shost_for_each_device() directly but add a new
> helper to iterate devices but do not skip devices in removing
>
> Wenchao Hao (4):
> scsi: core: Add new helper to iterate all devices of host
> scsi: scsi_error: Fix wrong statistic when print error info
> scsi: scsi_error: Fix device reset is not triggered
> scsi: scsi_core: Fix IO hang when device removing
>
> drivers/scsi/scsi.c | 46 ++++++++++++++++++++++++++------------
> drivers/scsi/scsi_error.c | 4 ++--
> drivers/scsi/scsi_lib.c | 2 +-
> include/scsi/scsi_device.h | 25 ++++++++++++++++++---
> 4 files changed, 57 insertions(+), 20 deletions(-)
>
> --
> 2.32.0
>

2023-10-17 21:42:09

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] SCSI: Fix issues between removing device and error handle


On 10/17/23 10:00, Wenchao Hao wrote:
> On Mon, Oct 16, 2023 at 10:04 AM Wenchao Hao <[email protected]> wrote:
>>
>> I am testing SCSI error handle with my previous scsi_debug error
>> injection patches, and found some issues when removing device and
>> error handler happened together.
>>
>> These issues are triggered because devices in removing would be skipped
>> when calling shost_for_each_device().
>
> Friendly ping...

The patch series was posted on October 15, 7 PM PDT and the ping has
been posted on October 17, 10 AM PDT. That's less than two days after
the patch series was posted. Isn't that way too soon to post a "ping"?

Thanks,

Bart.

2023-10-18 13:51:38

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] SCSI: Fix issues between removing device and error handle

On 10/17/23 18:37, Wenchao Hao wrote:
> The previous version was posted on 2023/9/28 but not reviewed, so I
> ping soon after repost.

Since a repost counts as a ping, I think posting a ping soon after
reposting is considered aggressive.

Bart.

2023-10-18 18:03:31

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] SCSI: Fix issues between removing device and error handle

On 10/18/23 07:40, Wenchao Hao wrote:
> On Wed, Oct 18, 2023 at 9:51 PM Bart Van Assche <[email protected]> wrote:
>>
>> On 10/17/23 18:37, Wenchao Hao wrote:
>>> The previous version was posted on 2023/9/28 but not reviewed, so I
>>> ping soon after repost.
>>
>> Since a repost counts as a ping, I think posting a ping soon after
>> reposting is considered aggressive.
>
> I didn't mean that, then how long is appropriate to post a ping?

It depends on how busy the reviewers are. I wait at least one week
before sending out a ping or reposting a patch series.

Bart.

2023-11-15 16:26:11

by Wenchao Hao

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] scsi: scsi_core: Fix IO hang when device removing

On 11/15/23 5:47 AM, Mike Christie wrote:
> On 11/14/23 3:23 PM, Mike Christie wrote:
>> On 10/15/23 9:03 PM, Wenchao Hao wrote:
>>> shost_for_each_device() would skip devices which is in progress of
>>> removing, so scsi_run_queue() for these devices would be skipped in
>>> scsi_run_host_queues() after blocking hosts' IO.
>>>
>>> IO hang would be caused if return true when state is SDEV_CANCEL with
>>> following order:
>>>
>>> T1: T2:scsi_error_handler
>>> __scsi_remove_device()
>>> scsi_device_set_state(sdev, SDEV_CANCEL)
>>> ...
>>> sd_remove()
>>> del_gendisk()
>>> blk_mq_freeze_queue_wait()
>>> scsi_eh_flush_done_q()
>>> scsi_queue_insert(scmd,...)
>>>
>>> scsi_queue_insert() would not kick device's queue since commit
>>> 8b566edbdbfb ("scsi: core: Only kick the requeue list if necessary")
>>>
>>> After scsi_unjam_host(), the scsi error handler would call
>>> scsi_run_host_queues() to trigger run queue for devices, while it
>>> would not run queue for devices which is in progress of removing
>>> because shost_for_each_device() would skip them.
>>>
>>> So the requests added to these queues would not be handled any more,
>>> and the removing device process would hang too.
>>>
>>> Fix this issue by using shost_for_each_device_include_deleted() in
>>> scsi_run_host_queues() to trigger a run queue for devices in removing.
>>>
>>> Signed-off-by: Wenchao Hao <[email protected]>
>>> ---
>>> drivers/scsi/scsi_lib.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index 195ca80667d0..40f407ffd26f 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -466,7 +466,7 @@ void scsi_run_host_queues(struct Scsi_Host *shost)
>>> {
>>> struct scsi_device *sdev;
>>>
>>> - shost_for_each_device(sdev, shost)
>>> + shost_for_each_device_include_deleted(sdev, shost)
>>> scsi_run_queue(sdev->request_queue);
>>
>> What happens if there were no commands for the device that
>> was destroyed and we race with this code and device deletion?
>>
>> So thread1 has set the device state tp SDEV_DEL and has finished
>> blk_mq_destroy_queue because there were no commands running.
>>
>> The above eh thread, then is calling:
>>
>> scsi_run_queue -> blk_mq_kick_requeue_list
>>
>> and that queues the requeue work.
>>
>> blk_mq_destroy_queue had done blk_mq_cancel_work_sync but
>> blk_mq_kick_requeue_list just added it back on the kblockd_workqueue.
>>
>> When __scsi_iterate_devices does scsi_device_put it would call
>> scsi_device_dev_release and call blk_put_queue which frees the
>> request_queue while it's requeue work might still be queued on
>> kblockd_workqueue.
>>

Hi Mike, thank you for the review.

Sorry I did not take the above flow into consideration and it's a bug
should be fixed in next version.

>
> Oh yeah, for your other lun/target reset patches were you trying to
> do something where you have a list for each scsi_device or a list of
> scsi_devices that needed error handler work? If so, maybe break that
> part out and use it here first.
>

The lun/target reset changes are not general for all drivers in my
design, so it should not work here.

> You can then just loop over the list of devices that needed work and
> start those above.

What about introduce a new flag "recovery" for each scsi_device to mark
if there is error command happened on it, the new flag is set in
scsi_eh_scmd_add() and cleared after error handle finished.

Since clear is always after scsi_error_handle() is waked up and no more
scsi_eh_scmd_add() would be called after scsi_error_handle() is waked
up, we do not need lock between set and clear this flag.

This change can help me to fix the issue you described above too.

Here is a brief changes:

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index c67cdcdc3ba8..36af294c2cef 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -310,6 +310,8 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
if (shost->eh_deadline != -1 && !shost->last_reset)
shost->last_reset = jiffies;

+ scmd->device->recovery = 1;
+
scsi_eh_reset(scmd);
list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
spin_unlock_irqrestore(shost->host_lock, flags);
@@ -2149,7 +2151,7 @@ static void scsi_restart_operations(struct Scsi_Host *shost)
* now that error recovery is done, we will need to ensure that these
* requests are started.
*/
- scsi_run_host_queues(shost);
+ scsi_run_host_recovery_queues(shost);

/*
* if eh is active and host_eh_scheduled is pending we need to re-run
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index cf3864f72093..0bf4423b6b9a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -470,6 +470,17 @@ void scsi_run_host_queues(struct Scsi_Host *shost)
scsi_run_queue(sdev->request_queue);
}

+void scsi_run_host_recovery_queues(struct Scsi_Host *shost)
+{
+ struct scsi_device *sdev;
+
+ shost_for_each_device_include_deleted(sdev, shost)
+ if (sdev->recovery) {
+ scsi_run_queue(sdev->request_queue);
+ sdev->recovery = 0;
+ }
+}
+
static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
{
if (!blk_rq_is_passthrough(scsi_cmd_to_rq(cmd))) {
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index 3f0dfb97db6b..3aba8ddd0101 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -107,6 +107,7 @@ extern void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd);
extern void scsi_queue_insert(struct scsi_cmnd *cmd, int reason);
extern void scsi_io_completion(struct scsi_cmnd *, unsigned int);
extern void scsi_run_host_queues(struct Scsi_Host *shost);
+extern void scsi_run_host_recovery_queues(struct Scsi_Host *shost);
extern void scsi_requeue_run_queue(struct work_struct *work);
extern void scsi_start_queue(struct scsi_device *sdev);
extern int scsi_mq_setup_tags(struct Scsi_Host *shost);
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 10480eb582b2..b730ceab9996 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -239,6 +239,7 @@ struct scsi_device {

unsigned cdl_supported:1; /* Command duration limits supported */
unsigned cdl_enable:1; /* Enable/disable Command duration limits */
+ unsigned recovery; /* Mark it error command happened */

unsigned int queue_stopped; /* request queue is quiesced */


2024-03-07 14:36:28

by Wenchao Hao

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] scsi: scsi_core: Fix IO hang when device removing

On 2023/10/16 10:03, Wenchao Hao wrote:
> shost_for_each_device() would skip devices which is in progress of
> removing, so scsi_run_queue() for these devices would be skipped in
> scsi_run_host_queues() after blocking hosts' IO.
>
> IO hang would be caused if return true when state is SDEV_CANCEL with
> following order:
>
> T1: T2:scsi_error_handler
> __scsi_remove_device()
> scsi_device_set_state(sdev, SDEV_CANCEL)
> ...
> sd_remove()
> del_gendisk()
> blk_mq_freeze_queue_wait()
> scsi_eh_flush_done_q()
> scsi_queue_insert(scmd,...)
>
> scsi_queue_insert() would not kick device's queue since commit
> 8b566edbdbfb ("scsi: core: Only kick the requeue list if necessary")
>
> After scsi_unjam_host(), the scsi error handler would call
> scsi_run_host_queues() to trigger run queue for devices, while it
> would not run queue for devices which is in progress of removing
> because shost_for_each_device() would skip them.
>
> So the requests added to these queues would not be handled any more,
> and the removing device process would hang too.
>
> Fix this issue by using shost_for_each_device_include_deleted() in
> scsi_run_host_queues() to trigger a run queue for devices in removing.
>

This issue is fixed by commit '6df0e077d76bd (scsi: core: Kick the requeue
list after inserting when flushing)', so do not need any more.

> Signed-off-by: Wenchao Hao <[email protected]>
> ---
> drivers/scsi/scsi_lib.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 195ca80667d0..40f407ffd26f 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -466,7 +466,7 @@ void scsi_run_host_queues(struct Scsi_Host *shost)
> {
> struct scsi_device *sdev;
>
> - shost_for_each_device(sdev, shost)
> + shost_for_each_device_include_deleted(sdev, shost)
> scsi_run_queue(sdev->request_queue);
> }
>